10 minutes guide
Introduction
filterx
is a command line tool to filter lines from files. It is inspired by grep and awk and bioawk.
Features
-
🚀 Filter lines by column-based expression
-
🎨 Support multiple input formats e.g. vcf/sam/fasta/fastq/gff/bed/csv/tsv
-
🎉 Cross-platform support
-
📦 Easy to install
-
📚 Rich documentations
Installation
Install by Cargo
cargo
1cargo install filterx
2# or
3cargo install --git [email protected]:dwpeng/filterx.git
Install from Pip
Download from Github Release
Github Prebuild Binary
1https://github.com/dwpeng/filterx/releases
Quick Start
This section provides a brief overview of filterx
's capabilities. We'll use this sample CSV file for many examples:
example.csv
1name,age
2Alice,20
3Bob,30
4Charlie,40
Basic Filtering
Filter rows where age is greater than 25. If your file has a header, use the -H
flag.
1filterx csv example.csv -H -e 'age > 25'
2
3# Output
4# Bob,30
5# Charlie,40
For files without headers, use col(index)
to refer to columns (0-indexed).
1# Assuming example-no-header.csv is example.csv without the header
2filterx csv example-no-header.csv -e 'col(1) > 25'
3
4# Output
5# Bob,30
6# Charlie,40
Combine conditions using and
or by providing multiple -e
flags. To keep the header in the output, use --oH
.
1filterx csv example.csv -H --oH -e 'age > 25 and name == "Bob"'
2# Output
3# name,age
4# Bob,30
Column Manipulation
filterx
provides powerful functions to manipulate columns.
Select columns: Use select
to choose which columns to output and in what order.
1filterx csv example.csv -H --oH -e 'select(age, name)'
2# Output
3# age,name
4# 20,Alice
5# 30,Bob
6# 40,Charlie
Create new columns: Use alias
to add a new column.
1filterx csv example.csv -H -e 'alias(age_plus_10) = age + 10'
2
3# Output
4# Alice,20,30
5# Bob,30,40
6# Charlie,40,50
Rename columns: Use rename
to change a column's name.
1filterx csv example.csv -H --oH -e 'rename(name, person_name)'
2
3# Output
4# person_name,age
5# Alice,20
6# Bob,30
7# Charlie,40
Row Manipulation
Get the first N rows: Use head
.
1filterx csv example.csv -H -e 'head(1)'
2
3# Output
4# Alice,20
Get the last N rows: Use tail
.
1filterx csv example.csv -H -e 'tail(1)'
2
3# Output
4# Charlie,40
Drop rows with missing values: Use drop_null
.
missing.csv
1name,age
2Alice,20
3Bob,
4,30
1filterx csv missing.csv -H --oH -e 'drop_null(age)'
2
3# output
4# name,age
5# Alice,20
6# ,30
String and Sequence Manipulation
Get string length: Use len
.
1filterx csv example.csv -H -e 'len(name) > 4'
2
3# Output
4# Charlie,40
Extract with Regex: Use extract
to get parts of a string.
1filterx csv names.csv -H -e 'alias(age) = extract(name, "(\d+)")'
2
3# Output
4# name,age
5# Alice-20,20
6# Bob-30,30
Bioinformatics Example: Calculate GC content and sequence length.
bio_example.fasta
1>seq1
2ATGC
3>seq2
4GATTACA
1filterx fasta bio_example.fasta -e 'print("{name}\tGC={gc(seq)}\tlen={len(seq)}")'
2
3# Output
4# seq1 GC=0.5 len=4
5# seq2 GC=0.2857142857142857 len=7
In-place Editing
Functions ending with an underscore (_
) modify the data directly (in-place). For example, to convert a sequence to its reverse complement:
1filterx fasta bio_example.fasta -e 'revcomp_(seq)'
2
3# Output
4# >seq1
5# GCAT
6# >seq2
7# TGTAATC
Advanced Output Formatting
Use print
with f-string like syntax to create custom outputs. You can even call other functions inside print
.
1filterx csv example.csv -H \
2 -e 'age > 25' \
3 -e 'print("{name} is {age} years old. Name length: {len(name)}")'
4
5# Output
6# Bob is 30 years old. Name length: 3
7# Charlie is 40 years old. Name length: 7
Functions
Column
alias
While add a new column to the table, the new column name must be wrapped in alias()
function. alias
need a column name as argument, and only can be used in create column statement.
create a new column named c
with value a + 1
:
example
1filterx c -H test.csv -e "alias(c) = a + 1"
2
3# output
4a,b,c
51,2,2
63,4,4
cast
Convert values in a column to another type. cast
need be invoked with the cast_type
and column_name
as arguments.
test.csv
1a,b,c
2a,1,1.1
3b,2,2.2
4c,3,3.3
Example1
1filterx csv -H --oH test.csv -e 'cast_str_(b);alias(a) = a + b'
2
3# Output
4a,b,c,d
5a,1,1.100,a1
6b,2,2.200,b2
7c,3,3.300,c3
col
Select a column by name
or index
.
For a csv with header, you can directly use the column name to select the column.
data.csv
1name,age
2Alice,30
3Bob,25
Example1
1filterx csv data.csv -H --oH -e "select(col(name))"
2
3# Output:
4name
5Alice
6Bob
For a csv without header, you can use the column index to select the column. The index starts from 0.
Example2
1filterx csv data.csv --oH -e "select(col(0))"
2# Output:
3Alice
4Bob
col
can also be used to select multiple columns by using regex.
data.csv
1a1,a2,a3,b
21,2,3,4
35,6,7,8
Example3
1filterx csv data.csv --oH -e "select(col('^a\d+$'))"
2
3# Output:
4a1,a2,a3
51,2,3
65,6,7
drop_null
drop rows with null values
test.csv
1name,age
2Alice,20
3Bob,
4,30
1filterx csv test.csv -H --oH -e 'drop_null(age)'
2
3# output
4name,age
5Alice,20
6,30
allow multiple columns
1filterx csv test.csv -H --oH -e 'drop_null(age, name)'
2
3# output
4name,age
5Alice,20
if no column is specified, it will apply to all columns, any row with null value will be dropped
1filterx csv test.csv -H --oH -e 'drop_null()'
2
3# output
4name,age
5Alice,20
dup
Duplicate items in a column(s). There are 4 functions:
-
dup
- Duplicate items in a column and keep the first occurrence.
-
dup_last
- Duplicate items in a column and keep the last occurrence.
-
dup_any
- Duplicate items in a column and keep any one occurrence.
-
dup_none
- Duplicate items in a column and keep no occurrence.
data.csv
1name,country
2Alice,USA
3Bob,UK
4Alice,Canada
5Alice,USA
Example
1filterx csv -H --oH data.csv 'dup(name)'
2
3# Output
4name,country
5Alice,USA
6Bob,UK
dup
supports multiple columns.
Example
1filterx csv -H --oH data.csv 'dup(name, country)'
2# Output
3name,country
4Alice,USA
5Bob,UK
6Alice,Canada
fill
fill value if the value is null or na. There are another two functions fill_null
and fill_na
which are aliases of fill
.
fill_null
: fill value if the value is null.
fill_na
: fill value if the value is na.
fill
: same as fill_null
.
test.csv
1name,age
2Alice,20
3Bob,
4Charlie,30
1filterx csv test.csv -H --oH -e 'fill_null_(age, 0)'
2
3# output
4name,age
5Alice,20
6Bob,0
7Charlie,30
print the table headers and exit
data.csv
1a1,a2,a3,b
21,2,3,4
35,6,7,8
Example
1filterx csv data.csv -e "header()"
2
3# Output
4index name type
50 a1 i64
61 a2 i64
72 a3 i64
83 b i64
is_null
filters rows with NULL values or not
test.csv
1name,age
2Alice,20
3Bob,
4Charlie,30
Example1
1filterx csv test.csv -H --oH -e 'is_null(age)'
2
3# output
4name,age
5Bob,
Example2
1filterx csv test.csv -H --oH -e 'is_not_null(age)'
2# output
3name,age
4Alice,20
5Charlie,30
occ
filterx rows by occurence
occ
is a function that filters rows by the number of times a value occurs in a column.
occ_lte
: less than or equal to the number of times a value occurs
occ_gte
: greater than or equal to the number of times a value occurs
occ
: same as occ_gte
data.csv
1a,b,c
21,2,3
32,2,1
42,2,3
52,1,3
Example1
1filterx csv -H --oH data.csv 'occ(a, 2)'
2
3# Output
4a,b,c
52,2,1
62,2,3
72,1,3
occ
supports multiple columns.
Example2
1filterx csv -H --oH data.csv 'occ_lte(a, b, 1)'
2
3# Output
4a,b,c
52,2,1
62,2,3
print
Format and print the given value.
data.csv
1name,age
2alice,34
3bob,23
Example1
1filterx csv -H --oH data.csv "print('{name} is {age} years old')"
2
3# Output
4alice is 34 years old
5bob is 23 years old
data.fastq
1@seq1
2ACGT
3+
4IIII
5@seq2
6TGCA
7+
8IIII
Example2
1filterx fq data.fastq "print('>{name}
2{seq}')"
3
4# Output
5>seq1
6ACGT
7>seq2
8TGCA
print
can also call functions in format strings.
Example3
1filterx csv -H --oH data.fastq "print('{name} {gc(seq)} {len(seq)}')"
2
3# Output
4seq1 0.5 4
5seq2 0.5 4
rename
rename a column in the table with a new name
data.csv
1name,age
2alice,34
3bob,23
Example
1filterx csv -H --oH data.csv 'rename(name, first_name)'
2
3# Output
4first_name,age
5alice,34
6bob,23
rm
Delete a column(s) from a table.
data.csv
1a1,a2,a3,b
21,2,3,4
35,6,7,8
Example
1filterx csv data.csv -e "rm(a1)"
2
3# Output
4a2,a3,b
52,3,4
66,7,8
select
Selects the columns of a table to output. The columns are specified by their names.
data.csv
1name,age
2alice,34
3bob,23
Example1
1filterx csv -H --oH data.csv 'select(name)'
2# Output
3
4name
5alice
6bob
Example2
1filterx csv -H --oH data.csv 'select(age, name)'
2
3# Output
4age,name
534,alice
623,bob
sort
Sort the items in a column(s) in the table. There are another two functions Sort
and sorT
that are aliases of sort
.
-
Sort
: from high to low
-
sorT
: from low to high
-
sort
same as sorT
data.csv
1a,b,c
21,2,3
33,2,1
42,1,3
Example1
1filterx csv -H --oH data.csv 'sort(a)'
2
3# Output
4a,b,c
51,2,3
62,1,3
73,2,1
Example2
1filterx csv -H --oH data.csv 'Sort(a)'
2
3# Output
4a,b,c
53,2,1
62,1,3
71,2,3
sort
supports multiple columns.
data.csv
1a,b,c
21,2,3
33,2,1
43,1,2
52,1,3
Example3
1filterx csv -H --oH data.csv 'sort(a, b)'
2
3# Output
4a,b,c
51,2,3
62,1,3
73,1,2
83,2,1
Number
abs
Compute the absolute value of a number.
Example1
1filterx csv -H --oH test.csv 'abs_(a)'
2
3# Output
4a
52
60
71
Example2
1filterx csv -H --oH test.csv 'abs(a) > 1'
2
3# Output
4a
52
Row
head
Get the first n
rows of a files.
test.csv
1a,b,c
21,2,3
34,5,6
47,8,9
Example1
1filterx csv -H --oH test.csv -e "head(2)"
2
3# output
4a,b,c
51,2,3
64,5,6
Example2
1filterx csv -H --oH test.csv -e "b > 2;head(1)"
2
3# output
4a,b,c
54,5,6
tail
Get the last n
rows of a files.
test.csv
1a,b,c
21,2,3
34,5,6
47,8,9
Example1
1filterx csv -H --oH test.csv -e "tail(2)"
2
3# output
4a,b,c
54,5,6
67,8,9
Example2
1filterx csv -H --oH test.csv -e "b > 2;tail(1)"
2
3# output
4a,b,c
57,8,9
Sequence
gc
Compute the gc content of a sequence
test.fa
1>seq1
2ATGC
3>seq2
4ATGCC
example
1filterx fasta test.fa -e "gc(seq) > 0.5"
2
3# output
4>seq2
5ATGCC
hpc
compute the HPC of a sequence, compress contiguous identical characters into a single character
example
1filterx fa test.fa -e "hpc_(seq)""
2
3# output
4>seq1
5ATGCT
phred
Check the quality of a sequence using the Phred algorithm.
There are two kinds of Phred scores: phred33 and phred64.
test.fq
1@seq1
2ATGC
3+
4IIII
5@seq2
6ATGCC
7+
8IIIII
example
1filterx fq test.fq -e "phred()"
2
3# output
4phred: phred64
qual
Compute the mean quality of a sequence.
test.fq
1@seq1
2ATGC
3+
4IIII
5@seq2
6ATGCC
7+
8IIIII
example
1filterx fq test.fq -e "print('{name}: {qual(seq)}')"
output
1# output
2seq1: 4.245115
3seq2: 3.9658952
revcomp
compute the reverse complement of a sequence
test.fa
1>seq1
2ATGC
3>seq2
4ATGCC
example
1filterx fa test.fa -e "revcomp_(seq)""
2
3# output
4>seq1
5GCAT
6>seq2
7GGCAT
to_fasta
Convert a fastq file to fasta format.
test.fq
1@seq1
2ATGC
3+
4IIII
5@seq2
6ATGCC
7+
8IIIII
example
1filterx fq test.fq -e "to_fasta()"
2
3# output
4>seq1
5ATGC
6>seq2
7ATGCC
For a fastq
file, you can directly use --no-qual
option to convert it to fasta
format.
example
1filterx fq test.fq --no-qual
2
3# output
4>seq1
5ATGC
6>seq2
7ATGCC
to_fastq
Convert a FASTA file to a FASTQ file. Quality scores will be set to ?
test.fa
1>aaa comment1
2ctatgctatctatcatc
3>bbb comment2
4aaa
5bbb
6CCC
example
1filterx fa test.fa -e "to_fq()"
2
3# output
4@aaa
5ctatgctatctatcatc
6+
7?????????????????
8@bbb
9aaabbbCCC
10+
11?????????
String
Extracts a substring from a string by using a regular expression.
1filterx csv test.csv -H -e 'alias(age) = extract(name, "(\d+)")'
2
3# Output
4name,age
5Alice-20,20
6Bob-30,30
len
Compute the length of a string.
test.fasta
1>seq1
2ACGT
3>seq2
4AGCTGGG
Example1
1filterx fa test.fasta -e "print('{name} {len(seq)}')
2
3# output
4seq1 4
5seq2 7
lower
Converts a string to lowercase.
test.fasta
1>seq1
2ACGT
3>seq2
4AGCTGGG
Example1
1filterx fa test.fasta -e "print('{lower(seq)}')"
2
3# output
4acgt
5agctggg
Example2
1filterx fa test.fasta -e "alias(lower_seq) = lower(seq)"
2 -e "print('{lower_seq} {seq}')"
3
4# output
5acgt ACGT
6agctggg AGCTGGG
replace
Replace substring with another pattern. There two kind of function: replace
and replace_one
. The difference between them is that replace
will replace all occurrences of the substring, while replace_one
will replace only the first occurrence.
Example1
1filterx fa test.fa -e "replace_(seq, 'A', 'G')"
2
3# output
4>seq1
5GGGGGGCC
Example2
1filterx fa test.fa -e "replace_one(seq, 'A', 'G')"
2# output
3>seq1
4GGGGAACC
There are also support for regular expression.
Example3
1filterx csv -H --oH test.csv -e "replace_(a, 'p{3,}', 'pp')"
2
3# output
4a
5apple
6apple
rev
Gets the reverse of a string.
Example
1filterx fa test.fa -e "rev_(seq)"
2
3# output
4>seq1
5GCTA
slice
Extract a substring from a string.
test.fa
1>seq1
2ACGTCTGATGCATCTAGTCTACAG
Example1
1# get the first 5 characters
2filterx fa test.fa -e "slice_(seq, 5)"
3
4# output
5>seq1
6ACGTC
Example2
1# get sub sequence start from 1 with length 5
2filterx fa test.fa -e "slice_(seq, 1, 5)" # offset starts from 0, so 1 is the second character
3
4# output
5>seq1
6ACGTC
strip
Remove prefix/suffix pattern from string. There are another two functions that can be used to remove only the prefix or suffix: lstrip
and rstrip
.
Example1
1filterx fa test.fa -e "strip_(seq, 'A')"
2
3# output
4>seq1
5GTCG
Example2
1filterx fa test.fa -e "lstrip_(seq, 'A')"
2# output
3>seq1
4GTCGAA
Example3
1filterx fa test.fa -e "rstrip_(seq, 'A')"
2# output
3>seq1
4AAGTCG
trim
trim removes leading and trailing whitespace from a string.
test.fa
1>seq1
2ACGTCTGATGCATCTAGTCTACAG
Example1
1# trim 5bp from start and 3bp from end
2filterx fa test.fa -e "trim_(seq, 5, 3)"
3
4# output
5>seq1
6TGATGCATCTAGTCTA
upper
Converts a string to upeercase.
test.fasta
1>seq1
2acgt
3>seq2
4agctggg
Example1
1filterx fa test.fasta -e "print('{upper(seq)}')"
2
3# output
4acgt
5agctggg
Example2
1filterx fa test.fasta -e "alias(upper_seq) = upper(seq)"
2 -e "print('{upper_seq} {seq}')"
3
4# output
5ACGT acgt
6AGCTGGG agctggg
width
Reformats a string to a given width. width
chars per line.
test.fa
1>aaa comment1
2ctatgctatctatcatc
3>bbb comment2
4aaabbbCCC
Example
1filterx fa test.fa -e "width_(seq, 3)"
2
3# Output:
4>aaa comment1
5cta
6tgc
7tat
8cta
9tca
10tc
11>bbb comment2
12aaa
13bbb
14CCC