10 minutes guide

Introduction

filterx is a command line tool to filter lines from files. It is inspired by grep and awk and bioawk.

Features

  • 🚀 Filter lines by column-based expression

  • 🎨 Support multiple input formats e.g. vcf/sam/fasta/fastq/gff/bed/csv/tsv

  • 🎉 Cross-platform support

  • 📦 Easy to install

  • 📚 Rich documentations

Installation

Install by Cargo

cargo
1cargo install filterx
2# or
3cargo install --git [email protected]:dwpeng/filterx.git

Install from Pip

pip
1pip install filterx

Download from Github Release

Github Prebuild Binary
1https://github.com/dwpeng/filterx/releases

Quick Start

This section provides a brief overview of filterx's capabilities. We'll use this sample CSV file for many examples:

example.csv
1name,age
2Alice,20
3Bob,30
4Charlie,40

Basic Filtering

Filter rows where age is greater than 25. If your file has a header, use the -H flag.

1filterx csv example.csv -H -e 'age > 25'
2
3# Output
4# Bob,30
5# Charlie,40

For files without headers, use col(index) to refer to columns (0-indexed).

1# Assuming example-no-header.csv is example.csv without the header
2filterx csv example-no-header.csv -e 'col(1) > 25'
3
4# Output
5# Bob,30
6# Charlie,40

Combine conditions using and or by providing multiple -e flags. To keep the header in the output, use --oH.

1filterx csv example.csv -H --oH -e 'age > 25 and name == "Bob"'
2# Output
3# name,age
4# Bob,30

Column Manipulation

filterx provides powerful functions to manipulate columns.

Select columns: Use select to choose which columns to output and in what order.

1filterx csv example.csv -H --oH -e 'select(age, name)'
2# Output
3# age,name
4# 20,Alice
5# 30,Bob
6# 40,Charlie

Create new columns: Use alias to add a new column.

1filterx csv example.csv -H -e 'alias(age_plus_10) = age + 10'
2
3# Output
4# Alice,20,30
5# Bob,30,40
6# Charlie,40,50

Rename columns: Use rename to change a column's name.

1filterx csv example.csv -H --oH -e 'rename(name, person_name)'
2
3# Output
4# person_name,age
5# Alice,20
6# Bob,30
7# Charlie,40

Row Manipulation

Get the first N rows: Use head.

1filterx csv example.csv -H -e 'head(1)'
2
3# Output
4# Alice,20

Get the last N rows: Use tail.

1filterx csv example.csv -H -e 'tail(1)'
2
3# Output
4# Charlie,40

Drop rows with missing values: Use drop_null.

missing.csv
1name,age
2Alice,20
3Bob,
4,30
1filterx csv missing.csv -H --oH -e 'drop_null(age)'
2
3# output
4# name,age
5# Alice,20
6# ,30

String and Sequence Manipulation

Get string length: Use len.

1filterx csv example.csv -H -e 'len(name) > 4'
2
3# Output
4# Charlie,40

Extract with Regex: Use extract to get parts of a string.

names.csv
1name
2Alice-20
3Bob-30
1filterx csv names.csv -H -e 'alias(age) = extract(name, "(\d+)")'
2
3# Output
4# name,age
5# Alice-20,20
6# Bob-30,30

Bioinformatics Example: Calculate GC content and sequence length.

bio_example.fasta
1>seq1
2ATGC
3>seq2
4GATTACA
1filterx fasta bio_example.fasta -e 'print("{name}\tGC={gc(seq)}\tlen={len(seq)}")'
2
3# Output
4# seq1	GC=0.5	len=4
5# seq2	GC=0.2857142857142857	len=7

In-place Editing

Functions ending with an underscore (_) modify the data directly (in-place). For example, to convert a sequence to its reverse complement:

1filterx fasta bio_example.fasta -e 'revcomp_(seq)'
2
3# Output
4# >seq1
5# GCAT
6# >seq2
7# TGTAATC

Advanced Output Formatting

Use print with f-string like syntax to create custom outputs. You can even call other functions inside print.

1filterx csv example.csv -H \
2     -e 'age > 25' \
3     -e 'print("{name} is {age} years old. Name length: {len(name)}")'
4
5# Output
6# Bob is 30 years old. Name length: 3
7# Charlie is 40 years old. Name length: 7

Functions

Column

alias

While add a new column to the table, the new column name must be wrapped in alias() function. alias need a column name as argument, and only can be used in create column statement.

test.csv
1a,b
21,2
33,4

create a new column named c with value a + 1:

example
1filterx c -H test.csv -e "alias(c) = a + 1"
2
3# output
4a,b,c
51,2,2
63,4,4

cast

Convert values in a column to another type. cast need be invoked with the cast_type and column_name as arguments.

test.csv
1a,b,c
2a,1,1.1
3b,2,2.2
4c,3,3.3
Example1
1filterx csv -H --oH test.csv -e 'cast_str_(b);alias(a) = a + b'
2
3# Output
4a,b,c,d
5a,1,1.100,a1
6b,2,2.200,b2
7c,3,3.300,c3

col

Select a column by name or index.

For a csv with header, you can directly use the column name to select the column.

data.csv
1name,age
2Alice,30
3Bob,25
Example1
1filterx csv data.csv -H --oH -e "select(col(name))"
2
3# Output:
4name
5Alice
6Bob

For a csv without header, you can use the column index to select the column. The index starts from 0.

data.csv
1Alice,30
2Bob,25
Example2
1filterx csv data.csv --oH -e "select(col(0))"
2# Output:
3Alice
4Bob

col can also be used to select multiple columns by using regex.

data.csv
1a1,a2,a3,b
21,2,3,4
35,6,7,8
Example3
1filterx csv data.csv --oH -e "select(col('^a\d+$'))"
2
3# Output:
4a1,a2,a3
51,2,3
65,6,7

drop_null

drop rows with null values

test.csv
1name,age
2Alice,20
3Bob,
4,30
1filterx csv test.csv -H --oH -e 'drop_null(age)'
2
3# output
4name,age
5Alice,20
6,30

allow multiple columns

1filterx csv test.csv -H --oH -e 'drop_null(age, name)'
2
3# output
4name,age
5Alice,20

if no column is specified, it will apply to all columns, any row with null value will be dropped

1filterx csv test.csv -H --oH -e 'drop_null()'
2
3# output
4name,age
5Alice,20

dup

Duplicate items in a column(s). There are 4 functions:

  • dup - Duplicate items in a column and keep the first occurrence.

  • dup_last - Duplicate items in a column and keep the last occurrence.

  • dup_any - Duplicate items in a column and keep any one occurrence.

  • dup_none - Duplicate items in a column and keep no occurrence.

data.csv
1name,country
2Alice,USA
3Bob,UK
4Alice,Canada
5Alice,USA
Example
1filterx csv -H --oH data.csv 'dup(name)'
2
3# Output
4name,country
5Alice,USA
6Bob,UK

dup supports multiple columns.

Example
1filterx csv -H --oH data.csv 'dup(name, country)'
2# Output
3name,country
4Alice,USA
5Bob,UK
6Alice,Canada

fill

fill value if the value is null or na. There are another two functions fill_null and fill_na which are aliases of fill.

  • fill_null: fill value if the value is null.
  • fill_na: fill value if the value is na.
  • fill: same as fill_null.
test.csv
1name,age
2Alice,20
3Bob,
4Charlie,30
1filterx csv test.csv -H --oH -e 'fill_null_(age, 0)'
2
3# output
4name,age
5Alice,20
6Bob,0
7Charlie,30

print the table headers and exit

data.csv
1a1,a2,a3,b
21,2,3,4
35,6,7,8
Example
1filterx csv data.csv -e "header()"
2
3# Output
4index   name    type
50       a1      i64
61       a2      i64
72       a3      i64
83       b       i64

is_null

filters rows with NULL values or not

  • is_null : filter rows with NULL values

  • is_not_null : filter rows without NULL values

test.csv
1name,age
2Alice,20
3Bob,
4Charlie,30
Example1
1filterx csv test.csv -H --oH -e 'is_null(age)'
2
3# output
4name,age
5Bob,
Example2
1filterx csv test.csv -H --oH -e 'is_not_null(age)'
2# output
3name,age
4Alice,20
5Charlie,30

occ

filterx rows by occurence

occ is a function that filters rows by the number of times a value occurs in a column.

  • occ_lte: less than or equal to the number of times a value occurs
  • occ_gte: greater than or equal to the number of times a value occurs
  • occ : same as occ_gte
data.csv
1a,b,c
21,2,3
32,2,1
42,2,3
52,1,3
Example1
1filterx csv -H --oH data.csv 'occ(a, 2)'
2
3# Output
4a,b,c
52,2,1
62,2,3
72,1,3

occ supports multiple columns.

Example2
1filterx csv -H --oH data.csv 'occ_lte(a, b, 1)'
2
3# Output
4a,b,c
52,2,1
62,2,3

print

Format and print the given value.

data.csv
1name,age
2alice,34
3bob,23
Example1
1filterx csv -H --oH data.csv "print('{name} is {age} years old')"
2
3# Output
4alice is 34 years old
5bob is 23 years old
data.fastq
1@seq1
2ACGT
3+
4IIII
5@seq2
6TGCA
7+
8IIII
Example2
1filterx fq data.fastq "print('>{name}
2{seq}')"
3
4# Output
5>seq1
6ACGT
7>seq2
8TGCA

print can also call functions in format strings.

Example3
1filterx csv -H --oH data.fastq "print('{name}	{gc(seq)}	{len(seq)}')"
2
3# Output
4seq1	0.5	4
5seq2	0.5	4

rename

rename a column in the table with a new name

data.csv
1name,age
2alice,34
3bob,23
Example
1filterx csv -H --oH data.csv 'rename(name, first_name)'
2
3# Output
4first_name,age
5alice,34
6bob,23

rm

Delete a column(s) from a table.

data.csv
1a1,a2,a3,b
21,2,3,4
35,6,7,8
Example
1filterx csv data.csv -e "rm(a1)"
2
3# Output
4a2,a3,b
52,3,4
66,7,8

select

Selects the columns of a table to output. The columns are specified by their names.

data.csv
1name,age
2alice,34
3bob,23
Example1
1filterx csv -H --oH data.csv 'select(name)'
2# Output
3
4name
5alice
6bob
Example2
1filterx csv -H --oH data.csv 'select(age, name)'
2
3# Output
4age,name
534,alice
623,bob

sort

Sort the items in a column(s) in the table. There are another two functions Sort and sorT that are aliases of sort.

  • Sort: from high to low

  • sorT: from low to high

  • sort same as sorT

data.csv
1a,b,c
21,2,3
33,2,1
42,1,3
Example1
1filterx csv -H --oH data.csv 'sort(a)'
2
3# Output
4a,b,c
51,2,3
62,1,3
73,2,1
Example2
1filterx csv -H --oH data.csv 'Sort(a)'
2
3# Output
4a,b,c
53,2,1
62,1,3
71,2,3

sort supports multiple columns.

data.csv
1a,b,c
21,2,3
33,2,1
43,1,2
52,1,3
Example3
1filterx csv -H --oH data.csv 'sort(a, b)'
2
3# Output
4a,b,c
51,2,3
62,1,3
73,1,2
83,2,1

Number

abs

Compute the absolute value of a number.

test.csv
1a
2-2
30
41
Example1
1filterx csv -H --oH test.csv 'abs_(a)'
2
3# Output
4a
52
60
71
Example2
1filterx csv -H --oH test.csv 'abs(a) > 1'
2
3# Output
4a
52

Row

Get the first n rows of a files.

test.csv
1a,b,c
21,2,3
34,5,6
47,8,9
Example1
1filterx csv -H --oH test.csv -e "head(2)"
2
3# output
4a,b,c
51,2,3
64,5,6
Example2
1filterx csv -H --oH test.csv -e "b > 2;head(1)"
2
3# output
4a,b,c
54,5,6

tail

Get the last n rows of a files.

test.csv
1a,b,c
21,2,3
34,5,6
47,8,9
Example1
1filterx csv -H --oH test.csv -e "tail(2)"
2
3# output
4a,b,c
54,5,6
67,8,9
Example2
1filterx csv -H --oH test.csv -e "b > 2;tail(1)"
2
3# output
4a,b,c
57,8,9

Sequence

gc

Compute the gc content of a sequence

test.fa
1>seq1
2ATGC
3>seq2
4ATGCC
example
1filterx fasta test.fa -e "gc(seq) > 0.5"
2
3# output
4>seq2
5ATGCC

hpc

compute the HPC of a sequence, compress contiguous identical characters into a single character

test.fa
1>seq1
2ATGCCCCCT
example
1filterx fa test.fa -e "hpc_(seq)""
2
3# output
4>seq1
5ATGCT

phred

Check the quality of a sequence using the Phred algorithm.

There are two kinds of Phred scores: phred33 and phred64.

test.fq
1@seq1
2ATGC
3+
4IIII
5@seq2
6ATGCC
7+
8IIIII
example
1filterx fq test.fq -e "phred()"
2
3# output
4phred: phred64

qual

Compute the mean quality of a sequence.

test.fq
1@seq1
2ATGC
3+
4IIII
5@seq2
6ATGCC
7+
8IIIII
example
1filterx fq test.fq -e "print('{name}: {qual(seq)}')"
output
1# output
2seq1: 4.245115
3seq2: 3.9658952

revcomp

compute the reverse complement of a sequence

test.fa
1>seq1
2ATGC
3>seq2
4ATGCC
example
1filterx fa test.fa -e "revcomp_(seq)""
2
3# output
4>seq1
5GCAT
6>seq2
7GGCAT

to_fasta

Convert a fastq file to fasta format.

test.fq
1@seq1
2ATGC
3+
4IIII
5@seq2
6ATGCC
7+
8IIIII
example
1filterx fq test.fq -e "to_fasta()"
2
3# output
4>seq1
5ATGC
6>seq2
7ATGCC

For a fastq file, you can directly use --no-qual option to convert it to fasta format.

example
1filterx fq test.fq --no-qual
2
3# output
4>seq1
5ATGC
6>seq2
7ATGCC

to_fastq

Convert a FASTA file to a FASTQ file. Quality scores will be set to ?

test.fa
1>aaa comment1
2ctatgctatctatcatc
3>bbb comment2
4aaa
5bbb
6CCC
example
1filterx fa test.fa -e "to_fq()"
2
3# output
4@aaa
5ctatgctatctatcatc
6+
7?????????????????
8@bbb
9aaabbbCCC
10+
11?????????

String

extract

Extracts a substring from a string by using a regular expression.

test.csv
1name
2Alice-20
3Bob-30
1filterx csv test.csv -H -e 'alias(age) = extract(name, "(\d+)")'
2
3# Output
4name,age
5Alice-20,20
6Bob-30,30

len

Compute the length of a string.

test.fasta
1>seq1
2ACGT
3>seq2
4AGCTGGG
Example1
1filterx fa test.fasta -e "print('{name}	{len(seq)}')
2
3# output
4seq1	4
5seq2	7

lower

Converts a string to lowercase.

test.fasta
1>seq1
2ACGT
3>seq2
4AGCTGGG
Example1
1filterx fa test.fasta -e "print('{lower(seq)}')"
2
3# output
4acgt
5agctggg
Example2
1filterx fa test.fasta -e "alias(lower_seq) = lower(seq)" 
2                      -e "print('{lower_seq}	{seq}')"
3
4# output
5acgt	ACGT
6agctggg	AGCTGGG

replace

Replace substring with another pattern. There two kind of function: replace and replace_one. The difference between them is that replace will replace all occurrences of the substring, while replace_one will replace only the first occurrence.

test.fa
1>seq1
2AAGGAACC
Example1
1filterx fa test.fa -e "replace_(seq, 'A', 'G')"
2
3# output
4>seq1
5GGGGGGCC
Example2
1filterx fa test.fa -e "replace_one(seq, 'A', 'G')"
2# output
3>seq1
4GGGGAACC

There are also support for regular expression.

test.csv
1a
2apppppple
3apple
Example3
1filterx csv -H --oH test.csv -e "replace_(a, 'p{3,}', 'pp')"
2
3# output
4a
5apple
6apple

rev

Gets the reverse of a string.

test.fa
1>seq1
2ATCG
Example
1filterx fa test.fa -e "rev_(seq)"
2
3# output
4>seq1
5GCTA

slice

Extract a substring from a string.

test.fa
1>seq1
2ACGTCTGATGCATCTAGTCTACAG
Example1
1# get the first 5 characters
2filterx fa test.fa -e "slice_(seq, 5)"
3
4# output
5>seq1
6ACGTC
Example2
1# get sub sequence start from 1 with length 5
2filterx fa test.fa -e "slice_(seq, 1, 5)" # offset starts from 0, so 1 is the second character
3
4# output
5>seq1
6ACGTC

strip

Remove prefix/suffix pattern from string. There are another two functions that can be used to remove only the prefix or suffix: lstrip and rstrip.

test.fa
1>seq1
2AAGTCGAA
Example1
1filterx fa test.fa -e "strip_(seq, 'A')"
2
3# output
4>seq1
5GTCG
Example2
1filterx fa test.fa -e "lstrip_(seq, 'A')"
2# output
3>seq1
4GTCGAA
Example3
1filterx fa test.fa -e "rstrip_(seq, 'A')"
2# output
3>seq1
4AAGTCG

trim

trim removes leading and trailing whitespace from a string.

test.fa
1>seq1
2ACGTCTGATGCATCTAGTCTACAG
Example1
1# trim 5bp from start and 3bp from end
2filterx fa test.fa -e "trim_(seq, 5, 3)"
3
4# output
5>seq1
6TGATGCATCTAGTCTA

upper

Converts a string to upeercase.

test.fasta
1>seq1
2acgt
3>seq2
4agctggg
Example1
1filterx fa test.fasta -e "print('{upper(seq)}')"
2
3# output
4acgt
5agctggg
Example2
1filterx fa test.fasta -e "alias(upper_seq) = upper(seq)" 
2                      -e "print('{upper_seq}	{seq}')"
3
4# output
5ACGT	acgt
6AGCTGGG	agctggg

width

Reformats a string to a given width. width chars per line.

test.fa
1>aaa comment1
2ctatgctatctatcatc
3>bbb comment2
4aaabbbCCC
Example
1filterx fa test.fa -e "width_(seq, 3)"
2
3# Output:
4>aaa comment1
5cta
6tgc
7tat
8cta
9tca
10tc
11>bbb comment2
12aaa
13bbb
14CCC