test.csv
1name,age
2Alice,20
3Bob,30
4Charlie,40
test-without-header.csv
1Alice,20
2Bob,30
3Charlie,40

col

get column by name or index, for index, it starts from 0.

1filterx csv test.csv -H -e 'col(name) == "Alice"'
2# equivalent to
3filterx csv test-without-header.csv -e 'col(0) == "Alice"'
4
5# output
6# Alice,20

col can be used as column name, so you can use it in any need of column name. And it also can be used in col function.

1filterx csv test.csv -H -e 'col(col(0)) == "Alice"'
WARNING

while using index, the csv file should not have header.

1filterx csv test.csv -H -e 'col(0) == "Alice"' # error

col can be used to slelect multiple columns by regex.

1filterx csv test.csv -H -e 'col("^ind\d+$")'

means select all columns start with ind and followed by numbers.

select

get columns by name, output will follow the order of selection.

1filterx csv test.csv -H -e 'select(age, name)'
2
3# output
4# 20,Alice
5# 30,Bob
6# 40,Charlie

alias

create a new column from a literal value or an expression or a column.

1filterx csv test.csv -H --oH -e 'alias(new_col) = 10'
2
3# output
4# name,age,new_col
5# Alice,20,10
6# Bob,30,10
7# Charlie,40,10
1filterx csv test.csv -H --oH -e 'alias(new_col) = age + 10'
2
3# output
4# name,age,new_col
5# Alice,20,30
6# Bob,30,40
7# Charlie,40,50
why not create a new column directly

alias is not a function, it is a statement. I will know where the new column is created, and it is more readable.

del

delete columns by name. It is more like select but if you don't want some columns, use delete will be more memory efficient.

1filterx csv test.csv -H --oH -e 'delete(age)'
2# output
3# name
4# Alice
5# Bob
6# Charlie

rename

rename columns by name.

1filterx csv test.csv -H --oH -e 'rename(name, new_name)'
2# output
3# new_name,age
4# Alice,20
5# Bob,30
6# Charlie,40

sort

sort by column(s), it can sort by multiple columns. This function provide three ways to sort.

The default way (sort) is from low to high, and you can use Sort to sort from high to low, and sorT to sort from low to high.

1filterx csv test.csv -H --oH -e 'sort(age)'
2# output
3# name,age
4# Alice,20
5# Bob,30
6# Charlie,40
7
8
9filterx csv test.csv -H --oH -e 'Sort(age)'
10# output
11# name,age
12# Charlie,40
13# Bob,30
14# Alice,20

sort support multiple columns.

multi-sort.csv
1a,b,c
21,2,3
31,1,3
41,1,2
1filterx csv multi-sort.csv -H --oH -e 'sort(a, b)'
2# output
3# a,b,c
4# 1,1,2
5# 1,1,3
6# 1,2,3

print headers of files. so that you can know the column names to manipulate them.

test.vcf
1##fileformat=VCFv4.2
2##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
3##reference=file://some/path/human_g1k_v37.fasta
4##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
5##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Dummy">
6##ALT=<ID=DEL,Description="Deletion">
7##contig=<ID=1,assembly=b37,length=249250621>
8##contig=<ID=2,assembly=b37,length=249250621>
9#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA001
101	5	.	C	a	.	PASS	.	GT	0/1
111	5	.	C	t	.	PASS	.	GT	0/1
121	7	.	T	a	.	PASS	.	GT	.
131	10	.	G	a	.	PASS	.	GT	0/1
141	12	.	GACA	ga	.	PASS	.	GT	0/1
151	16	.	T	taaa	.	PASS	.	GT	1/1
161	19	.	A	c	.	PASS	.	GT	0/1
171	61	.	C	a	.	PASS	.	GT	0/1
182	61	.	AGAG	aa	.	PASS	.	GT	0/1
192	119	.	AAA	t	.	PASS	.	GT	0/1
202	179	.	G	gacgtacgt	.	PASS	.	GT	0/1
212	200	.	A	<DEL>	.	PASS	END=210	GT	1/0
222	300	.	A	.	.	PASS	END=310;MinDP=10	GT	0/1
232	320	.	A	<*>	.	PASS	END=330;MinDP=20	GT	0/1
242	481	.	T	c,a	.	PASS	.	GT	0/2
1filterx vcf test.vcf -e 'header()'
2
3# output
40   chrom   str
51   pos     u32
62   id      str
73   ref     str
84   alt     str
95   qual    f32
106   filter  str
117   info    str
128   format  str
139   na001   str

cast_xxx

cast column to specific type. The following types are supported.

1str
2string
3u8
4u16
5u32
6u64
7i8
8i16
9i32
10i64
11f32
12f64
13bool
14int # cast to i32
15float # cast to f32

so full function name is cast_str, cast_string, cast_u8, cast_u16, cast_u32, cast_u64, cast_i8, cast_i16, cast_i32, cast_i64, cast_f32, cast_f64, cast_bool, cast_int, cast_float.

if you want to cast a column to i32, you can use cast_i32, or cast inplace by cast_i32_ (the same for other types).

1filterx csv test.csv -H --oH -e 'cast_i32(age)'
2
3# output
4# name,age
5# Alice,20
6# Bob,30
7# Charlie,40

fill_null

fill missing values with a specific value. it is useful when you want to fill missing values with a specific value.

test.vcf
1#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	ind1	ind2	ind3	ind4
2chr1	100	.	C	A	.	.	.	GT	0	0	0	.
3chr1	200	.	C	A	.	.	.	GT	1	1	1	.
4chr1	300	.	C	A	.	.	.	GT	1	1	1	1
5chr1	400	.	C	A,T	.	.	.	GT	2	1	1	.
6chr1	500	.	C	A	.	.	.	GT	0	0	1	1
7chr1	600	.	C	A	.	.	.	GT	0	0	1	.
1filterx vcf test.vcf -e 'fill_null_(cast_int(col("^ind\d+$")), 0)'
2
3# output
4# CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	ind1	ind2	ind3	ind4
5# chr1	100	.	C	A	.	.	.	GT	0	0	0	0
6# chr1	200	.	C	A	.	.	.	GT	1	1	1	0
7# chr1	300	.	C	A	.	.	.	GT	1	1	1	1
8# chr1	400	.	C	A,T	.	.	.	GT	2	1	1	0
9# chr1	500	.	C	A	.	.	.	GT	0	0	1	1
10# chr1	600	.	C	A	.	.	.	GT	0	0	1	0

This is a more complex example, let's talk about it.

  • col("^ind\d+$") will get all columns start with ind and followed by numbers.

  • cast_int will cast all these columns to i32.

  • fill_null_ will fill missing values with 0 inplace.

drop_null

drop rows with missing values.

test.csv
1name,age
2Alice,20
3Bob,
4Charlie,40
1filterx csv test.csv -H --oH -e 'drop_null_(age)'
2
3# output
4# name,age
5# Alice,20
6# Charlie,40

is_null & is_not_null

check if a column is null or not.

is_null(col("age")) will filter out all rows with missing values.

dup

dup has 4 types:

  • dup: dup all rows but keep the first one.

  • dup_none: dup all rows and keep none of duplicated rows.

  • dup_any: dup all rows and keep random one of duplicated rows.

  • dup_last: dup all rows and keep last one of duplicated rows.

test.csv
1name,age
2Alice,15
3Bob,30
4Charlie,40
5Alice,20
1filterx csv test.csv -H --oH -e 'dup(name)'
2# output
3# name,age
4# Alice,15
5# Bob,30
6# Charlie,40

abs

get the absolute value of a column.

test.csv
1name,age
2Alice,-15
3Bob,30
1filterx csv test.csv -H --oH -e 'abs_(age)'
2
3# output
4# name,age
5# Alice,15
6# Bob,30