test.csv
1name,age
2Alice,20
3Bob,30
4Charlie,40
test-without-header.csv
1Alice,20
2Bob,30
3Charlie,40
col
get column by name or index, for index, it starts from 0.
1filterx csv test.csv -H -e 'col(name) == "Alice"'
2# equivalent to
3filterx csv test-without-header.csv -e 'col(0) == "Alice"'
4
5# output
6# Alice,20
col
can be used as column name, so you can use it in any need of column name. And it also can be used in col
function.
1filterx csv test.csv -H -e 'col(col(0)) == "Alice"'
WARNING
while using index, the csv file should not have header.
1filterx csv test.csv -H -e 'col(0) == "Alice"' # error
col
can be used to slelect multiple columns by regex.
1filterx csv test.csv -H -e 'col("^ind\d+$")'
means select all columns start with ind
and followed by numbers.
select
get columns by name, output will follow the order of selection.
1filterx csv test.csv -H -e 'select(age, name)'
2
3# output
4# 20,Alice
5# 30,Bob
6# 40,Charlie
alias
create a new column from a literal value or an expression or a column.
1filterx csv test.csv -H --oH -e 'alias(new_col) = 10'
2
3# output
4# name,age,new_col
5# Alice,20,10
6# Bob,30,10
7# Charlie,40,10
1filterx csv test.csv -H --oH -e 'alias(new_col) = age + 10'
2
3# output
4# name,age,new_col
5# Alice,20,30
6# Bob,30,40
7# Charlie,40,50
why not create a new column directly
alias
is not a function, it is a statement. I will know where the new column is created, and it is more readable.
del
delete columns by name. It is more like select
but if you don't want some columns, use delete
will be more memory efficient.
1filterx csv test.csv -H --oH -e 'delete(age)'
2# output
3# name
4# Alice
5# Bob
6# Charlie
rename
rename columns by name.
1filterx csv test.csv -H --oH -e 'rename(name, new_name)'
2# output
3# new_name,age
4# Alice,20
5# Bob,30
6# Charlie,40
sort
sort by column(s), it can sort by multiple columns. This function provide three ways to sort.
The default way (sort
) is from low to high, and you can use Sort
to sort from high to low, and sorT
to sort from low to high.
1filterx csv test.csv -H --oH -e 'sort(age)'
2# output
3# name,age
4# Alice,20
5# Bob,30
6# Charlie,40
7
8
9filterx csv test.csv -H --oH -e 'Sort(age)'
10# output
11# name,age
12# Charlie,40
13# Bob,30
14# Alice,20
sort support multiple columns.
multi-sort.csv
1a,b,c
21,2,3
31,1,3
41,1,2
1filterx csv multi-sort.csv -H --oH -e 'sort(a, b)'
2# output
3# a,b,c
4# 1,1,2
5# 1,1,3
6# 1,2,3
print headers of files. so that you can know the column names to manipulate them.
test.vcf
1##fileformat=VCFv4.2
2##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
3##reference=file://some/path/human_g1k_v37.fasta
4##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
5##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Dummy">
6##ALT=<ID=DEL,Description="Deletion">
7##contig=<ID=1,assembly=b37,length=249250621>
8##contig=<ID=2,assembly=b37,length=249250621>
9#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA001
101 5 . C a . PASS . GT 0/1
111 5 . C t . PASS . GT 0/1
121 7 . T a . PASS . GT .
131 10 . G a . PASS . GT 0/1
141 12 . GACA ga . PASS . GT 0/1
151 16 . T taaa . PASS . GT 1/1
161 19 . A c . PASS . GT 0/1
171 61 . C a . PASS . GT 0/1
182 61 . AGAG aa . PASS . GT 0/1
192 119 . AAA t . PASS . GT 0/1
202 179 . G gacgtacgt . PASS . GT 0/1
212 200 . A <DEL> . PASS END=210 GT 1/0
222 300 . A . . PASS END=310;MinDP=10 GT 0/1
232 320 . A <*> . PASS END=330;MinDP=20 GT 0/1
242 481 . T c,a . PASS . GT 0/2
1filterx vcf test.vcf -e 'header()'
2
3# output
40 chrom str
51 pos u32
62 id str
73 ref str
84 alt str
95 qual f32
106 filter str
117 info str
128 format str
139 na001 str
cast_xxx
cast column to specific type. The following types are supported.
1str
2string
3u8
4u16
5u32
6u64
7i8
8i16
9i32
10i64
11f32
12f64
13bool
14int # cast to i32
15float # cast to f32
so full function name is cast_str
, cast_string
, cast_u8
, cast_u16
, cast_u32
, cast_u64
, cast_i8
, cast_i16
, cast_i32
, cast_i64
, cast_f32
, cast_f64
, cast_bool
, cast_int
, cast_float
.
if you want to cast a column to i32, you can use cast_i32
, or cast inplace by cast_i32_
(the same for other types).
1filterx csv test.csv -H --oH -e 'cast_i32(age)'
2
3# output
4# name,age
5# Alice,20
6# Bob,30
7# Charlie,40
fill_null
fill missing values with a specific value. it is useful when you want to fill missing values with a specific value.
test.vcf
1#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind1 ind2 ind3 ind4
2chr1 100 . C A . . . GT 0 0 0 .
3chr1 200 . C A . . . GT 1 1 1 .
4chr1 300 . C A . . . GT 1 1 1 1
5chr1 400 . C A,T . . . GT 2 1 1 .
6chr1 500 . C A . . . GT 0 0 1 1
7chr1 600 . C A . . . GT 0 0 1 .
1filterx vcf test.vcf -e 'fill_null_(cast_int(col("^ind\d+$")), 0)'
2
3# output
4# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind1 ind2 ind3 ind4
5# chr1 100 . C A . . . GT 0 0 0 0
6# chr1 200 . C A . . . GT 1 1 1 0
7# chr1 300 . C A . . . GT 1 1 1 1
8# chr1 400 . C A,T . . . GT 2 1 1 0
9# chr1 500 . C A . . . GT 0 0 1 1
10# chr1 600 . C A . . . GT 0 0 1 0
This is a more complex example, let's talk about it.
-
col("^ind\d+$")
will get all columns start with ind
and followed by numbers.
-
cast_int
will cast all these columns to i32.
-
fill_null_
will fill missing values with 0 inplace.
drop_null
drop rows with missing values.
test.csv
1name,age
2Alice,20
3Bob,
4Charlie,40
1filterx csv test.csv -H --oH -e 'drop_null_(age)'
2
3# output
4# name,age
5# Alice,20
6# Charlie,40
is_null & is_not_null
check if a column is null or not.
is_null(col("age")) will filter out all rows with missing values.
dup
dup has 4 types:
-
dup: dup all rows but keep the first one.
-
dup_none: dup all rows and keep none of duplicated rows.
-
dup_any: dup all rows and keep random one of duplicated rows.
-
dup_last: dup all rows and keep last one of duplicated rows.
test.csv
1name,age
2Alice,15
3Bob,30
4Charlie,40
5Alice,20
1filterx csv test.csv -H --oH -e 'dup(name)'
2# output
3# name,age
4# Alice,15
5# Bob,30
6# Charlie,40
abs
get the absolute value of a column.
test.csv
1name,age
2Alice,-15
3Bob,30
1filterx csv test.csv -H --oH -e 'abs_(age)'
2
3# output
4# name,age
5# Alice,15
6# Bob,30