Skip to content
Stefan Schroedl edited this page Mar 22, 2015 · 5 revisions

Tabulator: unix shell command line tools for delimited data files with header

Description and Features

Project Tabulatoris a set of Unix command line tools for quick and convenient batch processing of tabular text files (a.k.a., tab-delimited, csv, or flat file format). We assume rows constitute records, split into a fixed number of columns; the first line signifies a header specifying the column names.

  • Command line arguments can refer to column names.
  • Delimiter characters are detected.
  • Compressed files are recognized and treated appropriately.
  • All scripts can be run either with file arguments or on stdin, to support pipe constructs.
  • For each script, the -h and -m switches provide more detailed help and list possible options.
  • Most (but not all) scripts translate the given command line options into a sequence of Unix shell commands (awk, cut, join and sort).

tblmap

Per-line computation (corresponding to the Map operation within the MapReduce scheme). For example, create new columns based on functions of other columns; remove or reorder columns; filter rows based on a column expression.

Examples:

  • Print columns rep and sales for lines where column region has value us:

tblmap −s’region=="us"’ −k’rep,sales’

  • Compute ratio of columns sales and clients, output as single column:

tblmap −k ’’ −c’sales_per_client=sales/client’

  • Cut out of columns whose names contains the string internal:

tblmap −irk’internal’

  • Shuffle a file randomly by generating a random number per line:

cat file | tblmap −c"myrand=rand()" | tblcmd "sort −k myrand −t’,’" | tblmap −irk"myrand"

tblred

Compute ("reduce") aggregations (sum, average, etc) over groups of rows identified by key column(s).

Example: computes for each line the proportion of column sales to the total for all lines with the same value of column region.

tblred -k'region' 'sales_ratio=sales/sum(sales)' sales.txt

tbldesc

Represent a snapshot of the type of data in a file. Summarize columns (e.g., proportion of character/numeric content, min/mean/median/max, missing values). A target column can be specified, in which case the correlation and predictive strength of all other columns are calculated (feature selection). The columns can be ordered by any of the output characteristics.

Example: Suppose file contains the following lines:

         name,sex,house_nr,height,shoe_size
         arthur,m,42,181,11.5
         berta,f,101,163,8.5
         chris,m,1333,175,10
         don,m,77,185,12.5
         elisa,f,204,166,7

Then tbldesc −k shoe_size file prints:

  COL NAME            | TYPE        NA  CHAR   INT     MEAN      MED      MIN      MAX      STD |   MAR R2ADJ   CORR  PVAL  VALUES 
=======================================================================================================================================
  5   shoe_size       | num        0.0   0.0  40.0      9.9       10        7     12.5     2.22 | 100.0 100.0  100.0   0.0  5 uniq:
                      |                                                                         |                           '8.5'x1=>8.5,
                      |                                                                         |                           '7'x1=>7,
                      |                                                                         |                           '12.5'x1=>12.
                      |                                                                         |                           5,
                      |                                                                         |                           '10'x1=>10,
                      |                                                                         |                           '11.5'x1=>11.
                      |                                                                         |                           5
_______________________________________________________________________________________________________________________________________
  4   height          | int        0.0   0.0 100.0      174      175      163      185     9.43 | 100.0  83.8   93.7   1.9  5 uniq:
                      |                                                                         |                           '163'x1=>8.5,
                      |                                                                         |                           '166'x1=>7,
                      |                                                                         |                           '181'x1=>11.5
                      |                                                                         |                           ,
                      |                                                                         |                           '185'x1=>12.5
                      |                                                                         |                           , '175'x1=>10 
_______________________________________________________________________________________________________________________________________
  2   sex             | char       0.0 100.0   0.0                                              | 100.0  71.0          4.6  2 uniq:
                      |                                                                         |                           'm'x3=>11.3,
                      |                                                                         |                           'f'x2=>7.75
_______________________________________________________________________________________________________________________________________
  1   name            | char       0.0 100.0   0.0                                              | 100.0   0.0        100.0  5 uniq:
                      |                                                                         |                           'don'x1=>12.5
                      |                                                                         |                           ,
                      |                                                                         |                           'elisa'x1=>7,
                      |                                                                         |                           'berta'x1=>8.
                      |                                                                         |                           5,
                      |                                                                         |                           'arthur'x1=>1
                      |                                                                         |                           1.5,
                      |                                                                         |                           'chris'x1=>10
_______________________________________________________________________________________________________________________________________
  3   house_nr        | int        0.0   0.0 100.0      351      101       42     1333      552 | 100.0 −32.7   −6.8  91.4  5 uniq:
                      |                                                                         |                           '42'x1=>11.5,
                      |                                                                         |                           '204'x1=>7,
                      |                                                                         |                           '77'x1=>12.5,
                      |                                                                         |                           '101'x1=>8.5,
                      |                                                                         |                           '1333'x1=>10
 _______________________________________________________________________________________________________________________________________

tbljoin

Join files (in the relational algebra sense), taking care of necessary column matching and pre-sorting before performing unix join. Options for inner, left, right, and outer joins are available.

Examples:

Suppose file1 is

         name,street,house
         zorro,desert road,5
         john,main st,2
         arthur,pan−galactic bypass,42
         arthur,main st,15

and file2 is

         name,street,phone
         john,main st,654−321
         arthur,main st,121−212
         john,round cir,123−456

Then tbljoin file1 file2 gives

         name,street,house,phone
         arthur,main st,15,121−212
         john,main st,2,654−321

tbljoin −r file1 file2 gives

         house,name,street,phone
         42,arthur,main st,121−212
         2,john,main st,654−321
         NA,john,round cir,123−456

and tbljoin −lrn’?’ file1 file2 gives

         name,street,house,phone
         arthur,main st,15,121−212
         arthur,pan−galactic bypass,42,?
         john,main st,2,654−321
         john,round cir,?,123−456
         zorro,desert road,5,?

tblsort

Interface for unix sort command.

Example:

Suppose file is

         string_col,int_col,float_col
         b,0,1.43634633463252
         ab,123346646,13.4E04
         d,123346646,13.4E04
         aa,2323232,-1.2E-3

Then tblsort -k'float_col/g,string_col/r' file is equivalent to a sort key specification sort -k3g -k1r; it generates:

         string_col,int_col,float_col
         aa,2323232,-1.2E-3
         b,0,1.43634633463252
         d,123346646,13.4E04
         ab,123346646,13.4E04

tblcolumn

Format data to make it better human-readable. Calls unix column command; additionally, numerical columns are reformatted and aligned.

Example:

Suppose file is

         string_col,int_col,float_col
         shrt,0,1.43634633463252
         looooong striiiiiiiiiing,123346646,13.4E04
         medium,2323232,-1.2E-3

Then tblcolumn file generates:

string_col                int_col___  float_col_
shrt                               0        1.43635        
looooong striiiiiiiiiing   123346646   134000.             
medium                       2323232       -0.0012   

tblless

Shortcut script: call tblcolumn, and page through its output.

tblcmd

Run any command (e.g., sort, uniq) on the body of file, keep the header. Column name and delimiter substitution is supported on command line arguments.

Example:

Sort a csv file with header by column user_name:

tblcmd ’sort −t"?delim" −k ?user_name’

tblsplit

Split a file into several subfiles based on the value of a key column.

Example: Suppose file is

        continent,country
        americas,us
        americas,mx
        europe,de
        europe,fr

Then tblsplit −rk’continent’ file generates two files:

file.select.continent=americas:

        country
        us
        mx

and file.select.continent=europe:

        country
        de
        fr

tbluniq

Identify or remove duplicate (or isomorphic) columns.

A column y is dependent on x, written x−>y, if there is a function f with f(x)=y for all x. x is isomorphic to y, if x−>y and y−>x.
The script tries to find dependencies and isomorphisms. The dependency graph can be printed out.

Example:

Suppose file is

        key1,key2,key3,key4,key5,key6
        1,1,1,1,1,1
        1,2,4,2,1,2
        3,2,4,3,1,3
        4,4,3,4,1,4
        5,5,5,5,1,5

Then tbluniq −v file gives

        duplicates:
        {key4,key6}
        key1,key2,key3,key4,key5
        1,1,1,1,1
        1,2,4,2,1
        3,2,4,3,1
        4,4,3,4,1
        5,5,5,5,1

and tbluniq −xv file gives

        isomorphic:
        {key2,key3}
        {key4,key6}
        dependencies:
        {key4,key6} −> key1
        {key4,key6} −> {key2,key3}
        key1 −> key5
        {key2,key3} −> key5
        key1,key2,key4,key5
        1,1,1,1
        1,2,2,1
        3,2,3,1
        4,4,4,1
        5,5,5,1

tblcat

Concatenate files without repeating header.

tblhist

Histogram of column values. Also supports "ascii plotting".

tbltranspose

Transpose rows and columns.

tbltex

Format csv data into LaTex format.