Skip to content

medvedevgroup/ESSColor

Repository files navigation

ESS-color

ESS-Color is a bioinformatics tool for constructing compressed representation of sets of k-mer sets (i.e. compressed colored dBG).

Requirements

  • Linux operating system (64 bit)
  • GCC >= 4.8 or a C++11 capable compiler
  • Snakemake
  • Git
  • CMake 3.12+
  • Rust (for ggcat)
  • KMC

Quick start

First, install all the pre-requisites and make sure the executables are in your PATH. Then, install additional executables from source:

git clone https://github.com/medvedevgroup/ESSColor.git
cd ESSColor
bash compile.sh

You can move/copy ALL the executables in ESSColor/bin to the bin directory that is already in your PATH. For instance, considering /usr/bin is already in PATH, you need to run the command mv ESSColor/bin/* /usr/bin to move all executables for ESS-Color software. An alternative to moving/copying executables is adding the location of ESSColor/bin to your PATH.

Rust and ggcat Installation

ESS-Color uses a modified implemntation of ESS-Compress. We replace the unitig construction step in ESS-Compress by GGCAT for its optimized implementation. To install ggcat, first install rust.

To install rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup toolchain install nightly

To install ggcat:

git clone https://github.com/algbio/ggcat --recursive
cd ggcat/
cargo install --path crates/cmdline/ --locked

If the current ggcat version does not work with ESS-Color, please use the following commit (tested during release of manuscript):

git clone https://github.com/algbio/ggcat
git checkout dd64634a27467b9e56c8f7aad619eae7f4e7917a
git submodule init
git submodule update --recursive
cd ggcat/
cargo install --path crates/cmdline/ --locked

the binary is automatically copied to $HOME/.cargo/bin

Usage details

ESSColorCompress: compression of set of k-mer set

Syntax: ./essColorCompress [parameters] 

mandatory arguments:
-k [int]          k-mer size (must be >=4)
-i [input-file]   Path to input file. Input file is a single text file containing the list of multiple fasta/fastq files (one file per line)
-o [output-dir]   Path to output directory. [warning: this directory is also used as temp directory, so make sure it does not contain input files]

optional arguments:
-a [int]          Default=1. Sets a threshold X, such that k-mers that appear less than X times in the input dataset are filtered out. 
-j [int]          Default=1. Number of threads.   
-p [output-prefix]   Default="esscolor". Prefix of output compressed cdbg.

Upon successful completion, the output directory will contain the compressed colored dbGfile <output-prefix>.tar.gz.

ESSColorDecompress: decompression of ESSColor representation

Syntax: ./essColorDecompress [parameters] 

mandatory arguments:
-i [input-file]    compressed cdBG generated by `essColorCompress`.   

optional arguments:
-h                   Print this Help

Description of decompressed output

The decompressed folder contains 3 files.

  • simplitigs.fa
    • a FASTA file with set of simplitigs correspoinding to the union ESS.
  • meta.txt
    • Contains a text file with a header indicating value of k-mer size, followed by C rows each indicating name of the samples.
  • matrix.txt
    • ordered color matrix in plaintext

Quick start with a step-by-step example

Compression

Preparing the input

$ cd example/

$ mkdir -p output_test

Let's say your 4 gzipped FASTA files are stored in folder example/mini_k18c4m7, named

sample0.fa.gz
sample1.fa.gz
sample2.fa.gz
sample3.fa.gz

If you wish to runessColorCompress on all 4 ".fa.gz" files, first make a list named list_mini_k18c4m7 containing the absolute path to all 4 files in each line.

$ ls $PWD/mini_k18c4m7/*.fa > list_mini_k18c4m7

Performing the compression given the input list

Now, to compress this list using 8 threads, runLengh=16, k-mer size 18 and output to directory output_test/, run the following command:

$ essColorCompress -i list_mini_k18c4m7 -k 18 -o output_test/ -j 8

Upon successful completion, the output directory will contain a file called esscolor.tar.gz which is the compressed colored dBG.

Decompression

Generating text output

Run $ essColorDecompress -i esscolor.tar.gz

Output simplitigs.fa contains the non-labeled color matrix (ESS order). Output simplitigs.fa contain simplitigs in a FASTA file.

Let's look at the first simplitig $ cat simplitigs.fa | head -n 2

The first simplitig looks like this:

>
AAAAACAAAAAAAAAAAATTT

Let's look at the first 4 rows of the non-labeled color matrix $ head -n 4 matrix.txt

The first 4 rows of the non-labeled color matrix looks like this:

1100
0001
0001
0001

The color vectors in text should be read in MSB order. So, color vector 1100 indicates that the first k-mer AAAAACAAAAAAAAAAAA is present in sample0.fa.gz and sample1.fa.gz and absent in other two. The 2nd to 4th k-mers (AAAACAAAAAAAAAAAAT, AAACAAAAAAAAAAAATT, AACAAAAAAAAAAAATTT) are present only in sample3.fa.gz.

Other usage

Color matrix generation

If you are only interested to obtain the color matrix from a KMC database list, you can use the "genmatrix" module. (WARNING: In the current version of the software genmatrix, kmer size must be 32 at maximum. To support k>32, we use an alternative pipeline using joinCounts.)

genmatrix [OPTION...]
-c, --count-list arg  [Mandatory] Path to KMC database files. One line per database
-d, --debug-verif     Debug flag to verify if the output coresponds to the input (Time consuming).
-o, --outmatrix arg   [Mandatory] Path to the output color matrix
-l, --spss arg   [Mandatory] Path to the corresponding union SPSS    
    -s, --strout          String output

Command example for a 100 ecoli matrix: $ genmatrix -c db_list.txt -o matrix.bin -l kmers.bin

To generate matrix in plain text
$ genmatrix -c db_list.txt -l simplitigs.fa -o matrix.txt -s

To generate matrix in binary
$ genmatrix -c db_list.txt -l simplitigs.fa -o matrix.bin

The file db_list.txt must contain the paths to the KMC databases. The file simplitigs.fa must have the same k-mers in fasta format in de-duplicated manner. The path can be absolute or relative to the exec directory. The software is expecting one path per line.

The file matrix.bin contains the color matrix. The matrix has one row per kmer and C column (1 per sample). The columns have the same order than the databases in the db_list.txt file. In string format rows are separated using '\n' chars. Each row is composed of 100 chars that are 0 or 1 depending on the presence/absence of the row kmer in the column sample.

In binary format, a row is a large enough multiple of 64 bits. For our 100 samples, a row is composed of 128 bits (16 Bytes). The xth bit of the yth byte correspond to the sample $y*8+x$. There is no separator between successive rows.

The file kmer.bin contains the kmer list corresponding to the matrix. In string format, there is one kmer per line.

In binary format, all the values inside of the file are 64 bits. Each 64 bit is decomposed in 8 bytes little endian ordered. First value is k, second is the number n of kmers, then are n values that are kmers.

How to cite

If using ESS-Color in your research, please cite

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages