Rapid Elimination of Useless SEquences
REUSE is a k-mer based tool for filtration of reads in sequencing datasets that match a reference sequence.
reuse build
takes a reference fasta file as input, and ouputs a hashed index file. reuse filter
takes FASTA/FASTQ file inputs, along with the hashed index file, and outputs k-mer filtered reads in a user-specified format. Common applications of REUSE include filtration of host, contamination, PhiX or ribosomal sequences.
REUSE will run on most unix-based systems including Linux and Mac OS. Prerequisites include:
- zlib1g-dev
- libpthread-stubs0-dev
- libbz2-1.0
- libseqan2-dev
- c++ (≥14)
- cmake (≥3.5)
Download the pre-compiled binary from https://github.com/chorltsd/REUSE/releases/latest, extract and then run the reuse
binary:
wget https://github.com/chorltsd/REUSE/releases/latest/reuse_linux-x64.tar.gz
tar xzvf reuse_linux-x64.tar.gz
cd reuse
./reuse -h
Alternatively, this repository can be cloned and compiled using cmake:
git clone https://github.com/chorltsd/REUSE.git
cd REUSE
cmake .
reuse build [options] -o <output_file>
Example:
reuse build hg38.fa hg38
reuse filter -x hg38 -U input.fq -o filtered.fq
-i <input_file> = reference in. A comma-separated list of FASTA files containing the reference sequences to index (default: read from STDIN)
-o <output_file>= File to save index k-mer dataset to disk
-p/--threads = Number of threads to use (default: 1)
-r = Maximum RAM usage in MB (default: 400)
-k = k-mer length (default: 21)
-m = Mask k-mers found in this fasta file from the reference database. This option is used to minimize false positive filtering of related species or species of interest.
-g = Compress index when saving to disk. May take longer to generate the index and load when searching.
-h/--help = Print usage information and quit
-v/--version = Print version information and quit
-Eliminate all reads or read pairs when 1 or more reference k-mers are found within the read. Optionally, retain only those reads with matching k-mers.
reuse filter [options] -x <index> -1 <m1> -2 <m2>
-x <index>
The index file for the reference dataset, generated with reuse build
-1 <m1> Comma-separated list of files containing either unpaired reads or mate 1s (filename usually includes _1), e.g. -1 flyA_1.fq,flyB_1.fq. If -2 <m2> is specified, it is assumed that -1 <m1> represents mate 1s. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in . Reads may be a mix of different lengths. If - is specified, REUSE will read the mate 1s from the “standard in” or “stdin” filehandle. Reads may be in FASTQ or FASTA format, and may be gzipped or bzip2ed. REUSE detects these formats by default.
-2 <m2> Comma-separated list of files containing mate 2s (filename usually includes _2), e.g. -2 flyA_2.fq,flyB_2.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in . Reads may be a mix of different lengths. If - is specified, resuse will read the mate 2s from the “standard in” or “stdin” filehandle. Reads may be in FASTQ or FASTA format, and may be gzipped or bzip2ed. REUSE detects these formats by default.
-o <output> = Save reads not matching the k-mer filter to .fast(q/a) for single-end reads, or _1.fast(q/a) and _2.fast(q/a) for paired-end reads.
-f <filtered> = Save reads matching the k-mer filter to <filtered>.fast(q/a) for single-end reads, or <filtered>_1.fast(q/a) and <filtered>_2.fast(q/a) for paired-end reads. By default, reads are discarded.
-g = Compress outputted reads with gzip
-z = Compress outputted reads with alternate command, such as "bzip2"
-r = Maximum RAM usage in MB (default: 400)
-p/--threads = Number of threads to use (default: available number of threads)
-l = Log file
-k = Minimum number of k-mers per read to filter it (default: 1)
-s = Split pairs
By default, reads are output to STDOUT in the same format as they are input (eg. FASTQ input=FASTQ output). Paired end reads are interleaved before output to STDOUT. Please see the option -o
for further details.
REUSE will run fastest with filtration after the first k-mer is found (-mk 1), maximum thread and RAM usage, and a lower k-mer size. Lower k-mer sizes reduce the index size but are less specific at differentiating species.
This project is licensed under the GNU General Public License - see the LICENSE.md file for details