Assembly Artifacts Analyses

Link to code and scripts

Tool	Version / Git commit	Link / Source
Modified-directed-bcalm (for directed unitig generation)	e4d67f	GitHub link
Custom python simulator (simu_rand.py)	-	GitHub link
All other code and scripts	-	GitHub link

Other softwares used

Tool	Purpose	Version / Git commit	Link / Source
bcalm2	Bidirected unitig generation	b8cde9	GitHub link
Jellyfish 2	Canonical and non-canonical k-mer counting	029416	GitHub link
ABySS	Assembler	2.0.2	Conda: `conda install -c bioconda abyss`
SPAdes	Assembler	3.15.3	GitHub link
MEGAHIT	Assembler	1.2.9	Binary link
GATB-Minia	Assembler	831ba4	GitHub link
Bowtie 2	Alignment	c04cfb	Github link
ART	Read simulator	2.6.0	Link
CAMISIM	Metagenomic simulation	001a33	GitHub link

Instructions to reproduce all results in paper

Reproducing most of the results in the paper is pretty straightforward. The readers should be able to reproduce the results following the descriptions provided in the paper with their own scripts. Still, here we provide the scripts we used along with instructions. The scripts and instructions are included to serve as guidelines only, and by no means are complete pipelines to reproduce the results. Some scripts might need minor changes, i.e. you may need to put correct path of certain files.

Reproducing results in Table 1

The presence of unsafe and mis-assembled unitigs in human chromosome 1 is evaluated using simulated error-free reads. For the scripts, check scripts/exp1 folder.

Dataset

Dataset	Download
T2T v1 chr 1	link

Method

From reference fasta file, remove all N and uppercase and make a file called "ref.sset" containing a single string of alphabet A, C, G, T only. cat chr1.fa | sed '/^>/d' > ref.sset

The file pipe_invalid_d_read.sh simulates read at a certain coverage C (simple random model), then outputs sequenced segments, reads, unitigs (mis-assembled, unsafe and all others).
pipe_invalid_d_read.sh [folder containing ref.sset] [k-mer-size] [coverage]

If the reference file ref.sset is in the folder t2t_chr1, then for coverage 20x, k=71, generate the unitigs using following command: pipe_invalid_d_read.sh t2t_chr1/ 71 20

This outputs the following 4 important files. The file "unitigs.sset" contains all unitigs, one unitig per line. The file "segfile.sset" contains sequenced segments, one segment per line. The file "invalids_true" contains mis-assembled unitigs, one unitig per line. The file "invalids" contains unsafe unitigs, one unitig per line.

To obtain Figure 3 cases,

First get a file named "gtouch" containing all unitigs and touching sequenced segments or reads.

Syntax: bash gtouch_maker.sh <k-mer-size> <reference-string> <output-directory> <unitigs:one-unitig-per-line> <reads/segments:one-read-per-line>

Example: bash gtouch_maker.sh 71 t2t_chr1/ref.sset out/ t2t_chr1/unitigs.sset t2t_chr1/segfile.sset

An example to show the contents of "gtouch" file (generated with k=3):

AAATTTA TTTTTTTTAAATTTA
AAATTTA GGGGAAATTTA
GGG GGGGAAAAAAA
CCAAA CCA

Reverse complements are not considered. All k-mers in unitigs and all segment/read are substrings of reference.

Contains 2 columns separated by " " (space).
Column 1: unitig (directed).
Column 2: a substring (i.e. segment/simulated read) of reference genome that contains at least one k-mer of the unitig (column 1).

Grouped by column 1: That means if you want all the substrings touching unitig AAAAAAA, then start with first occurrence of AAAAAAA in column 1, and then scan subsequent lines untill you get a different unitig (i.e. GGG in the example). That gives you list of all substrings touching at least a k-mer from the unitig.

Now, execute the command: python pretty_printer.py 71 gtouch invalids_true ref.sset

The output will indicate how many Figure 3 cases (labelled as "number of misjoins") are there.

Reproducing Table 3 Results

Compile kmer_operations.cpp : g++ kmer_operations.cpp -o kmer_operations Then, execute: kmer_operations <k> <cov> <invalid_spades> <invalid_bcalm> <con/uni> denotes k-mer size, denotes coverage, <invalid_spades> denotes path to misassembled unitigs/contigs generated by spades (or, other assembler), <invalid_bcalm> denotes path to misassembled unitigs predicted by our theorem. <con/uni> denotes label of output, use it to distinguish between contig and unitig. The output file includes number of k-mers shared between each pair of misassembled unitigs (or, contigs) generated by assembler and misassembled unitigs predicted by our theorem. Filter out the rows with numbers less than threshold using awk to generate the numbers in table 3 and 4.

Reproducing Table 5 Results

Use bcalm2 to generate bidirected unitigs, and modified-directed-bcalm to generated double directed unitigs. Then, generate the table following the definitions in the paper.

Reproducing Table 6 Results

Dataset Generation

We use chr4 from grch38.

Dataset	Download
grch38	link

To generate synthetic dataset from chr4: python artificial_palin.py ref.fa

This generates ~7 million bp long contig from chr4 where around 700K locations were picked randomly and palindromes were introduced.

Generate reads of length 100 with 1x coverage, and then build unitigs. For k=31,
art_sp_invalid.sh t2t_chr1 31 1 100

CAMI simulated dataset generation and reproducing Tables 2 and 4

Dataset: Download the 30 reference genomes by using the accessions listed in supplementary of the paper.

Follow the instructions to download and install CAMISIM.
CAMISIM requires a configuration file (the one we used is located in cami_config directory). Now, execute the simulation python metagenomesimulation.py spades_config.ini. This results in paired-end reads from 30 reference genomes listed in CAMI low complexity dataset in the supplementary of metaSPAdes paper.
Concatenate all paired-end reads and convert to a fasta file named read.fa.

cat *.fq > read.fq
fq_to_fa.sh read.fq > read.fa

You can get number of total reads using this command: echo "$(cat read.fa | wc -l) / 2" | bc

Now, generate double directed unitigs using the following command. For k=55, dd_unitigs.sh read.fa 55. This generates unitigs in a file named unid55.fa.
Concatenate all references to ref.fa, then build bowtie2 index from it bowtie2-build -p -f ref.fa refindex
Align unitig to reference. We use options to align exactly (no errors or mismatch tolerated). The ones that are unaligned are misassembled bowtie2 -f -x refindex -U unid55.fa -S unid55_ref.sam --un misassembled55.fa --score-min 'C,0,-1' -p 8
Align reads to references and save the alignment sam file (read_ref.sam) Extract only the relevant columns from sam file using: cat read_to_ref.sam | grep -v @ | cut -f3,4,10 > read_to_ref.awkresult; cp read_to_ref.awkresult read_to_ref.samcut; cat read_to_ref.samcut | awk '{print $0 " " 150}' > read_to_ref.awkresult. Here 150 is the read length. Now, generate segments from the reads for k=55 using: python segment.py read_to_ref.awkresult ref.fa 55 segs.sset. This outputs a file called segs.sset which contains one segment per line.
Convert the file containing one segment per line to a fasta file. Then build a bowtie2 index from it. sset_to_fa.sh segs.sset > segs.fa bowtie2-build -p -f segs.fa segindex
Align unitig to segments. We use options to align exactly (no errors or mismatch tolerated). The ones that are unaligned are unsafe bowtie2 -f -x segindex -U unid55.fa -S unid55_seg.sam --un unsafe55.fa --score-min 'C,0,-1' -p 8

Now, use instructions to reproduce Table 1 and 3, and apply it to cami dataset to generate data for Table 2 and 4 respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assembly Artifacts Analyses

Link to code and scripts

Other softwares used

Instructions to reproduce all results in paper

Reproducing results in Table 1

Dataset

Method

Reproducing Table 3 Results

Reproducing Table 5 Results

Reproducing Table 6 Results

Dataset Generation

CAMI simulated dataset generation and reproducing Tables 2 and 4

About

Releases

Packages

Languages

medvedevgroup/assembly-artifacts-paper-experiments

Folders and files

Latest commit

History

Repository files navigation

Assembly Artifacts Analyses

Link to code and scripts

Other softwares used

Instructions to reproduce all results in paper

Reproducing results in Table 1

Dataset

Method

Reproducing Table 3 Results

Reproducing Table 5 Results

Reproducing Table 6 Results

Dataset Generation

CAMI simulated dataset generation and reproducing Tables 2 and 4

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages