Releases · pairwise-alignment/pa-bench

This is just a place to put datasets used for benchmarking. Datasets are provided in the .seq format, containing sequence pairs to be aligned like so:

>CTGGGGTTACAGGCATGCACCAGCACGCC...
<CTGGGGTTACAGGCATGCACCAGCACGCC...

`ont-500k.zip`: ONT reads length >500kbp @ 6.1% divergence

Contains 50 .seq files (seq01.seq .. seq50.seq) each containing a single alignment. This dataset contains only read errors.

This dataset was created by downloading some reads (this download, 300GB total) used for v1.1 of CHM13, and aligning them back to the reference.
See Snakefile for details.

`ont-500k-genvar.zip`: ONT reads length >500kbp @ 7.2% divergence, including genetic variation

Contains 48 .seq files each containing a single alignment. This dataset includes genetic variation and large gaps.

This dataset is reused directly from BiWFA and is also available in the BiWF repository. We provide it here for completeness, with the only change that seq[1-9].seq have been renamed to seq[01-09].seq. It was generated by the BiWFA authors by taking ONT MinION reads from Bowden et. al (2019), filtering them for length at least 500kbp, and aligning them to the CHM13 v1.1 assembly.

`ont-10k.zip` `ont-50k.zip`: ONT reads of length <10k and <50k @ 12% divergence

These contain 50 .seq files with 100 resp. 200 sequence pairs each. Pairs are sorted by edit distance, with the closest pairs in 00.seq.

These datasets were reused from BiWFA and only modified to split them into multiple files.

`ont-1k.zip`: ONT reads of length <1k @ 10% divergence

Contains 50 .seq files with ~250 sequence pairs each, sorted by increasing edit distance.

This dataset is reused from WFA.

`sars-cov-2.zip`: 10000 pairs of length 30k @ 1.5% divergence

This dataset was generated by downloading a 500MB of SARS-CoV-2 genomes. We stripped all non-ACTG characters, sampled 10000 random pairs, sorted them by edit distance, and split them into 50 files each containing 200 pairs.
Average divergence is 1.5%.

`illumina.seq.tar.bz`: illumina reads of length 100 @ 0.3% divergence

Contains a single .seq file with 100'000 pairs. Reused from WFA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ont-500k.zip`: ONT reads length >500kbp @ 6.1% divergence

`ont-500k-genvar.zip`: ONT reads length >500kbp @ 7.2% divergence, including genetic variation

`ont-10k.zip` `ont-50k.zip`: ONT reads of length <10k and <50k @ 12% divergence

`ont-1k.zip`: ONT reads of length <1k @ 10% divergence

`sars-cov-2.zip`: 10000 pairs of length 30k @ 1.5% divergence

`illumina.seq.tar.bz`: illumina reads of length 100 @ 0.3% divergence

Releases: pairwise-alignment/pa-bench

A*PA2 evals

A*PA evals

Datasets

ont-500k.zip: ONT reads length >500kbp @ 6.1% divergence

ont-500k-genvar.zip: ONT reads length >500kbp @ 7.2% divergence, including genetic variation

ont-10k.zip ont-50k.zip: ONT reads of length <10k and <50k @ 12% divergence

ont-1k.zip: ONT reads of length <1k @ 10% divergence

sars-cov-2.zip: 10000 pairs of length 30k @ 1.5% divergence

illumina.seq.tar.bz: illumina reads of length 100 @ 0.3% divergence

`ont-500k.zip`: ONT reads length >500kbp @ 6.1% divergence

`ont-500k-genvar.zip`: ONT reads length >500kbp @ 7.2% divergence, including genetic variation

`ont-10k.zip` `ont-50k.zip`: ONT reads of length <10k and <50k @ 12% divergence

`ont-1k.zip`: ONT reads of length <1k @ 10% divergence

`sars-cov-2.zip`: 10000 pairs of length 30k @ 1.5% divergence

`illumina.seq.tar.bz`: illumina reads of length 100 @ 0.3% divergence