Releases: pairwise-alignment/pa-bench
A*PA2 evals
This release is a placeholder release for downloads related to the A*PA2 paper.
results.zip
contains the results used by evals/astarpa2/evals.ipynb
. See evals/astarpa2/README.md
for more info.
A*PA evals
This release is a placeholder release for downloads related to the A*PA paper.
results.zip
contains the results used by evals/astarpa/evals.ipynb
. See evals/astarpa/README.md
for more info.
Datasets
This is just a place to put datasets used for benchmarking. Datasets are provided in the .seq
format, containing sequence pairs to be aligned like so:
>CTGGGGTTACAGGCATGCACCAGCACGCC...
<CTGGGGTTACAGGCATGCACCAGCACGCC...
ont-500k.zip
: ONT reads length >500kbp @ 6.1% divergence
Contains 50 .seq
files (seq01.seq
.. seq50.seq
) each containing a single alignment. This dataset contains only read errors.
This dataset was created by downloading some reads (this download, 300GB total) used for v1.1 of CHM13, and aligning them back to the reference.
See Snakefile for details.
ont-500k-genvar.zip
: ONT reads length >500kbp @ 7.2% divergence, including genetic variation
Contains 48 .seq
files each containing a single alignment. This dataset includes genetic variation and large gaps.
This dataset is reused directly from BiWFA and is also available in the BiWF repository. We provide it here for completeness, with the only change that seq[1-9].seq
have been renamed to seq[01-09].seq
. It was generated by the BiWFA authors by taking ONT MinION reads from Bowden et. al (2019), filtering them for length at least 500kbp, and aligning them to the CHM13 v1.1 assembly.
ont-10k.zip
ont-50k.zip
: ONT reads of length <10k and <50k @ 12% divergence
These contain 50 .seq
files with 100 resp. 200 sequence pairs each. Pairs are sorted by edit distance, with the closest pairs in 00.seq
.
These datasets were reused from BiWFA and only modified to split them into multiple files.
ont-1k.zip
: ONT reads of length <1k @ 10% divergence
Contains 50 .seq
files with ~250 sequence pairs each, sorted by increasing edit distance.
This dataset is reused from WFA.
sars-cov-2.zip
: 10000 pairs of length 30k @ 1.5% divergence
This dataset was generated by downloading a 500MB of SARS-CoV-2 genomes. We stripped all non-ACTG
characters, sampled 10000 random pairs, sorted them by edit distance, and split them into 50 files each containing 200 pairs.
Average divergence is 1.5%.
illumina.seq.tar.bz
: illumina reads of length 100 @ 0.3% divergence
Contains a single .seq
file with 100'000 pairs. Reused from WFA.