-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added readme, inlined operations, new median function that is way fas…
…ter. - Improved perfomance of `median()` ~5% fasterish. New tests. - Inlined `operations.run()` (thanks for suggestion @molpopgen!) - New example.
- Loading branch information
Showing
6 changed files
with
175 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
![CI tests](https://github.com/vsbuffalo/granges/workflows/rust/badge.svg) | ||
|
||
## The GRanges Rust library and command line tool | ||
|
||
GRanges is a Rust library for working with genomic ranges and their associated | ||
data. It aims to make it easy to write extremely performant genomics tools that | ||
work with genomic range data (e.g. BED, GTF/GFF, VCF, etc). Internally, GRanges | ||
uses the *very* fast [coitrees](https://github.com/dcjones/coitrees/) interval | ||
tree library written by Daniel C. Jones for overlap operations. In preliminary | ||
benchmarks, GRanges tools can be 10%-30% faster than similar functionality in | ||
[bedtools2](https://github.com/arq5x/bedtools2) (see benchmark and caveats | ||
below). | ||
|
||
GRanges is inspired by ["tidy"](https://www.tidyverse.org) data analytics | ||
workflows, as well as Bioconductor's | ||
[GenomicRanges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003118) | ||
and | ||
[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html). | ||
GRanges uses a similar *method-chaining* pipeline approach to manipulate | ||
genomic ranges, find overlapping genomic regions, and compute statistics. | ||
For example, you could implement your own `bedtools map`-like functionality | ||
in relatively few lines of code: | ||
|
||
```rust | ||
// Create the "right" GRanges object. | ||
let right_gr = bed5_gr | ||
// Convert to interval trees. | ||
.into_coitrees()? | ||
// Extract out just the score from the additional BED5 columns. | ||
.map_data(|bed5_cols| { | ||
bed5_cols.score | ||
})?; | ||
|
||
// Compute overlaps and combine scores into mean. | ||
let results_gr = left_gr | ||
// Find overlaps | ||
.left_overlaps(&right_gr)? | ||
// Summarize overlap data | ||
.map_over_joins(mean_score)?; | ||
``` | ||
|
||
However unlike GenomicRanges, GRanges is a *compile-time* generic Rust library. | ||
It is generic in the sense that it works with *any* data container type that | ||
stores data associated with genomic data: a `Vec<U>` of some type, an | ||
[ndarray](https://docs.rs/ndarray/latest/ndarray/) `Array2`, | ||
[polars](https://pola.rs) dataframe, etc. GRanges allows the user to write do | ||
common genomics data processing tasks in a few lines of Rust, and then lets the | ||
Rust compiler optimize it. | ||
|
||
As a proof-of-concept, GRanges also provides the command line tool `granges` | ||
built on this library's functionality. This command line tool is intended for | ||
benchmarks against comparable command line tools and for large-scale | ||
integration tests against other software to ensure that GRanges is bug-free. | ||
The `granges` tool currently provides a subset of the features of other great | ||
bioinformatics utilities like | ||
[bedtools](https://bedtools.readthedocs.io/en/latest/). | ||
|
||
## Preliminary Benchmarks | ||
|
||
In an attempt to combat "benchmark hype", this section details the results of | ||
some preliminary benchmarks in an honest and transparent way. On our lab | ||
server, with 100,000 range ranges per operation and n = 100 samples: | ||
|
||
``` | ||
command bedtools time granges time granges speedup (%) | ||
------------ --------------- -------------- --------------------- | ||
map_multiple 293.41 s 129.99 s 55.6963 | ||
map_max 131.97 s 111.39 s 15.5938 | ||
adjust 127.36 s 59.25 s 53.4811 | ||
filter 113.70 s 101.01 s 11.1607 | ||
map_min 116.84 s 108.16 s 7.42583 | ||
flank 150.95 s 86.25 s 42.8602 | ||
map_mean 116.50 s 143.70 s -23.3524 | ||
map_sum 110.60 s 111.39 s -0.71377 | ||
windows 476.12 s 67.82 s 85.7555 | ||
map_median 154.51 s 104.37 s 32.4551 | ||
``` | ||
|
||
The worse performance of `map_mean`, upon closer inspection, is largely driven | ||
by [aberrant replicates](https://github.com/vsbuffalo/granges/issues/2). Note too | ||
that unlike `bedtools`, `granges` *always* requires a genome file of chromosome | ||
lengths for validation; this disk I/O adds to GRange's benchmark times. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
use granges::{prelude::*, join::CombinedJoinDataLeftEmpty}; | ||
|
||
// Our overlap data processing function. | ||
pub fn mean_score(join_data: CombinedJoinDataLeftEmpty<Option<f64>>) -> f64 { | ||
// Get the "right data" -- the BED5 scores. | ||
let overlap_scores: Vec<f64> = join_data.right_data.into_iter() | ||
// filter out missing values ('.' in BED) | ||
.filter_map(|x| x).collect(); | ||
|
||
// calculate the mean | ||
let score_sum: f64 = overlap_scores.iter().sum(); | ||
score_sum / (overlap_scores.len() as f64) | ||
} | ||
|
||
|
||
fn try_main() -> Result<(), granges::error::GRangesError> { | ||
// Mock sequence lengths (e.g. "genome" file) | ||
let genome = seqlens!("chr1" => 100, "chr2" => 100); | ||
|
||
// Create parsing iterators to the left and right BED files. | ||
let left_iter = Bed3Iterator::new("tests_data/bedtools/map_a.txt")?; | ||
let right_iter = Bed5Iterator::new("tests_data/bedtools/map_b.txt")?; | ||
|
||
// Filter out any ranges from chromosomes not in our genome file. | ||
let left_gr = GRangesEmpty::from_iter(left_iter, &genome)?; | ||
let right_gr = GRanges::from_iter(right_iter, &genome)?; | ||
|
||
// Create the "right" GRanges object. | ||
let right_gr = right_gr | ||
// Convert to interval trees. | ||
.into_coitrees()? | ||
// Extract out just the score from the additional BED5 columns. | ||
.map_data(|bed5_cols| { | ||
bed5_cols.score | ||
})?; | ||
|
||
// Compute overlaps and combine scores into mean. | ||
let results_gr = left_gr | ||
.left_overlaps(&right_gr)? | ||
.map_over_joins(mean_score)?; | ||
|
||
results_gr.to_tsv(None::<String>, &BED_TSV)?; | ||
Ok(()) | ||
} | ||
|
||
fn main() { try_main().unwrap(); } | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters