Skip to content

Commit

Permalink
Added readme, inlined operations, new median function that is way fas…
Browse files Browse the repository at this point in the history
…ter.

 - Improved perfomance of `median()` ~5% fasterish. New tests.
 - Inlined `operations.run()` (thanks for suggestion @molpopgen!)
 - New example.
  • Loading branch information
vsbuffalo committed Feb 26, 2024
1 parent 2e32a5a commit 181046b
Show file tree
Hide file tree
Showing 6 changed files with 228 additions and 11 deletions.
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ bytes = "1.5.0"
ndarray-npy = { version = "0.8.1", optional = true }
num-traits = "0.2.18"
lazy_static = "1.4.0"
medians = "3.0.6"

[features]
# cli = [ "clap" ] # TODO make feature
Expand Down
137 changes: 137 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
![CI tests](https://github.com/vsbuffalo/granges/workflows/Rust/badge.svg)

## The GRanges Rust library and command line tool

GRanges is a Rust library for working with genomic ranges and their associated
data. It aims to make it easy to write extremely performant genomics tools that
work with genomic range data (e.g. BED, GTF/GFF, VCF, etc). Internally, GRanges
uses the *very* fast [coitrees](https://github.com/dcjones/coitrees/) interval
tree library written by Daniel C. Jones for overlap operations. In preliminary
benchmarks, GRanges tools can be 10%-30% faster than similar functionality in
[bedtools2](https://github.com/arq5x/bedtools2) (see benchmark and caveats
below).

GRanges is inspired by ["tidy"](https://www.tidyverse.org) data analytics
workflows, as well as Bioconductor's
[GenomicRanges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003118)
and
[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html).
GRanges uses a similar *method-chaining* pipeline approach to manipulate
genomic ranges, find overlapping genomic regions, and compute statistics.
For example, you could implement your own `bedtools map`-like functionality
in relatively few lines of code:

```rust
// Create the "right" GRanges object.
let right_gr = bed5_gr
// Convert to interval trees.
.into_coitrees()?
// Extract out just the score from the additional BED5 columns.
.map_data(|bed5_cols| {
bed5_cols.score
})?;

// Compute overlaps and combine scores into mean.
let results_gr = left_gr
// Find overlaps
.left_overlaps(&right_gr)?
// Summarize overlap data
.map_over_joins(mean_score)?;
```

where `mean_score()` is:

```rust
pub fn mean_score(join_data: CombinedJoinDataLeftEmpty<Option<f64>>) -> f64 {
// Get the "right data" out of the join -- the BED5 scores.
let overlap_scores: Vec<f64> = join_data.right_data.into_iter()
// filter out missing values ('.' in BED)
.filter_map(|x| x).collect();

// Calculate the mean score.
let score_sum: f64 = overlap_scores.iter().sum();
score_sum / (overlap_scores.len() as f64)
}
```

Note that GRanges is a *compile-time* generic Rust library, so the code above
will be heavily optimized by the compiler. Rust uses *zero-cost abstractions*,
meaning high-level code like this is compiled and optimized so that it would be
just as performant as if it were written in a low-level language.

GRanges is *generic* in the sense that it works with *any* data container type
that stores data associated with genomic data: a `Vec<U>` of some type, an
[ndarray](https://docs.rs/ndarray/latest/ndarray/) `Array2`,
[polars](https://pola.rs) dataframe, etc. GRanges allows the user to write do
common genomics data processing tasks in a few lines of Rust, and then lets the
Rust compiler optimize it.

As a proof-of-concept, GRanges also provides the command line tool `granges`
built on this library's functionality. This command line tool is intended for
benchmarks against comparable command line tools and for large-scale
integration tests against other software to ensure that GRanges is bug-free.
The `granges` tool currently provides a subset of the features of other great
bioinformatics utilities like
[bedtools](https://bedtools.readthedocs.io/en/latest/).

## Preliminary Benchmarks

In an attempt to combat "benchmark hype", this section details the results of
some preliminary benchmarks in an honest and transparent way. On our lab
server, here are two runs with 100,000 ranges per operation, and n = 100 samples:

```
# run 1
command bedtools time granges time granges speedup (%)
------------ --------------- -------------- ---------------------
map_multiple 293.41 s 129.99 s 55.6963
map_max 131.97 s 111.39 s 15.5938
adjust 127.36 s 59.25 s 53.4811
filter 113.70 s 101.01 s 11.1607
map_min 116.84 s 108.16 s 7.42583
flank 150.95 s 86.25 s 42.8602
map_mean 116.50 s 143.70 s -23.3524
map_sum 110.60 s 111.39 s -0.71377
windows 476.12 s 67.82 s 85.7555
map_median 154.51 s 104.37 s 32.4551
# run 2
command bedtools time granges time granges speedup (%)
------------ --------------- -------------- ---------------------
map_multiple 348.14 s 124.29 s 64.2979
map_max 126.27 s 107.05 s 15.2267
adjust 126.51 s 59.23 s 53.1802
filter 114.66 s 97.28 s 15.1542
map_min 118.61 s 113.02 s 4.72037
flank 165.31 s 80.97 s 51.0211
map_mean 133.39 s 108.22 s 18.867
map_sum 120.28 s 128.68 s -6.98736
windows 476.22 s 90.89 s 80.9151
map_median 106.83 s 104.37 s 2.30808
```

The worse performance of `map_mean` in run 1, upon closer inspection, is
largely driven by [aberrant
replicates](https://github.com/vsbuffalo/granges/issues/2). (The server is
under heavy use by others, which adds variance to these benchmarks.) Note too
that unlike `bedtools`, `granges` *always* requires a genome file of chromosome
lengths for validation; this disk I/O adds to GRange's benchmark times.

Here is another benchmark with smaller test datasets (10,000 ranges) and many
more replicates:

```
command bedtools time granges time granges speedup (%)
------------ --------------- -------------- ---------------------
map_multiple 133.41 s 77.81 s 41.6736
adjust 60.23 s 40.06 s 33.4794
map_min 67.11 s 59.31 s 11.6281
map_mean 64.76 s 59.29 s 8.43342
map_max 67.04 s 59.30 s 11.5468
map_sum 66.89 s 60.01 s 10.2835
map_median 65.63 s 59.47 s 9.37911
flank 84.42 s 57.62 s 31.7415
filter 77.95 s 56.65 s 27.3318
windows 291.41 s 51.47 s 82.3359
```

47 changes: 47 additions & 0 deletions examples/mean_score.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
use granges::{prelude::*, join::CombinedJoinDataLeftEmpty};

// Our overlap data processing function.
pub fn mean_score(join_data: CombinedJoinDataLeftEmpty<Option<f64>>) -> f64 {
// Get the "right data" -- the BED5 scores.
let overlap_scores: Vec<f64> = join_data.right_data.into_iter()
// filter out missing values ('.' in BED)
.filter_map(|x| x).collect();

// calculate the mean
let score_sum: f64 = overlap_scores.iter().sum();
score_sum / (overlap_scores.len() as f64)
}


fn try_main() -> Result<(), granges::error::GRangesError> {
// Mock sequence lengths (e.g. "genome" file)
let genome = seqlens!("chr1" => 100, "chr2" => 100);

// Create parsing iterators to the left and right BED files.
let left_iter = Bed3Iterator::new("tests_data/bedtools/map_a.txt")?;
let right_iter = Bed5Iterator::new("tests_data/bedtools/map_b.txt")?;

// Filter out any ranges from chromosomes not in our genome file.
let left_gr = GRangesEmpty::from_iter(left_iter, &genome)?;
let right_gr = GRanges::from_iter(right_iter, &genome)?;

// Create the "right" GRanges object.
let right_gr = right_gr
// Convert to interval trees.
.into_coitrees()?
// Extract out just the score from the additional BED5 columns.
.map_data(|bed5_cols| {
bed5_cols.score
})?;

// Compute overlaps and combine scores into mean.
let results_gr = left_gr
.left_overlaps(&right_gr)?
.map_over_joins(mean_score)?;

results_gr.to_tsv(None::<String>, &BED_TSV)?;
Ok(())
}

fn main() { try_main().unwrap(); }

4 changes: 2 additions & 2 deletions src/commands.rs
Original file line number Diff line number Diff line change
Expand Up @@ -455,7 +455,7 @@ pub fn granges_map(
// Process all the overlaps.
let result_gr = left_join_gr.map_over_joins(|join_data| {
// Get the "right data" -- the BED5 scores
let overlap_scores: Vec<f64> = join_data
let mut overlap_scores: Vec<f64> = join_data
.right_data
.into_iter()
// Filter out the `None` values.
Expand All @@ -465,7 +465,7 @@ pub fn granges_map(
// Run all operations on the scores.
operations
.iter()
.map(|operation| operation.run(&overlap_scores))
.map(|operation| operation.run(&mut overlap_scores))
.collect::<Vec<DatumType>>()
})?;

Expand Down
46 changes: 39 additions & 7 deletions src/data/operations.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,19 @@ use super::DatumType;
use crate::traits::IntoDatumType;

/// Calculate the median.
///
/// This will clone and turn `numbers` into a `Vec`.
pub fn median<F: Float + Sum>(numbers: &[F]) -> Option<F> {
pub fn median<F: Float + Sum>(numbers: &mut [F]) -> Option<F> {
if numbers.is_empty() {
return None;
}
let mut numbers = numbers.to_vec();
numbers.sort_by(|a, b| a.partial_cmp(b).unwrap());
let mid = numbers.len() / 2;
if numbers.len() % 2 == 0 {
Some((numbers[mid - 1] + numbers[mid]) / F::from(2.0).unwrap())
numbers.select_nth_unstable_by(mid-1, |a, b| a.partial_cmp(b).unwrap());
let lower = numbers[mid-1];
numbers.select_nth_unstable_by(mid, |a, b| a.partial_cmp(b).unwrap());
let upper = numbers[mid];
Some((lower + upper) / F::from(2.0).unwrap())
} else {
numbers.select_nth_unstable_by(mid, |a, b| a.partial_cmp(b).unwrap());
Some(numbers[mid])
}
}
Expand All @@ -42,7 +43,8 @@ pub enum Operation {
}

impl Operation {
pub fn run<T: IntoDatumType + Copy>(&self, data: &[T]) -> DatumType
#[inline(always)]
pub fn run<T: IntoDatumType + Copy>(&self, data: &mut [T]) -> DatumType
where
T: Float + Sum<T> + ToPrimitive + Clone + ToString,
{
Expand Down Expand Up @@ -88,3 +90,33 @@ impl Operation {
}
}
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_median_empty() {
let mut numbers: Vec<f64> = vec![];
assert_eq!(median(&mut numbers), None);
}

#[test]
fn test_median_odd() {
let mut numbers = vec![2.0, 3.0, 1.0];
assert_eq!(median(&mut numbers), Some(2.0));
}

#[test]
fn test_median_even() {
let mut numbers = vec![1.0, 2.0, 3.0, 4.0];
assert_eq!(median(&mut numbers), Some(2.5));
}

#[test]
fn test_median_negative() {
let mut numbers = vec![-3.0, -1.0, -2.0];
assert_eq!(median(&mut numbers), Some(-2.0));
}

}
4 changes: 2 additions & 2 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@
//! let left_join_gr = left_gr.left_overlaps(&right_gr)?;
//!
//! // Process all the overlaps.
//! let result_gr = left_join_gr.map_over_joins(|join_data| {
//! let results_gr = left_join_gr.map_over_joins(|join_data| {
//! // Get the "right data" -- the BED5 scores.
//! let overlap_scores: Vec<f64> = join_data.right_data.into_iter()
//! // filter out missing values ('.' in BED)
Expand All @@ -185,7 +185,7 @@
//! // Write to a TSV file, using the BED TSV format standards
//! // for missing values, etc.
//! let tempfile = tempfile::NamedTempFile::new().unwrap();
//! result_gr.to_tsv(Some(tempfile.path()), &BED_TSV)?;
//! results_gr.to_tsv(Some(tempfile.path()), &BED_TSV)?;
//! # Ok(())
//! # }
//! # fn main() { try_main().unwrap(); }
Expand Down

0 comments on commit 181046b

Please sign in to comment.