Skip to content

Commit

Permalink
Added readme, inlined operations, new median function that is way fas…
Browse files Browse the repository at this point in the history
…ter.

 - Improved perfomance of `median()` ~5% fasterish. New tests.
 - Inlined `operations.run()` (thanks for suggestion @molpopgen!)
 - New example.
  • Loading branch information
vsbuffalo committed Feb 26, 2024
1 parent 2e32a5a commit 0f37b24
Show file tree
Hide file tree
Showing 6 changed files with 175 additions and 11 deletions.
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ bytes = "1.5.0"
ndarray-npy = { version = "0.8.1", optional = true }
num-traits = "0.2.18"
lazy_static = "1.4.0"
medians = "3.0.6"

[features]
# cli = [ "clap" ] # TODO make feature
Expand Down
84 changes: 84 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
![CI tests](https://github.com/vsbuffalo/granges/workflows/rust/badge.svg)

## The GRanges Rust library and command line tool

GRanges is a Rust library for working with genomic ranges and their associated
data. It aims to make it easy to write extremely performant genomics tools that
work with genomic range data (e.g. BED, GTF/GFF, VCF, etc). Internally, GRanges
uses the *very* fast [coitrees](https://github.com/dcjones/coitrees/) interval
tree library written by Daniel C. Jones for overlap operations. In preliminary
benchmarks, GRanges tools can be 10%-30% faster than similar functionality in
[bedtools2](https://github.com/arq5x/bedtools2) (see benchmark and caveats
below).

GRanges is inspired by ["tidy"](https://www.tidyverse.org) data analytics
workflows, as well as Bioconductor's
[GenomicRanges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003118)
and
[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html).
GRanges uses a similar *method-chaining* pipeline approach to manipulate
genomic ranges, find overlapping genomic regions, and compute statistics.
For example, you could implement your own `bedtools map`-like functionality
in relatively few lines of code:

```rust
// Create the "right" GRanges object.
let right_gr = bed5_gr
// Convert to interval trees.
.into_coitrees()?
// Extract out just the score from the additional BED5 columns.
.map_data(|bed5_cols| {
bed5_cols.score
})?;

// Compute overlaps and combine scores into mean.
let results_gr = left_gr
// Find overlaps
.left_overlaps(&right_gr)?
// Summarize overlap data
.map_over_joins(mean_score)?;
```

However unlike GenomicRanges, GRanges is a *compile-time* generic Rust library.
It is generic in the sense that it works with *any* data container type that
stores data associated with genomic data: a `Vec<U>` of some type, an
[ndarray](https://docs.rs/ndarray/latest/ndarray/) `Array2`,
[polars](https://pola.rs) dataframe, etc. GRanges allows the user to write do
common genomics data processing tasks in a few lines of Rust, and then lets the
Rust compiler optimize it.

As a proof-of-concept, GRanges also provides the command line tool `granges`
built on this library's functionality. This command line tool is intended for
benchmarks against comparable command line tools and for large-scale
integration tests against other software to ensure that GRanges is bug-free.
The `granges` tool currently provides a subset of the features of other great
bioinformatics utilities like
[bedtools](https://bedtools.readthedocs.io/en/latest/).

## Preliminary Benchmarks

In an attempt to combat "benchmark hype", this section details the results of
some preliminary benchmarks in an honest and transparent way. On our lab
server, with 100,000 range ranges per operation and n = 100 samples:

```
command bedtools time granges time granges speedup (%)
------------ --------------- -------------- ---------------------
map_multiple 293.41 s 129.99 s 55.6963
map_max 131.97 s 111.39 s 15.5938
adjust 127.36 s 59.25 s 53.4811
filter 113.70 s 101.01 s 11.1607
map_min 116.84 s 108.16 s 7.42583
flank 150.95 s 86.25 s 42.8602
map_mean 116.50 s 143.70 s -23.3524
map_sum 110.60 s 111.39 s -0.71377
windows 476.12 s 67.82 s 85.7555
map_median 154.51 s 104.37 s 32.4551
```

The worse performance of `map_mean`, upon closer inspection, is largely driven
by [aberrant replicates](https://github.com/vsbuffalo/granges/issues/2). Note too
that unlike `bedtools`, `granges` *always* requires a genome file of chromosome
lengths for validation; this disk I/O adds to GRange's benchmark times.


47 changes: 47 additions & 0 deletions examples/mean_score.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
use granges::{prelude::*, join::CombinedJoinDataLeftEmpty};

// Our overlap data processing function.
pub fn mean_score(join_data: CombinedJoinDataLeftEmpty<Option<f64>>) -> f64 {
// Get the "right data" -- the BED5 scores.
let overlap_scores: Vec<f64> = join_data.right_data.into_iter()
// filter out missing values ('.' in BED)
.filter_map(|x| x).collect();

// calculate the mean
let score_sum: f64 = overlap_scores.iter().sum();
score_sum / (overlap_scores.len() as f64)
}


fn try_main() -> Result<(), granges::error::GRangesError> {
// Mock sequence lengths (e.g. "genome" file)
let genome = seqlens!("chr1" => 100, "chr2" => 100);

// Create parsing iterators to the left and right BED files.
let left_iter = Bed3Iterator::new("tests_data/bedtools/map_a.txt")?;
let right_iter = Bed5Iterator::new("tests_data/bedtools/map_b.txt")?;

// Filter out any ranges from chromosomes not in our genome file.
let left_gr = GRangesEmpty::from_iter(left_iter, &genome)?;
let right_gr = GRanges::from_iter(right_iter, &genome)?;

// Create the "right" GRanges object.
let right_gr = right_gr
// Convert to interval trees.
.into_coitrees()?
// Extract out just the score from the additional BED5 columns.
.map_data(|bed5_cols| {
bed5_cols.score
})?;

// Compute overlaps and combine scores into mean.
let results_gr = left_gr
.left_overlaps(&right_gr)?
.map_over_joins(mean_score)?;

results_gr.to_tsv(None::<String>, &BED_TSV)?;
Ok(())
}

fn main() { try_main().unwrap(); }

4 changes: 2 additions & 2 deletions src/commands.rs
Original file line number Diff line number Diff line change
Expand Up @@ -455,7 +455,7 @@ pub fn granges_map(
// Process all the overlaps.
let result_gr = left_join_gr.map_over_joins(|join_data| {
// Get the "right data" -- the BED5 scores
let overlap_scores: Vec<f64> = join_data
let mut overlap_scores: Vec<f64> = join_data
.right_data
.into_iter()
// Filter out the `None` values.
Expand All @@ -465,7 +465,7 @@ pub fn granges_map(
// Run all operations on the scores.
operations
.iter()
.map(|operation| operation.run(&overlap_scores))
.map(|operation| operation.run(&mut overlap_scores))
.collect::<Vec<DatumType>>()
})?;

Expand Down
46 changes: 39 additions & 7 deletions src/data/operations.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,19 @@ use super::DatumType;
use crate::traits::IntoDatumType;

/// Calculate the median.
///
/// This will clone and turn `numbers` into a `Vec`.
pub fn median<F: Float + Sum>(numbers: &[F]) -> Option<F> {
pub fn median<F: Float + Sum>(numbers: &mut [F]) -> Option<F> {
if numbers.is_empty() {
return None;
}
let mut numbers = numbers.to_vec();
numbers.sort_by(|a, b| a.partial_cmp(b).unwrap());
let mid = numbers.len() / 2;
if numbers.len() % 2 == 0 {
Some((numbers[mid - 1] + numbers[mid]) / F::from(2.0).unwrap())
numbers.select_nth_unstable_by(mid-1, |a, b| a.partial_cmp(b).unwrap());
let lower = numbers[mid-1];
numbers.select_nth_unstable_by(mid, |a, b| a.partial_cmp(b).unwrap());
let upper = numbers[mid];
Some((lower + upper) / F::from(2.0).unwrap())
} else {
numbers.select_nth_unstable_by(mid, |a, b| a.partial_cmp(b).unwrap());
Some(numbers[mid])
}
}
Expand All @@ -42,7 +43,8 @@ pub enum Operation {
}

impl Operation {
pub fn run<T: IntoDatumType + Copy>(&self, data: &[T]) -> DatumType
#[inline(always)]
pub fn run<T: IntoDatumType + Copy>(&self, data: &mut [T]) -> DatumType
where
T: Float + Sum<T> + ToPrimitive + Clone + ToString,
{
Expand Down Expand Up @@ -88,3 +90,33 @@ impl Operation {
}
}
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_median_empty() {
let mut numbers: Vec<f64> = vec![];
assert_eq!(median(&mut numbers), None);
}

#[test]
fn test_median_odd() {
let mut numbers = vec![2.0, 3.0, 1.0];
assert_eq!(median(&mut numbers), Some(2.0));
}

#[test]
fn test_median_even() {
let mut numbers = vec![1.0, 2.0, 3.0, 4.0];
assert_eq!(median(&mut numbers), Some(2.5));
}

#[test]
fn test_median_negative() {
let mut numbers = vec![-3.0, -1.0, -2.0];
assert_eq!(median(&mut numbers), Some(-2.0));
}

}
4 changes: 2 additions & 2 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@
//! let left_join_gr = left_gr.left_overlaps(&right_gr)?;
//!
//! // Process all the overlaps.
//! let result_gr = left_join_gr.map_over_joins(|join_data| {
//! let results_gr = left_join_gr.map_over_joins(|join_data| {
//! // Get the "right data" -- the BED5 scores.
//! let overlap_scores: Vec<f64> = join_data.right_data.into_iter()
//! // filter out missing values ('.' in BED)
Expand All @@ -185,7 +185,7 @@
//! // Write to a TSV file, using the BED TSV format standards
//! // for missing values, etc.
//! let tempfile = tempfile::NamedTempFile::new().unwrap();
//! result_gr.to_tsv(Some(tempfile.path()), &BED_TSV)?;
//! results_gr.to_tsv(Some(tempfile.path()), &BED_TSV)?;
//! # Ok(())
//! # }
//! # fn main() { try_main().unwrap(); }
Expand Down

0 comments on commit 0f37b24

Please sign in to comment.