Skip to content

Commit

Permalink
fixed zenodo link-only issue and new doc
Browse files Browse the repository at this point in the history
  • Loading branch information
vsbuffalo committed Sep 1, 2023
1 parent 0d3df12 commit 7ffd20b
Show file tree
Hide file tree
Showing 5 changed files with 265 additions and 18 deletions.
222 changes: 217 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,225 @@
![CI tests](https://github.com/vsbuffalo/sciflow/workflows/CI/badge.svg)


![SciDataFlow logo](https://github.com/vsbuffalo/sciflow/blob/a477fc3a7e612ff4c5d89f3b43e2826b8c90f3b8/logo.png)

# SciFlow -- Facilitating the Flow of Data in Science
# SciDataFlow — Facilitating the Flow of Data in Science

**Problem 1**: Have you ever wanted to reuse and build upon a research project's output or supplementary data, but can't find it?

**SciDataFlow** solves this issue by making it easy to **unite** a research project's *data* with its *code*. Often, code for open computational projects is managed with Git and stored on a site like GitHub. However, a lot of scientific data is too large to be stored on these sites, and instead is hosted by sites like [Zenodo](http://zenodo.org) or [FigShare](http://figshare.com).

**Problem 2**: Does your computational project have dozens or even hundreds of intermediate data files you'd like to keep track of? Do you want to see if these files are changed by updates to computational pipelines.

SciDataFlow also solves this issue by keeping a record of the necessary information to track when data is changed. This is stored alongside the information needed to retrieve data from and push data to remote data repositories. All of this is kept in a simple [YAML](https://yaml.org) "Data Manifest" (`data_manifest.yml`) file that SciDataFlow manages. This file is stored in the main project directory and meant to be checked into Git, so that Git commit history can be used to see changes to data. The Data Manifest is a simple, minimal, human and machine readable specification. But you don't need to know the specifics — the simple `sdf` command line tool handles it all for you.

## A Simple Workflow Example

The user interacts with the Data Manifest through the fast and concurrent command line tool `sdf` written in the inimitable [Rust language](https://www.rust-lang.org). The `sdf` tool has a Git-like interface. If you know Git, using it will be easy, e.g. to initialize SciDataFlow for a project you'd use:

```bash
$ sdf init
```

Registering a file in the manifest:

```bash
$ sdf add data/population_sizes.tsv
Added 1 file.
```

Checking to see if a file has changed, we'd use `sdf status`:

```bash
$ sdf status
Project data status:
0 files on local and remotes (1 file only local, 0 files only remote), 1 file total.

[data]
population_sizes.tsv current 3fba1fc3 2023-09-01 10:38AM (53 seconds ago)
```

Now, let's imagine a pipeline runs and changes this file:

```bash
$ bash tools/computational_pipeline.sh # changes data
$ sdf status
Project data status:
0 files on local and remotes (1 file only local, 0 files only remote), 1 file total.

[data]
population_sizes.tsv changed 3fba1fc3 → 8cb9d10b 2023-09-01 10:48AM (1 second ago)

```

If these changes are good, we can tell the Data Manifest it should update it's record of this version:

```bash
$ sdf update data/population_sizes.tsv
$ sdf status
Project data status:
0 files on local and remotes (1 file only local, 0 files only remote), 1 file total.

[data]
population_sizes.tsv current 8cb9d10b 2023-09-01 10:48AM (6 minutes ago)

```

**⚠️Warning**: SciDataFlow does not do data *versioning*. Unlike Git, it does not keep an entire history of data at each commit. Thus, **data backup must be managed by separate software**. SciDataFlow is still in alpha phase, so it is especially important you backup your data *before* using SciDataFlow. A tiny, kind reminder: you as a researcher should be doing routine backups *already* — losing data due to either a computational mishap or hardware failure is always possible.

## Pushing Data to Remote Repositories

SciDataFlow also saves researchers' time when submitting supplementary data to services like Zenodo or FigShare. Simply link the remote service (you'll need to first get an API access token from their website):

```bash
$ sdf link data/ zenodo <TOKEN> --name popsize_study
```

You only need to link a remote once. SciDataFlow will look for a project on the remote with this name first (see `sdf link --help` for more options). SciDataFlow stores the authentication keys for all remotes in `~/.scidataflow_authkeys.yml` so you don't have to remember them.

SciDataFlow knows you probably don't want to upload *every* file that you're keeping track of locally. Sometimes you just want to use SciDataFlow to track local changes. So, in addition to files being registered in the Data Manifest, you can also tell them you'd like to *track* them:

```bash
$ sdf track data/population_sizes.tsv
```

Now, you can check the status on remotes too with:

```bash
$ sdf status --remotes
Project data status:
1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total.

[data > Zenodo]
population_sizes.tsv current, tracked 8cb9d10b 2023-09-01 10:48AM (14 minutes ago) not on remote
```

Then, to upload these files to Zenodo, all we'd do is:

```bash
$ ../target/debug/sdf push
Info: uploading file "data/population_sizes.tsv" to Zenodo
Uploaded 1 file.
Skipped 0 files.
```

## Retrieving Data from Remotes

A key feature of SciDataFlow is that it can quickly reunite a project's *code* repository with its *data*. Imagine a colleague had a small repository containing the code lift a recombination map over to a new reference genome, and you'd like to use her methods. However, you also want to check that you can reproduce her pipeline on your system, which first involves re-downloading all the input data (in this case, the original recombination map and liftover files).

First, you'd clone the repository:

```bash
$ git clone [email protected]:mclintock/maize_liftover
$ cd maize_liftover/
```

Then, as long as a `data_manifest.yml` exists in the root project directory (`maize_liftover/` in this example), SciDataFlow is initialized. You can verify this by using:

```bash
$ sdf status --remotes
Project data status:
1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total.

[data > Zenodo]
recmap_genome_v1.tsv deleted, tracked 7ef1d10a exists on remote
recmap_genome_v2.tsv deleted, tracked e894e742 exists on remote
```

Now, to retrieve these files, all you'd need to do is:

```bash
$ sdf pull
Downloaded 1 file.
- population_sizes.tsv
Skipped 0 files. Reasons:
```

Note that if you run `sdf pull` again, it will not redownload the file (this is to over overwriting the local version, should it have been changed):

```bash
$ sdf pull
No files downloaded.
Skipped 1 files. Reasons:
Remote file is indentical to local file: 1 file
- population_sizes.tsv
```

If the file has changed, you can pull in the remote's version with `sdf pull --overwrite`. However, `sdf pull` is also lazy; it will not download the file if the MD5s haven't changed between the remote and local versions.

Downloads with SciDataFlow are fast and concurrent thanks to the [Tokio Rust Asynchronous Universal download MAnager](https://github.com/rgreinho/trauma) crate. If your project has a lot of data across multiple remotes, SciDataFlow will pull all data in as quickly as possible.

## Retrieving Data from Static URLs

Often we also want to retrieve data from URLs. For example, many genomic resources are available for download from the [UCSC](http://genome.ucsc.edu) or [Ensembl](http://ensembl.org) websites as static URLs. We want a record of where these files come from in the Data Manifest, so we want to combine a download with a `sdf add`. The command `sdf get` does this all for you — let's imagine you want to get all human coding sequences. You could do this with:

```bash
$ sdf get https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
⠄ [================> ] 9639693/22716351 (42%) eta 00:00:08
```

Now, it would show up in the Data Manifest:

```bash
$ sdf status --remotes
Project data status:
0 files local and tracked by a remote (0 files only local, 0 files only remote), 1 files total.

[data > Zenodo]
Homo_sapiens.GRCh38.cds.all.fa.gz current, untracked fb59b3ad 2023-09-01 3:13PM (43 seconds ago) not on remote
```

Note that files downloaded from URLs are not automatically track with remotes. You can do this with `sdf track <FILENAME>` if you want. Then, you can use `sdf push` to upload this same file to Zenodo or FigShare.

Since modern computational projects may require downloading potentially *hundreds* or even *thousands* of annotation files, the `sdf` tool has a simple way to do this: tab-delimited or comma-separated value files (e.g. those with suffices `.tsv` and `.csv`, respectively). The big picture idea of SciDataFlow is that it should take mere seconds to pull in all data needed for a large genomics project (or astronomy, or ecology, whatever). Here's an example TSV file full of links:

```bash
$ cat human_annotation.tsv
type url
cdna https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
fasta https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.alt.fa.gz
cds https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
```

Note that this has a header, and the URLs are in the second column. To get this data, we'd use:

```bash
$ sdf bulk human_annotation.tsv --column 1 --header
⠁ [ ] 0/2 (0%) eta 00:00:00
⠉ [====> ] 9071693/78889691 (11%) eta 00:01:22
⠐ [=========> ] 13503693/54514783 (25%) eta 00:00:35
```

**Columns indices are zero-indexed** and `sdf bulk` assumes no headers by default. Note that in this example, only two files are downloading — this is because `sdf` detected the CDS file already existed. SciDataFlow tells you this with a little message at the end:

```bash
$ sdf bulk human_annotation.tsv --column 1 --header
3 URLs found in 'human_annotation.tsv.'
2 files were downloaded, 2 added to manifest (0 were already registered).
1 files were skipped because they existed (and --overwrite was no specified).
```

### Adding Metadata

Some data repository services like Zenodo support

This indicates the
The larger vision of SciDataFlow is to change how data flows through scientific projects. The way scientific data is currently shared is fundamentally **broken**, and it is prevent the reuse of important **scientific assets**. By lowering the barrier to sharing and retrieving scientific data, SciDataFlow hopes to improve the *reuse* of data. For example, suppose one of your projects required

Or perhaps you're in
the midst of a large computational project, and need a better way to track the
data going into a project, intermediate data, and whether data has changed
during different runs of a pipeline.

Perhaps you're submitting a manuscript to a journal, and you need to go through
a large project's directory to find and upload supplementary data to a data
repository service like [Zenodo](http://zenodo.org) or
[FigShare](http://figshare.com). It can be labor intensive to manually find and
upload each of these data files necessary to make a project reproducible.

Or maybe a colleague created a small but important *scientific asset*, such as
lifting a recombination map to a new reference genome versionl


SciFlow is a both (1) a minimal specification of the data used in a scientific
project and (2) a fast, concurrent command line tool to retrieve and upload
scientific data from multiple repositories (e.g. FigShare, Zenodo, etc.).

## Philosophy

Expand Down
16 changes: 13 additions & 3 deletions src/lib/api/zenodo.rs
Original file line number Diff line number Diff line change
Expand Up @@ -265,13 +265,19 @@ impl ZenodoAPI {

pub async fn find_deposition(&self) -> Result<Option<ZenodoDeposition>> {
let depositions = self.get_depositions().await?;
let matches_found: Vec<_> = depositions.into_iter().filter(|a| a.title == self.name).collect();
let mut matches_found: Vec<_> = depositions.into_iter().filter(|a| a.title == self.name).collect();
if !matches_found.is_empty() {
if matches_found.len() > 1 {
return Err(anyhow!("Found multiple Zenodo Depositions with the \
title '{}'", self.name));
} else {
return Ok(Some(matches_found[0].clone()));
// We need to do one more API call, to get the full listing
// with the bucket URL.
let partial_deposition = matches_found.remove(0);
let url = format!("deposit/depositions/{}", partial_deposition.id);
let response = self.issue_request::<HashMap<String,String>>(Method::GET, &url, None, None).await?;
let deposition: ZenodoDeposition = response.json().await?;
return Ok(Some(deposition));
}
} else {
return Ok(None);
Expand Down Expand Up @@ -317,7 +323,11 @@ impl ZenodoAPI {
};

self.deposition_id = Some(info.id as u64);
self.bucket_url = info.links.bucket;
let bucket_url = info.links.bucket;
if bucket_url.is_none() {
return Err(anyhow!("Internal Error: ZenodoAPI::find_deposition() did not return an entry with a bucket_url."));
}
self.bucket_url = bucket_url;

Ok(())
}
Expand Down
3 changes: 2 additions & 1 deletion src/lib/data.rs
Original file line number Diff line number Diff line change
Expand Up @@ -904,7 +904,8 @@ impl DataCollection {
println!("Uploaded {}.", pluralize(num_uploaded as u64, "file"));
let num_skipped = overwrite_skipped.len() + current_skipped.len() +
messy_skipped.len() + untracked_skipped.len();
println!("Skipped {} files:", num_skipped);
let punc = if num_skipped > 0 { "." } else { ":" };
println!("Skipped {}{}", pluralize(num_skipped as u64, "file"), punc);
if !untracked_skipped.is_empty() {
println!(" Untracked: {}", pluralize(untracked_skipped.len() as u64, "file"));
for path in untracked_skipped {
Expand Down
3 changes: 2 additions & 1 deletion src/lib/download.rs
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,8 @@ impl Downloads {
.build();
downloader.download(&downloads).await;
if show_total {
println!("Downloaded {}.", pluralize(total_files as u64, "file"));
let punc = if total_files > 0 { "." } else { ":" };
println!("Downloaded {}{}", pluralize(total_files as u64, "file"), punc);
}
for download in downloads {
if let Some(msg) = success_status {
Expand Down
39 changes: 31 additions & 8 deletions src/lib/utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -208,44 +208,67 @@ struct FileCounts {
local: u64,
remote: u64,
both: u64,
total: u64
total: u64,
#[allow(dead_code)]
messy: u64
}

fn get_counts(rows: &BTreeMap<String,Vec<StatusEntry>>) -> Result<FileCounts> {
let mut local = 0;
let mut remote = 0;
let mut both = 0;
let mut total = 0;
let mut messy = 0;
for files in rows.values() {
for file in files {
total += 1;
match (&file.local_status, &file.remote_status) {
(None, None) => {
match (&file.local_status, &file.remote_status, &file.tracked) {
(None, None, _) => {
return Err(anyhow!("Internal Error: get_counts found a file with both local/remote set to None."));
},
(Some(_), None) => {
(None, Some(_), None) => {
remote += 1;
},
(Some(_), None, Some(false)) => {
local += 1;
},
(None, Some(_)) => {
(Some(_), None, None) => {
local += 1;
},
(None, Some(_), Some(true)) => {
remote += 1;
},
(Some(_), Some(_)) => {
(None, Some(_), Some(false)) => {
local += 1;
},
(Some(_), Some(_), Some(true)) => {
both += 1;
},
(Some(_), Some(_), Some(false)) => {
messy += 1;
},
(Some(_), None, Some(true)) => {
remote += 1;
},
(Some(_), Some(_), None) => {
messy += 1;
}
}
}
}
Ok(FileCounts { local, remote, both, total })
Ok(FileCounts { local, remote, both, total, messy })
}

pub fn print_status(rows: BTreeMap<String,Vec<StatusEntry>>, remote: Option<&HashMap<String,Remote>>,
all: bool) {
println!("--> {:?}", rows);
println!("{}", "Project data status:".bold());
let counts = get_counts(&rows).expect("Internal Error: get_counts() panicked.");
println!("{} on local and remotes ({} only local, {} only remote), {} total.\n",
println!("{} local and tracked by a remote ({} only local, {} only remote), {} total.\n",
pluralize(counts.both as u64, "file"),
pluralize(counts.local as u64, "file"),
pluralize(counts.remote as u64, "file"),
//pluralize(counts.messy as u64, "file"),
pluralize(counts.total as u64, "file"));

// this brings the remote name (if there is a corresponding remote) into
Expand Down

0 comments on commit 7ffd20b

Please sign in to comment.