-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
new shortened readme, since doc is on different site
- Loading branch information
Showing
1 changed file
with
4 additions
and
305 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,297 +31,11 @@ simple, minimal, human and machine readable specification. But you don't need | |
to know the specifics — the simple `sdf` command line tool handles it all for | ||
you. | ||
|
||
## A Simple Workflow Example | ||
## Documentation | ||
|
||
If you'd like to follow the example along, first [install | ||
SciDataFlow](#installing-scidataflow). | ||
|
||
The user interacts with the Data Manifest through the fast and concurrent | ||
command line tool `sdf` written in the inimitable [Rust | ||
language](https://www.rust-lang.org). The `sdf` tool has a Git-like interface. | ||
If you know Git, using it will be easy, e.g. to initialize SciDataFlow for a | ||
project you'd use: | ||
|
||
```console | ||
$ sdf init | ||
``` | ||
|
||
Registering a file in the manifest: | ||
|
||
```console | ||
$ sdf add data/population_sizes.tsv | ||
Added 1 file. | ||
``` | ||
|
||
Checking to see if a file has changed, we'd use `sdf status`: | ||
|
||
```console | ||
$ sdf status | ||
Project data status: | ||
0 files on local and remotes (1 file only local, 0 files only remote), 1 file total. | ||
|
||
[data] | ||
population_sizes.tsv current 3fba1fc3 2023-09-01 10:38AM (53 seconds ago) | ||
``` | ||
|
||
Now, let's imagine a pipeline runs and changes this file: | ||
|
||
```console | ||
$ bash tools/computational_pipeline.sh # changes data | ||
$ sdf status | ||
Project data status: | ||
0 files on local and remotes (1 file only local, 0 files only remote), 1 file total. | ||
|
||
[data] | ||
population_sizes.tsv changed 3fba1fc3 → 8cb9d10b 2023-09-01 10:48AM (1 second ago) | ||
|
||
``` | ||
|
||
If these changes are good, we can tell the Data Manifest it should update its | ||
record of this version: | ||
|
||
```console | ||
$ sdf update data/population_sizes.tsv | ||
$ sdf status | ||
Project data status: | ||
0 files on local and remotes (1 file only local, 0 files only remote), 1 file total. | ||
|
||
[data] | ||
population_sizes.tsv current 8cb9d10b 2023-09-01 10:48AM (6 minutes ago) | ||
|
||
``` | ||
|
||
**⚠️Warning**: SciDataFlow does not do data *versioning*. Unlike Git, it does | ||
not keep an entire history of data at each commit. Thus, **data backup must be | ||
managed by separate software**. SciDataFlow is still in alpha phase, so it is | ||
especially important you backup your data *before* using SciDataFlow. A tiny, | ||
kind reminder: you as a researcher should be doing routine backups *already* — | ||
losing data due to either a computational mishap or hardware failure is always | ||
possible. | ||
|
||
## Pushing Data to Remote Repositories | ||
|
||
SciDataFlow also saves researchers' time when submitting supplementary data to | ||
services like Zenodo or FigShare. Simply link the remote service (you'll need | ||
to first get an API access token from their website): | ||
|
||
```console | ||
$ sdf link data/ zenodo <TOKEN> --name popsize_study | ||
``` | ||
|
||
You only need to link a remote once. SciDataFlow will look for a project on the | ||
remote with this name first (see `sdf link --help` for more options). | ||
SciDataFlow stores the authentication keys for all remotes in | ||
`~/.scidataflow_authkeys.yml` so you don't have to remember them. | ||
|
||
SciDataFlow knows you probably don't want to upload *every* file that you're | ||
keeping track of locally. Sometimes you just want to use SciDataFlow to track | ||
local changes. So, in addition to files being registered in the Data Manifest, | ||
you can also tell them you'd like to *track* them: | ||
|
||
```console | ||
$ sdf track data/population_sizes.tsv | ||
``` | ||
|
||
Now, you can check the status on remotes too with: | ||
|
||
```console | ||
$ sdf status --remotes | ||
Project data status: | ||
1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total. | ||
|
||
[data > Zenodo] | ||
population_sizes.tsv current, tracked 8cb9d10b 2023-09-01 10:48AM (14 minutes ago) not on remote | ||
``` | ||
|
||
Then, to upload these files to Zenodo, all we'd do is: | ||
|
||
```console | ||
$ ../target/debug/sdf push | ||
Info: uploading file "data/population_sizes.tsv" to Zenodo | ||
Uploaded 1 file. | ||
Skipped 0 files. | ||
``` | ||
|
||
## Retrieving Data from Remotes | ||
|
||
A key feature of SciDataFlow is that it can quickly reunite a project's *code* | ||
repository with its *data*. Imagine a colleague had a small repository | ||
containing the code lift a recombination map over to a new reference genome, | ||
and you'd like to use her methods. However, you also want to check that you can | ||
reproduce her pipeline on your system, which first involves re-downloading all | ||
the input data (in this case, the original recombination map and liftover | ||
files). | ||
|
||
First, you'd clone the repository: | ||
|
||
```console | ||
$ git clone [email protected]:mclintock/maize_liftover | ||
$ cd maize_liftover/ | ||
``` | ||
|
||
Then, as long as a `data_manifest.yml` exists in the root project directory | ||
(`maize_liftover/` in this example), SciDataFlow is initialized. You can verify | ||
this by using: | ||
|
||
```console | ||
$ sdf status --remotes | ||
Project data status: | ||
1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total. | ||
|
||
[data > Zenodo] | ||
recmap_genome_v1.tsv deleted, tracked 7ef1d10a exists on remote | ||
recmap_genome_v2.tsv deleted, tracked e894e742 exists on remote | ||
``` | ||
|
||
Now, to retrieve these files, all you'd need to do is: | ||
|
||
```console | ||
$ sdf pull | ||
Downloaded 1 file. | ||
- population_sizes.tsv | ||
Skipped 0 files. Reasons: | ||
``` | ||
|
||
Note that if you run `sdf pull` again, it will not redownload the file (this is | ||
to over overwriting the local version, should it have been changed): | ||
|
||
```console | ||
$ sdf pull | ||
No files downloaded. | ||
Skipped 1 files. Reasons: | ||
Remote file is indentical to local file: 1 file | ||
- population_sizes.tsv | ||
``` | ||
|
||
If the file has changed, you can pull in the remote's version with `sdf pull | ||
--overwrite`. However, `sdf pull` is also lazy; it will not download the file | ||
if the MD5s haven't changed between the remote and local versions. | ||
|
||
Downloads with SciDataFlow are fast and concurrent thanks to the [Tokio Rust | ||
Asynchronous Universal download MAnager](https://github.com/rgreinho/trauma) | ||
crate. If your project has a lot of data across multiple remotes, SciDataFlow | ||
will pull all data in as quickly as possible. | ||
|
||
## Retrieving Data from Static URLs | ||
|
||
Often we also want to retrieve data from URLs. For example, many genomic | ||
resources are available for download from the [UCSC](http://genome.ucsc.edu) or | ||
[Ensembl](http://ensembl.org) websites as static URLs. We want a record of | ||
where these files come from in the Data Manifest, so we want to combine a | ||
download with a `sdf add`. The command `sdf get` does this all for you — let's | ||
imagine you want to get all human coding sequences. You could do this with: | ||
|
||
```console | ||
$ sdf get https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz | ||
⠄ [================> ] 9639693/22716351 (42%) eta 00:00:08 | ||
``` | ||
|
||
Now, it would show up in the Data Manifest: | ||
|
||
```console | ||
$ sdf status --remotes | ||
Project data status: | ||
0 files local and tracked by a remote (0 files only local, 0 files only remote), 1 files total. | ||
|
||
[data > Zenodo] | ||
Homo_sapiens.GRCh38.cds.all.fa.gz current, untracked fb59b3ad 2023-09-01 3:13PM (43 seconds ago) not on remote | ||
``` | ||
|
||
Note that files downloaded from URLs are not automatically track with remotes. | ||
You can do this with `sdf track <FILENAME>` if you want. Then, you can use `sdf | ||
push` to upload this same file to Zenodo or FigShare. | ||
|
||
Since modern computational projects may require downloading potentially | ||
*hundreds* or even *thousands* of annotation files, the `sdf` tool has a simple | ||
way to do this: tab-delimited or comma-separated value files (e.g. those with | ||
suffices `.tsv` and `.csv`, respectively). The big picture idea of SciDataFlow | ||
is that it should take mere seconds to pull in all data needed for a large | ||
genomics project (or astronomy, or ecology, whatever). Here's an example TSV | ||
file full of links: | ||
|
||
```console | ||
$ cat human_annotation.tsv | ||
type url | ||
cdna https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz | ||
fasta https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.alt.fa.gz | ||
cds https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz | ||
``` | ||
|
||
Note that this has a header, and the URLs are in the second column. To get this data, we'd use: | ||
|
||
```console | ||
$ sdf bulk human_annotation.tsv --column 2 --header | ||
⠁ [ ] 0/2 (0%) eta 00:00:00 | ||
⠉ [====> ] 9071693/78889691 (11%) eta 00:01:22 | ||
⠐ [=========> ] 13503693/54514783 (25%) eta 00:00:35 | ||
``` | ||
|
||
**Columns indices are one-indexed** and `sdf bulk` assumes no headers by | ||
default. Note that in this example, only two files are downloading — this is | ||
because `sdf` detected the CDS file already existed. SciDataFlow tells you this | ||
with a little message at the end: | ||
|
||
```console | ||
$ sdf bulk human_annotation.tsv --column 1 --header | ||
3 URLs found in 'human_annotation.tsv.' | ||
2 files were downloaded, 2 added to manifest (0 were already registered). | ||
1 files were skipped because they existed (and --overwrite was no specified). | ||
``` | ||
|
||
Note that one can also download files from URLs that are in the Data Manifest. | ||
Suppose that you clone a repository that has no remotes, but each file entry | ||
has a URL set. Those can be retrieved with: | ||
|
||
```console | ||
$ sdf pull --urls # if you want to overwrite any local files, use --ovewrite | ||
``` | ||
|
||
These may or may not be `tracked`; tracking only indicates whether to *also* | ||
manage them with a remote like Zenodo or FigShare. In cases where the data file | ||
can be reliable retrieved from a steady source (e.g. a website like the UCSC | ||
Genome Browser or Ensembl) you may not want to duplicate it by also tracking | ||
it. If you want to pull in *everything*, use: | ||
|
||
```console | ||
$ sdf pull --all | ||
``` | ||
|
||
## Adding Metadata | ||
|
||
Some data repository services like Zenodo allow data depositions to be | ||
associated with a creator's metadata (e.g. full name, email, affiliation). | ||
SciDataFlow automatically propagates this from a file in | ||
`~/.scidataflow_config`. You can set your user metadata (which should be done | ||
early on, sort of like with Git) with: | ||
|
||
```console | ||
$ sdf config --name "Joan B. Scientist" --email "[email protected]" --affiliation "UC Berkeley" | ||
``` | ||
|
||
Projects can also have store metadata, such as a title and description. This is | ||
kept in the Data Manifest. You can set this manually with: | ||
|
||
```console | ||
$ sdf metadata --title "genomics_analysis" --description "A re-analysis of Joan's data." | ||
``` | ||
|
||
## SciDataFlow Assets | ||
|
||
Good scientific workflows should create shareable **Scientific Assets** that | ||
are *trivial* to download and build upon in your own scientific work. | ||
SciDataFlow makes this possible, since in essence each `data_manifest.yml` file | ||
is like a minimal recipe specification for also how to *retrieve* data. The | ||
`sdf asset` command simply downloads a `data_manifest.yml` from | ||
SciDataFlow-Assets, another GitHub repository, or URL. After this is | ||
downloaded, all files can be retrieved in one line: | ||
|
||
$ sdf asset nygc_gatk_1000G_highcov | ||
$ sdf pull --all | ||
|
||
The idea of SciDataFlow-Assets is to have a open, user-curated collection of | ||
these recipes at https://github.com/scidataflow-assets. Please contribute | ||
an Asset when you release new data with a paper! | ||
SciDataFlow has [extensive | ||
documentation](https://vsbuffalo.github.io/scidataflow-doc/) full of | ||
examples of how to use the various subcommands. | ||
|
||
## SciDataFlow's Vision | ||
|
||
|
@@ -390,21 +104,6 @@ asset; all it takes is a mere `sdf pull --overwrite`. | |
|
||
## Installing SciDataFlow | ||
|
||
The easiest way to install SciDataFlow is to use the easy install script, which | ||
detects if you have Rust on your system, and if not installs it. Then it will | ||
install SciDataFlow via Rust's incredible `cargo` system. To run the easy | ||
install script: | ||
|
||
$ https://raw.githubusercontent.com/vsbuffalo/scidataflow/main/easy_install.sh | bash | ||
|
||
If you are security-conscious, you can check the MD5 of SHA1 digests as below: | ||
|
||
$ curl https://raw.githubusercontent.com/vsbuffalo/scidataflow/main/easy_install.sh | md5 | ||
75d205a92b63f30047c88ff7e3de1a9f | ||
|
||
$ curl https://raw.githubusercontent.com/vsbuffalo/scidataflow/main/easy_install.sh | sha256sum | ||
0a654048b932a237cb93a9359900919188312867c3b7aeea23843272bc616a71 - | ||
|
||
If you'd like to the Rust Programming Language manually, [see this | ||
page](https://www.rust-lang.org/tools/install), which instructs you to run: | ||
|
||
|