From 0dff76a93b59a8a8f91ffe30df4446d8d8c1bc5b Mon Sep 17 00:00:00 2001 From: Vince Buffalo Date: Thu, 16 Nov 2023 12:26:01 -0800 Subject: [PATCH] new shortened readme, since doc is on different site --- README.md | 309 +----------------------------------------------------- 1 file changed, 4 insertions(+), 305 deletions(-) diff --git a/README.md b/README.md index 32cc1cd..6ae1c02 100644 --- a/README.md +++ b/README.md @@ -31,297 +31,11 @@ simple, minimal, human and machine readable specification. But you don't need to know the specifics — the simple `sdf` command line tool handles it all for you. -## A Simple Workflow Example +## Documentation -If you'd like to follow the example along, first [install -SciDataFlow](#installing-scidataflow). - -The user interacts with the Data Manifest through the fast and concurrent -command line tool `sdf` written in the inimitable [Rust -language](https://www.rust-lang.org). The `sdf` tool has a Git-like interface. -If you know Git, using it will be easy, e.g. to initialize SciDataFlow for a -project you'd use: - -```console -$ sdf init -``` - -Registering a file in the manifest: - -```console -$ sdf add data/population_sizes.tsv -Added 1 file. -``` - -Checking to see if a file has changed, we'd use `sdf status`: - -```console -$ sdf status -Project data status: -0 files on local and remotes (1 file only local, 0 files only remote), 1 file total. - -[data] - population_sizes.tsv current 3fba1fc3 2023-09-01 10:38AM (53 seconds ago) -``` - -Now, let's imagine a pipeline runs and changes this file: - -```console -$ bash tools/computational_pipeline.sh # changes data -$ sdf status -Project data status: -0 files on local and remotes (1 file only local, 0 files only remote), 1 file total. - -[data] - population_sizes.tsv changed 3fba1fc3 → 8cb9d10b 2023-09-01 10:48AM (1 second ago) - -``` - -If these changes are good, we can tell the Data Manifest it should update its -record of this version: - -```console -$ sdf update data/population_sizes.tsv -$ sdf status -Project data status: -0 files on local and remotes (1 file only local, 0 files only remote), 1 file total. - -[data] - population_sizes.tsv current 8cb9d10b 2023-09-01 10:48AM (6 minutes ago) - -``` - -**⚠️Warning**: SciDataFlow does not do data *versioning*. Unlike Git, it does -not keep an entire history of data at each commit. Thus, **data backup must be -managed by separate software**. SciDataFlow is still in alpha phase, so it is -especially important you backup your data *before* using SciDataFlow. A tiny, -kind reminder: you as a researcher should be doing routine backups *already* — -losing data due to either a computational mishap or hardware failure is always -possible. - -## Pushing Data to Remote Repositories - -SciDataFlow also saves researchers' time when submitting supplementary data to -services like Zenodo or FigShare. Simply link the remote service (you'll need -to first get an API access token from their website): - -```console -$ sdf link data/ zenodo --name popsize_study -``` - -You only need to link a remote once. SciDataFlow will look for a project on the -remote with this name first (see `sdf link --help` for more options). -SciDataFlow stores the authentication keys for all remotes in -`~/.scidataflow_authkeys.yml` so you don't have to remember them. - -SciDataFlow knows you probably don't want to upload *every* file that you're -keeping track of locally. Sometimes you just want to use SciDataFlow to track -local changes. So, in addition to files being registered in the Data Manifest, -you can also tell them you'd like to *track* them: - -```console -$ sdf track data/population_sizes.tsv -``` - -Now, you can check the status on remotes too with: - -```console -$ sdf status --remotes -Project data status: -1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total. - -[data > Zenodo] - population_sizes.tsv current, tracked 8cb9d10b 2023-09-01 10:48AM (14 minutes ago) not on remote -``` - -Then, to upload these files to Zenodo, all we'd do is: - -```console -$ ../target/debug/sdf push -Info: uploading file "data/population_sizes.tsv" to Zenodo -Uploaded 1 file. -Skipped 0 files. -``` - -## Retrieving Data from Remotes - -A key feature of SciDataFlow is that it can quickly reunite a project's *code* -repository with its *data*. Imagine a colleague had a small repository -containing the code lift a recombination map over to a new reference genome, -and you'd like to use her methods. However, you also want to check that you can -reproduce her pipeline on your system, which first involves re-downloading all -the input data (in this case, the original recombination map and liftover -files). - -First, you'd clone the repository: - -```console -$ git clone git@github.com:mclintock/maize_liftover -$ cd maize_liftover/ -``` - -Then, as long as a `data_manifest.yml` exists in the root project directory -(`maize_liftover/` in this example), SciDataFlow is initialized. You can verify -this by using: - -```console -$ sdf status --remotes -Project data status: -1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total. - -[data > Zenodo] - recmap_genome_v1.tsv deleted, tracked 7ef1d10a exists on remote - recmap_genome_v2.tsv deleted, tracked e894e742 exists on remote -``` - -Now, to retrieve these files, all you'd need to do is: - -```console -$ sdf pull -Downloaded 1 file. - - population_sizes.tsv -Skipped 0 files. Reasons: -``` - -Note that if you run `sdf pull` again, it will not redownload the file (this is -to over overwriting the local version, should it have been changed): - -```console -$ sdf pull -No files downloaded. -Skipped 1 files. Reasons: - Remote file is indentical to local file: 1 file - - population_sizes.tsv -``` - -If the file has changed, you can pull in the remote's version with `sdf pull ---overwrite`. However, `sdf pull` is also lazy; it will not download the file -if the MD5s haven't changed between the remote and local versions. - -Downloads with SciDataFlow are fast and concurrent thanks to the [Tokio Rust -Asynchronous Universal download MAnager](https://github.com/rgreinho/trauma) -crate. If your project has a lot of data across multiple remotes, SciDataFlow -will pull all data in as quickly as possible. - -## Retrieving Data from Static URLs - -Often we also want to retrieve data from URLs. For example, many genomic -resources are available for download from the [UCSC](http://genome.ucsc.edu) or -[Ensembl](http://ensembl.org) websites as static URLs. We want a record of -where these files come from in the Data Manifest, so we want to combine a -download with a `sdf add`. The command `sdf get` does this all for you — let's -imagine you want to get all human coding sequences. You could do this with: - -```console -$ sdf get https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz -⠄ [================> ] 9639693/22716351 (42%) eta 00:00:08 -``` - -Now, it would show up in the Data Manifest: - -```console -$ sdf status --remotes -Project data status: -0 files local and tracked by a remote (0 files only local, 0 files only remote), 1 files total. - -[data > Zenodo] - Homo_sapiens.GRCh38.cds.all.fa.gz current, untracked fb59b3ad 2023-09-01 3:13PM (43 seconds ago) not on remote -``` - -Note that files downloaded from URLs are not automatically track with remotes. -You can do this with `sdf track ` if you want. Then, you can use `sdf -push` to upload this same file to Zenodo or FigShare. - -Since modern computational projects may require downloading potentially -*hundreds* or even *thousands* of annotation files, the `sdf` tool has a simple -way to do this: tab-delimited or comma-separated value files (e.g. those with -suffices `.tsv` and `.csv`, respectively). The big picture idea of SciDataFlow -is that it should take mere seconds to pull in all data needed for a large -genomics project (or astronomy, or ecology, whatever). Here's an example TSV -file full of links: - -```console -$ cat human_annotation.tsv -type url -cdna https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz -fasta https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.alt.fa.gz -cds https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz -``` - -Note that this has a header, and the URLs are in the second column. To get this data, we'd use: - -```console -$ sdf bulk human_annotation.tsv --column 2 --header -⠁ [ ] 0/2 (0%) eta 00:00:00 -⠉ [====> ] 9071693/78889691 (11%) eta 00:01:22 -⠐ [=========> ] 13503693/54514783 (25%) eta 00:00:35 -``` - -**Columns indices are one-indexed** and `sdf bulk` assumes no headers by -default. Note that in this example, only two files are downloading — this is -because `sdf` detected the CDS file already existed. SciDataFlow tells you this -with a little message at the end: - -```console -$ sdf bulk human_annotation.tsv --column 1 --header -3 URLs found in 'human_annotation.tsv.' -2 files were downloaded, 2 added to manifest (0 were already registered). -1 files were skipped because they existed (and --overwrite was no specified). -``` - -Note that one can also download files from URLs that are in the Data Manifest. -Suppose that you clone a repository that has no remotes, but each file entry -has a URL set. Those can be retrieved with: - -```console -$ sdf pull --urls # if you want to overwrite any local files, use --ovewrite -``` - -These may or may not be `tracked`; tracking only indicates whether to *also* -manage them with a remote like Zenodo or FigShare. In cases where the data file -can be reliable retrieved from a steady source (e.g. a website like the UCSC -Genome Browser or Ensembl) you may not want to duplicate it by also tracking -it. If you want to pull in *everything*, use: - -```console -$ sdf pull --all -``` - -## Adding Metadata - -Some data repository services like Zenodo allow data depositions to be -associated with a creator's metadata (e.g. full name, email, affiliation). -SciDataFlow automatically propagates this from a file in -`~/.scidataflow_config`. You can set your user metadata (which should be done -early on, sort of like with Git) with: - -```console -$ sdf config --name "Joan B. Scientist" --email "joanbscientist@berkeley.edu" --affiliation "UC Berkeley" -``` - -Projects can also have store metadata, such as a title and description. This is -kept in the Data Manifest. You can set this manually with: - -```console -$ sdf metadata --title "genomics_analysis" --description "A re-analysis of Joan's data." -``` - -## SciDataFlow Assets - -Good scientific workflows should create shareable **Scientific Assets** that -are *trivial* to download and build upon in your own scientific work. -SciDataFlow makes this possible, since in essence each `data_manifest.yml` file -is like a minimal recipe specification for also how to *retrieve* data. The -`sdf asset` command simply downloads a `data_manifest.yml` from -SciDataFlow-Assets, another GitHub repository, or URL. After this is -downloaded, all files can be retrieved in one line: - - $ sdf asset nygc_gatk_1000G_highcov - $ sdf pull --all - -The idea of SciDataFlow-Assets is to have a open, user-curated collection of -these recipes at https://github.com/scidataflow-assets. Please contribute -an Asset when you release new data with a paper! +SciDataFlow has [extensive +documentation](https://vsbuffalo.github.io/scidataflow-doc/) full of +examples of how to use the various subcommands. ## SciDataFlow's Vision @@ -390,21 +104,6 @@ asset; all it takes is a mere `sdf pull --overwrite`. ## Installing SciDataFlow -The easiest way to install SciDataFlow is to use the easy install script, which -detects if you have Rust on your system, and if not installs it. Then it will -install SciDataFlow via Rust's incredible `cargo` system. To run the easy -install script: - - $ https://raw.githubusercontent.com/vsbuffalo/scidataflow/main/easy_install.sh | bash - -If you are security-conscious, you can check the MD5 of SHA1 digests as below: - - $ curl https://raw.githubusercontent.com/vsbuffalo/scidataflow/main/easy_install.sh | md5 - 75d205a92b63f30047c88ff7e3de1a9f - - $ curl https://raw.githubusercontent.com/vsbuffalo/scidataflow/main/easy_install.sh | sha256sum - 0a654048b932a237cb93a9359900919188312867c3b7aeea23843272bc616a71 - - If you'd like to the Rust Programming Language manually, [see this page](https://www.rust-lang.org/tools/install), which instructs you to run: