new shortened readme, since doc is on different site

vsbuffalo · Nov 16, 2023 · 0dff76a · 0dff76a
1 parent 470d14e
commit 0dff76a
Showing 1 changed file with 4 additions and 305 deletions.
diff --git a/README.md b/README.md
@@ -31,297 +31,11 @@ simple, minimal, human and machine readable specification. But you don't need
 to know the specifics — the simple `sdf` command line tool handles it all for
 you.
 
-## A Simple Workflow Example
+## Documentation
 
-If you'd like to follow the example along, first [install
-SciDataFlow](#installing-scidataflow).
-
-The user interacts with the Data Manifest through the fast and concurrent
-command line tool `sdf` written in the inimitable [Rust
-language](https://www.rust-lang.org). The `sdf` tool has a Git-like interface.
-If you know Git, using it will be easy, e.g. to initialize SciDataFlow for a
-project you'd use:
-
-```console
-$ sdf init
-```
-
-Registering a file in the manifest: 
-
-```console
-$ sdf add data/population_sizes.tsv
-Added 1 file.
-```
-
-Checking to see if a file has changed, we'd use `sdf status`:
-
-```console
-$ sdf status
-Project data status:
-0 files on local and remotes (1 file only local, 0 files only remote), 1 file total.
-
-[data]
- population_sizes.tsv current 3fba1fc3 2023-09-01 10:38AM (53 seconds ago)
-```
-
-Now, let's imagine a pipeline runs and changes this file: 
-
-```console
-$ bash tools/computational_pipeline.sh # changes data
-$ sdf status 
-Project data status:
-0 files on local and remotes (1 file only local, 0 files only remote), 1 file total.
-
-[data]
- population_sizes.tsv changed 3fba1fc3 → 8cb9d10b 2023-09-01 10:48AM (1 second ago)
-
-```
-
-If these changes are good, we can tell the Data Manifest it should update its
-record of this version:
-
-```console 
-$ sdf update data/population_sizes.tsv
-$ sdf status
-Project data status:
-0 files on local and remotes (1 file only local, 0 files only remote), 1 file total.
-
-[data]
- population_sizes.tsv current 8cb9d10b 2023-09-01 10:48AM (6 minutes ago)
-
-```
-
-**⚠️Warning**: SciDataFlow does not do data *versioning*. Unlike Git, it does
-not keep an entire history of data at each commit. Thus, **data backup must be
-managed by separate software**. SciDataFlow is still in alpha phase, so it is
-especially important you backup your data *before* using SciDataFlow. A tiny,
-kind reminder: you as a researcher should be doing routine backups *already* —
-losing data due to either a computational mishap or hardware failure is always
-possible. 
-
-## Pushing Data to Remote Repositories
-
-SciDataFlow also saves researchers' time when submitting supplementary data to
-services like Zenodo or FigShare. Simply link the remote service (you'll need
-to first get an API access token from their website):
-
-```console
-$ sdf link data/ zenodo <TOKEN> --name popsize_study
-```
-
-You only need to link a remote once. SciDataFlow will look for a project on the
-remote with this name first (see `sdf link --help` for more options).
-SciDataFlow stores the authentication keys for all remotes in
-`~/.scidataflow_authkeys.yml` so you don't have to remember them.
-
-SciDataFlow knows you probably don't want to upload *every* file that you're
-keeping track of locally. Sometimes you just want to use SciDataFlow to track
-local changes. So, in addition to files being registered in the Data Manifest,
-you can also tell them you'd like to *track* them:
-
-```console
-$ sdf track data/population_sizes.tsv
-```
-
-Now, you can check the status on remotes too with:
-
-```console
-$ sdf status --remotes
-Project data status:
-1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total.
-
-[data > Zenodo]
- population_sizes.tsv current, tracked 8cb9d10b 2023-09-01 10:48AM (14 minutes ago) not on remote
-```
-
-Then, to upload these files to Zenodo, all we'd do is:
-
-```console
-$ ../target/debug/sdf push
-Info: uploading file "data/population_sizes.tsv" to Zenodo
-Uploaded 1 file.
-Skipped 0 files.
-```
-
-## Retrieving Data from Remotes
-
-A key feature of SciDataFlow is that it can quickly reunite a project's *code*
-repository with its *data*. Imagine a colleague had a small repository
-containing the code lift a recombination map over to a new reference genome,
-and you'd like to use her methods. However, you also want to check that you can
-reproduce her pipeline on your system, which first involves re-downloading all
-the input data (in this case, the original recombination map and liftover
-files).
-
-First, you'd clone the repository: 
-
-```console
-$ git clone [email protected]:mclintock/maize_liftover
-$ cd maize_liftover/
-```
-
-Then, as long as a `data_manifest.yml` exists in the root project directory
-(`maize_liftover/` in this example), SciDataFlow is initialized. You can verify
-this by using:
-
-```console
-$ sdf status --remotes
-Project data status:
-1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total.
-
-[data > Zenodo]
- recmap_genome_v1.tsv deleted, tracked 7ef1d10a exists on remote
- recmap_genome_v2.tsv deleted, tracked e894e742 exists on remote
-```
-
-Now, to retrieve these files, all you'd need to do is: 
-
-```console
-$ sdf pull 
-Downloaded 1 file.
- - population_sizes.tsv
-Skipped 0 files. Reasons:
-```
-
-Note that if you run `sdf pull` again, it will not redownload the file (this is
-to over overwriting the local version, should it have been changed): 
-
-```console
-$ sdf pull
-No files downloaded.
-Skipped 1 files. Reasons:
- Remote file is indentical to local file: 1 file
- - population_sizes.tsv
-```
-
-If the file has changed, you can pull in the remote's version with `sdf pull
---overwrite`. However, `sdf pull` is also lazy; it will not download the file
-if the MD5s haven't changed between the remote and local versions. 
-
-Downloads with SciDataFlow are fast and concurrent thanks to the [Tokio Rust
-Asynchronous Universal download MAnager](https://github.com/rgreinho/trauma)
-crate. If your project has a lot of data across multiple remotes, SciDataFlow
-will pull all data in as quickly as possible. 
-
-## Retrieving Data from Static URLs
-
-Often we also want to retrieve data from URLs. For example, many genomic
-resources are available for download from the [UCSC](http://genome.ucsc.edu) or
-[Ensembl](http://ensembl.org) websites as static URLs. We want a record of
-where these files come from in the Data Manifest, so we want to combine a
-download with a `sdf add`. The command `sdf get` does this all for you — let's
-imagine you want to get all human coding sequences. You could do this with: 
-
-```console
-$ sdf get https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
-⠄ [================> ] 9639693/22716351 (42%) eta 00:00:08
-```
-
-Now, it would show up in the Data Manifest:
-
-```console
-$ sdf status --remotes
-Project data status:
-0 files local and tracked by a remote (0 files only local, 0 files only remote), 1 files total.
-
-[data > Zenodo]
- Homo_sapiens.GRCh38.cds.all.fa.gz current, untracked fb59b3ad 2023-09-01 3:13PM (43 seconds ago) not on remote
-```
-
-Note that files downloaded from URLs are not automatically track with remotes.
-You can do this with `sdf track <FILENAME>` if you want. Then, you can use `sdf
-push` to upload this same file to Zenodo or FigShare. 
-
-Since modern computational projects may require downloading potentially
-*hundreds* or even *thousands* of annotation files, the `sdf` tool has a simple
-way to do this: tab-delimited or comma-separated value files (e.g. those with
-suffices `.tsv` and `.csv`, respectively). The big picture idea of SciDataFlow
-is that it should take mere seconds to pull in all data needed for a large
-genomics project (or astronomy, or ecology, whatever). Here's an example TSV
-file full of links:
-
-```console
-$ cat human_annotation.tsv
-type url
-cdna https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
-fasta https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.alt.fa.gz
-cds https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
-```
-
-Note that this has a header, and the URLs are in the second column. To get this data, we'd use: 
-
-```console
-$ sdf bulk human_annotation.tsv --column 2 --header
-⠁ [ ] 0/2 (0%) eta 00:00:00
-⠉ [====> ] 9071693/78889691 (11%) eta 00:01:22
-⠐ [=========> ] 13503693/54514783 (25%) eta 00:00:35
-```
-
-**Columns indices are one-indexed** and `sdf bulk` assumes no headers by
-default. Note that in this example, only two files are downloading — this is
-because `sdf` detected the CDS file already existed. SciDataFlow tells you this
-with a little message at the end: 
-
-```console 
-$ sdf bulk human_annotation.tsv --column 1 --header
-3 URLs found in 'human_annotation.tsv.'
-2 files were downloaded, 2 added to manifest (0 were already registered).
-1 files were skipped because they existed (and --overwrite was no specified).
-```
-
-Note that one can also download files from URLs that are in the Data Manifest.
-Suppose that you clone a repository that has no remotes, but each file entry
-has a URL set. Those can be retrieved with:
-
-```console
-$ sdf pull --urls # if you want to overwrite any local files, use --ovewrite
-```
-
-These may or may not be `tracked`; tracking only indicates whether to *also*
-manage them with a remote like Zenodo or FigShare. In cases where the data file
-can be reliable retrieved from a steady source (e.g. a website like the UCSC
-Genome Browser or Ensembl) you may not want to duplicate it by also tracking
-it. If you want to pull in *everything*, use:
-
-```console
-$ sdf pull --all
-```
-
-## Adding Metadata
-
-Some data repository services like Zenodo allow data depositions to be
-associated with a creator's metadata (e.g. full name, email, affiliation).
-SciDataFlow automatically propagates this from a file in
-`~/.scidataflow_config`. You can set your user metadata (which should be done
-early on, sort of like with Git) with:
-
-```console
-$ sdf config --name "Joan B. Scientist" --email "[email protected]" --affiliation "UC Berkeley"
-```
-
-Projects can also have store metadata, such as a title and description. This is
-kept in the Data Manifest. You can set this manually with: 
-
-```console
-$ sdf metadata --title "genomics_analysis" --description "A re-analysis of Joan's data."
-```
-
-## SciDataFlow Assets
-
-Good scientific workflows should create shareable **Scientific Assets** that
-are *trivial* to download and build upon in your own scientific work.
-SciDataFlow makes this possible, since in essence each `data_manifest.yml` file
-is like a minimal recipe specification for also how to *retrieve* data. The
-`sdf asset` command simply downloads a `data_manifest.yml` from
-SciDataFlow-Assets, another GitHub repository, or URL. After this is
-downloaded, all files can be retrieved in one line:
-
- $ sdf asset nygc_gatk_1000G_highcov
- $ sdf pull --all
-
-The idea of SciDataFlow-Assets is to have a open, user-curated collection of
-these recipes at https://github.com/scidataflow-assets. Please contribute
-an Asset when you release new data with a paper!
+SciDataFlow has [extensive
+documentation](https://vsbuffalo.github.io/scidataflow-doc/) full of
+examples of how to use the various subcommands.
 
 ## SciDataFlow's Vision
 
@@ -390,21 +104,6 @@ asset; all it takes is a mere `sdf pull --overwrite`.
 
 ## Installing SciDataFlow
 
-The easiest way to install SciDataFlow is to use the easy install script, which
-detects if you have Rust on your system, and if not installs it. Then it will
-install SciDataFlow via Rust's incredible `cargo` system. To run the easy
-install script:
-
- $ https://raw.githubusercontent.com/vsbuffalo/scidataflow/main/easy_install.sh | bash
-
-If you are security-conscious, you can check the MD5 of SHA1 digests as below:
-
- $ curl https://raw.githubusercontent.com/vsbuffalo/scidataflow/main/easy_install.sh | md5
- 75d205a92b63f30047c88ff7e3de1a9f
-
- $ curl https://raw.githubusercontent.com/vsbuffalo/scidataflow/main/easy_install.sh | sha256sum
- 0a654048b932a237cb93a9359900919188312867c3b7aeea23843272bc616a71 -
-
 If you'd like to the Rust Programming Language manually, [see this
 page](https://www.rust-lang.org/tools/install), which instructs you to run: