-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Optionally give an error and exit when a remote file has a different md5 than the original data manifest #10
Comments
Hi Nick, I'm so glad to hear you're liking SciDataFlow, and that you're using it in a bunch of projects! I think I see what you're getting at here but let me check. SDF currently has |
Yep, you got it. My original idea for how to implement it was indeed a command line arg on sdf status --locked # exit code 1 if a file is changed That said, though I hadn't considered it before, having it as a global option is appealing too, both from an under-the-hood implementation perspective as well as from a usage perspective. From the implementation perspective, you could add a boolean #[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct DataFile {
pub path: String,
pub tracked: bool,
pub md5: String,
pub size: u64,
pub url: Option<String>,
pub locked: bool, // new locked field
} From a usage perspective, I could see it being useful like this: # initialize the project
sdf init
# add files you'd like to track, with it being important that file3 never changes
sdf add file1
sdf add file2
sdf add --locked file3
#
# run a bunch of other project code to generate reproducible results
#
sdf status # gives error if the locked file3 was changed Hopefully that makes sense! Curious which of these implementations are more appealing to you. |
This looks good to me! A few comments/ideas:
|
Hi Vince,
Love the tool so far. I've been integrating it into more of my repos, especially publication code, and agree that a tool like this that is language-ecosystem agnostic was sorely needed. So, thanks for your work!
I have a Nextflow pipeline I'm developing that starts off by pulling some auxiliary data files for use downstream. Previously, I was just using
wget
with a hardcoded URL, whereas now I have the URLs in a data manifest. For the sake of a publication, I'd like SciDataFlow to give an error if the data manifest URL links to a file with a different md5 than was previously in the manifest. Think of it as an immutable or "locked" asset in the data manifest. This would enhance reproducibility in that the workflow would prevent future users from using different auxiliary data than was used in the manuscript.Curious if you see any value in this enhancement, or if other tools might be better suited here. If you do see some value here, I'd be happy to work on a PR for your review, probably involving an optional "locked"
clap
parameter insdf pull
,sdf status
, or both, where the default behavior would be to leave the asset unlocked and mutable.Thanks again for your good work,
--Nick
The text was updated successfully, but these errors were encountered: