You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I read your paper with interest, and I do have some comments. I will send them here, because I am still learning to use GitHub. (I made a co-author write a book on it, so I have some experience, but still pretty limited.)
Of course, I agree completely with the emphasis on reproducible workflow. But having worked on very long-term, complex data collection and analysis projects, I have a slightly different emphasis in what I think are the major challenges of workflow than those that emerge in your article. Your article focuses principally on a reproducible workflow for a single collaboration and/or a single article. But you are aware that long-term data collection and analysis projects involve some decisions that cannot be handled within the framework you lay out. For many people working in areas that require original data collection, this will be the case. The main and chronologically prior decisions have to do with directory structure, an issue you hardly touch on. I think the current tools that are available to us do not yet solve inherent directory structure issues, especially those that cut across multiple “projects," and that some of the advice you give (about where to keep data and how to name paths) may be misguided over the long term.
Let me give you an example to illustrate and then comment on what I see as the outstanding issues.
I have 3 GB of data (divided across more than 5,000 files) that was collected, cleaned, assembled, and analyzed over a period of more than ten years with at least five separate groups of collaborators (producing about a dozen separate publications) that has to do with Italian corruption. I started collecting data before Github even existed. Nonetheless, I can find any piece of data in these files in less than 5 minutes. This is because I have a very clear, standardized, and well documented directory structure for organizing the raw data, for cleaning it, for assembling and merging, for creating new variables, and then for returning to correct errors in the original data that may be discovered years down the line.
This kind of large-scale, complex data collection and assembly process cannot be managed within either Github or Dropbox, in my experience. Dropbox is particularly pernicious for anything other than sharing a small number of files with a group, because it requires moving the pertinent directories out of their original site on the computer. You note that it is easy to exceed the Dropbox size constraints, which is true, but even more problematic is that moving files around to put them in a Dropbox inevitably means forgetting where to put them back, especially if you have to assemble into a single Dropbox many files from disparate locations. So Dropbox is terrific for very tightly organized, short-term, limited file sharing but will cause chaos for more complex collaborations. (In fact, I use SugarSync for more giving collaborators access to files that I am reluctant to move out of place.)
As I see it, Github is also inadequate to this particular task. Github is terrific for version control and for collaboration on flat files (e.g. latex, .R or .do). It is problematic for collaboration on binary or compressed files (e.g. .docx and .xlsx), where it causes errors and lost work. More importantly, I do not see a way to seamlessly lay Git on top of the kind of complex directory structures required for multiple collaborations that use the same underlying data. The main reason is as follows.
Say I have the thousands and thousands of raw data files that I have collected for a complex data project (like Italian corruption). There is then one subdirectory in the project directory that assembles the data; this subdirectory itself includes 200+ .do files, and reaches into dozens of other directories to grab the raw data. One reason the assembly process involves 200+ files is that Stata has (or used to have) a line length that was easily exceeded when it took thousands of lines of code to assemble a complex dataset. And even if that is no longer true, or if the same feature does not hobble R, modularity requires dividing up the code into small chunks. There is also a master.do file in the assembly directory, which fulfills the function similar to a GNU make; i.e. it allows me to run the entire data cleaning and assembly process from a single file, which also includes the lines of code that produces the public version of the dataset and codebook for Dataverse. But these files can only be run in place, because they have a lot absolute (rather than relative) path names in them. Why is this? First, it’s because when I started coding, I was not aware of the importance of relative path names (but I am now, so that objection no longer stands); second, it’s because if you are going to run thousands of files thousands of times, in order to generate a clean dataset, there have to be lots of absolute path names embedded in the files to work in a modular fashion. That is, I don’t want to have to re-run thousands of files every time I’m working on one small area of the data cleaning process; but to avoid that, I have to embed an absolute path name in each segment of the process. You could remove them when you’re done, I suppose, but you’re never done when you think you are.
Those absolute path names tie the process of dataset creation to an identifiable location on a computer, making it very difficult to move the project to Github, for instance. I can of course take the resulting dataset and move it to Github, setting up a separate repo for each collaboration/paper that uses the final assembled dataset. (Although obviously, over many years and many projects, there will actually be multiple “final” datasets.) But then when we discover some kind of error early in the dataset cleaning and assembly process, I have to go back to the original location and fix the problem there, regenerating the dataset. If I don’t, then I risk forgetting that I fixed the problem and working with erroneous data later on one of the other collaborations. I do not see any way to incorporate the underlying dataset assembly process into each of the subsequent collaborations; that has to remain separate, and in a single definable location on my computer (or on the “master” project computer).
So the obstacle that I encounter is that I do not see any way to incorporate into the Git workflow the various distinct processes of (1) data collection, and dataset assembly and (2) subsequent multiple, different collaborations. There may be ways to do this that I am unfamiliar with, but I have never seen a description of how to do this in the literature on reproducible and transparent workflows that takes into account the genuine complexity of the data set collection and assembly process in our discipline.
Reproducible workflow for a single article is now easy; the combination of Dropbox, Git, R, and Markdown allow that to happen. Integrating coding into writing with R and Markdown, which I am just learning to do, is particularly beautiful for attaching analysis to written output. My students are all doing this, or at least claim they can do it. But this does not speak to a separate set of very complex organizational and management problems that arise out of the complex processes of data collection, cleaning, dataset assembly, and subsequent collaborations. It is possible to isolate the collaborative analysis and writing process to a reproducible workflow, but I do not see how to make the entire process (from the initial data collection) entirely transparent and reproducible within a single unified framework.
Likewise, my students still have a lot of trouble conceptualizing the process from beginning to end, and because of that, they are not able to set up a standardized, well documented and identifiable directory structure; they forget basic things, like putting the raw data in a separate directory and including a README that includes information on the source of the data (including download date, if applicable). They still have trouble remembering to label files appropriately (lead with the date, so the files hang chronologically); they are more able to incorporate basic concepts of modularity into their work (distinguishing parts of the cleaning and assembly process with different prefixes; e.g. prep_, merge_, clean_, analy__) but their ability to work using only relative and interchangeable path names (interchangeable across analysts) is still weak. To me, these are the actually more important over the long term than integrating analysis and writing. I can figure out what I did in an article I wrote ten years ago even without R and Markdown because I have a good directory structure which is (generally, but not always!) well documented. A clear, consistent, well documented directory structure is more important, in other words, than almost anything else — but is not the focus of discussion in reproducibility.
So I would emphasize more than you and Maarten currently do the importance of simply being really organized, and thinking about the complex processes of work flow over the long term. Things like Git and R and Markdown are helpful (but not essential) tools; of these, I would say that version control is the most important, whereas R/Markdown is simply a convenience. But nothing beats having a good directory structure.
Another set of issues that impinge on the process that you do not discuss regards what should be made available publicly. I have examined a bunch of materials made available via the AER, which now requires accompanying datasets and code. This is of highly varying quality; in many cases, it’s simply the code that produces the final tables, and documentation of the hundreds of preceding decisions that went into coding up the data is not included. Clearly, we need to begin a conversation about the standards that should be used for making public “reproducible” research, and this is a distinct topic. I don’t expect your article to tackle it, but I think you should mention that collaborating with your future selves is different than figuring out what to post on a public site to accompany an article, or how to prepare a well-documented dataset for Dataverse.
So these are my thoughts. I loved your piece, and I learned a lot from it — for instance, I did not know about GNU make, and now I will use that. Also, I will explore some of the software you mention (Flow, Asana) and see if they might work for me.
The text was updated successfully, but these errors were encountered:
Dear Jake:
I read your paper with interest, and I do have some comments. I will send them here, because I am still learning to use GitHub. (I made a co-author write a book on it, so I have some experience, but still pretty limited.)
Of course, I agree completely with the emphasis on reproducible workflow. But having worked on very long-term, complex data collection and analysis projects, I have a slightly different emphasis in what I think are the major challenges of workflow than those that emerge in your article. Your article focuses principally on a reproducible workflow for a single collaboration and/or a single article. But you are aware that long-term data collection and analysis projects involve some decisions that cannot be handled within the framework you lay out. For many people working in areas that require original data collection, this will be the case. The main and chronologically prior decisions have to do with directory structure, an issue you hardly touch on. I think the current tools that are available to us do not yet solve inherent directory structure issues, especially those that cut across multiple “projects," and that some of the advice you give (about where to keep data and how to name paths) may be misguided over the long term.
Let me give you an example to illustrate and then comment on what I see as the outstanding issues.
I have 3 GB of data (divided across more than 5,000 files) that was collected, cleaned, assembled, and analyzed over a period of more than ten years with at least five separate groups of collaborators (producing about a dozen separate publications) that has to do with Italian corruption. I started collecting data before Github even existed. Nonetheless, I can find any piece of data in these files in less than 5 minutes. This is because I have a very clear, standardized, and well documented directory structure for organizing the raw data, for cleaning it, for assembling and merging, for creating new variables, and then for returning to correct errors in the original data that may be discovered years down the line.
This kind of large-scale, complex data collection and assembly process cannot be managed within either Github or Dropbox, in my experience. Dropbox is particularly pernicious for anything other than sharing a small number of files with a group, because it requires moving the pertinent directories out of their original site on the computer. You note that it is easy to exceed the Dropbox size constraints, which is true, but even more problematic is that moving files around to put them in a Dropbox inevitably means forgetting where to put them back, especially if you have to assemble into a single Dropbox many files from disparate locations. So Dropbox is terrific for very tightly organized, short-term, limited file sharing but will cause chaos for more complex collaborations. (In fact, I use SugarSync for more giving collaborators access to files that I am reluctant to move out of place.)
As I see it, Github is also inadequate to this particular task. Github is terrific for version control and for collaboration on flat files (e.g. latex, .R or .do). It is problematic for collaboration on binary or compressed files (e.g. .docx and .xlsx), where it causes errors and lost work. More importantly, I do not see a way to seamlessly lay Git on top of the kind of complex directory structures required for multiple collaborations that use the same underlying data. The main reason is as follows.
Say I have the thousands and thousands of raw data files that I have collected for a complex data project (like Italian corruption). There is then one subdirectory in the project directory that assembles the data; this subdirectory itself includes 200+ .do files, and reaches into dozens of other directories to grab the raw data. One reason the assembly process involves 200+ files is that Stata has (or used to have) a line length that was easily exceeded when it took thousands of lines of code to assemble a complex dataset. And even if that is no longer true, or if the same feature does not hobble R, modularity requires dividing up the code into small chunks. There is also a master.do file in the assembly directory, which fulfills the function similar to a GNU make; i.e. it allows me to run the entire data cleaning and assembly process from a single file, which also includes the lines of code that produces the public version of the dataset and codebook for Dataverse. But these files can only be run in place, because they have a lot absolute (rather than relative) path names in them. Why is this? First, it’s because when I started coding, I was not aware of the importance of relative path names (but I am now, so that objection no longer stands); second, it’s because if you are going to run thousands of files thousands of times, in order to generate a clean dataset, there have to be lots of absolute path names embedded in the files to work in a modular fashion. That is, I don’t want to have to re-run thousands of files every time I’m working on one small area of the data cleaning process; but to avoid that, I have to embed an absolute path name in each segment of the process. You could remove them when you’re done, I suppose, but you’re never done when you think you are.
Those absolute path names tie the process of dataset creation to an identifiable location on a computer, making it very difficult to move the project to Github, for instance. I can of course take the resulting dataset and move it to Github, setting up a separate repo for each collaboration/paper that uses the final assembled dataset. (Although obviously, over many years and many projects, there will actually be multiple “final” datasets.) But then when we discover some kind of error early in the dataset cleaning and assembly process, I have to go back to the original location and fix the problem there, regenerating the dataset. If I don’t, then I risk forgetting that I fixed the problem and working with erroneous data later on one of the other collaborations. I do not see any way to incorporate the underlying dataset assembly process into each of the subsequent collaborations; that has to remain separate, and in a single definable location on my computer (or on the “master” project computer).
So the obstacle that I encounter is that I do not see any way to incorporate into the Git workflow the various distinct processes of (1) data collection, and dataset assembly and (2) subsequent multiple, different collaborations. There may be ways to do this that I am unfamiliar with, but I have never seen a description of how to do this in the literature on reproducible and transparent workflows that takes into account the genuine complexity of the data set collection and assembly process in our discipline.
Reproducible workflow for a single article is now easy; the combination of Dropbox, Git, R, and Markdown allow that to happen. Integrating coding into writing with R and Markdown, which I am just learning to do, is particularly beautiful for attaching analysis to written output. My students are all doing this, or at least claim they can do it. But this does not speak to a separate set of very complex organizational and management problems that arise out of the complex processes of data collection, cleaning, dataset assembly, and subsequent collaborations. It is possible to isolate the collaborative analysis and writing process to a reproducible workflow, but I do not see how to make the entire process (from the initial data collection) entirely transparent and reproducible within a single unified framework.
Likewise, my students still have a lot of trouble conceptualizing the process from beginning to end, and because of that, they are not able to set up a standardized, well documented and identifiable directory structure; they forget basic things, like putting the raw data in a separate directory and including a README that includes information on the source of the data (including download date, if applicable). They still have trouble remembering to label files appropriately (lead with the date, so the files hang chronologically); they are more able to incorporate basic concepts of modularity into their work (distinguishing parts of the cleaning and assembly process with different prefixes; e.g. prep_, merge_, clean_, analy__) but their ability to work using only relative and interchangeable path names (interchangeable across analysts) is still weak. To me, these are the actually more important over the long term than integrating analysis and writing. I can figure out what I did in an article I wrote ten years ago even without R and Markdown because I have a good directory structure which is (generally, but not always!) well documented. A clear, consistent, well documented directory structure is more important, in other words, than almost anything else — but is not the focus of discussion in reproducibility.
So I would emphasize more than you and Maarten currently do the importance of simply being really organized, and thinking about the complex processes of work flow over the long term. Things like Git and R and Markdown are helpful (but not essential) tools; of these, I would say that version control is the most important, whereas R/Markdown is simply a convenience. But nothing beats having a good directory structure.
Another set of issues that impinge on the process that you do not discuss regards what should be made available publicly. I have examined a bunch of materials made available via the AER, which now requires accompanying datasets and code. This is of highly varying quality; in many cases, it’s simply the code that produces the final tables, and documentation of the hundreds of preceding decisions that went into coding up the data is not included. Clearly, we need to begin a conversation about the standards that should be used for making public “reproducible” research, and this is a distinct topic. I don’t expect your article to tackle it, but I think you should mention that collaborating with your future selves is different than figuring out what to post on a public site to accompany an article, or how to prepare a well-documented dataset for Dataverse.
So these are my thoughts. I loved your piece, and I learned a lot from it — for instance, I did not know about GNU make, and now I will use that. Also, I will explore some of the software you mention (Flow, Asana) and see if they might work for me.
The text was updated successfully, but these errors were encountered: