-
Notifications
You must be signed in to change notification settings - Fork 12
/
workflow.Rmd
137 lines (101 loc) · 13.2 KB
/
workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
output:
html_document:
theme: journal
toc: true
includes:
after_body: ../linking_script.html
---
<!-- title: "10 Things You Need to Know About Project Workflow" -->
<!-- author: "Methods Guide Author: Matthew Lisiecki" -->
1. People have limited memories
==
There are limits to human memory. Since most experimental research requires at least months if not years of design, monitoring, analysis, and reporting, few researchers can maintain mental oversight of all of a project’s moving pieces over time. Introduce additional investigators into the mix, and the questions of who did what, when, and why (if not how) multiply and become harder to answer. As replication becomes more important (by the original project team or outside researchers), maintaining a written record of decisions, actions, and questions becomes essential. Bowers and Voors (2016)[^1] provide a framework and steps for improving project’s workflow; this guide draws upon their paper and upon additional tools aimed at documenting the important choices made by researchers and effectively communicating those choices to the project team.
[^1]: Bowers, J. & Voors, M. 2016. How to Improve Your Relationship With Your Future Self. Revista de Ciencia Política. Vol. 36, Issue 3.
2. Effective data analysis is aided by coding
==
The best way to leave an easy-to-retrace path (for your future self and others) in data analysis is to produce all outputs through code. Opening Excel to make one change to one table may seem the quickest way to complete the task, but when you are trying to remember what you did months later, you will be glad to have updated your R script or do file to make the desired edit. To maximize re-traceability, leave comments in your code file to explain the purpose of each line, as in the example below.
```{r, eval=FALSE}
# This file produces a plot relating the explanatory variable to the outcome.
## Read the data
thedata <- read.csv(“Data/thedata-15-03-2011.csv”)
## begin writing to the pdf file
please-open-pdf(“g1.pdf”)
please-plot(outcome by explanatory using thedata. red lines
please.)
please-add-a-line(using model1)
## Note to self: a quadratic term does not add to the substance
## model2 <- please-fit(outcome by explanatory+explanatory^2
using thedata
## summary(abs(fitted(model1)-fitted(model2)))
## stop writing to the pdf file
please-close-pdf()
```
Coding has other benefits. It saves time by systematically completing repetitive tasks that could be done manually but with greater time investment. It leads to smoother collaboration by communicating progress and next steps. Finally, it helps researchers avoid making mistakes by combining multiple steps into one file where the order of steps and the relationships among steps is clear.
3. Coding is communication: to yourself, to co-investigators, and to the future
==
Data analysis requires a series of decisions, large and small. These decisions will be made both by individual researchers and by research teams. The why and how questions that will arise when you or others attempt to reproduce your work can only be answered if you can remember and communicate about these decisions.
One way to record decisions made during the coding process is to use the **comment** feature of whichever analysis software you are using (for example, lines beginning with * in Stata, # in R are not executed but are understood as notes). Comments are unexecuted text within the script, and can explain the purpose of each line or section to your collaborators or future self. You can see some commented code in the chunk of R code in point 2.
Coding and communication can be linked even tighter by combining them in the same file. Using markup systems like R Markdown (see section 4 below) or R+LaTeX[^2], you can place a code chunk that produces a table or figure right where that table/figure will be in the document. If you make a change to the code that produces that table/figure, it appears in the document without any additional work. The R code in point 2 starts with ` ```{r codechunk1}` which tells R+markdown that the next bit of text is R code to be executed and ends with ` ``` ` which tells R+markdown to go back to interpreting the text as prose. For example, in markdown one uses two asterisks to make text **bold** (```**like this**```) but two asterisks would confuse R (there is no R command with two asterisks), so the idea of a code chunk allows two or more kinds of text to be mixed in a single document. Coauthors and editors can also review the manuscript together with the tables and figures, rather than scrolling back and forth between the two. When it comes time to produce the final version of the document, the code chunks can be hidden to display only the relevant table/figure.[^3] The idea is to type the document in plain text but that external programs (including R) can compile so that they end up looking like nice, formatted PDFs. You can use Markdown on Windows and Mac OSX, and you can also edit the file in an ordinary text editor. Jupyter notebooks are another recent open source approach to mixing prose text and code text which emphasize the python language but which can also include R.[^4]
[^2]: Called, anachronistically, Sweave
[^3]: Note that all of the EGAP methods guides that produce figures or tables are produced this way (see https://github.com/egap/methods-guides)
[^4]: There are many ways to type text that includes code such that parenthesis are automatically closed, spelling is checked, or even help is provided about the code itself. RStudio is an integrated development environment (IDE) that includes a nice editor. Some of us use older tools that are more widespread such as vim, or eMacs. Others any of the contemporary improvements on those old tools such as Atom.
4. A useful tool for combining code and writeup: R Markdown
==
R Markdown is a format for making documents within R using a plain text format. When compiled, these text formats end up looking like nice, formatted PDFs. You can use Markdown on Windows and Mac OSX, and you can also edit the file in an ordinary text editor.
For more information, see the [Metaketa III committee R Markdown Manual](https://github.com/egap/methods-guides/blob/master/workflow/metaketa3_Rmarkdown.pdf).
5. Data analysis requires clear routes between inputs and outputs
==
Nearly every script begins by calling in the relevant data for analysis. This should be done with a line that produces the same output on every computer. This, in turn, requires that collaborators need to have access to a shared file structure for storing project materials, about which more in section 7.
Within that shared file structure, the organization of files can be a very helpful map for identifying what went where. Nagler [^5] outlines the principle of modularity, which suggest that you separate files by function (namely making data cleaning/processing/recoding/merging distinct from analysis). Storing the original data file in a folder labeled “raw” and the cleaned/recoded datafile in one marked “clean” tells you and your collaborators where to find each without searching too hard. If the folder structure starts to get overly complicated, consider using a README file (a brief text file) to explain where the relevant input and output files for your analysis are stored.
[^5]: Nagler, Jonathan. 1995. “Coding Style and Good Computing Practices”. PS: Political Science and Politics 28(3): 488-92.
6. Version control keeps track of who did what and when, and makes sure that work is not overwritten
==
Even teams with perfect file structure in their shared folder can run into issues without effective version control. Version control allows teams to:
* Determine what changed between different versions of the same document;
* Experiment within a document, delete the parts that did not work, and merge in the parts that did;
* Produce multiple versions of the same document (such as one for a journal and one for a conference);
* Work on a file without undoing the work that someone else on your team happened to do at the same time
One way to practice effective version control is to use the “track changes” function within word processors. This allows one author to make edits in a way that others can quickly identify, review, and “accept” or “reject.” In Google Docs[^6], this feature (which Google calls “suggestion” mode) can be combined with the ability for multiple authors to work on the document at once time (assuming that everyone has a reliable internet connection).
[^6]: https://docs.google.com
In situations with less reliable internet access, Dropbox[^7] can provide shared access to a folder or series of folders where project materials are stored. One author can make an edit and sync their account, and then another can view the latest version. Authors cannot make changes at the same time, but they can hand off to each other with greater ease. This can be aided by adopting a file naming structure that tells you when the latest changes were made. Naming a file ```document.docx``` does not communicate this; naming it ```yyyymmdd_document.docx``` does (it also allows you to sort the folder by name and see the most recently edited version first. If the versions start to pile up, consider creating an archive subfolder to store all but the most recent version.
[^7]: https://www.dropbox.com
7. A useful tool for version control: GitHub
==
When collaboration involves plain text files (i.e., code or files that combine code and prose), we recommend following the best practices of community of software developers (who make a living from working together in teams to write reliable code in an efficient way). Currently, the standard tool to enable this collaboration is Github.
GitHub is a code hosting platform to streamline issues related to version control and working as a group. It lets you and others work together on projects from anywhere. The core GitHub workflow includes each user copying, or “cloning”, a version of a folder (or folder structure) to their own desktop, so that the latest version from the repository is on a local device, making changes to the desired files, and pushing those files to the shared repository. At the beginning of a work session one “pulls” or “synchronizes” any changes made by other team members online with the local copy. The git system tracks all changes, and contains functionality to revert back to previous versions as needed.
For more information, see [Metaketa III committee GitHub Manual](https://github.com/egap/methods-guides/blob/master/workflow/metaketa3_github.pdf).
8. Your future self--and others--will want to be able to access your data
==
Science involves learning from the work of others in addition to remembering about the past work of oneself. If one works in such a way that a future self can remember, then, often that work will be easy to use by others as they themselves attempt to extend the science or even just learn how to apply your good ideas to their own context.[^8] There are a number of tools to make storing and sharing files for replication easier, including GitHub and the Center for Open Science’s Open Science Framework (OSF). OSF recently ran a [replication project](http://www.socialsciencesreplicationproject.com) aimed at understanding “predictors of reproducibility of research results, and how low reproducibility may inhibit efficient accumulation of knowledge.”
Data should be stored in places like a Dataverse or the ICPSR or other places with a plan for daily backups band also a plan for data to be preserved and made available for many years into the future (in contrast to your favorite homemade website).
[^8]: King, Gary. 1995. “Replication, Replication”. PS: Political Science and Politics 28(3): 444-452.
9. To minimize error, build testing into the code
==
Testing ensures that errors from bugs and typos are caught before it is too late, and including that testing in your coding makes sure that it is done.
Below is a simple example of this testing on a function designed to multiply a number by 2.
```{r, eval=FALSE}
## The test function:
test.times.2.fn <- function(){
## This function tests times.2.fn
if (times.2.fn(thenumber = 4) == 8 &
times.2.fn(thenumber = -4) == -8) {
print(“It works!”)
} else { print(“It does not work!”)
}
}
## The actual function is:
times.2.fn <- function(thenumber){
## This function multiplies a scalar number by 2
## thenumber is a scalar number
thenumber+2
```
What happens when running the test function?
```{r, eval=FALSE}
test.times.2.fn()
[1] “It does not work!”
```
This tells you to go back and review the function, where we see a + instead of a *. Testing can be even more effective if we code the script to stop when an error occurs instead of generating an error message. In R, the stopifnot function will do just that.
10. Research should remove speculation
==
Ultimately, the point of doing research is to move things from the realm of opinion into the realm of knowledge. Thus, effective research should have findings that are not a matter of opinion, and at which others can arrive. If, in the future, somebody disagrees with your analysis, you can hand them the raw data files and analysis scripts and watch them arrive at the same result.