-
Notifications
You must be signed in to change notification settings - Fork 0
/
collab4ss.tex
471 lines (424 loc) · 26.9 KB
/
collab4ss.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
% File: collab4ss.tex
% Created: Tue Feb 22 02:00 PM 2011 C
% Last Change: Tue Feb 22 02:00 PM 2011 C
%
\documentclass[]{article}
\usepackage{hyperref}
\author{Mark M. Fredrickson \and Paul F. Testa \and Nils B. Weidmann} % \and you!
\title{Collaboration for Social Scientists, or Software is the Easy Part}
\begin{document}
\maketitle
\section{Collaboration Basics}
% text files/documents of simple structure
% sideband communication (``Hey! I'm working on that!'')
% place in context of other articles in TPM
In this article, we consider two different modes of collaboration: synchronous
and asynchronous. When working synchronously, contributors are both working on
\emph{the same portions of the research at the same time}. We provide some
suggestions for maximizing time spent working together. Of course, virtually
any research project will require collaborators to spend time working on
either different portions of the project or working on the same sections but
at different times. We label this form of collaboration asynchronous.
Asynchronous collaboration requires more careful attention to dividing labor,
and we spend more time providing software solutions in this domain. These
suggestions are based on what has worked \emph{for us}. These suggestions are
grounded in experience, and we think they are
useful techniques for any team to adopt, but we have also found that software
is the easy part of any collaboration. Hopefully, adopting some of these
techniques may help your team get past technical details faster and down to
the real business of producing research.
\section{Synchronous Collaboration}
While it might appear that only collaborators at the same institution or who can
frequently meet face to face will benefit from synchronous collaboration
techniques, many of these techniques rely on networked computers or can be
applied over video chat or speaker phone.
We begin by importing some techniques from software engineering. In recent
years, so-called ``Agile'' programming and project management aproaches to
software engineering have become popular, especially at start ups and younger
development shops. Many techniques fall under the umbrella of ``Agile''
methods, including suggestions for organizing teams, minimizing unnecessary
meetings, and communicating frequently changing client requests. While social
scientists could benefit from these suggestions, we tend to be our own clients
and work in smaller teams than programmers. One technique we do think would be
of benefit to social scientists would be the concept of \emph{Pair
Programming,} sometimes also called ``eXtreme Programming'' (XP). Pair
programming places two programmers at the same computer: one screen, one
keyboard, two heads. One programmer takes the lead to write software, while
the second provides suggestions, acts as a sounding board, catches errors, and
questions assumptions made by the first programmer.
While it may sound wasteful to place two collaborators in front of a single
computer and have them both focus on the same task, the technique can lead to
\emph{more} code being written and \emph{higher quality} software as well. The
key insight is that typing is rarely the bottleneck for producing code. Having
a second person on hand to help with the concepts, design, and implementation
cuts down on time spent chasing dead-ends or time wasted on simple bugs.
If you have spent several hours on a problem only to realize your mistake
while explaining the problem to someone, you will see the immediate benefits
of pair programming.
In practice, pair programming need not have both subjects staring at the same
screen at the same time. One programmer may be writing code, while the other
looks into API documentation, writes unit tests, or provides documentation,
but is immediately available to support the first programmer. Additionally,
collaborators need not be in the same physical space. There are several tools
for real-time co-editing of documents. Wikipedia provides a fairly detailed
list of \href{http://en.wikipedia.org/wiki/Collaborative_real-time_editor}
{collaborative real-time
editors}. All of these editors allow multiple authors to simultaneously edit
documents, which may even be a useful feature to pair programmers in the same
physical space. Since this issue of TPM is strongly encouraging learning and
using a text editor, you may wish to favor editors that allow for simultaneous
editing. At a minimum, \href{http://www.gnu.org/software/screen/}{GNU Screen}
provides an immediate solution for Emacs and VIM users who wish to pair program.
More advanced uses may require a editor plugin or separate editor.
In addition to managing file editing, some real time editors also facilitate
verbal communication. Of course, if your editor does not immediately provide
this service, a call via \href{http://www.skype.com}{Skype} or
\href{http://chat.google.com}{Google Chat} can fulfill communication needs. Of
course, these tools can also be of use to collaborators, even they forgo the
pair programming model.
% pair programming
% other agile methods
% screen & google docs
% irc/aim/jabber
\section{Asynchronous Collaboration}
% dropbox/shared files
% wiki
%PFT: Here's my attempt to give a transition from Synch to Async Collab with a
% drop box intro that leads into your discussion of its limitations.
%Introdcution/transition
Even the closest of collaborators need to spend some time apart. The primary
challenge for effective asynchronous collaboration is ensuring that this time
is spent contributing to the final product and not wondering where one's
changes went and why one's partner is working off a draft from three months ago.
\subsection{Shared Files}
Services such as \href{http://www.dropbox.com}{Dropbox} provide a relatively
simple and elegant solution to the challenge of keeping collaborators on same
page. Users download a desktop application that creates a DropBox folder on
their computer. Files stored in this folder are available online through the
user's account, as well as on any other computer the user has installed DropBox
on or shared his or her folder with. Changes made to a file are automatically
synchronized with a user's other online account and other computers, ensuring
that
collaborators are always working off the most recent copy.
%Awkward i know but this was getting late.
Think of DropBox as the sort of iPhone of file sharing services (in fact there
are several DropBox applications for the iPhone and other mobile devices). It's
relatively easy to use, synchronizes changes automatically and offers a
degree
of version control (up to thirty days for free users and unlimited for paying
customers). DropBox is well suited for the less technically savvy social
scientists. More advanced users and frequent collaborators may chafe at the
limitations of its ``freemium'' services and find that some of its more user
friendly features promote bad collaboration habits.
\subsection{Version Control}
While shared files solve the problem of all collaborators having
access to common resources, simple file servers provide no guarantee that
collaborators will not unintentionally overwrite each other's changes. Consider
for example the following scenario, both you and your collaborator are working
on the same LaTeX file. You are editing the abstract, while your partner
changes a few lines in the conclusion. You save your work to the shared area,
while unknown to you, your partner saved her work only a few minutes before.
Even though you were working in an entirely different part of the file, your
changes overwrite those of your partner, silently dropping her work and
reverting back to the old conclusion.
This is exactly what so-called version control (VC) systems are designed to avoid.
Developed for is software engineering, these systems enable multiple authors to
work on the same documents and safely merge their changes.
Version control systems come in two flavors: centralized and decentralized.
The basic setup of a
centralized version control system is simple. Collaborators work off a central repository, where
the version-controlled documents reside. Upon joining a project, each collaborator
obtains a working copy of each of these documents. Versions are tracked by means of
revision numbers, assigned and maintained by the VC system. When you obtain a document
(``check out''), your local copy is assigned the revision number of the repository at
that time.
As the project proceeds, collaborators change different parts of their working copies
of the same document, which brings us to the situation described above: how do your
modified abstract and the conclusion written by your collaborator end up in the same
document? Let's assume that you are finished writing the abstract, while your partner
is still working on the conclusion. Your and your collaborator's working copies are
both currently in revision 22. You upload your changes to the central repository, a
step that in the VC world is called a ``commit''. When you do so, a new revision
(rev.~23) is created in the central repository. Once your partner attempts to upload
the new document with the conclusion, the VC system prevents her from doing so, because
the working copy she is using is not up to date and still in revision 22. Prior to
committing her changes, she has to update her working copy to the current revision of
the repository (rev.~23). This is done by performing an ``update'' operation in the VC
system, during with the updated parts in the main document (the abstract) are merged
into the local working copy of your partner, while, and this is important, preserving
her newly written conclusion.\footnote{Things become more tricky if you and your
partner modify the same parts of a document (for example, you both provide an
abstract). In this case, human intervention is required: the VC system highlights the
conflicting changes, but lets you decide what should be the final version.
VC systems (generally) assess changes on a line-by-line basis. You can
minimize your conflicts by editing the fewest lines possible for any change
and by making frequent check-ins. You may find it helpful to use hard wrapping
in your editor of choice when editing \texttt{.tex} files, at say 80
characters. This will automatically break long sentences into smaller chunks
from the perspective of the VC system and provide for fewer headaches down the
road.}
Various implementations of this centralized version control model exist. One of the most
popular systems is \href{http://subversion.tigris.org/}{Subversion}, which is available
free of charge (both the server required to set up a central repository, and a command
line client to check out and maintain a local working copy). However, various
alternatives exist that make life easier for less tech-savvy people. \href{http://
www.projectlocker.com/}{Projectlocker} offers Subversion hosting, and a free account
can store up to 300 MB. If the documents you put under version control mainly include
Latex and R code, this is more than enough. \href{http://tortoisesvn.tigris.org/}
{TortoiseSVN} is a graphical Subversion client for Windows users that integrates nicely
with the Windows Explorer. A similar project exists for OS X (\href{http://
scplugin.tigris.org/}{SCPlugin}), but due to its early development stage, some people
may prefer commercial products (\href{http://www.zennaware.com/cornerstone/}
{Cornerstone} or \href{http://versionsapp.com/}{Versions}).
As the name implies, decentralized version control systems spread out the work
of managing repositories while still emphasizing collaboration. The main
difference is that instead of a single, server-side repository, each
collaborator has a local repository and a local working copy. The advantage of
this mode of version control is that each collaborator can work offline, make
small, frequent, and fast commits, and still communicate with other
repositories (usually by pushing to a server as with a centralized system).
By their decentralized nature, distributed VCs make ``branching'' a natural
technique. ``Branching'' refers to making parallel copies of the repository to
try out ideas, make complex changes, or just play around in a sandbox. For
example, say you wish to add an additional data source to the paper to see if
it adds strength to the argument, but you also will be editing the document
simultaneously. By adding a branch, you have a safe place to add data and
change wide sections of the analysis, but can still edit text on the main
document. If the additional data source proves useful, you can merge the
branch back to the main code base, maintaining any edits. If the additional
data is not useful, you have done no harm to your primary document. In a
distributed version control system, every collaborator is working his or her
own branch, so these merges are natural and well supported.\footnote{It is
fair to note that systems like Subversion support branching; however, SVN
places more burden on the user to merge branches than the distributed systems.
This situation is improving for SVN and will likely be a non-issue in the
future.}
The downside of distributed systems is that they require slightly more
overhead than a centralized system. To commit changes in a centralized system,
a user pushes his changes to a centralized server and they are immediately
available to other users. In a decentralized system, a user commits locally,
then pushes to a remote server. This remote server can either be commonly
shared or unique the individual collaborator. In the latter case, the other
collaborators will need to pull in changes from their associates' online
repositories. To some degree this additional complexity can be hidden by
tools. There are number of online services providing tooling
around distributed version controls systems
(\href{http://www.github.com}{GitHub} for \href{http://git-scm.com/}{Git},
\href{https://launchpad.net/}{Launchpad} for
\href{http://bazaar.canonical.com/en/}{Bazaar}, and
\href{https://bitbucket.org/}{BitBucket} for
\href{http://mercurial.selenic.com/}{Mercurial} --- three popular DVCs).
Similar to SVN, there also exist client side tools for working with these
version controls systems.
So which to use? In our experience, centralized version control (specifically
SVN) is the easiest for collaborators to use, but has required the most work
to set up the server. If you have technical support through your institution
or plan to use an online service, the maturity and simplicity of SVN could be
a big benefit to your team. On the other hand, distributed version control
systems can be quick to get up and running. We have combined a distributed
system with Dropbox, where collaborators pushed to a shared repository on
Dropbox. This was not a perfect solution, as two simultaneous pushes could
conflict, but it still provided more guarantees than just directly editing the
shared files. This was also a good solution to incorporate a collaborator with
less technical ability. This collaborator edited the copy in Dropbox, and the
other collaborators pushed and pulled to their own local repositories. Single
users can also make quick use of distributed systems to have an undo stack of
previous documents and a place to create sandbox branches. In both the local
user and the Dropbox case, it would be easy to start pushing to a shared
remote server for more serious collaboration.
% make files to explicitly state dependences
\subsection{Scripts}
While version control systems will keep track of changes and make sure the
source is in a consistent state, % blah, write this better
source code alone (and here we include things such as .tex files and data),
does not completely describe \emph{how} to create the research. Consider, for
example, creating a figure for a paper. Being a good collaborator, you take
the time to write the figure generating code in \texttt{figure.R} and insert
it into the main \LaTeX file using
\texttt{\\includegraphics}. The figure relies on data in \texttt{data.csv} and
some code in \texttt{models.R}. While it is straightforward for the original
author to create this graphic, will it be obvious to others in the project
that if either the data or models change, the figure should also change? For
lengthy projects, even the original author may forget which files depend on
others.
Again borrowing for software engineering, we suggest the use of build files to
solve this problem. Build files explicitly state dependencies between files
and explain how to generate artifacts, such as PDF files. As an added benefit,
build files automate the creation of artifacts and ensure that files are built
in the proper order. Most importantly, build files ensure that artifacts are
updated when source documents, including data, change. Returning to the figure
example above, we could notate the necessary conditions for updating the
figure and the main PDF with the following GNU \texttt{Makefile}\footnote{We
use the classic and widely available GNU \texttt{make} system, but other build
systems exist. Some of these systems, for example \texttt{Rake} for Ruby,
allow more programming and customization within the build scripts. Your team
may benefit from this extended functionality}:
\begin{verbatim}
paper.pdf: figure.pdf paper.tex
latexmk -pdf paper.tex
figure.pdf: figure.R data.csv
R --silent --file = figure.R
\end{verbatim}
The unindented lines indicate \emph{targets}, with a list of dependencies
after the colon and a build command on the following indented line.
The \texttt{make} command checks each target and compares the
time stamp on the target with the time stamp on each dependency. If any
dependency is newer than the target, the dependency is rebuilt using its
command (perhaps recursively building further dependencies) and then builds
the target using its command. For example, if \texttt{data.csv} is updated,
\texttt{make} will automatically rebuild \texttt{figure.pdf} before rebuilding
\texttt{paper.pdf}.
Using a \texttt{Makefile} simplifies the amount of knowledge any individual on
the team has to have regarding creating artifacts. Instead of having to
remember and manually implement the build process, all a collaborator has to
do is type \texttt{make} at the command line and he will be certain to have a
properly built version of an artifact, say a PDF document.
To some degree, literate programming tools, such as \texttt{Sweave}, minimize
the need establish a clear dependency tree in a build script. \texttt{Sweave}
chunks take the place of
having separate files for loading and transforming data, building models, and
generating figures. Since files are evaluated top-down, there is an implicit
dependency structure, with later chunks depending on earlier chunks. In order
to weave the file, all chunks are rebuilt, guaranteeing any changes in early
code chunks flow downstream.
While we are heavy users of \texttt{Sweave}, we still think explicit build
scripts have a role to play. First, certain computations can be time
consuming, but do not need to be frequently updated. Simulations and
boostrapping within a \texttt{Sweave} document increase the time from making
an edit (perhaps to the text) and final output as a PDF. Writing these
computations in separate \texttt{.R} files eliminates the need to rerun the
computations when no code or data has changed.\footnote{We are aware of caching
mechanisms for \texttt{Sweave} chunks, but we are concerned that these systems
do not include specific dependency management and therefore still require more
rebuilding than is necessary with explicit build scripts.} Second, even when
using a single \texttt{Sweave} file to merge text and code, projects of a
reasonable size will include additional files, especially data. These files
may have their own build steps or simply be dependencies for the
\texttt{Sweave} file. Encapsulating these relationships in a build script is
still useful, even when using \texttt{Sweave}. Finally, even when using
\texttt{Sweave}, your team may wish to split up the work into different files
for logical or practical reasons. While version control systems are powerful
tools, sometimes the best way to work with collaborators is to divide the task
into separate tasks and each work in a different file. Build scripts help with
merging the separate files into a unified whole.
While build files indicate which files should be updated when data change, we
encourage social scientists to write scripts to update data as well. Again
borrowing from software engineering, we have found ``database migrations'' to
be a useful technique in capturing exactly how and why data should
change.\footnote{As the name implies, database migrations are most often
applied to databases, as compared to flat files, such as \texttt{.csv} or
\texttt{.dta} files. While beyond the scope of this article, relational databases
are an
underused tool in the social sciences. We hope to provide suggestions on using
true databases in a research context in future articles.}
While a version control system could capture how a \texttt{.csv} file changes
from one commit to the next, a migration provides the exact steps by which the
data are manipulated. For example, consider downloading data from \href{http://
electionstudies.org}{the
ANES} and \href{http://www.census.gov}{the United States
Census} and joining it into a single table for analysis
in \texttt{R}. The most familiar approach might be to load both datasets into
\texttt{R}, use the \texttt{merge} function to combine them into a single
table, and then save them into a \texttt{.rda}. We suggest two alternatives,
both of which could be considered ``migrations,'' that provide more
information about the steps undertaken to combine the disparate data sources.
The first technique we call ``one file to rule them all.'' In this scenario,
you add your ANES and Census data to your version control repository, along
with a file \texttt{data.R}. This file contains the code to load, clean, and
merge the data, possibly saving a \texttt{data.rda} file in the process. You can
enter \texttt{data.rda} as a dependency in your \texttt{Makefile}:
\begin{verbatim}
analysis.tex: analysis.Rnw data.rda
R CMD Sweave analysis.Rnw
data.rda: data.R anes.csv census.csv
R --silent --file = data.R
\end{verbatim}
If you or your collaborators need to edit the data at a later time, you can
update \texttt{data.R}. Since it is included in the \texttt{Makefile},
downstream files will be appropriately updated as well.
For more complex data needs, you may concern employing a series of migrations,
each building off the previous. As a convention, label your migrations in
order: \texttt{001\_load\_data.R}, \texttt{002\_fix\_coding.R}, etc. Rather than
having a single file manage all data manipulations, each migration is a
separate file that loads, manipulates, and saves the updated data. To run the
migrations, collaborators run the scripts in sequence starting with
\texttt{001\_\ldots}. This migration strategy has been most successful, for us,
when using mixed languages to update the data. Here, for example, is a listing
of migrations on a project that pulls in hate crime data, survey information,
and Census information from the web and builds a relational
database:\footnote{\texttt{.sql} files are SQL files, a relational database
language, and \texttt{.clj} are Clojure files, a LISP language for the JVM.}
\begin{verbatim}
001_initialize.sql
002_populate.clj
003_remove_redundant_data.sql
004_connect_census_tables.sql
005_hate_groups.clj
006_splc_hatewatch_events.clj
007_coding_events # is a directory of relevant files for
\end{verbatim}
Where, as an example, \texttt{003\_connect\_census\_tables.sql} presupposes a
\texttt{data.sql} file:
\begin{verbatim}
-- see the .schema for what is in these tables
-- fips gets duplicated as fips:1, fips:2, ...
CREATE VIEW census AS
SELECT * FROM
counties c LEFT JOIN census_area_pop ap ON c.fips = ap.fips
LEFT JOIN census_employment e ON c.fips = e.fips
LEFT JOIN census_foreign_moved fm ON c.fips = fm.fips
LEFT JOIN census_income i ON c.fips = i.fips
LEFT JOIN census_language_education le ON c.fips = le.fips
LEFT JOIN census_occupation o ON c.fips = o.fips
LEFT JOIN census_race r ON c.fips = r.fips;
\end{verbatim}
Splitting migrations into separate files is more overhead, but provides a
finer grained record of changes, even in a version control scenario. We
consider either single files (like the \texttt{data.R} example above) or
multiple files good solutions. Your team should select the method that best
suits your work style and the amount of cleaning and manipulation data
required for your project.
\section{Conclusion}
While we have stressed software throughout this article, the technology is the
easiest part of collaboration. Habits, conventions, and best pratices are much
harder to achieve. At the same time, software suggests (or makes easier)
certain mehtods of collaboration. Using a version control system requires
collaborators to think about which files to add to the shared repository and
which files are transitory or local. Similarly, build files help us
communicate the steps necessary to build documents to our collaborators.
Alternatively, we could just detailed \texttt{README} files, but these tools
add value above and beyond pure description, though they serve a similar
purpose.
Nevertheless, agreeing to use \texttt{SVN} or \texttt{make} is a relatively
simple decision, adhering to best practices is much harder. Version control
frees us from calling our collaboratos on the phone and saying, ``Don't touch
this file; I'm working on it.'' But just because one can check in a file
without merge conflicts does not mean the document is in a good state. We can
still mutually non-conflicting changes that lead to disastrous results.
Working through such disasters is still a matter of communication between
collaborators. But even such conflicts are usually quickly resolved. More
difficult is maintaining common style and usage throughout a
project.\footnote{You may wish to adopt a style guide for both coding and
text.
\href{http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html}
{Google publishes a style guide for
R}.
Style guides for the English language are numerous.}
This article has focused on collaboration, but these suggestions also have
benefits for \emph{replication.} In sense, someone replicating your research
is simply a future collaborator. Encapsulating changes within a version
control system, providing build scripts, and adhering to a consistent style
guide all make it easier for a future researcher to replicate our work. These
tools also make it easier for other researchers evaluate our work by adjusting
assumptions, use
different data sets, or applying new methods to our data. Writing code with a
partner at your elbow ensures that at least one other person understands what
the code is doing, raising the probability that some replicating your work
will understand it too. Good practices for collaboration make it easier for
future partners (i.e. replicators) to understand what we have done and how to
do it.
% good collaboration leads to reproducible research
% ``collaborating with replicators'' or something like that
% \section{Further Reading}
% TODO appendix and a bib file
\end{document}