-
Notifications
You must be signed in to change notification settings - Fork 4
/
datasetup.Rmd
187 lines (134 loc) · 6.43 KB
/
datasetup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
title: Setup data for analysis of the GSA Auctions experiment^[`r paste("<",system('grep -n github .git/config | cut -d "=" -f 2',intern=TRUE),">",sep="")`]
author: Jake
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
html_document:
highlight: pygments
keep_md: true
---
```{r include=FALSE, cache=FALSE}
# Some customization. You can alter or delete as desired (if you know what you are doing).
# knitr settings to control how R chunks work.
## To make the html file do
## render("datasetup.Rmd",output_format=html_document(fig_retina=FALSE))
## To make the pdf file do
## render("datasetup.Rmd",output_format=pdf_document())
require(knitr)
opts_chunk$set(
tidy=FALSE, # display code as typed
size="small", # slightly smaller font for code
echo=TRUE,
results='markup',
strip.white=TRUE,
fig.path='figs/fig',
cache=FALSE,
highlight=TRUE,
width.cutoff=132,
size='footnotesize',
out.width='.9\\textwidth',
message=FALSE,
comment=NA)
```
# Design
For each item posted for auction with no previous bids, an algorithm identifies a population of
potential bidders (people who had previously used the system who had bid on an
item of the same type in the past 3 months). Within this group, half are assigned to receive an email informing them that said item's auction is ending soon. The other half receives no email.
# Outcomes
The key outcome for now is whether or not a person makes a bid.
# Pre-processing of files
I don't think that I found any semi-colons all alone in the files that matter
to us here.^[These next bits of bash script are not run here because they only need
to be run once per set of files. Probably these lines belong in a
pre-processing file, maybe just a bash script.]
```{r eval=FALSE, engine='bash'}
cd Data
egrep '[^;];[^;]' EMAILQ\ _1101.TXT
egrep '[^;];[^;]' BIDS_1101.TXT
```
So, this means that I can just read them into R with a single ';' as a separator and then delete empty columns. It might eventually be faster to pre-process the files with 'sed' or 'python' or something.
It turns out I was having trouble reading in BIDS_1101.TXT (one of the lines had some seemingly hidden characters that was causing fread to bomb), so I did a little cleanup of that file in the process of diagnosing the problem.
First, replace all ';;;' or ';;;;' with ';' (since I had already verified that we never had a ';' by itself). I discovered one weird case with 4 ';'
```{r eval=FALSE, engine='bash'}
egrep '[^;;;;];;;;[^;;;;]' *.TXT # first just check for four in a row
sed -i.bak1 's/;;;;/;/g' BIDS_1101.TXT
sed -i.bak2 's/;;;/;/g' BIDS_1101.TXT
```
Then, remove excess whitespace at end of line (this is just to make the process of looking for the problem easier) and excess dots
```{r eval=FALSE,engine='bash'}
sed -i.bak3 's/ *$//' BIDS_1101.TXT
sed -i.bak4 's/\.\.\./\./' BIDS_1101.TXT
```
# Reading the files into R
```{r}
## Specific file names need to be changed by hand when the files change unless we automate this some how.
library(data.table,quietly=TRUE) ## using data.table for speed with large datasets
emailq<-fread("Data/EMAILQ\ _1101.TXT",sep=";",header=TRUE,colClasses="character")
## emailq<-emailq[,which(unlist(lapply(emailq, function(x){!all(is.na(x))}))),with=F]
emailq<-emailq[,which(unlist(lapply(emailq, function(x){!all(x=="")}))),with=F]
setnames(emailq,names(emailq),make.names(gsub("\\.","",names(emailq)))) ## make nicer names
setnames(emailq,"REG.SALE.LOT..","RegSaleLot")
str(emailq)
```
We can see that the experimental pool varied a lot across the different items including some items for which only one email was eligible for sending. In those cases, the only email eligible was sent except for a case with SENT=="E" (I'm not sure what E means). So, we will have to remove these rows from consideration since there is no comparison group.
```{r}
regSaleLotsTab<-table(emailq[,RegSaleLot])
sort(regSaleLotsTab,decreasing=TRUE)[1:10]
sort(table(emailq[,RegSaleLot]))[1:10]
## SENT is Control:
table(emailq[SENT.DTE=="00000000",SENT])
emailq[RegSaleLot %in% names(regSaleLotsTab[regSaleLotsTab==1]),list(RegSaleLot,SENT)]
emailq <- emailq[!(RegSaleLot %in% names(regSaleLotsTab[regSaleLotsTab==1])),]
stopifnot(any(table(emailq[,RegSaleLot])!=1))
```
Notice that the same person can be in the experiment multiple times (either as
treatment or control). This means that the causal estimand will be a bit
difficult to define and also that statistical inference will not be
straightforward.
```{r}
table(table(emailq$USER.ID))
```
There is a BID column in 'emailq' but I don't know what it is.
In the BIDS dataset, we have a column for REG, for SALE.NUMBER and LOT. We want a `r unique(nchar(emailq[,RegSaleLot]))` character code from the bids data so that we can combine information about the people making bids and the bids themselves (whether or not a control or treatment assigned email address bid on an item).
```{r}
bids<-fread("Data/BIDS_1101.TXT",sep=";",header=TRUE,colClasses="character")
setnames(bids,names(bids),make.names(gsub("\\.","",names(bids)))) ## make nicer names
str(bids)
## Testing on a small piece of the dataset
set.seed(12345)
blah<-bids[sample(1:nrow(bids),10),]
blah[,RegSaleLot:=paste(REG,SALE.NUMBER,LOT,sep=""),by=1:nrow(blah)]
stopifnot(unique(nchar(blah[,RegSaleLot]))==15)
bids[,RegSaleLot:=paste(REG,SALE.NUMBER,LOT,sep=""),by=1:nrow(bids)]
stopifnot(unique(nchar(bids[,RegSaleLot]))==15)
```
Is there a user id from the email file that exists in the bids file showing that we have bid for a given sale?
First, make a unique id for user and sale:
```{r}
bids[,ID:=paste(RegSaleLot,USER.ID,sep=""),by=1:nrow(bids)]
emailq[,ID:=paste(RegSaleLot,USER.ID,sep=""),by=1:nrow(emailq)]
stopifnot(all(table(emailq$ID)==1)) ## good, uniquely identifies rows of emailq
```
Then collapse bid to the sale/user level (sometimes people gave many bids for the same sale):
```{r}
bids[,bidamt:=as.numeric(BID.AMT)]
bidsItemUser<-bids[,mean(bidamt),by=ID]
setnames(bidsItemUser,"V1","meanbidamt")
stopifnot(all(table(bidsItemUser$ID)==1))
```
Then add the mean bid amt to the data recording whether an email was sent:
```{r}
setkey(bidsItemUser,ID)
setkey(emailq,ID)
## Merge keeping the rows of emailq
mergedat<-bidsItemUser[emailq]
stopifnot(all(table(mergedat$ID)==1))
str(mergedat)
summary(mergedat$meanbidamt)
table(!is.na(mergedat$meanbidamt),mergedat$SENT,exclude=c())
```
Save the data for analysis:
```{r}
write.csv(mergedat,file="Data/mergedat.csv")
save(mergedat,file="Data/mergedat.rda")
```