forked from jamesdunham/dgo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
211 lines (163 loc) · 8.34 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
output:
md_document:
variant: markdown_github
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
# dgirt
```{r, knitr-options, echo = FALSE}
# rmarkdown::render("~/projects/dgirt/README.Rmd")
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>",
fig.path = "README-",
cache = FALSE)
```
[![Build Status](https://travis-ci.org/jamesdunham/dgirt.svg?branch=master)](https://travis-ci.org/jamesdunham/dgirt)
dgirt is an R package for dynamic group-level IRT models, as developed in [Caughey and Warshaw
2014](http://pan.oxfordjournals.org/content/early/2015/02/04/pan.mpu021.full.pdf+html):
> Over the past eight decades, millions of people have been surveyed on their political opinions. Until recently,
> however, polls rarely included enough questions in a given domain to apply scaling techniques such as IRT models at
> the individual level, preventing scholars from taking full advantage of historical survey data. To address this
> problem, we develop a Bayesian group-level IRT approach that models latent traits at the level of demographic and/or
> geographic groups rather than individuals. We use a hierarchical model to borrow strength cross-sectionally and
> dynamic linear models to do so across time. The group-level estimates can be weighted to generate estimates for
> geographic units. This framework opens up vast new areas of research on historical public opinion, especially at the
> subnational level.
## Installation
```{r, install-dgirt, results = "hide", message = FALSE}
devtools::install_github("jamesdunham/dgirt")
```
Get updates by reinstalling. dgirt is in early stages and under development.
See [NEWS](NEWS.md), last updated 2016-01-28.
## Quick start
* `wrangle` prepares data
* `dgirt` fits models
* `poststratify` reweights estimates
## Use
`state_opinion` is a dataset included with `dgirt` in which rows correspond to survey
responses from individuals.
```{r, load-and-attach, results = "hide", message = FALSE}
library(dgirt)
data(state_opinion)
```
Data formatted in this way need to be restructured with the `wrangle` function. We'll pass `state_opinion` to
`wrangle`'s `data` argument, which takes a list of data to be used in modeling. Our table of survey responses will be an
element named `level1`, for the lowest hierarchical level in the model. (Note: at the moment, there are limitations on
model specifications. Level-one data is required and a second hierarchical level is optional.)
We'll use the `vars` argument to identify variables of importance in `data` (e.g. which represent item responses).
`vars` is a list of named character vectors with (at least) these elements:
* `items`: Names of item response variables in `data$level1`.
* `groups`: Names of respondent characteristic variables in `data$level1`. (Note: at this time, `wrangle` requires
that the user exclude the geographic indicator from `groups` and name it instead in `geo_id`. Modeling any group
predictor is coming.)
* `time_id`: Name of time period variable in `data$level1`.
* `geo_id`: Name of geographic identifier variable in `data$level1`.
* `survey_id`: Name of survey identifier variable in `data$level1`.
* `survey_weight`: Name of weight variable in `data$level1`.
The names of the item response variables start with "Q\_", so we'll pass them using `grep`.
```{r, wrangle, message = FALSE}
state_opinion_fmt = wrangle(
data = list(level1 = state_opinion),
vars = list(items = grep("^Q_", colnames(state_opinion), value = TRUE),
groups = c("race"),
time_id = "year",
geo_id = "state",
survey_id = "source",
survey_weight = "weight"),
filters = list(periods = c(2006:2010)))
```
This output omits verbose messages. `wrangle` returns a list of objects that `dgirt` expects as its first argument.
We'll also set its `n_iter` and `n_chain` arguments to minimize its run time, but otherwise rely on the defaults.
`dgirt()` calls `rstan()`, which reports any problems it encounters when compiling the model and sampling. Reporting is
verbose and not all messages indicate problems. If sampling is successful, `dgirt()` returns an object of class
`stanfit`. (See rstan documentation.)
A short trial run is often a good idea.
```{r, dgirt-rstan-trial, message = FALSE}
dgirt_estimates = dgirt(state_opinion_fmt, n_iter = 3, n_chain = 1)
```
We omit verbose messages here. Now, a longer run:
```
# Not run
dgirt_estimates = dgirt(state_opinion_fmt, n_iter = 2000, n_chain = 4)
```
To examine the the `dgirt()` results we can use `extract_dgirt()`, which attaches labels to the saved parameters
according to the variable names originally passed to `wrangle()` and any factor levels. Right now, `extract_dgirt()`
shows only the posterior means.
Note that `dgirt()` returns a `stanfit` object when `method = "rstan"` (as above, the default) and a list of point
estimates if `method = "optimize"` (details below); `extract_dgirt()` only works with `stanfit` objects. (The
inconsistency in return types isn't desirable and will change in the future.)
The group means can be found as `theta_bar`.
```{r, dgirt-extract}
dgirt_extract = extract_dgirt(dgirt_estimates, state_opinion_fmt)
head(dgirt_extract$theta_bar)
```
## `cmdstan`
We can use the `method` argument of `dgirt` to choose an alternative to MCMC sampling if `cmdstan` is available. See
http://mc-stan.org/interfaces/cmdstan.html for installation instructions. For example, setting `method = "optimize"`
will call `cmdstan optimize`.
First, a trial run.
```{r, dgirt-optimize}
optimize_estimates = dgirt(state_opinion_fmt, n_iter = 20, method = "optimize",
init_range = 0.5)
head(optimize_estimates$theta_bar)
```
And now a longer run.
```
# Not run
optimize_estimates = dgirt(state_opinion_fmt, n_iter = 20000,
method = "optimize", init_range = 0.5)
```
## `poststratify`
`poststratify()` can reweight estimates from `dgirt()` (if `method = "optimize"`) or `extract_dgirt()` (if `method =
"rstan"`, the default). `postratify()` returns weighted means for groups or arbitrary aggregations of groups.
The `state_demographics` dataset contains population proportions for demographic strata by year. At the moment, it's
necessary to relabel the group factor levels in the `dgirt()` results to match those in the population proportion data.
```{r, load-demographics}
data(state_demographics)
head(state_demographics)
state_demographics$race = factor(state_demographics$race, labels = c("white", "black", "other"))
```
Now we pass these data, the same `groups` argument as used originally with `wrangle`, and a vector of variable names as
`strata` that define aggregations of interest in the data. For exposition we'll set two optional variables. We give the
name of the variable in the demographic data for the population proportion as `prop_var`. And passing a variable name to
`summands` will test the demographic data for whether population proportions sum to one within groups defined by the
values of that variable.
```{r}
dgirt_extract$theta_bar$year = as.integer(dgirt_extract$theta_bar$year)
group_means = poststratify(
group_means = dgirt_extract$theta_bar,
targets = state_demographics,
groups = c("race"),
strata = c("state", "year"),
prop_var = "proportion",
summands = "year")
head(group_means)
```
The same approach works after `dgirt()` if `method = "optimize"`.
```{r}
optimize_estimates$theta_bar$year = as.integer(optimize_estimates$theta_bar$year)
optimize_group_means = poststratify(
group_means = optimize_estimates$theta_bar,
targets = state_demographics,
groups = c("race"),
strata = c("state", "year"),
prop_var = "proportion",
summands = "year")
head(optimize_group_means)
```
## `plot_means`
We can quickly plot group means with `plot_means`. It can handle the `theta_bar` element of the value of
`poststratify()`. (Figures omitted.)
```{r, results = "hide", warning = FALSE, message = FALSE, fig.show = 'hide'}
plot_means(group_means, "year", "state", jitter = TRUE)
```
Or that of `dgirt_extract()` where dgirt(`method = "rstan"`),
```{r, results = "hide", warning = FALSE, message = FALSE, fig.show = 'hide'}
plot_means(dgirt_extract$theta_bar, "year", "state", jitter = TRUE)
```
Or the value of `dgirt()` in the case of `method = "optimize"`.
```{r, results = "hide", warning = FALSE, message = FALSE, fig.show = 'hide'}
plot_means(optimize_estimates$theta_bar, "year", "state", jitter = TRUE)
```