This repository has been archived by the owner on Sep 15, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
R4SME 10 logistic covariates.Rmd
271 lines (209 loc) · 8.24 KB
/
R4SME 10 logistic covariates.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
---
title: "R for SME 10: Logistic regression (covariates)"
author: Andrea Mazzella [link](https://github.com/andreamazzella)
output: html_notebook
---
# Part 1: mortality
## Basics
Load packages
```{r}
library(haven)
library(epiDisplay)
library(magrittr)
library(tidyr)
library(dplyr)
```
## Data import, exploration, management
Make sure you have the mortality.dta dataset in the same folder as this .rmd
```{r Import data}
mortality<-read_dta("./mortality.dta")
mortality %<>% mutate_if(is.labelled,as_factor)
```
```{r Explore data}
glimpse(mortality)
summary(mortality)
View(mortality)
```
## Data analysis
1.
Does onchocercal infection appear to be associate with death?
```{r}
# Cross-tabulate "died" and "mfpos"
mortality %$% tabpct(died, mfpos, percent = "row", graph = F)
```
Run a logistic regression model to estimate the crude OR for onchocercal infection.
```{r}
# Logistic regression
glm(died ~ mfpos,
data = mortality,
family = binomial()) %>%
logistic.display()
# Alternative: chi2 test
mortality %$% cc(died, mfpos, graph = F)
```
Crude OR for death in infected vs non-infected = 1.63 (1.08-2.48).
Wald's test, p = 0.021
Chi2 test, p = 0.02
2.
Let's check if age is a potential confounder. Is our exposure (onchocerchal infection) somehow associated with age?
```{r}
# Cross-tabulate mfpos against agegrp
mortality %$% tabpct(agegrp, mfpos, percent = "col", graph = F)
```
Yes, it looks like the older people are affected more frequently. So we might analyse age group as a potential confounder.
3.
Now stratify this association by age group.
```{r}
# Stratified 2x2 tables
mortality %>%
filter(agegrp == "15-34") %$%
tabpct(died, mfpos, percent = "col", graph = F)
mortality %>%
filter(agegrp == "35-54") %$%
tabpct(died, mfpos, percent = "col", graph = F)
mortality %>%
filter(agegrp == "55-64") %$%
tabpct(died, mfpos, percent = "col", graph = F)
mortality %>%
filter(agegrp == "65+") %$%
tabpct(died, mfpos, percent = "col", graph = F)
```
Let's get a summary estimate of the association between our exposure and our outcome, accounting for age. First, let's do it the old-fashioned way with dear Mantel-Haenszel.
```{r}
# Mantel-Haenszel OR and chi2 test
mortality %$%
mhor(died, mfpos, agegrp, graph = F)
```
- The MHOR for death in infected vs non-infected, accounting for age, is 1.50 (0.98-2.30); compare with crude 1.63 (1.08-2.48) from the logistic regression / chi2 test. It appears that age group is indeed a confounder in this association.
- Whilst the stratum-specific ORs are dissimilar, their CIs overlap, and the homogeneity test has a large p-value. Therefore, age group is not an effect modifier in the relation between infection and death.
- Now that we have adjusted for age, the p-value for the chi2 test has increased to 0.06.
4.
Now let's do the same but with logistic regression.
```{r}
# Logistic regression with a confounder
glm(died ~ mfpos + agegrp,
data = mortality,
family = binomial()) %>%
logistic.display()
```
- Unlike Stata, R also gives you the crude OR.
- The adjusted OR are very similar to the ones calculated with MH.
- There's also a Wald's test p-value and a LR-test p-value.
5.
(Summarise the results of your analysis)
6.
Now let's run a more complex model, that also includes visual impairment (vimp).
```{r}
# Logistic regression with two confounders
glm(died ~ vimp + mfpos + agegrp,
data = mortality,
family = binomial()) %>%
logistic.display()
```
- The OR associated with vimp means: once you account for infection and age group, the OR of death in visual impairment is 2.28 (1.44-3.58), and there is very strong evidence for this association (Wald's p < 0.001).
- Does onchocercal infection appear to confound the association between visual impairment and odds of death?
7.
- The OR associated with mfpos means: once you account for visual impairment and age group, the OR of death in microfilarial infection is 1.46 (0.95-2.23). There is weak evidence for this association (Wald's p = 0.09).
8.
Perform a likelihood ratio test of the H0 that, after accounting for *age* and onchocercal infection, there is no association between visual impairment and odds of death.
```{r}
# Create the first model with all confounders
logit_vimp_mfpos_agegrp <- glm(died ~ vimp + mfpos + agegrp,
data = mortality,
family = binomial())
# Create a simpler regression without the exposure
logit_mfpos_agegrp <- glm(died ~ mfpos + agegrp,
data = mortality,
family = binomial())
# Likelihood ratio test
lrtest(logit_vimp_mfpos_agegrp, logit_mfpos_agegrp)
```
- The LRT p < 0.001. Therefore there is very strong evidence for an association between visual impairment and odds of death, after accounting for age and onchocercal infection.
- We could also have done a Wald's test (and I think this was already in the logistic model at question 6, along with a LRT!)
9.
Perform a likelihood ratio test of the H0 that, after accounting for onchocercal infection and *visual impairment*, there is no association between visual impairment and odds of death.
```{r}
# Create a simpler regression without age
logit_vimp_mfpos <- glm(died ~ vimp + mfpos,
data = mortality,
family = binomial())
# LRT
lrtest(logit_vimp_mfpos_agegrp, logit_vimp_mfpos)
```
The LRT p << 0.001. Therefore there is very, *very* strong evidence for an association between age and odds of death (don't you say?) after accounting for visual impairment and onchocercal infection.
- It wouldn't have been possible to use a Wald's test, because the age group variable is not binary. And if you check in the output of the model at question 6, there is no single Wald's test, but there is only a LRT.
10.
What do you conclude about the relationship between vimp and death?
# Part 2: Mwanza
## Data import, exploration & management
Make sure you have the mwanza.dta dataset in the same folder as this .rmd, and load it. It contains data on HIV infection among women in Mwanza, Tanzania.
```{r}
# Import the dataset
mwanza <- read_dta("./mwanza.dta")
```
```{r}
#Familiarise yourself with the data
View(mwanza)
glimpse(mwanza)
summary(mwanza)
```
The ed variable represents education but it's represented in four different categories. We need to transform it into a binary variable.
```{r}
# Tabulate all possible values of ed
mwanza %$% table(ed)
# Recategorise and label education level
mwanza <- mwanza %>%
mutate(ed2 = as.factor(
case_when(
ed == 1 ~ "none",
ed == 2 ~ "any formal",
ed == 3 ~ "any formal",
ed == 4 ~ "any formal"
)
))
# Check it worked ok
mwanza %$% table(ed,ed2)
```
11.
Cross-tabulate binary education against HIV infection and get the crude OR.
```{r}
# 2x2 table with crude OR
mwanza %$% cc(case, ed2, graph = F)
```
12.
We need to use the religion variable, but let's make sure that there are no missing values first.
```{r}
mwanza %$% table(rel)
# Replace rel "9" with "NA"
mwanza$rel <- na_if(mwanza$rel, 9)
```
13.
Estimate the OR for HIV/education adjusted for religion, using the Mantel-Haenszel approach
```{r}
# M-H OR
mwanza %$% mhor(case, ed2, rel, graph = F)
```
- The M-H-adjusted OR is 0.52 (0.35-0.77). There is strong evidence for this association (M-H chi2 p = 0.001).
14.
Let's do the same with logistic regression
```{r}
# Create a model controlling for religion
logit_confounder <- glm(case ~ ed2 + rel,
data = mwanza,
family = binomial())
logistic.display(logit_confounder)
```
- The adjusted OR is 0.49 (0.34,0.72). p < 0.001. This is similar to the MH results above.
15.
Perform a LRT on the null hypothesis that, after controlling for religion, there is no association between education and HIV status.
```{r}
# Create a crude model excluding the missing value in the religion variable
logit_crude <- mwanza %>%
drop_na(rel) %>%
glm(case ~ ed2,
data = .,
family = binomial())
# LRT
lrtest(logit_confounder, logit_crude)
```
- LRT, p < 0.001