This repository has been archived by the owner on Sep 15, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
R4SME 12 logistic quantitative.Rmd
334 lines (253 loc) · 8.64 KB
/
R4SME 12 logistic quantitative.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
title: "R for SME 12: Logistic regression (quantitative exposures)"
author: Andrea Mazzella [link](https://github.com/andreamazzella)
output: html_notebook
---
## Basics
Load packages
```{r Load packages}
library(haven)
library(pubh)
library(rstatix)
library(epiDisplay)
library(magrittr)
library(tidyverse)
```
# Part 1: mortality
## Data import, exploration, management
1.
Make sure you have the mortality.dta dataset in the same folder as this .rmd
```{r Import data}
mortality<-read_dta("./mortality.dta")
mortality %<>% mutate_if(is.labelled,as_factor)
```
1a.
Explore the data.
What type of variables are "systolic" and "died"?
```{r Explore data}
glimpse(mortality)
summary(mortality)
View(mortality)
```
- systolic is a quantitative variable
- died is a binary variable coded as 0/1, but its class is "double" and not "factor".
```{r Transform "died" into a factor variable}
mortality %<>%
mutate(died = factor(
died,
levels = c(0, 1),
labels = c("alive", "dead")
))
```
Plot systolic BP against death. What does the plot show?
```{r Visualising}
# Scatterplot
mortality %>% ggplot(aes(x = systolic, y = died)) +
geom_point() +
theme_bw() +
labs(title = "Systolic blood pressure and death",
x = "Systolic blood pressure",
y = "Death")
## I am not sure why that's what they wanted us to do in the STATA practical. A scatteplot is only helpful with two continuous variables. With 1 continuous and 1 categorical, you use Box plots, which can give you a vague idea of how did sBP compare in dead vs alive.
# Box-and-whiskers plot
mortality %>% ggplot(aes(x = died, y = systolic)) +
geom_boxplot() +
labs(title = "Systolic blood pressure and death",
y = "Systolic blood pressure",
x = "Death") +
theme_bw() +
coord_flip()
```
1b.
Group systolic into three levels and add labels.
- <120: normal
- 120-139: pre-hypertension
- 140+: hypertension
```{r Categorise "systolic"}
mortality %<>%
mutate(systolic_grp = cut(
systolic,
breaks = c(0, 119, 139, +Inf),
labels = c("normal", "pre-hypertension", "hypertension")
))
# Check it worked ok
mortality %$% table(systolic, systolic_grp)
```
Explore the relation between systolic_grp and died.
```{r Crosstab systolic_grp and died}
mortality %$% tabpct(systolic_grp, died, percent = "row", graph = F)
```
- A higher percentage of hypertensives died than people with normal systolic BP.
1c, 1d.
Calculate the log{odds} of death in each sBP group, and plot them manually on a graph, then draw a line by eye to fit these points.
_Issue:_ I don't know how to get odds in R.
_Workaround:_ odds_trend() gives you the OR for each group of systolic BP and the baseline (normal systolic BP):
```{r OR}
# OR
odds_trend(died ~ systolic_grp, data = mortality)
# Test for trend
death_by_BP <- mortality %$% table(died, systolic_grp)
prop_trend_test(death_by_BP)
```
- pre-hypertension: OR 1.32 (0.87-2.00)
- hypertension: OR 3.63 (2.38-5.52)
- this command also gives you a graph (but not on the log scale)
- chi2 test for trend: p <0.001
1e.
Fit a logistic regression model for death, with exposure being grouped sBP treated as a quantitative variable. In order to do this, you simply need to transform the variable back to a numeric variable, with values of 1, 2, 3. Once you do that, the glm() command is the same.
```{r Logistic regression with a linear relation}
# Recode systolic_grp as numeric
mortality %<>%
mutate(systolic_quant = as.numeric(systolic_grp))
# Check
mortality %$% table(systolic_grp, systolic_quant)
# Logistic regression with linear relation - log odds scale
logit_linear <- glm(died ~ systolic_quant,
data = mortality,
family = binomial())
logit_linear
```
- What is the interpretation of the estimates for grouped sBP?
1f.
Now do the same in the odds (ie: non-log) scale.
```{r}
# Logistic regression with linear relation - odds scale display
logistic.display(logit_linear)
```
- What is the interpretation of the estimates for grouped sBP now?
1g.
Fit a model with grouped sBP assuming factors and perform a LRT on the following hypotheses:
- H0: the effect of grouped sBP on log(odds) of death is linear;
- H1: the relationship is not linear.
```{r}
logit_nonlin <- glm(died ~ systolic_grp,
data = mortality,
family = binomial())
logistic.display(logit_nonlin)
# LRT
lrtest(logit_linear, logit_nonlin)
```
1h.
Summarise your findings
- There is very strong evidence for a linear association between grouped sBP and death; OR = 1.86 (1.49,2.32) Wald's p < 0.001
- There is only weak evidence for a non-linear relationship between systolic BP and death (LRT p = 0.06)
1i.
Recode the sBP groups so that it's a quantitative variable, and each group contains a mid-point value in mmHg of that group.
Reminder of systolic BP categories:
- <120: normal
- 120-139: pre-hypertension
- 140+: hypertension
```{r}
# Recode the variable to numerical with (arbitrary) midpoints
mortality %<>%
mutate(systolic_grp_mid = as.numeric(
case_when(
systolic_grp == "normal" ~ 110,
systolic_grp == "pre-hypertension" ~ 130,
systolic_grp == "hypertension" ~ 150
)
))
# Check you haven't made a mess
mortality %$% table(systolic_grp, systolic_grp_mid)
```
Now fit the equivalent model to 1f.
```{r}
glm(died ~ systolic_grp_mid,
data = mortality,
family = binomial()) %>%
logistic.display()
```
- What is the interpretation of the estimate for grouped sBP now?
2a.
Consider a model with
- outcome: death;
- exposure: visual impairment;
- confounder: age (agegrp)
Does our estimate of effect of visual impairment differ if we treat age as quantitative rather than categorical ?
Fit a linear and a non-linear model.
```{r}
mortality %$% class(agegrp)
# Transform agegrp into a quantitative variable
mortality %<>%
mutate(agegrp_lin = as.numeric(agegrp))
mortality %$% table(agegrp, agegrp_lin)
mortality %$% class(agegrp_lin)
# Quantitative model
logit_linear2 <- glm(died ~ vimp + agegrp_lin,
data = mortality,
family = binomial())
logistic.display(logit_linear2)
# Categorical model
logit_nonlin2 <- glm(died ~ vimp + agegrp,
data = mortality,
family = binomial())
logistic.display(logit_nonlin2)
# LRT
lrtest(logit_linear2, logit_nonlin2)
```
- The adjusted OR are very similar in the two models
- LRT p-value = 0.786
- There is no evidence that for non-linearity.
2b.
Perform a LRT for the interaction between vimp and agegrp treated as a linear effect.
Is there evidence that the association of visual impairment on death differs by age group, where age group is treated as a linear effect?
```{r}
mortality %<>%
mutate(
agegrp2 =
case_when(
agegrp == "15-34" ~ 0,
agegrp == "35-54" ~ 1,
agegrp == "55-64" ~ 2,
agegrp == "65+" ~ 3
)
)
mortality %$% table(agegrp2, agegrp)
# Logistic regression with interaction
glm(died ~ vimp * agegrp2,
data = mortality,
family = binomial()) %>%
logistic.display()
# LRT
# Calculate stratum-specific ORs for visual impairment for the four age group strata
```
3.
Using the Mwanza dataset (MWANZA.dta), investigate the association between HIV infection and number of injections in the past year.
```{r}
# Import the dataset
mwanza <- read_dta("./mwanza.dta")
```
```{r}
#Familiarise yourself with the data
View(mwanza)
glimpse(mwanza)
summary(mwanza)
```
The variable "inj" is coded as follows:
1: does not inject
2-4: increasing number of injections per year
9: missing value
```{r}
mwanza %<>%
mutate(hiv = as.factor(case_when(case == 1 ~ "HIV+",
case == 0 ~ "negative")))
# Replace inj "9" with "NA"
mwanza$inj <- na_if(mwanza$inj, 9)
table(mwanza$inj)
mwanza %$% tabpct(hiv, inj, graph = F)
# Logistic regression including the zero group
glm(hiv ~ inj,
data = mwanza,
family = binomial()) %>%
logistic.display()
```
Remember that the zero group should be excluded in order to confirm that a trend with increasing numbers is not simply induced by a difference between the zero category and the rest.
```{r}
# Logistic regression excluding the zero group
mwanza %>%
filter(inj != 1) %>% # this line temporarily removes all observations with a value of `1` in "inj"
glm(hiv ~ inj,
data = .,
family = binomial()) %>%
logistic.display()
```