This repository has been archived by the owner on Sep 15, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
R4SME 03 survival
339 lines (259 loc) · 9.39 KB
/
R4SME 03 survival
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
title: "R for SME 3: Survival analysis"
author: Andrea Mazzella [link](https://github.com/andreamazzella)
output: html_notebook
---
# R Prerequisites
- Using assignment operator (<-)
- Difference between factors, integers, doubles
- Using %>% and %$% pipes
- Basic understanding of ggplot
If you need to learn about these, check out Daniel Carter's "R 4 STEPH" repository on github: https://github.com/danieljcarter/r4steph.
# Part 1: Introduction
Before analysing the Trinindad dataset, let's learn how to use R's stset equivalent with a sample dataset, "ovarian".
## A) Basics
1. Shift+click on this link, great tutorial:
https://www.youtube.com/watch?v=Wmc8bQoL-J0
:)
2. Install these packages if you haven't already
```{r Install packages}
install.packages("survival")
install.packages("surviminer")
install.packages("magrittr")
install.packages("lubridate")
install.packages("tidyverse")
```
3. Load packages
```{r Load packages}
library(survival)
library(survminer) # to get Kaplan-Meier plots in a ggplot-like style
library(magrittr) # for the %$% pipe
library(lubridate) # for easier calculations with dates
library(haven) # to import dta files
library(tidyverse) # as usual
```
## B) Data exploration
4. Load the "ovarian" dataframe, part of {survival}, and explore it. It contains data from a trial comparing two treatments for ovarian cancer.
- How many observations does it have?
- How many variables?
- What do these variables code?
- Which type are these variables?
- Which way of exploring data do you prefer?
```{r Explore data}
?ovarian #to get more information on the dataframe
View(ovarian)
glimpse(ovarian)
summary(ovarian)
```
## C) Create a survival item
In the R {survival} package, the equivalent command of Stata's "stset" is *Surv()*. Unlike Stata, it doesn't require dates as input; it requires pre-calculated time intervals (so, number of days). In the ovarian dataset, this is already (conveniently...) calculated for you.
```{r Survival item}
ovarian_surv <-
Surv(time = ovarian$futime, #days until event/censoring
event = ovarian$fustat) # outcome (usually 0 = no outcome, 1 = outcome; or TRUE/FALSE, or 1/2)
ovarian_surv
```
As you can see, Surv() creates a list of numbers (which represent the days of follow-up) followed by a "+" if the observation is censored. These numbers are the equivalent of the _t variable that Stata creates once you stset your dataset.
## D) Create a survival table
*survfit()* requires a formula containing your survival item. For the simplest survival table containing the whole dataset, this is "~ 1".
- How many deaths occurred?
- What was the median time of death in years?
```{r Survival table}
survfit(ovarian_surv ~ 1, data = ovarian)
```
If you want to divide the cohort into two groups, for example for performance status, you put the relevant categorical variable after "~":
```{r Survival table by PS}
survfit(ovarian_surv ~ ecog.ps, data = ovarian)
```
## E) Create Kaplan-Meier plots
*ggsurvplot()* is a function from {survminer} that allows you to plot KM curves using ggplot's functions.
It takes a survfit() table as an object.
The following code generates the simplest KM plot, with the whole cohort.
- What do you think the "+" represent in the graph?
```{r KM curve}
ggsurvplot(fit = survfit(ovarian_surv ~ 1, data = ovarian))
```
The "+" represent censored observations.
You can see that, unlike Stata, the 95% CI are automatically shown in the plot.
In order to stratify by a categorical variable, you need to change the formula within the survfit() function.
```{r KM curves by PS}
ggsurvplot(fit = survfit(ovarian_surv ~ ecog.ps, data = ovarian))
```
There are many options to customise these plots.
- What do these do?
- What do you think the p-value represents?
- Why are the confidence intervals huge?
- How much cooler is this than with Stata?
```{r Better KM curves}
ggsurvplot(
fit = survfit(ovarian_surv ~ ecog.ps, data = ovarian),
pval = TRUE,
censor = FALSE,
xlab = "Days",
legend.title = "ECOG performance status",
legend.labs = c("1", "2")
)
```
You can also change the direction of the curve to upwards with the option "fun" set to "event".
- What else changes in the graph?
```{r KM curves with events}
ggsurvplot(
fit = survfit(ovarian_surv ~ ecog.ps, data = ovarian),
fun = "event",
pval = TRUE,
censor = FALSE,
xlab = "Days",
legend.title = "ECOG performance status",
legend.labs = c("1", "2")
)
```
If you want more info on ggsurvplot(), check out this cheatsheet: [link](https://rpkgs.datanovia.com/survminer/survminer_cheatsheet.pdf).
## F) Calculating the probability of survival beyond a certain number of years
You use summary() on survfit and you specify the amount of _days_.
```{r}
summary(survfit(ovarian_surv ~ 1, data = ovarian), times = 365.25)
```
- What is the probability of survival after 2 years?
```{r}
```
To check the _median_ probability of survival, you simply launch survfit() as in point D) above.
## G) Log-rank test
You use *survdiff()* with a formula similar to the one you used in survfit() above. Obviously you have to group the data by another categorical variable, you can't run the test on the whole dataset.
- Is survival different according to performance status?
```{r}
survdiff(ovarian_surv ~ ecog.ps, data = ovarian)
```
# Part 2: SME practical 3
## 1
Read the Trinidad data (trinmlsh.dta; you need to put it into the folder where this .rmd file is) and familiarise yourself with it.
- What are the outcome variables?
- What are the follow-up time variables and which type are they?
- Bonus question: why is the way in which variables such as "smokenum" are coded slightly problematic?
```{r}
```
Code below
```{r Data import}
# Import the dataset
trin <- read_dta("./trinmlsh.dta")
# Familiarise yourself with the data
View(trin)
glimpse(trin)
summary(trin)
```
## 2
Examine the overall survival experience of these men, i.e. analyse the outcome called death.
Use the Kaplan-Meier method to produce a survival curve for overall mortality in the complete cohort.
Before we can make a survival curve, we need to Surv() the data. But the trinindad dataset does not have the times in the format that Surv() wants.
First, you need to calculate the difference in years between the dates of entering and exiting the study. Because calculations involving dates are weird, you need to use some special code, like {lubridate}'s *%--%* pipe, which calculates a time interval.
The *%<>%* pipe combines an assignment with a %>% pipe (the first line is equivalent to: "trin <- trin %>%").
```{r}
trin %<>%
mutate(followup =
as.duration(timein %--% timeout) / dyears(1))
trin %$% summary(followup)
```
Then, you can create a survival object with Surv(), as you did earlier.
```{r}
trinsurv <- trin %$% Surv(time = followup,
event = death)
```
Now you can make a Kaplan-Meier plot for overall mortality.
```{r}
```
Code:
```{r}
ggsurvplot(
fit = survfit(trinsurv ~ 1, data = trin),
censor = F,
xlab = "Years"
)
```
## 3
Examine the cumulative survival probability among these patients at 1, 3, and 5 years.
```{r}
```
Code:
```{r}
survfit(trinsurv ~ 1, data = trin) %>%
summary(., times = c(1, 3, 5))
# c(1,3,5) means: do this for time 1, time 3, and time 5.
```
## 4
Make the same K-M curve but without confidence intervals.
```{r}
```
Code:
```{r}
ggsurvplot(
fit = survfit(trinsurv ~ 1, data = trin),
censor = F,
conf.int = F,
xlab = "Years"
)
```
## 5
From smokenum, create a new variable called "smokstatus" that identifies the participants who were active smokers at entry into the study.
Tip: you might use mutate() to create a new variable and as.factor() to assign values and labels.
```{r}
```
Code:
```{r}
# Unify non-smokers + ex-smokers vs current smokers
trin %<>%
mutate(smokstatus = as.factor(
case_when(
smokenum == "0" ~ "non-smoker",
smokenum == "1" ~ "non-smoker",
smokenum == "2" ~ "active smoker",
smokenum == "3" ~ "active smoker",
smokenum == "4" ~ "active smoker",
smokenum == "5" ~ "active smoker"
)
))
# Check you've done it well
summary(trin$smokstatus)
trin %$% table(smokstatus, smokenum)
```
Now compare the survival curves of active smokers vs non-smokers.
- Do they look different?
```{r}
```
Code:
```{r}
ggsurvplot(
fit = survfit(trinsurv ~ smokstatus, data = trin),
censor = F,
conf.int = F,
xlab = "Years",
legend.title = "Smoking status",
legend.labs = c("non-smokers", "active smokers")
)
```
## 6
Use a log-rank test to compare these two survival curves.
- Is there evidence to say they are different?
```{r}
```
Code:
```{r}
survdiff(trinsurv ~ smokstatus, data = trin)
```
## 7-16
(The same functions with a different dataset + some functions from practical 2)
## 17
Plot cumulative mortality instead of cumulative survival (= get the curve to go upwards)
```{r}
```
Code:
```{r}
ggsurvplot(
fit = survfit(trinsurv ~ smokstatus, data = trin),
fun = "event",
pval = TRUE,
censor = FALSE,
xlab = "Years",
legend.title = "Smoking status",
legend.labs = c("non-smokers", "active smokers"))
```
## 18 onwards
(Only appliable to Stata's *stset* function)