-
Notifications
You must be signed in to change notification settings - Fork 220
/
tidystats_logistic_regression.Rmd
372 lines (245 loc) · 9.66 KB
/
tidystats_logistic_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
# logistic回归模型 {#tidystats-logistic-regression}
```{r, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
fig.showtext = TRUE
)
```
本章讲广义线性模型中的logistic回归模型。
## 问题
假定这里有一组数据,包含学生GRE成绩和被录取的状态(admit = 1,就是被录取;admit = 0,没有被录取)
```{r logistic-regression-1, message=FALSE, warning=FALSE}
library(tidyverse)
gredata <- read_csv("demo_data/gredata.csv")
gredata
```
我们能用学生GRE的成绩**预测**录取状态吗?回答这个问题需要用到logistic回归模型,
$$
\begin{align*}
\text{admit}_{i} &\sim \mathrm{Binomial}(1, p_{i}) \\
\text{logit}(p_{i}) &= \log\Big(\frac{p_{i}}{1 - p_{i}}\Big) = \alpha + \beta \cdot \text{gre}_{i} \\
\text{equivalent to,} \quad p_{i} &= \frac{1}{1 + \exp[- (\alpha + \beta \cdot \text{gre}_{i})]} \\
& = \frac{\exp(\alpha + \beta \cdot \text{gre}_{i})}{1 + \exp (\alpha + \beta \cdot\text{gre}_{i})} \\
\end{align*}
$$
这里 $p_i$ 就是被录取的概率。预测因子 $gre$ 的线性组合**模拟**的是 $log\Big(\frac{p_{i}}{1 - p_{i}}\Big)$ ,即对数比率(log-odds).
按照上面表达式,用**glm**函数写代码,
```{r logistic-regression-2}
model_logit <- glm(admit ~ gre,
data = gredata,
family = binomial(link = "logit")
)
summary(model_logit)
```
得到gre的系数是
```{r logistic-regression-3}
coef(model_logit)[2]
```
怎么理解这个0.003582呢?
## 模型的输出
为了更好地理解模型的输出,这里用三种不同的度量方式(scales)来计算系数。
假定 $p$ 为录取的概率
| num | scale | formula |
|:----|:----------------------|:----------------------|
| 1 | The log-odds scale | $log \frac{p}{1 - p}$ |
| 2 | The odds scale | $\frac{p}{1 -p}$ |
| 3 | The probability scale | $p$ |
### The log-odds scale
模型给出**系数**(0.003582)事实上就是log-odds度量方式的结果,具体来说, 这里系数(0.003582)代表着: GRE考试成绩每增加1个单位,那么`log-odds(录取概率)`就会增加0.003582. (注意,不是录取概率增加0.003582,而是`log-odds(录取概率)`增加0.003582)
```{r logistic-regression-4, eval=FALSE, include=FALSE}
library(effects)
effect_link <- Effect("gre", mod = model_logit)
plot(effect_link,
type = "link",
main = "gre effect plot\n(log odds scale)"
)
```
为了更清楚的理解,我们把log-odds的结果与gre成绩画出来看看
```{r logistic-regression-5}
logit_log_odds <- broom::augment_columns(
model_logit,
data = gredata,
type.predict = c("link")
) %>%
rename(log_odds = .fitted)
```
```{r logistic-regression-6}
library(latex2exp)
logit_log_odds %>%
ggplot(aes(x = gre, y = log_odds)) +
geom_path(color = "#771C6D", size = 2) +
labs(title = "Log odds",
subtitle = "This is linear!",
x = NULL,
y = TeX("$log \\frac{p}{1 - p}$")) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
axis.title.y = element_text(angle = 90)
)
```
由图看到,GRE成绩 与`log-odds(录取概率)`这个值的关系是线性的,斜率就是模型给出的系数0.003582
### The odds scale
第二种odds scale度量方式可能要比第一种要好理解点,我们先求系数的指数,`exp(0.003582) = 1.003588`
```{r logistic-regression-7}
exp(0.003582)
```
1.003588的含义是: GRE考试成绩每增加1个单位,那么`odds(录取概率)`就会增大1.003588倍;若增加2个单位,那么`odds(录取概率)`就会增大(1.003588 * 1.003588)倍,也就说是个乘法关系。
有时候,大家喜欢用`增长百分比`表述,那么就是
(exp(0.003582) - 1) x 100% = (1.003588 - 1) x 100% = 0.36%
即,GRE考试成绩每增加1个点,那么`odds(录取概率)`就会增长百分之0.36.
同样,我们把`odds(录取概率)`的结果与GRE成绩画出来看看
```{r logistic-regression-8}
logit_odds <- broom::augment_columns(
model_logit,
data = gredata,
type.predict = c("link")
) %>%
rename(log_odds = .fitted) %>%
mutate(odds_ratio = exp(log_odds))
```
```{r logistic-regression-9}
logit_odds %>%
ggplot(aes(x = gre, y = odds_ratio)) +
geom_line(color = "#FB9E07", size = 2) +
labs(title = "Odds",
subtitle = "This is curvy, but it's a mathy transformation of a linear value",
x = NULL,
y = TeX("$\\frac{p}{1 - p}$")) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
axis.title.y = element_text(angle = 90)
)
```
### The probability scale.
第三种度量方式是概率度量(probability scale),因为模型假定的是,GRE的分数与`log-odds(录取概率)`呈线性关系,那么很显然GRE的分数与`录取概率`就不可能是线性关系了,而是呈非线性关系。我们先看下非线性关系长什么样。
```{r logistic-regression-10, eval=FALSE, include=FALSE}
library(effects)
effect_response <- Effect("gre", mod = model)
plot(effect_response,
type = "response",
main = "gre effect plot\n(probability scale)"
)
```
```{r logistic-regression-11}
logit_probs <- broom::augment_columns(
model_logit,
data = gredata,
type.predict = c("response")
) %>%
rename(pred_prob = .fitted)
```
```{r logistic-regression-12}
logit_probs %>%
ggplot(aes(x = gre, y = pred_prob)) +
#geom_point(aes(x = gre, y = admit)) +
geom_line(color = "#CF4446", size = 2) +
labs(title = "Predicted probabilities",
sutitle = "Plug values of X into ",
x = "X (value of explanatory variable)",
y = TeX("\\hat{P(Y)} ")) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
```
可以看到,GRE分数对录取的概率的影响是**正的且非线性的**,具体来说,
- GRE分数200分左右,录取概率约0.1;
- GRE分数500分左右,录取概率约0.25;
- GRE分数800分左右,录取概率接近0.5;
::: {.rmdnote}
提请注意的是,以上三种度量的方式中:
- `log_odds scale`,预测因子与`log_odds()`的关系是一个固定值,具有可加性
- `odds scale`, 预测因子与`odds()`的关系是一个固定值,具有乘法性
- `probability`,预测因子与`probability`的关系不再是一个固定值了
:::
用哪种度量方式来理解模型的输出,取决不同的场景。第一种方式容易计算但理解较为困难,第三种方式最容易理解,但不再是线性关系了。
## 预测
### 预测与拟合
先认识下两个常用的函数`predict()`和`fitted()`.
```{r logistic-regression-14}
gredata %>%
mutate(
pred = predict(model_logit),
fitted = fitted(model_logit)
)
```
线性模型中,`predict()`和`fitted()`这种写法的返回结果是一样的。但在glm模型中,两者的结果是不同的。`predict()`返回的是`log_odds(录取概率)`度量的结果;而`fitted()`返回的是`录取概率`。如果想保持一致,需要对`predict()`返回结果做`反向的log_odds`计算
$$p = \exp(\alpha) / (1 + \exp(\alpha) )$$
具体如下
```{r logistic-regression-15}
gredata %>%
mutate(
pred = predict(model_logit),
fitted = fitted(model_logit),
pred2 = exp(pred) / (1 + exp(pred) )
)
```
我这样折腾无非是想让大家知道,在glm中`predict()`和`fit()`是不同的。
如果想让`predict()`也返回`录取概率`,也可以不用那么麻烦,事实上`predict`的`type = "response")` 选项已经为我们准备好了。
```{r logistic-regression-16}
gredata %>%
mutate(
pred = predict(model_logit, type = "response"),
fitted = fitted(model_logit)
)
```
### 预测
有时候,我们需要对给定的GRE分数,用建立的模型**预测**被录取的概率
```{r logistic-regression-17}
newdata <- tibble(
gre = c(550, 660, 700, 780)
)
newdata
```
前面讲到`predict()`中`type`参数有若干选项`type = c("link", "response", "terms")`,
可参考第 \@ref(tidystats-marginaleffects) 章
- `type = "link"`,预测的是log_odds,实际上就是`coef(model_logit)[1] + gre * coef(model_logit)[2]`
```{r logistic-regression-18}
newdata %>%
mutate(
pred_log_odds = predict(model_logit, newdata = newdata, type = "link"),
#
pred_log_odds2 = coef(model_logit)[1] + gre * coef(model_logit)[2]
)
```
- `type = "response""`,预测的是probabilities, 实际上就是`exp(pred_log_odds) / (1 + exp(pred_log_odds) )`
```{r logistic-regression-19}
newdata %>%
mutate(
pred_log_odds = predict(model_logit, newdata = newdata, type = "link")
) %>%
mutate(
pred_prob = predict(model_logit, newdata = newdata, type = "response"),
#
pred_prob2 = exp(pred_log_odds) / (1 + exp(pred_log_odds) )
)
```
- `type = "terms"`,返回一个矩阵,具体计算为:模型的**设计矩阵**中心化以后,与模型返回的系数相乘而得到的新矩阵^[https://stackoverflow.com/questions/37963904/what-does-predict-glm-type-terms-actually-do]。
```{r logistic-regression-20}
predict(model_logit, gredata, type = 'terms') %>%
as_tibble() %>%
head()
```
我们复盘一次
```{r logistic-regression-21}
X <- model.matrix(admit ~ gre, data = gredata)
X %>%
as.data.frame() %>%
mutate(
across(everything(), ~ .x - mean(.x))
) %>%
transmute(
term = coef(model_logit)[2] * gre
) %>%
head()
```
```{r logistic-regression-22, echo = F}
# remove the objects
# rm(list=ls())
rm(gredata, logit_log_odds, logit_odds, logit_probs, model_logit, newdata, X)
```
```{r logistic-regression-23, echo = F, message = F, warning = F, results = "hide"}
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```