-
Notifications
You must be signed in to change notification settings - Fork 220
/
tidyverse_forcats.Rmd
330 lines (237 loc) · 7.67 KB
/
tidyverse_forcats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# 因子型变量 {#tidyverse-forcats}
```{r, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
fig.showtext = TRUE
)
```
本章介绍R语言中的因子类型数据。因子型变量常用于数据处理和可视化中,尤其在希望不以字母顺序排序的时候,因子就格外有用。
## 什么是因子
因子是把数据进行**分类**并标记为不同层级(level,有时候也翻译成因子水平, 我个人觉得翻译为层级,更接近它的特性,因此,我都会用层级来描述)的数据对象,他们可以存储字符串和整数。因子类型有三个属性:
- 存储类别的数据类型
- 离散变量
- 因子的层级是有限的,只能取因子层级中的值或缺失(NA)
## 创建因子
```{r forcats-1, message=FALSE, warning=FALSE}
library(tidyverse)
library(palmerpenguins)
```
```{r forcats-2}
income <- c("low", "high", "medium", "medium", "low", "high", "high")
factor(income)
```
因子层级会自动按照字符串的字母顺序排序,比如`high low medium`。也可以指定顺序,
```{r forcats-3}
factor(income, levels = c("low", "high", "medium") )
```
不属于因子层级中的值, 比如这里因子层只有`c("low", "high")`,那么income中的"medium"会被当作缺省值NA
```{r forcats-4}
factor(income, levels = c("low", "high") )
```
相比较字符串而言,因子类型更容易处理,因此很多函数会自动的将字符串转换为因子来处理,但事实上,这也会造成,不想当做因子的却又当做了因子的情形,最典型的是在R 4.0之前,`data.frame()`中`stringsAsFactors`选项,默认将字符串类型转换为因子类型,但这个默认也带来一些不方便,因此在R 4.0之后取消了这个默认。在tidyverse集合里,有专门处理因子的宏包`forcats`,因此,本章将围绕`forcats`宏包讲解如何处理因子类型变量,更多内容可以参考[这里](https://r4ds.had.co.nz/factors.html)。
```{r forcats-5}
library(forcats)
```
## 调整因子顺序
前面看到因子层级是按照字母顺序排序
```{r forcats-6}
x <- factor(income)
x
```
也可以指定顺序
```{r forcats-7}
x %>% fct_relevel( c("high", "medium", "low"))
```
或者让"medium" 移动到最前面
```{r forcats-8}
x %>% fct_relevel( c("medium"))
```
或者让"medium" 移动到最后面
```{r forcats-9}
x %>% fct_relevel("medium", after = Inf)
```
可以按照字符串第一次出现的次序
```{r forcats-10}
x %>% fct_inorder()
```
按照其他变量的中位数的升序排序
```{r forcats-11}
x %>% fct_reorder(c(1:7), .fun = median)
```
## 应用
调整因子层级有什么用呢?
这个功能在ggplot可视化中调整分类变量的顺序非常方便。这里为了方便演示,我们假定有数据框
```{r forcats-12}
d <- tibble(
x = c("a","a", "b", "b", "c", "c"),
y = c(2, 2, 1, 5, 0, 3)
)
d
```
先画个散点图看看吧
```{r forcats-13}
d %>%
ggplot(aes(x = x, y = y)) +
geom_point()
```
我们看到,横坐标上是a-b-c的顺序。
### fct_reorder()
`fct_reorder()`可以让x的顺序按照x中每个分类变量对应y值的中位数升序排序,具体为
- a对应的y值`c(2, 2)` 中位数是`median(c(2, 2)) = 2`
- b对应的y值`c(1, 5)` 中位数是`median(c(1, 5)) = 3`
- c对应的y值`c(0, 3)` 中位数是`median(c(0, 3)) = 1.5`
因此,x的因子层级的顺序调整为c-a-b
```{r forcats-14}
d %>%
ggplot(aes(x = fct_reorder(x, y, .fun = median), y = y)) +
geom_point()
```
当然,我们可以加一个参数`.desc = TRUE`让因子层级变为降序排列b-a-c
```{r forcats-15}
d %>%
ggplot(aes(x = fct_reorder(x, y, .fun = median, .desc = TRUE), y = y)) +
geom_point()
```
但这样会造成x坐标标签一大串,因此建议可以写`mutate()`函数里
```{r forcats-16}
d %>%
mutate(x = fct_reorder(x, y, .fun = median, .desc = TRUE)) %>%
ggplot(aes(x = x, y = y)) +
geom_point()
```
我们还可以按照y值中最小值的大小降序排列
```{r forcats-17}
d %>%
mutate(x = fct_reorder(x, y, .fun = min, .desc = TRUE)) %>%
ggplot(aes(x = x, y = y)) +
geom_point()
```
### fct_rev()
按照因子层级的逆序排序
```{r forcats-18}
d %>%
mutate(x = fct_rev(x)) %>%
ggplot(aes(x = x, y = y)) +
geom_point()
```
### fct_relevel()
```{r forcats-19}
d %>%
mutate(
x = fct_relevel(x, c("c", "a", "b"))
) %>%
ggplot(aes(x = x, y = y)) +
geom_point()
```
## 可视化中应用
可能没说明白,那就看企鹅柱状图吧
```{r forcats-20}
ggplot(penguins, aes(y = species)) +
geom_bar()
```
```{r forcats-21}
ggplot(penguins, aes(y = fct_rev(species))) +
geom_bar()
```
```{r forcats-22a, eval=FALSE}
penguins %>%
count(species) %>%
pull(species)
penguins %>%
count(species) %>%
mutate(species = fct_relevel(species, "Chinstrap", "Gentoo", "Adelie")) %>%
pull(species)
```
```{r forcats-22}
# Move "Chinstrap" in front, rest alphabetic
ggplot(penguins, aes(y = fct_relevel(species, "Chinstrap"))) +
geom_bar()
```
```{r forcats-23}
# Use order "Chinstrap", "Gentoo", "Adelie"
ggplot(penguins, aes(y = fct_relevel(species, "Chinstrap", "Gentoo", "Adelie"))) +
geom_bar()
```
```{r forcats-24}
penguins %>%
mutate(species = fct_relevel(species, "Chinstrap", "Gentoo", "Adelie")) %>%
ggplot(aes(y = species)) +
geom_bar()
```
```{r forcats-25}
ggplot(penguins, aes(y = fct_relevel(species, "Adelie", after = Inf))) +
geom_bar()
```
```{r forcats-26}
# Use the order defined by the number of penguins of different species
# The order is descending, from most frequent to least frequent
penguins %>%
mutate(species = fct_infreq(species)) %>%
ggplot(aes(y = species)) +
geom_bar()
```
```{r forcats-27}
penguins %>%
mutate(species = fct_rev(fct_infreq(species))) %>%
ggplot(aes(y = species)) +
geom_bar()
```
```{r forcats-28}
# Reorder based on numeric values
penguins %>%
count(species) %>%
mutate(species = fct_reorder(species, n)) %>%
ggplot(aes(n, species)) +
geom_col()
```
## 作业
- 画出的2007年美洲人口寿命的柱状图,要求从高到低排序
```{r forcats-29}
library(gapminder)
gapminder %>%
filter(
year == 2007,
continent == "Americas"
)
```
```{r forcats-30, eval=FALSE, echo = FALSE}
gapminder %>%
filter( year == 2007, continent == "Americas") %>%
mutate( country = fct_reorder(country, lifeExp)) %>%
ggplot(aes(lifeExp, country)) +
geom_point()
```
- 这是四个国家人口寿命的变化图
```{r forcats-31}
gapminder %>%
filter(country %in% c("Norway", "Portugal", "Spain", "Austria")) %>%
ggplot(aes(year, lifeExp)) + geom_line() +
facet_wrap(vars(country), nrow = 1)
```
- 要求给四个分面排序,按每个国家寿命的中位数
```{r forcats-32, eval=FALSE, echo = FALSE}
gapminder %>%
filter(country %in% c("Norway", "Portugal", "Spain", "Austria")) %>%
mutate(country = fct_reorder(country, lifeExp)) %>% # default: order by median
ggplot(aes(year, lifeExp)) + geom_line() +
facet_wrap(vars(country), nrow = 1)
```
- 要求给四个分面排序,按每个国家寿命差(最大值减去最小值)
```{r forcats-33, eval=FALSE, echo = FALSE}
gapminder %>%
filter(country %in% c("Norway", "Portugal", "Spain", "Austria")) %>%
# order by custom function: here, difference between max and min
mutate(country = fct_reorder(country, lifeExp, function(x) { max(x) - min(x) })) %>%
ggplot(aes(year, lifeExp)) + geom_line() +
facet_wrap(vars(country), nrow = 1)
```
```{r forcats-34, echo = F}
# remove the objects
# rm(list=ls())
rm(d, income, x)
```
```{r forcats-35, echo = F, message = F, warning = F, results = "hide"}
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```