forked from yufree/datadown
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10-beiyesi.Rmd
1409 lines (1134 loc) · 44.1 KB
/
10-beiyesi.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# 贝叶斯统计 {#bayes}
```{r include=FALSE}
library(knitr)
opts_chunk$set(echo=T,cache = T, warning = FALSE, message = FALSE)
options(digits = 3)
```
## 贝塔分布
- 贝塔分布的本质是概率分布的分布
- 棒球击球率的预测问题,你不可能预测一个刚打出本垒下一个也击中,会有一个先验概率
- 这个概率可以用一个参数 $\alpha$ 与 $\beta$ 的贝塔分布来描述,例如一共打了300个球,81个击中,219个击空,那么 $\alpha$ 为81,$\beta$ 为219
- 均值为$\frac{\alpha}{\alpha + \beta} = \frac{81}{81+219} = 0.27$
- 概率密度分布图,从图上我们可以看出一个大约在0.2-0.35的概率区间,表示击球的先验概率空间可能的取值
```{r beta}
library(ggplot2)
x <- seq(0,1,length=100)
db <- dbeta(x, 81, 219)
ggplot() + geom_line(aes(x,db)) + ylab("Density of beta")
```
## 为什么击球的概率分布符合贝塔分布?
- 设想球员A打了一个球打中了,那么在没有先验知识的情况下我会认为他击中概率为1
- 这个球员又打中了一个球,那么还是1
- 但第三个没打中,我们会认为他击中概率是0吗?
- 一般而言,这类连续击球问题可以用二项分布来描述,例如10个球打中8个的概率,我们假设这个击球概率为q,那么这个概率应该是个q的函数:
$$f(q) \propto q^a(1-q)^b$$
- q对于一个实际问题是确定的常数,所以出现这个场景的概率实际上是a与b的函数
- 为了保障这个概率函数累积为1,需要除一个跟a与b有关的数
- 这个数可以用贝塔函数$B(a,b)$来表示,数学证明[略](https://en.wikipedia.org/wiki/Conjugate_prior#Example)
- 如果接着打了一个中了,那么如何更新这个概率?
- 根据贝叶斯公式,最后推导出的结果如下:
$$Beta(\alpha+1,\beta+0)$$
- 那么我们对这个击球率的估计就略高了一点,这是贝塔分布的神奇之处,形式非常简单,理解也很直观
## 先验与后验
- 如果我们后续观察的击球少,那么不太容易影响到对概率的先验估计
```{r beta1}
x <- seq(0,1,length=100)
db <- dbeta(x, 81+1, 219)
ggplot() + geom_line(aes(x,db)) + ylab("Density of beta")
```
- 如果后续观察了大量的击球都中了,那么概率会偏向后面数据量的那一部分
```{r beta2}
x <- seq(0,1,length=100)
db <- dbeta(x, 81+1000, 219)
ggplot() + geom_line(aes(x,db)) + ylab("Density of beta")
```
- 这是贝叶斯分析的核心思想,通过证据更新经验
- 最后得到的均值(后验0.83)一定是介于经验值(先验0.27)与证据值(全击中就是1)之间
- 贝塔分布天然适合描述一个对概率的估计场景
- 另一种不那么严谨的理解方法是如果一个概率是稳定的,那么多次实验的结果差别不会太大,则有:
$$\frac{a}{b} = \frac{c}{d} = \frac{a+b}{c+d}$$
- 如果每次实验的概率持平,那么不存在不确定度;但如果前面实验的次数少而后面实验的次数多,那么概率会偏重于后面,这就是贝塔分布想说明的事
## 经验贝叶斯
- 对于两个球员,一个打了10个球中了4个,另一个打了1000个球中了300个,一般击中概率0.2,你会选哪一个?
- 我们对于小样本量的统计推断会有天然的不信任,如何通过统计量来描述?
- 下面用MLB的数据说明,首先提取出球员的击球数据:
```{r MLB}
library(dplyr)
library(tidyr)
library(Lahman)
# 拿到击球数据
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
mutate(average = H / AB)
# 把ID换成球员名字
career <- Master %>%
tbl_df() %>%
dplyr::select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID") %>%
dplyr::select(-playerID)
# 展示数据
career
# 击球前5
career %>%
arrange(desc(average)) %>%
head(5) %>%
kable()
# 击球后5
career %>%
arrange(average) %>%
head(5) %>%
kable()
```
- 如果仅考虑击球率会把很多板凳球员与运气球员包括进来,一个先验概率分布很有必要
- 那么考虑下如何得到,经验贝叶斯方法认为如果估计一个个体的参数,那么这个个体所在的整体的概率分布可作为先验概率分布
- 这个先验概率分布可以直接从数据中得到,然后我们要用极大似然或矩估计的方法拿到贝塔分布的两个参数:
```{r ebayes}
career_filtered <- career %>%
filter(AB >= 500)
m <- MASS::fitdistr(career_filtered$average, dbeta,
start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]
# 看下拟合效果
ggplot(career_filtered) +
geom_histogram(aes(average, y = ..density..), binwidth = .005) +
stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
size = 1) +
xlab("Batting average")
```
## 从整体到个人
- 当我们估计个人的击球率时,整体可以作为先验函数,个人的数据可以通过贝塔分布更新到个体
- 那么如果一个人数据少,我们倾向于认为他是平均水平;数据多则认为符合个人表现
- 这事实上是一个分层结构,经验贝叶斯推断里隐含了这么一个从整体到个人的过程
```{r ebayes2}
career_eb <- career %>%
mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0))
# 击球率高
career_eb %>%
arrange(desc(eb_estimate)) %>%
head(5) %>%
kable()
# 击球率低
career_eb %>%
arrange(eb_estimate) %>%
head(5) %>%
kable()
# 整体估计
ggplot(career_eb, aes(average, eb_estimate, color = AB)) +
geom_hline(yintercept = alpha0 / (alpha0 + beta0), color = "red", lty = 2) +
geom_point() +
geom_abline(color = "red") +
scale_colour_gradient(trans = "log", breaks = 10 ^ (1:5)) +
xlab("Batting average") +
ylab("Empirical Bayes batting average")
```
- 数据点多会收缩到$x=y$,也就是个人的击球率;数据点少则回归到整体击球率
- 这就是经验贝叶斯方法的全貌:先估计整体的参数,然后把整体参数作为先验概率估计个人参数
## 可信区间与置信区间
- 经验贝叶斯可以给出点估计,但现实中我们可能更关心区间估计
- 一般这类区间估计可以用二项式比例估计来进行,不过没有先验经验的限制置信区间大到没意义
- 经验贝叶斯会给出一个后验分布,这个分布可以用来求可信区间
```{r ci}
library(broom)
# 给出后验分布
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
mutate(average = H / AB)
career <- Master %>%
tbl_df() %>%
dplyr::select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID")
career0 <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB), year = mean(yearID)) %>%
mutate(average = H / AB)
career2 <- Master %>%
tbl_df() %>%
dplyr::select(playerID, nameFirst, nameLast, bats) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career0, by = "playerID")
career_eb <- career %>%
mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0))
career_eb <- career_eb %>%
mutate(alpha1 = H + alpha0,
beta1 = AB - H + beta0)
# 提取洋基队的数据
yankee_1998 <- c("brosisc01", "jeterde01", "knoblch01", "martiti02", "posadjo01", "strawda01", "willibe02")
yankee_1998_career <- career_eb %>%
filter(playerID %in% yankee_1998)
# 提取可信区间
yankee_1998_career <- yankee_1998_career %>%
mutate(low = qbeta(.025, alpha1, beta1),
high = qbeta(.975, alpha1, beta1))
yankee_1998_career %>%
dplyr::select(-alpha1, -beta1, -eb_estimate) %>%
knitr::kable()
# 绘制可信区间
yankee_1998_career %>%
mutate(name = reorder(name, average)) %>%
ggplot(aes(average, name)) +
geom_point() +
geom_errorbarh(aes(xmin = low, xmax = high)) +
geom_vline(xintercept = alpha0 / (alpha0 + beta0), color = "red", lty = 2) +
xlab("Estimated batting average (w/ 95% interval)") +
ylab("Player")
# 对比置信区间与可信区间
career_eb <- career_eb %>%
mutate(low = qbeta(.025, alpha1, beta1),
high = qbeta(.975, alpha1, beta1))
set.seed(2016)
some <- career_eb %>%
sample_n(20) %>%
mutate(name = paste0(name, " (", H, "/", AB, ")"))
frequentist <- some %>%
group_by(playerID, name, AB) %>%
do(tidy(binom.test(.$H, .$AB))) %>%
dplyr::select(playerID, name, estimate, low = conf.low, high = conf.high) %>%
mutate(method = "Confidence")
bayesian <- some %>%
dplyr::select(playerID, name, AB, estimate = eb_estimate,
low = low, high = high) %>%
mutate(method = "Credible")
combined <- bind_rows(frequentist, bayesian)
combined %>%
mutate(name2 = reorder(name, -AB)) %>%
ggplot(aes(estimate, name2, color = method, group = method)) +
geom_point() +
geom_errorbarh(aes(xmin = low, xmax = high)) +
geom_vline(xintercept = alpha0 / (alpha0 + beta0), color = "red", lty = 2) +
xlab("Estimated batting average") +
ylab("Player") +
labs(color = "")
```
- 可信区间与置信区间很大的区别在于前者考虑了先验概率进而实现了区间的收缩,后者则可看作无先验贝塔分布给出的区间估计,频率学派目前没有很好的收缩区间估计的方法
## 后验错误率
- 现实问题经常不局限于估计,而是侧重决策,例如如果一个球员的击球率高于某个值,他就可以进入名人堂(击球率大于0.3),这个决策常常伴随区间估计而不是简单的点估计
```{r lp}
# 以 Hank Aaron 为例
career_eb %>%
filter(name == "Hank Aaron") %>%
do(data_frame(x = seq(.27, .33, .0002),
density = dbeta(x, .$alpha1, .$beta1))) %>%
ggplot(aes(x, density)) +
geom_line() +
geom_ribbon(aes(ymin = 0, ymax = density * (x < .3)),
alpha = .1, fill = "red") +
geom_vline(color = "red", lty = 2, xintercept = .3)
# 提取该球员数据
career_eb %>% filter(name == "Hank Aaron")
# 计算其不进入名人堂的概率
pbeta(.3, 3850, 8818)
```
- 后验错误率(Posterior Error Probability)可类比经典假设检验中的显著性水平$\alpha$
- 后验包括率(Posterior Inclusion Probability)可类比经典假设检验中的置信水平$1-\alpha$
```{r ap}
# 所有球员的后验错误率分布,大部分不超过0.3
career_eb <- career_eb %>%
mutate(PEP = pbeta(.3, alpha1, beta1))
ggplot(career_eb, aes(PEP)) +
geom_histogram(binwidth = .02) +
xlab("Posterior Error Probability (PEP)") +
xlim(0, 1)
# 后验错误率与击球率的关系
career_eb %>%
ggplot(aes(eb_estimate, PEP, color = AB)) +
geom_point(size = 1) +
xlab("(Shrunken) batting average estimate") +
ylab("Posterior Error Probability (PEP)") +
geom_vline(color = "red", lty = 2, xintercept = .3) +
scale_colour_gradient(trans = "log", breaks = 10 ^ (1:5))
```
- 后验错误率高于0.3的多数是击球率与击球数都高的人,因为贝叶斯方法惩罚了击球数低的人
## 错误发现率
- 错误发现率(FDR)可用来控制一个整体决策,保证整体犯错的概率低于某个数值,错误发现率越高,越可能把假阳性包括进来
- 假如我们把进入名人堂的决策作为一个整体,则可允许一定的整体错误率,因为每个人的后验错误率可以计算且期望值线性可加和,我们可以得到一个整体的错误率
```{r FDR}
# 取前100个球员
top_players <- career_eb %>%
arrange(PEP) %>%
head(100)
# 总错率率
sum(top_players$PEP)
# 平均错误率
mean(top_players$PEP)
# 错误率随所取球员的变化
sorted_PEP <- career_eb %>%
arrange(PEP)
mean(head(sorted_PEP$PEP, 50))
mean(head(sorted_PEP$PEP, 200))
```
- 错误率在排序后前面低后面高,但这个错误率不特指某个球员,而是包含到某个球员的整体犯错的概率
## q值
- q值定义为排序后累积到某个样本的整体平均错误率,类似多重比较中对整体错误率控制的p值
```{r qvalue}
# 生成每个球员的q值
career_eb <- career_eb %>%
arrange(PEP) %>%
mutate(qvalue = cummean(PEP))
# 观察不同q值对名人堂球员数的影响
career_eb %>%
ggplot(aes(qvalue, rank(PEP))) +
geom_line() +
xlab("q-value cutoff") +
ylab("Number of players included")
# 观察小q值部分
career_eb %>%
filter(qvalue < .25) %>%
ggplot(aes(qvalue, rank(PEP))) +
geom_line() +
xlab("q-value cutoff") +
ylab("Number of players included")
```
- 200个人进入名人堂可能有约1/4的球员不合适,如果是50个人进入名人堂那么基本不会犯错
- q值是一个整体而非个体的平均错误率,具有累积性,不代表q值大的那一个就是错的
- q值在频率学派的多重比较里也有定义,虽然没有空假设(有先验概率),但实质等同
## 贝叶斯视角的假设检验
- 前面描述的是击球率如何求,如何进行区间估计与多个体的错误率控制,面向的个体或整体,那么如何解决比较问题
- 设想多个球员,我们考虑如何去比较他们击球率。如果两个球员击球率的概率密度曲线比较接近,那么即便均值有不同我们也无法进行区分;如果重叠比较少,那么我们有理由认为他们之间的差异显著
- 贝叶斯视角下如何定量描述这个差异是否显著?
### 模拟验证
- 单纯取样比大小然后计算比例
```{r sim}
# 提取两人数据
aaron <- career_eb %>% filter(name == "Hank Aaron")
piazza <- career_eb %>% filter(name == "Mike Piazza")
# 模拟取样10万次
piazza_simulation <- rbeta(1e6, piazza$alpha1, piazza$beta1)
aaron_simulation <- rbeta(1e6, aaron$alpha1, aaron$beta1)
# 计算一个人超过另一个人的概率
sim <- mean(piazza_simulation > aaron_simulation)
sim
```
### 数值积分
- 两个概率的联合概率分布,然后积分一个队员大于另一个的概率
```{r int}
d <- .00002
limits <- seq(.29, .33, d)
sum(outer(limits, limits, function(x, y) {
(x > y) *
dbeta(x, piazza$alpha1, piazza$beta1) *
dbeta(y, aaron$alpha1, aaron$beta1) *
d ^ 2
}))
```
### 解析解
- 两个贝塔分布一个比另一个高是有含有贝塔函数的解析解的:
$$p_A \sim \mbox{Beta}(\alpha_A, \beta_A)$$
$$p_B \sim \mbox{Beta}(\alpha_B, \beta_B)$$
$${\rm Pr}(p_B > p_A) = \sum_{i=0}^{\alpha_B-1}\frac{B(\alpha_A+i,\beta_A+\beta_B)}{(\beta_B+i) B(1+i, \beta_B) B(\alpha_A, \beta_A) }$$
```{r betaa}
h <- function(alpha_a, beta_a,
alpha_b, beta_b) {
j <- seq.int(0, round(alpha_b) - 1)
log_vals <- (lbeta(alpha_a + j, beta_a + beta_b) - log(beta_b + j) -
lbeta(1 + j, beta_b) - lbeta(alpha_a, beta_a))
1 - sum(exp(log_vals))
}
h(piazza$alpha1, piazza$beta1,
aaron$alpha1, aaron$beta1)
```
### 正态近似求解
- 贝塔分布在$\alpha$与$\beta$比较大时接近正态分布,可以直接用正态分布的解析解求,速度快很多
```{r betan}
h_approx <- function(alpha_a, beta_a,
alpha_b, beta_b) {
u1 <- alpha_a / (alpha_a + beta_a)
u2 <- alpha_b / (alpha_b + beta_b)
var1 <- alpha_a * beta_a / ((alpha_a + beta_a) ^ 2 * (alpha_a + beta_a + 1))
var2 <- alpha_b * beta_b / ((alpha_b + beta_b) ^ 2 * (alpha_b + beta_b + 1))
pnorm(0, u2 - u1, sqrt(var1 + var2))
}
h_approx(piazza$alpha1, piazza$beta1, aaron$alpha1, aaron$beta1)
```
## 比例检验
- 这是个列联表问题,频率学派对比两个比例
```{r fl}
two_players <- bind_rows(aaron, piazza)
two_players %>%
transmute(Player = name, Hits = H, Misses = AB - H) %>%
knitr::kable()
prop.test(two_players$H, two_players$AB)
```
- 贝叶斯学派对比两个比例
```{r bp}
credible_interval_approx <- function(a, b, c, d) {
u1 <- a / (a + b)
u2 <- c / (c + d)
var1 <- a * b / ((a + b) ^ 2 * (a + b + 1))
var2 <- c * d / ((c + d) ^ 2 * (c + d + 1))
mu_diff <- u2 - u1
sd_diff <- sqrt(var1 + var2)
data_frame(posterior = pnorm(0, mu_diff, sd_diff),
estimate = mu_diff,
conf.low = qnorm(.025, mu_diff, sd_diff),
conf.high = qnorm(.975, mu_diff, sd_diff))
}
credible_interval_approx(piazza$alpha1, piazza$beta1, aaron$alpha1, aaron$beta1)
```
- 多个球员对比一个
```{r mp}
set.seed(2016)
intervals <- career_eb %>%
filter(AB > 10) %>%
sample_n(20) %>%
group_by(name, H, AB) %>%
do(credible_interval_approx(piazza$alpha1, piazza$beta1, .$alpha1, .$beta1)) %>%
ungroup() %>%
mutate(name = reorder(paste0(name, " (", H, " / ", AB, ")"), -estimate))
f <- function(H, AB) broom::tidy(prop.test(c(H, piazza$H), c(AB, piazza$AB)))
prop_tests <- purrr::map2_df(intervals$H, intervals$AB, f) %>%
mutate(estimate = estimate1 - estimate2,
name = intervals$name)
all_intervals <- bind_rows(
mutate(intervals, type = "Credible"),
mutate(prop_tests, type = "Confidence")
)
ggplot(all_intervals, aes(x = estimate, y = name, color = type)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
xlab("Piazza average - player average") +
ylab("Player")
```
- 置信区间与可信区间的主要差异来自于经验贝叶斯的区间收敛
## 错误率控制
- 如果我打算交易一个球员,那么如何筛选候选人?
- 先选那些击球率更好的球员
```{r fdre}
# 对比打算交易的球员与其他球员
career_eb_vs_piazza <- bind_cols(
career_eb,
credible_interval_approx(piazza$alpha1, piazza$beta1,
career_eb$alpha1, career_eb$beta1)) %>%
dplyr::select(name, posterior, conf.low, conf.high)
career_eb_vs_piazza
# 计算q值
career_eb_vs_piazza <- career_eb_vs_piazza %>%
arrange(posterior) %>%
mutate(qvalue = cummean(posterior))
# 筛选那些q值小于0.05的
better <- career_eb_vs_piazza %>%
filter(qvalue < .05)
better
```
- 这样我们筛到一个可交易的群体,总和错误率不超过5%
## 影响因子
- 击球率高除了能力影响外还有可能是因为得到的机会多或者光环效应,例如一开始凭运气打得好,后面给机会多,通过经验累积提高了击球率
```{r if, echo=TRUE}
career %>%
filter(AB >= 20) %>%
ggplot(aes(AB, average)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_x_log10()
```
- 击球数低方差会大,这比较正常,很多人挂在起跑线上了
- 直接使用经验贝叶斯方法会导致整体向均值收敛,这高估了新手的数据
```{r ife}
prior_mu <- alpha0 / (alpha0 + beta0)
career_eb %>%
filter(AB >= 20) %>%
gather(type, value, average, eb_estimate) %>%
mutate(type = plyr::revalue(type, c(average = "Raw",
eb_estimate = "With EB Shrinkage"))) %>%
ggplot(aes(AB, value)) +
geom_point() +
scale_x_log10() +
geom_hline(color = "red", lty = 2, size = 1.5, yintercept = prior_mu) +
facet_wrap(~type) +
ylab("average") +
geom_smooth(method = "lm")
```
- 为了如实反应这种情况,我们应该认为击球率符合贝塔分布,但同时贝塔分布的两个参数受击球数的影响,击球数越多,越可能击中
- 这个模型可以用贝塔-二项式回归来描述
$$\mu_i = \mu_0 + \mu_{\mbox{AB}} \cdot \log(\mbox{AB})$$
$$\alpha_{0,i} = \mu_i / \sigma_0$$
$$\beta_{0,i} = (1 - \mu_i) / \sigma_0$$
$$p_i \sim \mbox{Beta}(\alpha_{0,i}, \beta_{0,i})$$
$$H_i \sim \mbox{Binom}(\mbox{AB}_i, p_i)$$
### 拟合模型
- 寻找拟合后的模型参数,构建新的先验概率
```{r fitbb}
library(gamlss)
# 拟合模型
fit <- gamlss(cbind(H, AB - H) ~ log(AB),
data = career_eb,
family = BB(mu.link = "identity"))
# 展示拟合参数
td <- tidy(fit)
td
# 构建新的先验概率
mu_0 <- td$estimate[1]
mu_AB <- td$estimate[2]
sigma <- exp(td$estimate[3])
# 看看AB对先验概率的影响
crossing(x = seq(0.08, .35, .001), AB = c(1, 10, 100, 1000, 10000)) %>%
mutate(density = dbeta(x, (mu_0 + mu_AB * log(AB)) / sigma,
(1 - (mu_0 + mu_AB * log(AB))) / sigma)) %>%
mutate(AB = factor(AB)) %>%
ggplot(aes(x, density, color = AB, group = AB)) +
geom_line() +
xlab("Batting average") +
ylab("Prior density")
```
### 求后验概率
```{r fitpo}
# 计算所有拟合值
mu <- fitted(fit, parameter = "mu")
sigma <- fitted(fit, parameter = "sigma")
# 计算所有后验概率
career_eb_wAB <- career_eb %>%
dplyr::select(name, H, AB, original_eb = eb_estimate) %>%
mutate(mu = mu,
alpha0 = mu / sigma,
beta0 = (1 - mu) / sigma,
alpha1 = alpha0 + H,
beta1 = beta0 + AB - H,
new_eb = alpha1 / (alpha1 + beta1))
# 展示拟合后的击球率
ggplot(career_eb_wAB, aes(original_eb, new_eb, color = AB)) +
geom_point() +
geom_abline(color = "red") +
xlab("Original EB Estimate") +
ylab("EB Estimate w/ AB term") +
scale_color_continuous(trans = "log", breaks = 10 ^ (0:4))
# 对比
library(tidyr)
lev <- c(raw = "Raw H / AB", original_eb = "EB Estimate", new_eb = "EB w/ Regression")
career_eb_wAB %>%
filter(AB >= 10) %>%
mutate(raw = H / AB) %>%
gather(type, value, raw, original_eb, new_eb) %>%
mutate(mu = ifelse(type == "original_eb", prior_mu,
ifelse(type == "new_eb", mu, NA))) %>%
mutate(type = factor(plyr::revalue(type, lev), lev)) %>%
ggplot(aes(AB, value)) +
geom_point() +
geom_line(aes(y = mu), color = "red") +
scale_x_log10() +
facet_wrap(~type) +
xlab("At-Bats (AB)") +
ylab("Estimate")
```
- 矫正后我们的数据更复合现实了,其实这是贝叶斯分层模型的一个简单版本,通过考虑更多因素,我们可以构建更复杂的模型来挖掘出我们所需要的信息
### 考虑更多因素
- 现在我们听说左利手跟右利手的表现可能不一样,所以我们要对模型进行完善,考虑把左右手参数加入模型
```{r career2}
# 展示数据
career2 %>%
count(bats)
# 排除NA
career3 <- career2 %>%
filter(!is.na(bats)) %>%
mutate(bats = relevel(bats, "R"))
# 重建模型
fit2 <- gamlss(cbind(H, AB - H) ~ log(AB) + bats,
data = career3,
family = BB(mu.link = "identity"))
# 观察参数
tidy(fit2)
sigma <- fitted(fit2, "sigma")[1]
crossing(bats = c("L", "R"),
AB = c(1, 10, 100, 1000, 10000)) %>%
augment(fit2, newdata = .) %>%
rename(mu = .fitted) %>%
crossing(x = seq(.1, .36, .0005)) %>%
mutate(alpha = mu / sigma,
beta = (1 - mu) / sigma,
density = dbeta(x, alpha, beta)) %>%
ggplot(aes(x, density, color = factor(AB), lty = bats)) +
geom_line() +
labs(x = "Batting average",
y = "Prior density",
color = "AB",
lty = "Batting hand")
```
- 存在先验概率的情况下,可以考虑考察随着击球数增长左右手的不同
```{r lrcom}
crossing(bats = c("L", "R"),
AB = c(10, 100, 1000, 10000)) %>%
augment(fit2, newdata = .) %>%
mutate(H = .3 * AB,
alpha0 = .fitted / sigma,
beta0 = (1 - .fitted) / sigma,
alpha1 = alpha0 + H,
beta1 = beta0 + AB - H,
estimate = alpha1 / (alpha1 + beta1),
conf.low = qbeta(.025, alpha1, beta1),
conf.high = qbeta(.975, alpha1, beta1),
record = paste(H, AB, sep = " / ")) %>%
ggplot(aes(estimate, record, color = bats)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
labs(x = "Estimate w/ 95% credible interval",
y = "Batting record",
color = "Batting hand")
```
- 另一个要考虑的因素是不同年份的平均击球率可能也有起伏
```{r batyear, eval=F}
career3 %>%
mutate(decade = factor(round(year - 5, -1))) %>%
filter(AB >= 500) %>%
ggplot(aes(decade, average)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Batting average")
# 用样条插值来进行拟合
library(splines)
fit3 <- gamlss(cbind(H, AB - H) ~ 0 + ns(year, df = 5) + bats + log(AB),
data = career3,
family = BB(mu.link = "identity"))
# 观察在击球数1000上先验概率的变化
plot_gamlss_fit <- function(f) {
career3 %>%
dplyr::select(year, bats) %>%
distinct() %>%
filter(bats != "B") %>%
mutate(AB = 1000) %>%
augment(f, newdata = .) %>%
rename(mu = .fitted) %>%
mutate(sigma = fitted(fit3, "sigma")[1],
alpha0 = mu / sigma,
beta0 = (1 - mu) / sigma,
conf_low = qbeta(.025, alpha0, beta0),
conf_high = qbeta(.975, alpha0, beta0)) %>%
ggplot(aes(year, mu, color = bats, group = bats)) +
geom_line() +
geom_ribbon(aes(ymin = conf_low, ymax = conf_high), linetype = 2, alpha = .1) +
labs(x = "Year",
y = "Prior distribution (median + 95% quantiles)",
color = "Batting hand")
}
plot_gamlss_fit(fit3)
```
- 同时另一个问题是这些因素会交互影响
```{r interact, eval=F}
fit4 <- gamlss(cbind(H, AB - H) ~ 0 + ns(year, 5) * bats + log(AB),
data = career3,
family = BB(mu.link = "identity"))
plot_gamlss_fit(fit4)
Pitching %>%
dplyr::select(playerID, yearID, GS) %>%
distinct() %>%
inner_join(dplyr::select(Master, playerID, throws)) %>%
count(yearID, throws, wt = GS) %>%
filter(!is.na(throws)) %>%
mutate(percent = n / sum(n)) %>%
filter(throws == "L") %>%
ggplot(aes(yearID, percent)) +
geom_line() +
geom_smooth() +
scale_y_continuous(labels = scales::percent_format()) +
xlab("Year") +
ylab("% of games with left-handed pitcher")
```
- 左右手之间的差距伴随年份在逐渐减少
```{r yearcompare,eval=F}
players <- crossing(year = c(1915, 1965, 2015),
bats = c("L", "R"),
H = 30,
AB = 100)
players_posterior <- players %>%
mutate(mu = predict(fit4, what = "mu", newdata = players),
sigma = predict(fit4, what = "sigma", newdata = players, type = "response"),
alpha0 = mu / sigma,
beta0 = (1 - mu) / sigma,
alpha1 = alpha0 + H,
beta1 = beta0 + AB - H)
players_posterior %>%
crossing(x = seq(.15, .3, .001)) %>%
mutate(density = dbeta(x, alpha1, beta1)) %>%
ggplot(aes(x, density, color = bats)) +
geom_line() +
facet_wrap(~ year) +
xlab("Batting average") +
ylab("Posterior density") +
ggtitle("Posterior distributions for batters with 30 / 100")
```
- 经验贝叶斯对先验概率的估计类似频率学派,但进行的又是贝叶斯分析
## 混合概率模型
- 用击球概率为例,击球手跟非击球手的概率分布是不一样的,那么实际看到的总体球员概率分布应该是一个混合在一起的两个独立分布
```{r mixermodel1}
# 找出投球3次以上的人
pitchers <- Pitching %>%
group_by(playerID) %>%
summarize(gamesPitched = sum(G)) %>%
filter(gamesPitched > 3)
# 参考上一章节的发现找出击球率稳定的选手
career <- Batting %>%
filter(AB > 0, lgID == "NL", yearID >= 1980) %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB), year = mean(yearID)) %>%
mutate(average = H / AB,
isPitcher = playerID %in% pitchers$playerID)
# 链接上名字
career <- Master %>%
tbl_df() %>%
dplyr::select(playerID, nameFirst, nameLast, bats) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID")
```
### 期望最大算法
- 将一个分布拆成两个,可以使用期望最大算法
```{r EM}
set.seed(2017)
# 先随机分为两组
starting_data <- career %>%
filter(AB >= 20) %>%
dplyr::select(-year, -bats, -isPitcher) %>%
mutate(cluster = factor(sample(c("A", "B"), n(), replace = TRUE)))
# 观察效果
starting_data %>%
ggplot(aes(average, color = cluster)) +
geom_density()
library(VGAM)
fit_bb_mle <- function(x, n) {
# dbetabinom.ab 是用n、alpha与beta作为参数的二项贝塔分布的似然度函数
ll <- function(alpha, beta) {
-sum(dbetabinom.ab(x, n, alpha, beta, log = TRUE))
}
m <- stats4::mle(ll, start = list(alpha = 3, beta = 10), method = "L-BFGS-B",
lower = c(0.001, .001))
ab <- stats4::coef(m)
data_frame(alpha = ab[1], beta = ab[2], number = length(x))
}
# 看下初始参数
fit_bb_mle(starting_data$H, starting_data$AB)
# 看下随机分拆后的参数并生成各分组样本数的先验概率
fits <- starting_data %>%
group_by(cluster) %>%
do(fit_bb_mle(.$H, .$AB)) %>%
ungroup() %>%
mutate(prior = number / sum(number))
fits
```
- 算法优化的期望是将这两个分布分拆开,前面一次分拆已经产生微弱差异,下面就通过贝叶斯思想对数据更新这个差异重新分组让两者分开
```{r EMbayes}
assignments <- starting_data %>%
dplyr::select(-cluster) %>%
crossing(fits) %>%
mutate(likelihood = prior * VGAM::dbetabinom.ab(H, AB, alpha, beta)) %>%
group_by(playerID) %>%
top_n(1, likelihood) %>%
ungroup()
# 去除掉原有分组,根据更新的后验概率重新分组
assignments
# 观察更新后概率分布
ggplot(assignments, aes(average, fill = cluster)) +
geom_histogram()
```
- 不断重复这个过程,最终分拆数据(其实就是第一步分拆最重要,后面直接收敛了)
```{r EMrepeat}
set.seed(1987)
iterate_em <- function(state, ...) {
fits <- state$assignments %>%
group_by(cluster) %>%
do(mutate(fit_bb_mle(.$H, .$AB), number = nrow(.))) %>%
ungroup() %>%
mutate(prior = number / sum(number))
assignments <- assignments %>%
dplyr::select(playerID:average) %>%
crossing(fits) %>%
mutate(likelihood = prior * VGAM::dbetabinom.ab(H, AB, alpha, beta)) %>%
group_by(playerID) %>%
top_n(1, likelihood) %>%
ungroup()
list(assignments = assignments,
fits = fits)
}
library(purrr)
# 使用purrr包存储中间结果
iterations <- accumulate(1:5, iterate_em, .init = list(assignments = starting_data))
assignment_iterations <- iterations %>%
map_df("assignments", .id = "iteration")
# 观察收敛过程
assignment_iterations %>%
ggplot(aes(average, fill = cluster)) +
geom_histogram() +
facet_wrap(~ iteration)
fit_iterations <- iterations %>%
map_df("fits", .id = "iteration")
# 两个分布的收敛过程
fit_iterations %>%
crossing(x = seq(.001, .4, .001)) %>%
mutate(density = prior * dbeta(x, alpha, beta)) %>%
ggplot(aes(x, density, color = iteration, group = iteration)) +
geom_line() +
facet_wrap(~ cluster)
```
### 分配
- 得到每个选手在两个分布中后验概率后要对其进行分配,这里我们认为拆分出的两个分布其实就是是否是击球手的两个分组,由于两组重叠较多,直接分配会有困难
```{r assign}
# 找6个击球数100的选手进行分配
batter100 <- career %>%
filter(AB == 100) %>%
arrange(average)
batter100
# 前面算法得到的最终结果
fit_iterations
final_parameters <- fit_iterations %>%
filter(iteration == 1)
final_parameters
# 观察球员位置
final_parameters %>%
crossing(x = 0:45) %>%
mutate(density = prior * VGAM::dbetabinom.ab(x, 100, alpha, beta)) %>%
ggplot(aes(x, density)) +
geom_line(aes(color = cluster)) +
geom_vline(aes(xintercept = H), data = batter100, lty = 2) +
geom_text(aes(x = H, y = -.022, label = name), data = batter100, hjust = 1, vjust = 1, angle = 270) +
labs(x = "H (out of 100 at-bats)",
y = "Likelihood of this H out of 100 hits")
# 根据贝叶斯理论,我们可以用在A分组的似然度比上两个分组似然度的和得到后验概率
final_parameters %>%
crossing(H = 1:40) %>%
transmute(H, cluster, likelihood = prior * VGAM::dbetabinom.ab(H, 100, alpha, beta)) %>%
spread(cluster, likelihood) %>%
mutate(probabilityA = A /(A + B)) %>%
ggplot(aes(H, probabilityA)) +
geom_line() +
geom_vline(aes(xintercept = H), data = batter100, lty = 2) +
geom_text(aes(x = H, y = 0, label = name), data = batter100, hjust = 1, vjust = 1, angle = 270) +
labs(x = "H (out of 100 at-bats)",