-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
540 lines (410 loc) · 15.6 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
---
title: "Study *Tidy Finance with R*"
author: "Shitao"
date: "`r Sys.Date()`"
output:
rmdformats::downcute:
use_bookdown: true
self_contained: true
default_style: "light"
downcute_theme: "default"
number_sections: true
code_folding: "show"
css: "style.css"
code_download: true
includes:
before_body: header.html
after_body: footer.html
bibliography: references.bib
link-citations: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
fig.align = "center",
fig.asp = 0.618,
comment = "#>",
collapse = TRUE,
cache = TRUE)
options(digits = 3)
rmdformats::downcute(
use_bookdown = TRUE
)
# ggplot2 global theme
library(ggplot2)
theme_set(theme_bw() + theme(legend.position = "bottom"))
```
# Introduction {.unnumbered}
这是[Tidy Finance with R](https://www.tidy-finance.org)的学习记录,使用的R软件版本为`r paste0(version$major, ".", version$minor)`(`r paste(version$year, version$month, version$day, sep = "-")`),下载源码[^1]可点击页面最顶端标题右侧"Code \> Download Rmd"。
[^1]: 在安装后所需的包后,运行源码可生成该文件的文本内容。由于配置了css等样式文件,如果需要完全重现该文件,请在[Github](https://github.com/Shitao5/tidy-finance)下载该仓库后进行Knit。
对内容有疑问或建议的小伙伴,欢迎在[这里](https://github.com/Shitao5/tidy-finance)提交Issues或者Pull requests,帮助完善,在下先行谢过了。🤝🤝
# Introduction to Tidy Finance
说到Tidy(整洁),首先加载两个tidy开头的包:
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(tidyquant)
```
这里补充一下管道操作符`%>%`的知识,其作用是将前一步生成的结果作为下一步函数的第一个参数[^2]:
[^2]: 也可以不是第一个,这里先按下不表。 <!-- 💥记得到需要的时候表的哈!💥 -->
- `x %>% f()`等同于`f(x)`
- `x %>% f(y)`等同于`f(x, y)`
- `x %>% f() %>% g()`等同于`g(f(x))`
- `f(x) %>% g(y) %>% t(z)`等同于`t(g(f(x), y), z)`
是不是有点像"然后"的意思?它可以让写代码、读代码都安装从上到下、从左到右的顺序一步一步地进行,且不会生成很多的中间变量,大部分时候可以从数据到结果"一管到底",大大增加数据分析的流畅性。快捷键为`Ctrl + Shift + M`。在4.1以上的R版本中,自带了原生的管道:`|>`,它的主要作用于`%>%`大致相同,不过`%>%`功能更丰富。
揣着这个效率神器,我们出发!🚀
## 单个股票分析
读取苹果公司股票价格数据,`tq_get()`函数默认使用雅虎数据库,因此在国内需要连接外网后使用[^3]。
[^3]: 或者使用[RStudio Cloud](https://rstudio.cloud)运行代码。
```{r data-price, eval=FALSE, include=FALSE}
# 如不能连接外网,请运行这个代码块。
# save(price, file = "data/price.rda")
load("data/price.rda")
```
```{r}
price <- tq_get("AAPL", # Apple Stock
get = "stock.price",
from = "2000-01-01",
to = "2022-06-30")
price
```
简单的可视化,呈现价格走势:
```{r fig_AAPL}
price %>%
ggplot(aes(date, adjusted)) +
geom_line() +
labs(x = NULL, y = NULL,
title = "AAPL stock prices",
subtitle = "Prices in USD, adjusted for dividend payments and stock splits")
```
根据以下公式,用调整后价格计算日收益率(Daily Return)。
```{=tex}
\begin{equation}
\text{daily return} = \frac{p_t - p_{t-1}}{p_{t-1}} = \frac{p_t}{p_{t-1}} - 1
\end{equation}
```
计算前需先将数据根据日期排序,由于`r dplyr::arrange(price, date)$date[1]`为第一天,因此计算收益率时产生缺失值(NA),可以直接剔除。
```{r}
returns <- price %>%
arrange(date) %>%
mutate(ret = adjusted / lag(adjusted) - 1) %>%
select(symbol, date, ret) %>%
drop_na(ret)
returns
```
对日收益率进行汇总统计:
```{r}
summary(returns$ret)
```
可以看到最高日收益率为`r max(returns$ret) * 100`%,最低日收益达到`r min(returns$ret) * 100`%,而平均日收益率为`r mean(returns$ret) * 100`%,略高于0%。
绘制日收益率柱状图,更直观地展示数据。图中的红色竖线代表日收益率的5%分位数(`r quantile(returns$ret, .05)`)[^4]。
[^4]: 5%分位数与风险价值密切相关,也是监管机构通常监控的风险评估。
```{r fig_ret}
quantile_05 <- quantile(returns$ret, .05)
returns %>%
ggplot(aes(ret)) +
geom_histogram(bins = 100) +
geom_vline(aes(xintercept = quantile_05),
color = "red", linetype = "dashed") +
scale_x_continuous(labels = scales::percent) +
labs(x = NULL, y = NULL,
title = "Distribution of daily AAPL return",
subtitle = "The dotted vertical line indicates the historical 5% quantile")
```
按年分组后,对日收益率进行分组统计:
```{r paged.print=TRUE}
returns %>%
mutate(ret = ret * 100) %>%
group_by(year = year(date)) %>%
summarise(across(ret,
list(
daily_mean = mean,
daily_sd = sd,
daily_min = min,
daily_max = max
),
.names = "{.fn}"
)) %>%
print(n = Inf)
```
## 多个股票分析
基于整洁数据[@wickham2014]的特点,可以很方便地将单个股票的分析拓展到多个股票。接下来分析目前[道琼斯工业指数](https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average)中所有的成分股:
```{r message=FALSE, warning=FALSE}
ticker <- tq_index("DOW") # 获取成分股信息
ticker
```
获取所有成分股在2000-01-01到2022-06-30的股票信息:
```{r}
index_prices <- tq_get(ticker,
get = "stock.prices",
from = "2000-01-01",
to = "2022-06-30")
index_prices
```
共获得了来自30家公司的`r nrow(index_prices)`行数据,对调整价格进行可视化,查看总体走势:
```{r}
index_prices %>%
ggplot(aes(date, adjusted, color = symbol)) +
geom_line() +
labs(x = NULL, y = NULL, color = NULL,
title = "DOW index stock prices",
subtitle = "Prices in USD, adjusted for dividend payments and stock splits") +
theme(legend.position = "none")
```
对30家公司的日收益率进行汇总统计:
```{r}
all_returns <- index_prices %>%
group_by(symbol) %>%
mutate(ret = adjusted / lag(adjusted) - 1) %>%
select(symbol, date, ret) %>%
drop_na(ret) %>%
ungroup() # 解除分组
all_returns %>%
mutate(ret = ret * 100) %>%
group_by(symbol) %>%
summarise(across(
ret,
list(
daily_mean = mean,
daily_sd = sd,
daily_min = min,
daily_max = max
),
.names = "{.fn}"
))
```
到此为止,就可以下载和处理任意股票组合的价格数据了。比如用`ticker <- tq_index("SP500")`修改`ticker`为标普500的成分股信息,就可以分析标普500中股票价格数据。因为有很多数据需要下载,这个操作比较耗时,这里不演示。
## 其他数据汇总形式
道琼斯总交易量:
```{r}
volume <- index_prices %>%
mutate(volume_usd = volume * close / 1e9) %>%
group_by(date) %>%
summarise(volume = sum(volume_usd))
volume %>%
ggplot(aes(date, volume)) +
geom_line() +
labs(x = NULL, y = NULL,
title = "Aggregate daily trading volume in billion USD")
```
可利用45°线比较第$t$天与第$t-1$天的交易量:
```{r warning=FALSE}
volume %>%
ggplot(aes(lag(volume), volume)) +
geom_point() +
geom_abline(aes(intercept = 0, slope = 1),
linetype = "dotted", color = "red",
size = .8) +
labs(x = "Previous day aggregate trading volume (billion USD)",
y = "Aggregate trading volume (billion USD)",
title = "Trading volume on DOW Index versus previous day volume")
```
纯纯目测可以发现:交易量大的日子后面往往也是交易量大的日子。
::: {.bd-callout .bd-callout-info}
<h4>我看45°线</h4>
将股票价格记作$p$,这里的$x$轴为$p_{t-1}$,$y$轴为$p_t$,在一阶自回归模型中,$\beta_1$(线性回归系数)相比45°线可以显示更多信息。后面作者讲深入应该会涉及。
:::
## 投资组合问题
最佳投资组合追求高回报、低风险:
> The standard framework for optimal portfolio selection considers investors that prefer higher future returns but dislike future return volatility (defined as the square root of the return variance): the *mean-variance investor*.
```{r}
# 清洗掉行数较少的企业的所有股票
index_prices <- index_prices %>%
group_by(symbol) %>%
mutate(n = n()) %>%
ungroup() %>%
filter(n == max(n)) %>%
select(-n)
# 计算每个企业每月的回报率
returns <- index_prices %>%
mutate(month = floor_date(date, "month")) %>%
group_by(symbol, month) %>%
summarise(price = last(adjusted), .groups = "drop_last") %>%
mutate(ret = price / lag(price) - 1) %>%
drop_na(ret) %>%
select(-price)
```
将收益率转为$(T \times N)$的矩阵,计算样本回报率均值矩阵$\hat{\mu}$和样本协方差矩阵$\hat{\Sigma}$:
$$\hat\mu = \frac{1}{T}\sum\limits_{t=1}^T r_t$$
$$\hat\Sigma = \frac{1}{T-1}\sum\limits_{t=1}^T (r_t - \hat\mu)(r_t - \hat\mu)'$$
```{r}
return_matrix <- returns %>%
pivot_wider(names_from = symbol,
values_from = ret) %>%
select(-month)
mu <- colMeans(return_matrix)
Sigma <- cov(return_matrix)
```
::: {.bd-callout .bd-callout-warning}
<h4>难点警告</h4>
下面的内容理解还不透彻,先记录。
:::
接下来计算最小方差的投资组合权重$\omega_\text{mvp}$和它的预期收益$\omega_\text{mvp}'\mu$和波动率$\sqrt{\omega_\text{mvp}'\Sigma\omega_\text{mvp}}$。$\omega_\text{mvp}$是以下问题的解:
$$\omega_\text{mvp} = \arg\min w'\Sigma w \\
\text{ s.t. } \sum\limits_{i=1}^Nw_i = 1$$
```{r}
N <- ncol(return_matrix)
iota <- rep(1, N)
mvp_weights <- solve(Sigma) %*% iota # solve用于求逆
mvp_weights <- mvp_weights / sum(mvp_weights)
tibble(
expected_ret = t(mvp_weights) %*% mu,
volatility = sqrt(t(mvp_weights) %*% Sigma %*% mvp_weights)
)
```
尝试寻找一个可以获得三倍于最小方差组合的预期收益的投资组合权重:
```{r}
mu_bar <- 3 * t(mvp_weights) %*% mu
C <- as.numeric(t(iota) %*% solve(Sigma) %*% iota)
D <- as.numeric(t(iota) %*% solve(Sigma) %*% mu)
E <- as.numeric(t(mu) %*% solve(Sigma) %*% mu)
lambda_tilde <- as.numeric(2 * (mu_bar - D / C) / (E - D^2 / C))
efp_weights <- mvp_weights +
lambda_tilde / 2 * (solve(Sigma) %*% mu - D * mvp_weights)
```
## 有效边界
只要有两个有效的投资组合,就可以通过它们权重的线性组合得到有效边界。有效边界描述了在每个风险水平下可以实现的最高预期收益。
```{r}
c <- seq(from = -.4, to = 1.9, by = .01)
res <- tibble(
c = c,
mu = NA,
sd = NA
)
for (i in seq_along(c)) {
w <- (1 - c[i]) * mvp_weights + (c[i]) * efp_weights
res$mu[i] <- 12 * 100 * t(w) %*% mu
res$sd[i] <- 12 * sqrt(100) * sqrt(t(w) %*% Sigma %*% w)
}
```
```{r}
res %>%
ggplot(aes(sd, mu)) +
geom_point() +
geom_point(
data = res %>% filter(c %in% c(0, 1)),
size = 4
) +
geom_point(
data = tibble(
mu = 12 * 100 * mu,
sd = 12 * 10 * sqrt(diag(Sigma))
),
aes(sd, mu), size = 1, color = "blue"
) +
labs(
x = "Annualized standard deviation (in percent)",
y = "Annualized expected return (in percent)",
title = "Efficient frontier for Dow Jones constituents",
subtitle = "Thick dots indicate the location of the minimum variance and efficient tangency portfolio"
)
```
# 访问和处理金融数据
## Fama-French data
```{r message=FALSE}
library(lubridate)
library(scales)
library(frenchdata)
```
定义获取和储存的金融数据的起止日期,方便未来更新:
```{r}
start_date <- ymd("1960-01-01")
end_date <- ymd("2020-12-31")
```
`frenchdata`包提供了从[Prof. Kenneth French](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html)金融数据库下载和读取数据集的功能。
```{r message=FALSE, warning=FALSE}
# 月度数据
factors_ff_monthly_raw <- download_french_data("Fama/French 3 Factors")
factors_ff_monthly <- factors_ff_monthly_raw$subsets$data[[1]] %>%
transmute(
month = floor_date(ymd(str_c(date, "01")), "month"),
rf = as.numeric(RF) / 100,
mkt_excess = as.numeric(`Mkt-RF`) / 100,
smb = as.numeric(SMB) / 100,
hml = as.numeric(HML) / 100
) %>%
filter(month >= start_date & month <= end_date)
# 每日数据
factors_ff_daily_raw <- download_french_data("Fama/French 3 Factors [Daily]")
factors_ff_daily <- factors_ff_daily_raw$subsets$data[[1]] |>
transmute(
date = ymd(date),
rf = as.numeric(RF) / 100,
mkt_excess = as.numeric(`Mkt-RF`) / 100,
smb = as.numeric(SMB) / 100,
hml = as.numeric(HML) / 100
) |>
filter(date >= start_date & date <= end_date)
industries_ff_monthly_raw <- download_french_data("10 Industry Portfolios")
industries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>
mutate(month = floor_date(ymd(str_c(date, "01")), "month")) |>
mutate(across(where(is.numeric), ~ . / 100)) |>
select(month, everything(), -date) |>
filter(month >= start_date & month <= end_date)
```
可以用`get_french_data_list()`查看Kenneth French主页上的投资组合汇报时间序列数据。
## q-facrors
```{r}
factors_q_monthly_link <-
"http://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2020.csv"
factors_q_monthly <- read_csv(factors_q_monthly_link) |>
mutate(month = ymd(str_c(year, month, "01", sep = "-"))) |>
select(-R_F, -R_MKT, -year) |>
rename_with(~ str_remove(., "R_")) |>
rename_with(~ str_to_lower(.)) |>
mutate(across(-month, ~ . / 100)) |>
filter(month >= start_date & month <= end_date)
```
## 宏观经济预测因素
```{r}
library(readxl)
library(googledrive)
```
`googledrive`包用于连接谷歌硬盘。通常需要认证后才能用R与谷歌硬盘交互,而读取公共链接存储的数据,无需认证。
```{r}
drive_deauth()
macro_predictors_link <-
"https://drive.google.com/file/d/1ACbhdnIy0VbCWgsnXkjcddiV8HF4feWv/view"
drive_download(
macro_predictors_link,
path = "data/macro_predictors.xlsx"
)
```
数据被放置在了当前工作目录下的data文件夹[^project]。使用`readxl`包进行读取:
[^project]: 笔记采用项目进行管理,因此所有本地文件均为相对路径,方便项目迁移。
```{r}
macro_predictors <- read_xlsx("data/macro_predictors.xlsx",
sheet = "Monthly"
) %>%
mutate(month = ym(yyyymm)) %>%
filter(month >= start_date & month <= end_date) %>%
mutate(across(where(is.character), as.numeric)) %>%
mutate(
IndexDiv = Index + D12,
logret = log(IndexDiv) - log(lag(IndexDiv)),
Rfree = log(Rfree + 1),
rp_div = lead(logret - Rfree, 1), # Future excess market return
dp = log(D12) - log(Index), # Dividend Price ratio
dy = log(D12) - log(lag(Index)), # Dividend yield
ep = log(E12) - log(Index), # Earnings price ratio
de = log(D12) - log(E12), # Dividend payout ratio
tms = lty - tbl, # Term spread
dfy = BAA - AAA # Default yield spread
) %>%
select(month, rp_div, dp, dy, ep, de, svar,
bm = `b/m`, ntis, tbl, lty, ltr,
tms, dfy, infl
) %>%
drop_na()
```
读取数据至内存后,可以删除下载下来的文件:
```{r}
file.remove("data/macro_predictors.xlsx")
```
## 其他宏观经济数据
利用`tidyquant`包从圣路易斯联邦储备银行提供的联邦储备经济数据(FRED)中下载数据,例如下载CPI数据:
```{r}
cpi_monthly <- tq_get("CPIAUCNS",
get = "economic.data",
from = start_date,
to = end_date)
```
# References