时间类型的tibble类数据,数据清洗与一般的tibble数据会有不同,我们可以用不同的packages来尝试一下:
【数据】
选取深圳综指300106.SZ的18年2月22/23日的高频数据(tick级),可下载,但仅用于学习和练习之用,勿做任何商业用途。
- # SETUP
- library(tidyverse)
- # load data
- load("i106.RData")
- i106 %>%
- tbl_df() %>%
- head(3)
- #> datetime close volume
- #2018-02-22 09:25:03 1754.307 155945300
- #2018-02-22 09:30:00 1754.148 12515800
- #2018-02-22 09:30:03 1756.871 116001200
复制代码【用tidyquant包】
- library(tidyquant)
- # 变为分钟的价格数据
- i106 %>%
- tq_transmute(select = close, mutate = to.minutes) %>%
- head(3)
- #> datetime close
- #2018-02-22 09:25:03 1754.307
- #2018-02-22 09:30:57 1758.003
- #2018-02-22 09:31:57 1755.187
- # 变成日算数收益率的数据
- i106 %>%
- tq_transmute(select = close,
- mutate_fun = periodReturn,
- period = "daily",
- type = "arithmetic")
- #> datetime daily.returns
- #2018-02-22 15:00:03 0.010066826
- #2018-02-23 15:00:03 0.001806975
- # 增加MACD指标数据
- i106 %>%
- tq_mutate(select = close,
- mutate_fun = MACD,
- col_rename = c("MACD", "Signal")) %>%
- tail(3)
- #> datetime close volume MACD Signal
- #2018-02-23 14:57:00 1774.756 7101900 0.009579118 0.008594071
- #2018-02-23 14:57:03 1774.770 1073700 0.009929062 0.008861070
- #2018-02-23 15:00:03 1775.169 219719500 0.011883175 0.009465491
复制代码【用tibbletime包(0.0.2版本)】
- library(tibbletime)#要手动安装0.0.2版本的,最新的0.1.1版不好用
- # 按照每45秒一次的数据
- i106 %>%
- as_tbl_time(index = datetime) %>%
- as_period(45~second) %>%
- slice(1:5)
- # A time tibble: 5 x 3
- # Index: datetime
- #datetime close volume
- #* <dttm> <dbl> <dbl>
- #1 2018-02-22 09:25:03 1754. 155945300.
- #2 2018-02-22 09:30:00 1754.12515800.
- #3 2018-02-22 09:30:18 1758. 8272000.
- #4 2018-02-22 09:31:03 1758.10190700.
- #5 2018-02-22 09:31:48 1755.10640000.
- # 变成每45秒一次的K线数据
- i106 %>%
- as_tbl_time(index = datetime) %>%
- time_collapse(period = 45~second) %>%
- group_by(datetime) %>%
- summarise(open= first(close),
- high= max(close),
- low = min(close),
- close = last(close)) %>%
- slice(1:5)
- # A time tibble: 5 x 5
- # Index: datetime
- #datetime openhigh low close
- #* <dttm> <dbl> <dbl> <dbl> <dbl>
- #1 2018-02-22 09:25:03 1754. 1754. 1754. 1754.
- #2 2018-02-22 09:30:15 1754. 1758. 1754. 1758.
- #3 2018-02-22 09:31:00 1758. 1759. 1758. 1758.
- #4 2018-02-22 09:31:45 1758. 1758. 1756. 1756.
- #5 2018-02-22 09:32:30 1755. 1756. 1754. 1754.
- # 对数据做滚动总结
- # 先编写需要计算的函数
- summary_df <- function(x) {
- data.frame(rolled_summary_type = c("mean", "sd", "min", "max", "median"),
- rolled_summary_val = c(mean(x), sd(x), min(x), max(x), median(x)))
- }
- # 再变成rolling的版本
- rolling_summary <- rollify(~summary_df(.x), window = 5, unlist = FALSE)
- i106 %>%
- mutate(summary_list_col = rolling_summary(close)) %>%
- filter(!is.na(summary_list_col)) %>%
- unnest()
- ## A tibble: 47,440 x 5
- # datetime close volume rolled_summary_type rolled_summary_val
- # <dttm> <dbl> <dbl> <fct> <dbl>
- # 1 2018-02-22 09:30:09 1758. 10882600. mean 1756.
- # 2 2018-02-22 09:30:09 1758. 10882600. sd 1.75
- # 3 2018-02-22 09:30:09 1758. 10882600. min 1754.
- # 4 2018-02-22 09:30:09 1758. 10882600. max 1758.
- # 5 2018-02-22 09:30:09 1758. 10882600. median 1757.
- # 6 2018-02-22 09:30:12 1758.8906900. mean 1757.
- # 7 2018-02-22 09:30:12 1758.8906900. sd 1.56
- # 8 2018-02-22 09:30:12 1758.8906900. min 1754.
- # 9 2018-02-22 09:30:12 1758.8906900. max 1758.
- # 10 2018-02-22 09:30:12 1758.8906900. median 1757.
- # ... with 47,430 more rows
复制代码【小结】
- tidyquant包的优点是,从quantmod,TTR,xts等包转移过来的成本较小,对于熟悉这些包的同学是福音,缺点就是函数比较规范化,比较难做个性化的数据操纵——类似我们在第二个例子中做的每隔45秒做一次切面的做法。
- tibbletime包可以满足较多个性化的数据清洗,但是版本还不太稳定,且做Rolling计算时,速度较慢,还需要等待其逐渐完善。