楼主: oliyiyi
928 1

Ebola Analytics and Data Science Lessons [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Ebola epidemic remains a major global health threat, and continues to increase rapidly. The top world epidemiologists and medical professionals are already examining the Ebola data from all angles, so my goal in this post is more to draw some data science lessons rather than to discover something new. Still, we do find new results on changes in Ebola death rate in Liberia and Sierra Leone, and the implications it has on the extend of undercounting of Ebola cases.

Ebola is presently an example of "small data" (let's hope it stays this way), but its data has many problems which lend themselves to useful lessons that can be applied to Big Data as well.

I have extracted the data for the 3 most affected countries (Guinea, Liberia, and Sierra Leone) from

The first lesson is to examine the data closely, since the report date is usually a few days later than "data" date, and the dates for each country sometimes differ. Thus we have to keep a separate date for each country data. Next figure shows the growth in the number of reported cases.


Fig 1. Total Number of Ebola Cases in West Africa, as of Oct 14, 2014

First observation we can make is that the growth in the number of reported cases in Liberia - the most affected country - has slowed down since the middle of September.

Is it because the epidemic has slowed down or because the cases are not being reported by overwhelmed local health officials? If the epidemic has slowed down, we can expect the slow down in number of deaths as well.


Fig 2. Total Number of Ebola Deaths in West Africa, as of Oct 14, 2014

The grim figure 2 of the number of death does not show any slowdown in Liberia but shows a significant jump in Sierra Leone deaths, starting in late Sept.

LiberiaNext, we examine Liberia in more details and look at the reported number of cases/day and number of deaths/day there.


Fig 3. Liberia Ebola Cases and Deaths per day (reported), Jun-Oct 2014.

The raw data is very noisy, since the reports come at different intervals. Next idea is to "smooth" the data by examining changes at regular intervals, and weekly intervals is the next logical choice.

Since we see changing trends, having one global estimator will not work, even if we can fit a polynomial function to the total number of cases with R=0.987.

Instead, we will take the simplest approach and use local interpolation.

There is actual data for Liberia for Oct 7, Sep 30, and Sep 23 (all Tuesdays), but not for Sep 16. There is data for Sep 17 and Sep 9, so we can use these 2 points to estimate data for Sep 16. Since the number of cases appears to grow exponentially, we will not use linear extrapolation.

Instead we will take logarithms (which do look piecewise linear) and do a linear interpolation of logarithms. For example, to get the estimate for the log of cases for DateB between Date 1 and Date 2, we use


log (Cases (DateB)) = ( log(Cases(Date1)) - log(Cases(Date2)) ) * (DateB - Date2) /
(Date1 - Date2)


and then convert the log to the number of cases.

The following chart confirms that the estimated cases fit very nicely into the overall pattern.


Fig 4. Liberia Ebola reported and estimated Cases using log-linear approach, Jun-Oct 2014.

The Data Science Lesson here is the extra effort to estimate missing values is justified, if it results in a simpler model.

Now we can look at the data at weekly intervals, which are indicated on the X axis of Fig 4.


Fig 5. Liberia Ebola Weekly Data (reported and estimated), Jun - Oct 2014

The number of cases/day declines sharply, but the number of deaths does not decline correspondingly. Since cases are more likely to be underreported than deaths, the more likely conclusion is that the number of cases is very much under-reported.

So the data science lesson here is to remember that data is only an approximation of the real world, and try to understand the process of data collection and measurement.

Sierra LeoneWe can also do a similar analysis for Sierra Leone for July - October (in June the number of cases there was not reported reliably).


Fig 6. Sierra Leone Ebola Weekly Data (reported and estimated), July - Oct 2014.

We note a continued growth in cases/day since Sep 10 and a sharp jump deaths/day in early October - perhaps sharpness of the jump is due to late reporting.

Death RateNext, we examine what is the Ebola death rate.

It was reported that the Ebola death rate has risen to 70%, but no supporting data was provided.

The death rate in some analyses I saw compared the number of cases diagnosed on day X with number of deaths reported on the same day. This, however, is incorrect, since it takes Ebola a number of days to kill its victims.

Wikipedia says death from Ebola usually occurs within 7 to 16 days. Thomas Eric Duncan, first Ebola patient diagnosed in the US, succumbed to Ebola in about 13 days after his first hospital visit. The treatment for Ebola victims in West Africa is certainly less advanced than what he received, so a reasonable range for death from Ebola is about 1-2 weeks from diagnosis.

So we can compute death rate with 1 week delay, by dividing death(date=X) with cases(date=X + 1 week), etc.

Next chart looks at Ebola death rate in Liberia with same day (under-estimate), 1 week, and 2 week delays. The chart does not show data before July 8, when there were less than 110 reported cases, which produced both 1 and 2-week death rates higher than 100% - suggesting significant undercount of initial cases.


Fig 7. Liberia Ebola Death rate, same day, 1-week, and 2-weeks delay, July - Oct 2014

We see that with more cases, death rate converges to around 60-65%, with delay between 1 and 2 weeks.

Here is a similar chart for Sierra Leone death rate.


Fig 8. Sierra Leone Ebola Death rate, same day, 1-week, and 2-weeks delay, July - Oct 2014

In Sierra Leone there is a consistent gap between the same day, 1-week, and 2-week death rates, which suggests more consistent reporting than in Liberia. All 3 death rates decline from July up to Oct 1, then increase in the last 2 weeks. Sierra Leone death rate is lower, which suggests better treatment or more under-reporting.

What can we infer about spread of Ebola from this data?

Not much, since we have not used additional sources of data, such as geo-location.

The Oct 15, 2014 report from WHO has a very good chart showing the latest geographic spread of Ebola by counties - see below.


Fig 9. Geographical distribution of new cases and total cases in Guinea, Liberia, and Sierra Leone, as of Oct 12, 2014.

We can see where the disease grows faster.

Other analysis was done using air traffic data to predict Ebola spread via airline travel - see below.


Fig 10. Air traffic data to predict the Ebola spread

Data Science Lesson: Bring as many additional data sources as you can to get a better understanding of the problem.

What do you think about Ebola analysis and Data Science lessons?

Let me know in the comments below.

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Science Analytics Analytic Science lessons something discover increase already medical

已有 1 人评分经验 收起 理由
william9225 + 60 精彩帖子

总评分: 经验 + 60   查看全部评分

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
william9225 学生认证  发表于 2016-8-10 14:34:31 |只看作者 |坛友微信交流群

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-25 16:54