Data Science: The End of Statistics?
This question was recently posted by Larry Wasserman on the Normal Deviate blog (see extract below). Larry is a statistics and machine learning professor at Carnegie Mellon University.
Here is my answer:
Data science is more than statistics: it also encompasses computer science and business concepts, and it's far more than a set of techniques and principles. I could imagine a data scientist not having a degree - this is not possible for a statistician. But the core of the issue, in my opinion, is explained below.
- I am one of the guys who contributes to the adoption of the keyword data science. Ironically, I'm a pure statistician (Ph.D. in statistics, 1993 - computational statistics) although I changed a lot since 1993, I'm now an entrepreneur. The reason I tried hard to move away from being called statistician to being called something (anything) else, is because of the American Statistical Association: they killed the keyword statistician as well as limiting career prospects to future statisticians, by making it almost narrowly and exclusively associated with the pharmaceutical industry and small data (where most of its revenue comes from). They missed the boat - on purpose, I believe - of the new statistical revolution that came along with big data over the last 15 years.
- Statisticians should be very familiar with computer science, big data and software: 10 billion rows with 10,000 variables should not scare a true statistician. On the cloud (or on even on my laptop as streaming data), it gets processed real fast. First step is data reduction, but even if you must keep all observations and variables, it still is feasible. And good computer scientists also produce confidence intervals - you don't need to be statistician for that, just use the First AnalyticBridge Theorem (if you are curious, check out the Second AnalyticBridge Theorem). The distinction between computer scientist and statistician is getting thinner and more fuzzy over the years. The things you did not learn at school (in statistical classes), you can still learn it online.
This diagram misses a few key concepts - including business and domain knowledge
Here's the article:
As I see newspapers and blogs filled with talk of “Data Science” and “Big Data” I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines.
The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data — a field called statistics — is alarming. I like what Karl Broman says:
When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.
If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.
Well put.
Maybe I am just pessimistic and am just imagining that statistics is getting left out. Perhaps, but I don’t think so. It’s my impression that the attention and resources are going mainly to Computer Science. Not that I have anything against CS of course, but it is a tragedy if Statistics gets left out of this data revolution.
Two questions come to mind:
- Why do statisticians find themselves left out?
- What can we do about it?