楼主: oliyiyi
27071 756

【回复有奖】The Data Science Puzzle, Explained   [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

There is no dearth of articles around the web comparing and contrasting data science terminology. There are all sorts of articles written by all types of people relaying their opinions to anyone who will listen. It's almost overwhelming.

So let me set the record straight, for those wondering if this is one of those types of posts. Yes. Yes it is.

Why another one? I think that, while there may be an awful lot of opinion pieces defining and comparing these related terms, the fact is that much of this terminology is fluid, is not entirely agreed-upon, and, frankly, being exposed to other peoples' views is one of the best ways to test and refine your own.

So, while one may not agree entirely (or even minimally) with my opinion on much of this terminology, there may still be something one can get out of this. Several concepts central to data science will be examined. Or, at least, central in my opinion. I will do my best to put forth how they relate to one another and how they fit together as individual pieces of a larger puzzle.

As an example of somewhat divergent opinions, and prior to considering any of the concepts individually, KDnuggets' Gregory Piatetsky-Shapiro has put together the following Venn diagram which outlines the relationship between the very same data science terminology we will be considering herein. The reader is encouraged to compare this Venn diagram with Drew Conway's now famousdata science Venn diagram, as well as my own discussion below and modified process/relationship diagram near the bottom of the post. I think that, while differences exist, the concepts line up with some degree of similarity (see the previous few paragraphs).





We will now give treatment to the same 6 selected core concepts as depicted in the above Venn diagram, and provide some insight as to how they fit together into a data science puzzle. First, we quickly dispense with one of the biggest buzz terms of the past decade.

Big Data

There are all sorts of articles available defining big data, and I won't spend much time on this concept here. I will simply state that big data could very generally be defined as datasets of a size "beyond the ability of commonly used software tools to capture, manage, and process." Big data is a moving target; this definition is both vague and accurate enough to capture its central characteristic.

As for the remaining concepts we will investigate, it's good to gain some initial understanding of their search term popularities and N-gram frequencies, in order to help separate the hard fact from the hype. Given that a pair of these concepts are relatively new, the N-gram frequencies for our 'older' concepts from 1980 to 2008 are shown above.



The more recent Google Trends show the rise of 2 new terms, the continued upward trend of 2 others, and the gradual, but noticeable, decline of the last. Note that big data was not included in the above graphics due to it already being quantitatively analyzed to death. Read on for further insights into the observations.

Machine Learning

According to Tom Mitchell in his seminal book on the subject, machine learning is "concerned with the question of how to construct computer programs that automatically improve with experience." Machine learning is interdisciplinary in nature, and employs techniques from the fields of computer science, statistics, and artificial intelligence, among others. The main artifacts of machine learning research are algorithms which facilitate this automatic improvement from experience, algorithms which can be applied in a variety of diverse fields.

I don't think there is anyone who would doubt that machine learning is a central aspect of data science. I give the term data science detailed treatment below, but if you consider that at a very high level its goal is to extract insight from data, machine learning is the engine which allows this process to be automated. Machine learning has a lot in common with classical statistics, in that it uses samples to infer and make generalizations. Where statistics has more of a focus on the descriptive (though it can, by extrapolation, be predictive), machine learning has very little concern with the descriptive, and employs it only as an intermediate step in order to be able to make predictions. Machine learning is often thought to be synonymous with pattern recognition; while that really won't get much disagreement from me, I believe that the term pattern recognition implies a much less sophisticated and more simplistic set of processes than machine learning actually is, which is why I tend to shy away from it.

Machine learning has a complex relationship with data mining.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Science Explained Science explain Puzzle comparing defining opinion another related

已有 1 人评分学术水平 热心指数 信用等级 收起 理由
janyiyi + 1 + 1 + 1 精彩帖子

总评分: 学术水平 + 1  热心指数 + 1  信用等级 + 1   查看全部评分

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
oliyiyi 发表于 2016-7-6 08:13:48 |只看作者 |坛友微信交流群

Data Mining

Fayyad, Piatetsky-Shapiro & Smyth define data mining as "the application of specific algorithms for extracting patterns from data." This demonstrates that, in data mining, the emphasis is on the application of algorithms, as opposed to on the algorithms themselves. We can define the relationship between machine learning and data mining as follows: data mining is a process, during which machine learning algorithms are utilized as tools to extract potentially-valuable patterns held within datasets.

Data mining, as a sister term of machine learning, is also critical to data science. Before the explosion of the term data science, in fact, data mining enjoyed much greater success as a Google search term. Having a look at Google Trends stretching back a further 5 years than those shown in the above graphic, data mining was once much more popular. Today, however, data mining seems to be split as a concept between machine learning and data science itself. If one was to endorse the above explanation, that data mining is a process, then it makes sense to view data science as both a superset of data mining as well as a successor term.

Deep Learning

Deep learning is a relatively new term, although it has existed prior to the dramatic uptick in online searches of late. Enjoying a surge in research and industry, due mainly to its incredible successes in a number of different areas, deep learning is the process of applying deep neural network technologies - that is, neural network architectures with multiple hidden layers - to solve problems. Deep learning is a process, like data mining, which employs deep neural network architectures, which are particular types of machine learning algorithms.

Deep learning has racked up an impressive collection of accomplishments of late. In light of this, it's important to keep a few things in mind, at least in my opinion:

  • Deep learning is not a panacea - it is not an easy one-size-fits-all solution to every problem out there
  • It is not the fabled master algorithm - deep learning will not displace all other machine learning algorithms and data science techniques, or, at the very least, it has not yet proven so
  • Tempered expectations are necessary - while great strides have recently been made in all types of classification problems, notably computer vision and natural language processing, as well as reinforcement learning and other areas, contemporary deep learning does not scale to working on very complex problems such as "solve world peace"
  • Deep learning and artificial intelligence are not synonymous

Deep learning can provide an awful lot to data science in the form of additional processes and tools to help solve problems, and when observed in that light, deep learning is a very valuable addition to the data science landscape.

Artificial Intelligence

Most people find a precise, and often times even a broad, definition of artificial intelligence difficult to put their finger on. I am not an artificial intelligence researcher, and so my answer here may wildly differ from someone who is, or may even upset folks in other fields. I have philosophized on the idea of AI a lot over the years, and I have come to the conclusion that artificial intelligence, at least the concept of it which we generally think of when we do think of it, does not actually exist.

In my opinion, AI is a yardstick, a moving target, an unattainable goal. Whenever we get on a path toward AI achievements, somehow these accomplishments seem to morph into being referred to as something else.

I once read something like the following: If you asked an AI researcher in the 1960s what their idea of AI was, they would probably agree that a small device that fit in our pockets, which could help anticipate our next moves and desires, and had the entirety of human knowledge readily available at will, there would probably be consensus that said device was true AI. But we all carry smartphones today, a very few of us would refer to them as artificial intelligence.

Where does AI fit into data science? Well, as I have stated that I don't believe that AI is really anything tangible, I guess it's hard to say that it fits in anywhere. But there are a number of areas related to data science and machine learning where AI has provided motivation, which at times is just as valuable as the tangible; computer vision certainly comes to mind, as does contemporary deep learning research, which have both benefited from the Artificial Intelligence Ethos at some point, if not indefinitely.

AI may well be the research and development apparatus with the deepest pockets which never actually produces anything in its namesake industry. While I would say that drawing a straight line from AI to data science may not be the best way to view the relationship between the 2, many of the intermediary steps between the 2 entities have been developed and refined by AI in some form.

Data Science

So, after discussing these related concepts and their place within data science, what exactly is data science? To me, this is the toughest concept of the lot to attempt to define precisely. Data science is a multifaceted discipline, which encompasses machine learning and other analytic processes, statistics and related branches of mathematics, increasingly borrows from high performance scientific computing, all in order to ultimately extract insight from data and use this new-found information to tell stories. These stories are often accompanied by pictures (we call them visualizations), and are aimed at industry, research, or even just at ourselves, with the purpose of gleaning some new idea from The Data.

Data science employs all sorts of different tools from a variety of related areas (see everything you've read above here). Data science is both synonymous with data mining, as well as a superset of concepts which includes data mining.

Data science yields all sorts of different outcomes, but they all share the common aspect of insight. Data science is all of this and more, and to you it may be something else completely... and we haven't even covered acquiring, cleaning, wrangling, and pre-processing data yet! And by the way, what evenis data? And is it always big?

I think my idea of the data science puzzle, at least, the version of it which can be represented by the above diagram, jives well with Piatetsky-Shapiro's Venn diagram at the top of this post. I would also suggest that it is also mostly in agreement with with Drew Conway's data science Venn diagram, though I would add one caveat: I believe his very well-reasoned and useful graphic is actually referring to data scientists, as opposed to data science. This may be splitting hairs, but I don't think the { field | discipline | concept } of data science, itself, encompasses hacking skills; I believe this is a skill that scientists possess in order to allow the to do data science. Admittedly, this may be quibbling over semantics, but it makes sense in my mind.

Of course, this is not a full picture of the landscape, which is constantly evolving. For example, I recall reading, not very long ago, that data mining was a sub-field of business intelligence! Even with differences in opinions, I really can't imagine this being a valid idea today (it was difficult to accept a few years ago, to be honest).

And there you have it: some of your favorite terms bent out of shape in new ways you won't forgive me for. If you're furious right now and can't wait to tell me how wrong I am, remember the point of this post: you have just read one man's opinion. In that spirit, feel free to sound off in the comments with your (potentially heated and sharply) contrasting views. Otherwise, I hope that this has either exposed new readers to the puzzle which is data science or forced them to look at their own version of this puzzle in their heads.


已有 1 人评分热心指数 信用等级 收起 理由
janyiyi + 1 + 1 精彩帖子

总评分: 热心指数 + 1  信用等级 + 1   查看全部评分

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

藤椅
wmjxy2425 发表于 2016-7-6 08:32:05 |只看作者 |坛友微信交流群

回帖奖励 +3

There is no dearth of articles around the web comparing and contrasting data science terminology. There are all sorts of articles written by all types of people relaying their opinions to anyone who will listen. It's almost overwhelming.
已有 1 人评分经验 收起 理由
oliyiyi + 10 精彩帖子

总评分: 经验 + 10   查看全部评分

使用道具

板凳
smartlife 在职认证  发表于 2016-7-6 08:32:28 |只看作者 |坛友微信交流群

回帖奖励 +3

使用道具

报纸
wmjxy2425 发表于 2016-7-6 08:32:32 |只看作者 |坛友微信交流群
So, while one may not agree entirely (or even minimally) with my opinion on much of this terminology, there may still be something one can get out of this. Several concepts central to data science will be examined. Or, at least, central in my opinion. I will do my best to put forth how they relate to one another and how they fit together as individual pieces of a larger puzzle.
已有 1 人评分经验 收起 理由
oliyiyi + 10 精彩帖子

总评分: 经验 + 10   查看全部评分

使用道具

地板
wmjxy2425 发表于 2016-7-6 08:33:04 |只看作者 |坛友微信交流群
就只能一次啊
已有 1 人评分经验 收起 理由
oliyiyi + 10 精彩帖子

总评分: 经验 + 10   查看全部评分

使用道具

7
sqy 发表于 2016-7-6 08:35:56 |只看作者 |坛友微信交流群

回帖奖励 +3

顶!!!!!

使用道具

8
lnulxg 发表于 2016-7-6 08:40:52 |只看作者 |坛友微信交流群

回帖奖励 +3

I  admire  your  work  very  much  and  thanks a  lot
已有 1 人评分论坛币 收起 理由
oliyiyi + 10 精彩帖子

总评分: 论坛币 + 10   查看全部评分

使用道具

回帖奖励 +3

高大上!!
已有 1 人评分论坛币 收起 理由
oliyiyi + 5 精彩帖子

总评分: 论坛币 + 5   查看全部评分

使用道具

10
colongkong 发表于 2016-7-6 08:54:27 |只看作者 |坛友微信交流群
数据挖掘和深度学习都是大数据的重要组成部分!
已有 1 人评分论坛币 收起 理由
oliyiyi + 10 精彩帖子

总评分: 论坛币 + 10   查看全部评分

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-27 06:48