楼主: jerker
3344 23

[讨论交流] 《数据科学实战》作者Cathy O'Neil访谈(图灵访谈) [推广有奖]

学术权威

80%

还不是VIP/贵宾

-

TA的文库  其他...

数据科学(Data Science)

威望
6
论坛币
50724 个
通用积分
3488.9806
学术水平
3272 点
热心指数
3508 点
信用等级
3023 点
经验
406 点
帖子
5811
精华
28
在线时间
3530 小时
注册时间
2009-11-19
最后登录
2024-3-27

初级学术勋章 中级热心勋章 初级热心勋章 初级信用勋章 中级学术勋章 高级学术勋章 特级学术勋章 高级热心勋章 特级热心勋章 中级信用勋章 高级信用勋章 特级信用勋章

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Cathy O'Neil是约翰逊实验室高级数据科学家、哈佛大学数学博士、麻省理工学院数学系博士后、巴纳德学院教授,曾发表过大量算术代数几何方面的论文。他曾在著名的全球投资管理公司D.E. Shaw担任对冲基金金融师,后加入专门评估银行和对冲基金风险的软件公司RiskMetrics。


Cathy是一位数学家,后来转型为数据科学家,她的个人博客http://mathbabe.org/很受欢迎,在博客中的“关于自己”部分,她说自己一直在期待下面这个问题能有更好的答案:非理论派的数学家能做些什么以让这个世界变得更加美好?她和哥伦比亚大学统计系兼职教授Rachel Schutt根据一门名为“数据科学导论”的课程撰写了《数据科学实战》一书。
QQ截图20150410221238.png


Rachel(美国新闻集团旗下数据科学部门高级副总裁)向大学提议开设“数据科学导论”这门课程时,恰好认识了Cathy,那时她正在一个初创公司工作,职位是数据科学家。对于Rachel开课的尝试,她十分支持。后来Cathy和Rachel的博客中所有关于“数据科学导论”的条目,构成了《数据科学实战》的原始素材。


《数据科学实战》不是一本关于机器学习的教科书。恰恰相反,本书会多角度全方位、深入地介绍数据科学。它是对现有数据学科领域的纵览,试图为这一学科勾勒出一幅全景图。《数据科学实战》还是一本从人文主义角度、全面讲解数据科学的书。两位作者不仅关注工具、数学、模型、算法和代码,同时也很关注上述过程中的人性化考量。如何在数据科学中体现人文主义?你在建模和设计算法时,认识到你作为个人所应起到的作用,想想哪些东西是人所具备而电脑不具备的,比如基于道德的判断;向世界公布一种新的统计模型前,想想会为他人的生活带来什么样的影响。

信息来源和书籍详情

本帖隐藏的内容

信息来源:图灵社区(http://www.ituring.com.cn/article/195789
书籍详情:https://bbs.pinggu.org/thread-3645553-1-1.html


iTuring: Where does being data scientist attract you the most? Have you found any clues to this question: what can a non-academic mathematician do that makes the world a better place?


I love data! I love seeing how we can learn about the way things work by measuring them. I particularly enjoy figuring out how to quantify something we only vaguely knew, or to compare the effects of two things that until that moment seemed incomparable.


My biggest clue is that I need to spend more time and energy making sure data scientists think before they act. Data science is powerful and influential and can be used for evil or good. We need to recognize that fact.


iTuring: Many readers are inspired by your blog Mathbebe, have you been inspired by them through interaction?


Absolutely! My readers have brought me on many strange and intellectually stimulating journeys. I am thankful every day to them for doing so.


iTuring: In your perspectives, what quality and learning background of a person would be most qualified to do the job of data science?


It really depends. I've written the book Doing Data Science with a mathematical background in mind, but honestly a data science team should also have people with backgrounds in philosophy and ethics who learn more scientific approaches. We need diversity of thought to solve a problem well.


iTuring: Some people believe that applications that based on big data actually indulge people’s reliance on old habits, which constrain tryouts of various experiences, do you agree?


That can be true. For example, a resume or application sorting algorithm that simply learns from historical data and the regenerates that old-fashioned decision-making is merely codifying all the biases that the system had, whether it's sexism, or an over-reliance on certain college degrees. I suggest to people to try to figure out what it is that they are actually looking for, and how they can locate those skills while staying as unbiased as possible. We should at least attempt this.


iTuring: Many companies have benefited magnificently from big data analysis, but there are also companies who use big data to formulate policies and strategies but have benefited little or even failed. What are their mistakes during the process?


Often they think that big data is magical. Of course it's not, you need good questions, and moreover you don't just need big data, you need the right data, which often hasn't been collected.


iTuring: For a large part, big data are used for prediction. Do you think accidental incident could be predicted by determinate data?


This question is a bit vague, but if I understand it correctly it's asking whether something that is fundamentally unpredictable can be predicted. I guess not! However, it's of course true that even stochastic processes have some underlying characteristics. For example, if you have a waiting time process, you can talk about when you'd be "surprised" that the event hasn't occurred, after defining what surprises you.


iTuring: To better access web data, NoSQL arises. While traditional database also come up with the concept of Data Space, in which data comes first then comes the model. How does this technology apply nowadays? Are there similar topics that are not familiar to most people?


Generally speaking, big data uses unstructured and dirty data, at least to develop models. After the models go into production, there's sometimes a standard database being used, and definitely by the time the results and daily reports are being made, it is using standard databases.


I tend to ignore the details of this kind of data storage question, not because it's uninteresting but because it is rapidly changing. When I need to work on a new project, I go figure out what the current best technology is.


iTuring: In machine learning, training data are usually given. Engineeringly speaking, what is the most important (tricky) thing while extracting training data from database? Data traits, data size or the way data is extracted?


Really hard to say in general! Of course, sometimes you need just a huge amount of training data to train your model, and other times not so much but you need to be careful you are pulling a representative sample.


For my part I almost always train my models according to timestamps, when possible. I start earlier and train my data, then I test it on later data.


iTuring: In order to extract the key factors of a model, data analysts often had to have good understandings of certain business. Is there any easy way to do it? Or it’s the inevitable part of the job?


It is truly inevitable; only domain experts will be able to guide the modeling, at least near the beginning, when there are still easily achieved goals. Later on, when all the domain expertise has been included, it might become less domain specific.


iTuring: As data science has greatly advanced these years, do you think any of the contents in your book need to be updated? And what contents would remain unchanged for a long time?


Certainly! This is a fast moving field which I wanted to explain as an overview. If I rewrote this book today every chapter would be different. Even so, the overall approach of learning what you need, and being technical without losing sight of the human impact, will remain. As things progress, the techniques will get better and more mathematically complex, so in some sense this is the best time to be a data scientist.




二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Cathy neil 数据科学 CAT Data Science 哥伦比亚 哈佛大学 理工学院 约翰逊 管理公司

已有 3 人评分经验 论坛币 学术水平 热心指数 信用等级 收起 理由
Nicolle + 100 + 1 + 1 + 1 精彩帖子
421073390 + 60 + 1 + 1 + 1 精彩帖子
fantuanxiaot + 60 + 50 + 1 + 1 + 1 精彩帖子

总评分: 经验 + 220  论坛币 + 50  学术水平 + 3  热心指数 + 3  信用等级 + 3   查看全部评分

本帖被以下文库推荐

沙发
shgby 发表于 2015-4-10 22:49:31 |只看作者 |坛友微信交流群
据科学实战

使用道具

藤椅
冰蝶、 发表于 2015-4-10 23:28:20 |只看作者 |坛友微信交流群
楼主好样的。。。

使用道具

板凳
jerker 发表于 2015-4-10 23:32:49 |只看作者 |坛友微信交流群
冰蝶、 发表于 2015-4-10 23:28
楼主好样的。。。
多谢

使用道具

报纸
mike68097 发表于 2015-4-10 23:47:12 |只看作者 |坛友微信交流群
已有 1 人评分经验 论坛币 热心指数 收起 理由
jerker + 5 + 5 + 1 精彩帖子

总评分: 经验 + 5  论坛币 + 5  热心指数 + 1   查看全部评分

使用道具

地板
sqy 发表于 2015-4-11 00:09:32 |只看作者 |坛友微信交流群
ding!!!!!!!!!!!!!!

使用道具

7
feders 发表于 2015-4-11 00:24:57 |只看作者 |坛友微信交流群
来看看

使用道具

8
sunyiping 发表于 2015-4-11 04:38:04 |只看作者 |坛友微信交流群
学习学习。

使用道具

9
jiangqing001 发表于 2015-4-11 06:46:18 |只看作者 |坛友微信交流群
学习学习!!

使用道具

10
jerker 发表于 2015-4-11 08:44:28 |只看作者 |坛友微信交流群
sunyiping 发表于 2015-4-11 04:38
学习学习。
兄台不要太熬夜,对身体不好

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jr
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-26 18:27