楼主: Lay.Terry
6118 15

[其他] 数据科学家修炼指南(转) [推广有奖]

学术权威

21%

还不是VIP/贵宾

-

威望
4
论坛币
214340 个
通用积分
1011.9746
学术水平
427 点
热心指数
197 点
信用等级
399 点
经验
69693 点
帖子
769
精华
50
在线时间
2606 小时
注册时间
2011-8-29
最后登录
2024-2-19

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

Software engineer’s guide to getting started with data science

数据科学家修炼指南


  

December 30, 2012
By prasoonsharma


     Many of my software engineer friends ask me about learning data science. There are many articles on this subject from renowned data scientists (Dataspora, Gigaom, Quora, Hilary Mason). This post captures my journey (a software engineer) on learning Statistics and Data Visualization.
       许多的软件工程师们问我关于学习数据科学的事情。网络上有很多富有声望的软件工程师们的文章。下面这一篇,主要讲的是我这个程序猿在统计和数据可视化的修炼之路。
  


   
    I'm mid-way in my 5 year journey to become proficient in data science and my learning program has included self-learning (books, blogs, toy problems), projects at work, class-room training (Stanford), teaching/presentations, conferences (UseR, Strata). Here's what I've done so far and what worked and what didn't...
    我给自己定了一个5年计划修炼成数据科学达人,现在已经走了一半了。学习计划包括:自学(书,博客,练习),工作中的项目,课堂上的训练(斯坦福),教别人/Presentation,参加大会(UseR, Strata)。如下是我的心得体会经验教训史……
     


     
________________________________________

1. GETTING STARTED
开始


a) Self-learning (2 - 4 months)

自学(2-4个月)

   Explore if data science is for you
   看看数据科学家是不是适合你


   This is the key to getting started. Two years ago some of us at work formed a study group to review Stats 202 class material. This is what got me excited and started with data analytics. Only 2 of the 5 members of our study group chose to dive deeper into this field (data science is not for everyone).
    看是否适合你,这是开始的关键。2年以前,我和一些工作上的小伙伴成立了一个学习小组,专攻Stats 202 课程。这让我十分兴奋并且开始走上了数据分析之路。最后,我们5个人中只有2个决定再在这个领域深入研究。

    Learn basic statistics: Stats 202 coursework is perfect for this
    基础学习统计:Stats 202 是完美的选择

    Learn a statistical tool: I spent 3 months heads-down learning R as a new-bee and had the most fun doing so. Why learn R?
    学个统计工具:我在菜鸟阶段开始花3个月投入学习R,并且乐在其中。

    Solve toy problems: Curiosity is key to data science. If you've questions about your country's economy, crime stats, sports performance, get the data and start answering your questions
    解决那些你感兴趣却无足轻重的问题:好奇心是数据科学的关键。如果你对于国家经济、犯罪统计、体育比赛数据有疑问,那赶紧找来这些数据并且开始解决你的问题。

    Learn Unix tools: I picked O'Reilly's Data Analysis with Open Source Tools (A hands-on guide for programmers and data scientists) book to read.
    学习Unix工具:我找来了“”这本书来读

    Learn SQL and scripting languages: I know Java, Ruby and SQL. Python is on my list.
    学习SQL和脚本语言:我懂Java,Ruby和SQL。计划学习Python。
  • There's a lot of training material available online
  • Stats 202
  • Caltech Data Science course
  • Coursera: Introduction to Data Science, Machine learning, Data Analysis, Computing for Data Analysis
  • University of California Berkeley - Introduction to Data Science
  • Knight Center for Journalism's course on Introduction to Infographics and Data Visualization
  • Stats 101: Udacity (Intro to Stats), Khan academy, Carnegie Mellon's stats course
  • Learn R

b) Class-room training (9 - 12 months)
课堂训练(9-12月)


    If you're serious about learning, enroll into a formal program
    如果你是认真的,那就报个正式课程。

    If you're serious about picking this skill, then opt for a course. The rigor of the class ensured that I didn't slack. Stanford offers great coursework to get started. They are far superior compared to many week-long training courses I've been to...
    如果你是认真想学习这个技能,那就选个课。课堂会强制你不松懈。Stanford提供很多不错的课。我曾参加过好多其他的长期课程,Stanford这个要比那些都强多了……


  • Data Mining and Analysis STATS202
  • Linear and Nonlinear Optimization MS&E211
  • Mining Massive Data Sets CS246
  • Modern Applied Statistics: Learning STATS315A
  • Statistical Methods in Finance STATS240P
  • Modern Applied Statistics: Data Mining STATS315B
  
________________________________________
  

2. GETTING FOCUSED
进阶


a) Spend 100% of my time on data science
100%的精力投入到数据科学学习上来


    Once I was hooked on data science, it was difficult to spend only 20% of my time on it to build expertise. I needed to spend 100% of my time on it, so I found work problems related to data science (big data analysis, healthcare, marketing & sales and retail analytics, optimization problems).
    当我被数据科学深深吸引之后,我发现花20%的时间来修炼的话,时间都不够。我需要花100%的精力。由此,我发现工作问题其实和数据科学是相关的(大数据分析、医疗、市场销售和零售分析、优化问题)

b) Work on interesting problems
研究感兴趣的问题


    I aligned my learning goal with my passion. I found it energizing and engaging to solve interesting problems while learning new techniques. I was interested in retail, healthcare and sports (cricket) data analysis.
    把我的学习目标和学习热情挂钩。用新技术分析感兴趣的问题时,我总那么热情投入。我么,对零售、医疗和体育(板球)数据分析最感兴趣。

c) Accelerate learning:
加速学习

    Teach: I taught R and data mining introductory classes to colleagues and friends. This helped me reinforce my learning and get others excited on this topic. This is also a great way for me to give back to the open source community. Blogging is another medium to contribute and learn
    教别人:我给同事和朋友讲解R和数据挖掘导论课程。这帮我强化了学习,也让其他人很受用。还能回馈社会。写博客也是另一种奉献和学习的方式。

    Follow the leaders in data science and network with data scientists: DJ Patil, Hillary Mason, Jeff Hammerbacher, Carla Gentry, Monica Rogati, Cathy O'Neil. There are many others in this space. Apologies for missing out many of them. These are the people I look up to.
    跟随大牛(请看英文)

    Follow interesting blogs: http://datascience101.wordpress.com,           http://columbiadatascience.com/blog, http://www.r-bloggers.com,     http://www.datawrangling.com, http://flowingdata.com (Quora's best data blog list)
订阅博客(请看英文)

    Attend conferences/meetups periodically: Local data science/R meetups, O'Reilly Strata is great! Given how rapidly this field is evolving, I go there at least every other year. UseR is wonderful to see what's happening in the world of R
    定期参加会议/聚会

     Learn Big Data techniques: MapReduce/Hadoop, Cloud computing. I avoided picking any commercial, vendor technology and in retrospect, it was a good decision.
    学习大数据技术:MapReduce/Hadoop,云计算。我避免学习商业以及厂商技术。事后来看,这是个正确决定。

d) Learn business domains
学点业务


    I'm lucky to have access to internal and external experts in data science, and they've helped me understand their approach to data science problems (how they think, hypothesize and test/access/reject solutions). I've learned from them the importance of "Hypothesis-driven data analysis" rather than "blind/brute-force data analysis". This highlighted the importance of understanding the business domains really well before trying to extract meaningful insights from the data. This led me to understand operations research and marketing topics, retail, travel & logistics (revenue management) and healthcare industries. NY Times recently published an article highlighting the need for intuition.
    我很幸运能接触到内外部的数据科学大牛们。他们真的让我了解数据科学问题从何下手(如何思考,假想,验证/解决/否定问题)。从他们身上,我真的理解了“假想是数据分析的驱动力”,而不是“数据分析靠一拍脑袋”。就是说在试图从数据中分析出有用结果之前,一定要充分理解商业领域那些事儿。这让我理解运营研究,以及市场方面,零售,旅游物流行业营收管理,还有医疗行业。

________________________________________


3. DATA SCIENCE BOOKS I FOUND USEFUL
我觉得有用的数据科学书籍


    Introduction to Data Mining by Tan, Steinback and Kumar This is the textbook used in many introductory data science courses, including Stats 202 at Stanford. Great guide to keep handy
  • R in a nutshell
  • Data Analysis by using Open Source tools
  • Beautiful visualization
  • See more books on data science: O'Reilly, Manning
   
________________________________________
  

4. WHAT DIDN'T WORK FOR ME
对我没用的那些事儿……


    Learning multiple Statistical tools: A year ago, I started getting some work requests for SAS programming, so I wanted to learn it. I tried to learn it for a month or so but could not do it. The main reason was learning inertia and my love for the statistical tool I knew already - R. I really didn't need another statistical tool. I could solve most of my data science problems with R and other software tools I knew. So my advice is that if you already know SAS, Stata, Matlab, SPSS, Statistica very well, stick to it. However if you're learning a new statistical tool, pick R. R is open source while most others are commercial software (expensive and complex).
    学习各种各样的统计软件:一年以前,因为工作需要,我开始学习SAS。我试了一个月,结果发现不行。主要原因就是学习惯性,而我对于统计工具的热情已经全部奉献给了R。我真心不需要另一个了。对于大部分的统计软件问题,我可以用R以及其他软件解决。所以我的建议就是:如果你已经懂SAS, Stata, Matlab, SPSS, Statistica了,那就坚持。不过,如果你在学新的,那就选R。开源,而且其他的软件都是商业软件(得花钱,且复杂)。

    Auditing courses: I tried to follow self-paced coursework from Coursera and other MOOCs but it wasn't effective for me. I needed the routine, the pressure of a formal course with proper grading to go through the rigor
    纯听课:我曾经按照自己的节奏追Coursera上面的课程,也追过其它MOOC类课程,结果发现特没效率。我需要固定的学习计划,能带来犹如正式上课那样的压力,还有打分,这才能给我坚持的动力。

    Increasing academic workload: Manage work-life balance and work-commitments well. Earlier this year, I tried to take multiple difficult courses at the same time and quickly realized that I wasn't enjoying and learning as I should.
    加强作业负荷:牢记工作的同时也注意平衡工作生活。今年早些时候,我试图在同一时间参加各种各样的高难度课程,然后很快意识到我没有乐在其中并且学习最该学的。

    Sticking to course text book only: Many of the books in these classes are too "dense" for me (a software engineer). So I used other material to understand the concepts. E.g. regression from Carnegie Mellon notes
    纯啃书本:这些课程的好多书对我这个软件工程师而言都太重了。所以我利用其他材料加深理解,比如卡内基梅隆大学笔记。
  
作      者:prasoonsharma
原文选自http://www.r-bloggers.com/softwa ... -with-data-science/

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:数据科学家 数据科学 科学家 Data Science introduction 科学家

已有 1 人评分学术水平 热心指数 信用等级 收起 理由
scabcpk + 1 + 1 + 1 精彩帖子

总评分: 学术水平 + 1  热心指数 + 1  信用等级 + 1   查看全部评分

沙发
xuruilong100 发表于 2013-10-31 13:46:10 |只看作者 |坛友微信交流群
狂赞!!!

使用道具

藤椅
420948492 发表于 2013-10-31 13:48:07 |只看作者 |坛友微信交流群
有人的地方就有江湖

使用道具

板凳
allg168 发表于 2013-10-31 14:51:19 |只看作者 |坛友微信交流群
学习学习

使用道具

报纸
开始XZ 在职认证  发表于 2013-11-1 12:33:02 |只看作者 |坛友微信交流群

使用道具

地板
goodwell010 发表于 2013-11-1 14:16:52 |只看作者 |坛友微信交流群
这个适合姐,去挖掘数据中的秘密,甚至用数据预测未来,很有意思
楼主辛苦了

使用道具

7
tsengxia 发表于 2013-11-1 23:08:47 |只看作者 |坛友微信交流群
仔细学习,膜拜

使用道具

8
zxp5799873 发表于 2013-11-2 00:18:28 |只看作者 |坛友微信交流群

使用道具

9
kai0261 发表于 2013-11-13 10:29:29 |只看作者 |坛友微信交流群
{:soso_e179:}{:soso_e179:}

使用道具

10
goldbaodi 发表于 2013-11-30 02:15:20 |只看作者 |坛友微信交流群
赞!

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-20 03:28