Enabling Exploratory DataScience with Spark and R
(使用Spark和R语言进行探索性数据科学)
演讲嘉宾为Hossein Falaki
Hossein Falaki is a software engineer at Databricks working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).
Hossein是Databricks的软件工程师。在Databricks工作之前,Hossein是苹果个人助理Siri的数据科学家。他在加州大学洛杉矶分校的获得计算机科学博士学位,他是嵌入式网络传感中心(CENS)的成员。
Academics
My Ph.D. research was focused on making mobile phones smarter networked devices when they were used in health applications. My Ph.D. dissertation is available here. As a Master's student at the University of Waterloo, I was a member of the Tetherless Computing Lab, where I worked on the KioskNet Project with Prof. S. Keshav. I also studied scanning strategies for opportunistic communication over Wi-Fi on mobile devices.(first person)
R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook
R语言是许多数据科学家最喜欢的语言之一。除了语言和运行时,R语言具有丰富的生态系统库,可用于从统计推断到数据可视化的各种用途。
然而,使用R语言处理大型数据集是很困难的,特别是当数据科学家使用其他语言编写的框架或工具时。在这种模式下,大多数阻力出现在R语言和其他系统的界面上。例如,当数据被一个大数据平台取样时,需要将结果作为原生数据结构转移、导入到R语言中。
在该讲座中,我们展示了SparkR是如何解决这些问题的,以实现更流畅的体验。当中我们将介绍SparkR架构概况,包括如何在R语言和JVM之间传输数据和控制。这些知识将帮助数据科学家在使用SparkR时做出更好的决策。我们将在一个笔记本环境中演示现有的例子。演示将强调Spark cluster、R和交互式笔记本环境,如Jupyter或Databricks,便于对大数据进行探索性分析。
Enabling Exploratory DataScience with Spark and R
使用Spark和R语言进行探索性数据科学 [视频讲解·中文字幕]
使用Spark和R语言进行探索性数据科学·pdf
CDA数据分析研究院团队译制
本讲座选自Spark Summit Europe 2015