发帖

楼主: ReneeBK

1012 0

Why I choose Scala for Apache Spark project [推广有奖]

1关注
62粉丝

VIP

已卖：4897份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库 其他...

R资源总汇

Panel Data Analysis

Experimental Design

0%

威望: 1 级
论坛币: 49635 个
通用积分: 55.6937
学术水平: 370 点
热心指数: 273 点
信用等级: 335 点
经验: 57805 点
帖子: 4005
精华: 21
在线时间: 582 小时
注册时间: 2005-5-8
最后登录: 2023-11-26

楼主

ReneeBK 发表于 2016-8-20 02:48:24 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Why I choose Scala for Apache Spark project

Lan Jiang

Solutions Architect at Cloudera

Greater Chicago AreaInformation Technology and Serviceshttps://www.linkedin.com/in/lanjiang?trk=pulse-det-athr_prof-art_hdr Apache Spark currently supports multiple programming languages, including Java, Scala and Python. Words on the street is that Spark 1.4, expected in June, will add R language support too. What language to choose for Spark project is a common question asked on different forums and mailing lists.

The answer to the question is quite subjective. Each team has to answer the question based on its own skillset, use cases, and ultimately personal taste. For me personally, Scala is my language of choice.

First of all, I elimiate Java from the list. Don't get me wrong, I love Java. I have been working on Java for more than 14 years. However, when it comes to big data Spark project, Java is just not suitable. Compared to Python and Scala, Java is too verbose. To achieve the same goal, you have to write many more lines of codes. Java 8 makes it better by introducing Lambda expressions, but it is still not as terse as Python and Scala. Most importantly, Java does not support REPL (Read-Evaluate-Print Loop) interactive shell. That's a deal breaker for me. With an interactive shell, developers and data scientists can explore and access their dataset and prototype their application easily without full-blown development cycle. It is a must-have tool for big data project.

Now it comes down to Python vs. Scala. Both have succinct syntax. Both are Object Oriented plus Functional. Both have passionate support communities.

I ultimately choose Scala due to the below reasons

1. Python is in general slower than Scala. If you have significent processing logic written in your own codes, Scala definitely will offer better performance.
2. Scala is static typed. It looks like dynamic-typed language because it uses a sophisticated type inference mechanism. It means that I still have the compiler to catch the compile-time errors for me. Call me old school.
3. Apache Spark is built on Scala, thus being proficient in Scala helps you digging into the source code when something does not work as you expect. It is especially true for a young fast-moving open source project like Spark.
4. When Python wrapper calls the underlying Spark codes written in Scala running on a JVM, translation between two different environments and languages might be the source of more bugs and issues.
5. Last but not least, because Spark is implemented in Scala, using Scala allows you to access the latest greatest features. Most features are first availabe on Scala and then port to Python.

Spark Streaming
Streaming processing is probably has the weakest support in Python. Python streaming API was first introduced in Spark 1.2, which only supports basic source like text file or text over socket. Only in Spark 1.3 did it introduce Python Kafka source. Other sources such as Flume and Kenesis are still not available in Python API. To make things worse, Python custom source is not support either. Streaming output operations such as saveAsObjectFile() and saveAsHadoopFile() are also not available as of today. In today's Enterprise big data projects, we see more and more designs that combine batch processing and streaming processing together to give end users a holistic view of data (i.e. Lambda Architecture). I see laggard streaming support in Python API as a major drawback.

Spark MLlib for Machine Learning
It is true that Python has an impressive machine learning libraries such as Pandas and Scikit-learn. However, these libraries are mostly suitable for working with data that fits into one single machine. Spark MLlib is more suitable for big data machine learning that store data across a cluster of nodes. Most MLlib algorithms are first implemented in Scala and then port to Python. As of Spark 1.3.1, some ML algorithms such as Clustering (PIC, LDA, Streaming K-means), Dimentionality Reduction (SVD, PCA), Frequent Pattern Mining (FP-growth) are still not available in Python. When the new Machine Learning Pipeline API was introduced back in Spark 1.2, it was only available in Scala and Java too.

Of course, Python still fits some use cases especially in the machine learning projects. MLlib only contains parallel ML algorithms that are suitable to run on a cluster of distributed dataset. Some classic ML algorithms are not implemented in the MLlib. Equiped with Python knowlege, you can still use ML single node library such as scikit-learn together with Spark core parallel processing framework to distribute workload in the cluster. Another use case is your dataset is small and can fit in one machine. But you are required to tune your parameters to fit your model better. You can use Spark parallelize() call to distribute the list of parameteres to the cluster and run the single node algorithm available in scikit-learn, with multiple nodes running in parallel with same dataset and pick the best result for your model.

In summary, Scala is my first choice of programming language for Spark projects and I will keep Python in mind when the use case fits.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Apache Spark Project apache choose Spark currently different including expected multiple

Why I choose Scala for Apache Spark project [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

Why I choose Scala for Apache Spark project [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群