This series discuss the design and implementation of Apache Spark, with focuses on its design principles, execution mechanisms, system architecture and performance optimization. In addition, there's some comparisons with Hadoop MapReduce in terms of design and implementation. I'm reluctant to call this document a "code walkthrough", because the goal is not to analyze each piece of code in the project, but to understand the whole system in a systematic way (through analyzing the execution procedure of a Spark job, from its creation to completion).
There're many ways to discuss a computer system. Here, We've chosen a problem-driven approach. Firstly one concret problem is introduced, then it gets analyzed step by step. We'll start from a typical Spark example job and then discuss all the related important system modules. I believe that this approach is better than diving into each module right from the beginning.
The target audiences of this series are geeks who want to have a deeper understanding of Apache Spark as well as other distributed computing frameworks.
I'll try my best to keep this documentation up to date with Spark since it's a fast evolving project with an active community. The documentation's main version is in sync with Spark's version. The additional number at the end represents the documentation's update version.
For more academic oriented discussion, please check out Matei's PHD thesis and other related papers. You can also have a look at my blog (in Chinese) blog.
I haven't been writing such complete documentation for a while. Last time it was about three years ago when I was studying Andrew Ng's ML course. I was really motivated at that time! This time I've spent 20+ days on this document, from the summer break till now (August 2014). Most of the time is spent on debugging, drawing diagrams and thinking how to put my ideas in the right way. I hope you find this series helpful.
复制代码
Contents
We start from the creation of a Spark job, and then discuss its execution. Finally, we dive into some related system modules and features.
Overview Overview of Apache Spark
Job logical plan Logical plan of a job (data dependency graph)
Job physical plan Physical plan
Shuffle details Shuffle process
Architecture Coordination of system modules in job execution