楼主: ReneeBK
1184 2

【Jerry Lead】Spark Internals [推广有奖]

  • 1关注
  • 62粉丝

VIP

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49402 个
通用积分
51.7504
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57815 点
帖子
4006
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. Introduction

  2. This series discuss the design and implementation of Apache Spark, with focuses on its design principles, execution mechanisms, system architecture and performance optimization. In addition, there's some comparisons with Hadoop MapReduce in terms of design and implementation. I'm reluctant to call this document a "code walkthrough", because the goal is not to analyze each piece of code in the project, but to understand the whole system in a systematic way (through analyzing the execution procedure of a Spark job, from its creation to completion).

  3. There're many ways to discuss a computer system. Here, We've chosen a problem-driven approach. Firstly one concret problem is introduced, then it gets analyzed step by step. We'll start from a typical Spark example job and then discuss all the related important system modules. I believe that this approach is better than diving into each module right from the beginning.

  4. The target audiences of this series are geeks who want to have a deeper understanding of Apache Spark as well as other distributed computing frameworks.

  5. I'll try my best to keep this documentation up to date with Spark since it's a fast evolving project with an active community. The documentation's main version is in sync with Spark's version. The additional number at the end represents the documentation's update version.

  6. For more academic oriented discussion, please check out Matei's PHD thesis and other related papers. You can also have a look at my blog (in Chinese) blog.

  7. I haven't been writing such complete documentation for a while. Last time it was about three years ago when I was studying Andrew Ng's ML course. I was really motivated at that time! This time I've spent 20+ days on this document, from the summer break till now (August 2014). Most of the time is spent on debugging, drawing diagrams and thinking how to put my ideas in the right way. I hope you find this series helpful.
复制代码
  1. Contents

  2. We start from the creation of a Spark job, and then discuss its execution. Finally, we dive into some related system modules and features.

  3. Overview Overview of Apache Spark
  4. Job logical plan Logical plan of a job (data dependency graph)
  5. Job physical plan Physical plan
  6. Shuffle details Shuffle process
  7. Architecture Coordination of system modules in job execution
  8. Cache and Checkpoint Cache and Checkpoint
  9. Broadcast Broadcast feature
  10. Job Scheduling TODO
  11. Fault-tolerance TODO
复制代码

本帖隐藏的内容

Spark Internals.pdf (23.91 MB, 需要: 5 个论坛币)



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Internal Intern inter jerry Spark understand principles execution addition document

沙发
h2h2 发表于 2017-2-15 11:03:02 |只看作者 |坛友微信交流群
谢谢分享

使用道具

藤椅
fengyg 企业认证  发表于 2017-2-15 11:44:20 |只看作者 |坛友微信交流群
kankan

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-20 10:32