楼主: ReneeBK
1391 9

【Apache Spark】Resilient Distributed Datasets: [推广有奖]

  • 1关注
  • 62粉丝

VIP

已卖:4900份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49655 个
通用积分
55.9937
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57805 点
帖子
4005
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

楼主
ReneeBK 发表于 2017-3-10 09:04:25 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. Abstract
  2. We present Resilient Distributed Datasets (RDDs), a distributed
  3. memory abstraction that lets programmers perform
  4. in-memory computations on large clusters in a
  5. fault-tolerant manner. RDDs are motivated by two types
  6. of applications that current computing frameworks handle
  7. inefficiently: iterative algorithms and interactive data
  8. mining tools. In both cases, keeping data in memory
  9. can improve performance by an order of magnitude.
  10. To achieve fault tolerance efficiently, RDDs provide a
  11. restricted form of shared memory, based on coarsegrained
  12. transformations rather than fine-grained updates
  13. to shared state. However, we show that RDDs are expressive
  14. enough to capture a wide class of computations, including
  15. recent specialized programming models for iterative
  16. jobs, such as Pregel, and new applications that these
  17. models do not capture. We have implemented RDDs in a
  18. system called Spark, which we evaluate through a variety
  19. of user applications and benchmarks.
复制代码

本帖隐藏的内容

Resilient Distributed Datasets.pdf (865.54 KB)


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Apache Spark distributed resilient datasets dataset

沙发
ReneeBK 发表于 2017-3-10 09:07:23
Example: Console Log Mining
  1. lines = spark.textFile("hdfs://...")
  2. errors = lines.filter(_.startsWith("ERROR"))
  3. errors.persist()
  4. Line 1 defines an RDD backed by an HDFS file (as a
  5. collection of lines of text), while line 2 derives a filtered
  6. RDD from it. Line 3 then asks for errors to persist in
  7. memory so that it can be shared across queries. Note that
  8. the argument to filter is Scala syntax for a closure.
  9. At this point, no work has been performed on the cluster.
  10. However, the user can now use the RDD in actions,
  11. e.g., to count the number of messages:
  12. errors.count()
  13. The user can also perform further transformations on
  14. the RDD and use their results, as in the following lines:
  15. // Count errors mentioning MySQL:
  16. errors.filter(_.contains("MySQL")).count()
  17. // Return the time fields of errors mentioning
  18. // HDFS as an array (assuming time is field
  19. // number 3 in a tab-separated format):
  20. errors.filter(_.contains("HDFS"))
  21. .map(_.split(’\t’)(3))
  22. .collect()
复制代码

藤椅
auirzxp 学生认证  发表于 2017-3-10 09:11:37
提示: 作者被禁止或删除 内容自动屏蔽

板凳
ReneeBK 发表于 2017-3-10 09:11:50
Logistic Regression
  1. Many machine learning algorithms are iterative in nature
  2. because they run iterative optimization procedures, such
  3. as gradient descent, to maximize a function. They can
  4. thus run much faster by keeping their data in memory.
  5. As an example, the following program implements logistic
  6. regression [14], a common classification algorithm
  7. that searches for a hyperplane w that best separates two
  8. sets of points (e.g., spam and non-spam emails). The algorithm
  9. uses gradient descent: it starts w at a random
  10. value, and on each iteration, it sums a function of w over
  11. the data to move w in a direction that improves it.
  12. val points = spark.textFile(...)
  13. .map(parsePoint).persist()
  14. var w = // random initial vector
  15. for (i <- 1 to ITERATIONS) {
  16. val gradient = points.map{ p =>
  17. p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
  18. }.reduce((a,b) => a+b)
  19. w -= gradient
  20. }
  21. We start by defining a persistent RDD called points
  22. as the result of a map transformation on a text file that
  23. parses each line of text into a Point object. We then repeatedly
  24. run map and reduce on points to compute the
  25. gradient at each step by summing a function of the current
  26. w. Keeping points in memory across iterations can
  27. yield a 20× speedup.
复制代码

报纸
ReneeBK 发表于 2017-3-10 09:15:11
PageRank
  1. A more complex pattern of data sharing occurs in
  2. PageRank [6]. The algorithm iteratively updates a rank
  3. for each document by adding up contributions from documents
  4. that link to it. On each iteration, each document
  5. sends a contribution of r
  6. n
  7. to its neighbors, where r is its
  8. rank and n is its number of neighbors. It then updates
  9. its rank to α/N + (1 − α)∑ci
  10. , where the sum is over
  11. the contributions it received and N is the total number of
  12. documents. We can write PageRank in Spark as follows:
  13. // Load graph as an RDD of (URL, outlinks) pairs
  14. val links = spark.textFile(...).map(...).persist()
  15. var ranks = // RDD of (URL, rank) pairs
  16. for (i <- 1 to ITERATIONS) {
  17. // Build an RDD of (targetURL, float) pairs
  18. // with the contributions sent by each page
  19. val contribs = links.join(ranks).flatMap {
  20. (url, (links, rank)) =>
  21. links.map(dest => (dest, rank/links.size))
  22. }
  23. // Sum contributions by URL and get new ranks
  24. ranks = contribs.reduceByKey((x,y) => x+y)
  25. .mapValues(sum => a/N + (1-a)*sum)
  26. }
复制代码

地板
franky_sas 发表于 2017-3-10 09:54:08
Thanks.

7
钱学森64 发表于 2017-3-10 13:54:52
谢谢分享

8
zhouxinwj 发表于 2017-3-20 16:15:21
Statistical Arbitrage

9
gyrolee 发表于 2017-3-21 14:39:00
thanks for sharing

10
whlgh 发表于 2017-3-23 16:47:26
感谢分享。

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2026-1-17 16:29