人大经济论坛 › 论坛 › 数据科学与人工智能 › 大数据分析 › spark高速集群计算平台 › 【Apache Spark】Resilient Distributed Datasets:

发帖

楼主: ReneeBK

1408 9

【Apache Spark】Resilient Distributed Datasets: [推广有奖]

1关注
62粉丝

VIP

已卖：4901份资源

学术权威

14%

还不是VIP/贵宾

TA的文库 其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望: 1 级
论坛币: 49675 个
通用积分: 56.2487
学术水平: 370 点
热心指数: 273 点
信用等级: 335 点
经验: 57805 点
帖子: 4005
精华: 21
在线时间: 582 小时
注册时间: 2005-5-8
最后登录: 2023-11-26

楼主

ReneeBK 发表于 2017-3-10 09:04:25 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Abstract
We present Resilient Distributed Datasets (RDDs), a distributed
memory abstraction that lets programmers perform
in-memory computations on large clusters in a
fault-tolerant manner. RDDs are motivated by two types
of applications that current computing frameworks handle
inefficiently: iterative algorithms and interactive data
mining tools. In both cases, keeping data in memory
can improve performance by an order of magnitude.
To achieve fault tolerance efficiently, RDDs provide a
restricted form of shared memory, based on coarsegrained
transformations rather than fine-grained updates
to shared state. However, we show that RDDs are expressive
enough to capture a wide class of computations, including
recent specialized programming models for iterative
jobs, such as Pregel, and new applications that these
models do not capture. We have implemented RDDs in a
system called Spark, which we evaluate through a variety
of user applications and benchmarks.

复制代码

本帖隐藏的内容

Resilient Distributed Datasets.pdf (865.54 KB)

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Apache Spark distributed resilient datasets dataset

相关帖子

沙发

ReneeBK 发表于 2017-3-10 09:07:23

Example: Console Log Mining

lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the cluster.
However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()

复制代码

加关注串个门加好友发消息 6关注 76粉丝禁止访问 auirzxp 当前离线阅读权限 0 威望 1 级论坛币 229692 个通用积分 25371.2470 学术水平 4223 点热心指数 4861 点信用等级 4173 点经验 4493 点帖子 13491 精华 0 在线时间 12559 小时注册时间 2007-1-3 最后登录 2024-4-8 雷达卡	藤椅 auirzxp 发表于 2017-3-10 09:11:37 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

板凳

ReneeBK 发表于 2017-3-10 09:11:50

Logistic Regression

Many machine learning algorithms are iterative in nature
because they run iterative optimization procedures, such
as gradient descent, to maximize a function. They can
thus run much faster by keeping their data in memory.
As an example, the following program implements logistic
regression [14], a common classification algorithm
that searches for a hyperplane w that best separates two
sets of points (e.g., spam and non-spam emails). The algorithm
uses gradient descent: it starts w at a random
value, and on each iteration, it sums a function of w over
the data to move w in a direction that improves it.
val points = spark.textFile(...)
.map(parsePoint).persist()
var w = // random initial vector
for (i <- 1 to ITERATIONS) {
val gradient = points.map{ p =>
p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)
w -= gradient
}
We start by defining a persistent RDD called points
as the result of a map transformation on a text file that
parses each line of text into a Point object. We then repeatedly
run map and reduce on points to compute the
gradient at each step by summing a function of the current
w. Keeping points in memory across iterations can
yield a 20× speedup.

复制代码

报纸

ReneeBK 发表于 2017-3-10 09:15:11

PageRank

A more complex pattern of data sharing occurs in
PageRank [6]. The algorithm iteratively updates a rank
for each document by adding up contributions from documents
that link to it. On each iteration, each document
sends a contribution of r
n
to its neighbors, where r is its
rank and n is its number of neighbors. It then updates
its rank to α/N + (1 − α)∑ci
, where the sum is over
the contributions it received and N is the total number of
documents. We can write PageRank in Spark as follows:
// Load graph as an RDD of (URL, outlinks) pairs
val links = spark.textFile(...).map(...).persist()
var ranks = // RDD of (URL, rank) pairs
for (i <- 1 to ITERATIONS) {
// Build an RDD of (targetURL, float) pairs
// with the contributions sent by each page
val contribs = links.join(ranks).flatMap {
(url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
// Sum contributions by URL and get new ranks
ranks = contribs.reduceByKey((x,y) => x+y)
.mapValues(sum => a/N + (1-a)*sum)
}

复制代码

地板

franky_sas 发表于 2017-3-10 09:54:08

Thanks.

7楼

钱学森64 发表于 2017-3-10 13:54:52

谢谢分享

8楼

zhouxinwj 发表于 2017-3-20 16:15:21

Statistical Arbitrage

9楼

gyrolee 发表于 2017-3-21 14:39:00

thanks for sharing

10楼

whlgh 发表于 2017-3-23 16:47:26

感谢分享。

返回列表

发帖

本版微信群

加好友,备注cda
拉您进交流群

京ICP备16021002号-2 京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明

【Apache Spark】Resilient Distributed Datasets: [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

【Apache Spark】Resilient Distributed Datasets: [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

扫码加我拉你入群