签到
- 苹果/安卓/wp
- 苹果/安卓/wp
客户端
0.0

0.00

人大经济论坛 › 论坛 › 数据科学与人工智能 › 大数据分析 › spark高速集群计算平台 › 【Rindra Ramamonjison】Apache Spark Graph Processing

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

提升主题| 本版置顶| 关闭主题| 变更主题颜色| 抢沙发| 顶贴| 显身卡| 道具中心

楼主: Lisrelchen

2786 20

【Rindra Ramamonjison】Apache Spark Graph Processing [推广有奖]

0关注
62粉丝

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 49957 个
通用积分: 79.5487
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2017-3-6 05:09:11 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Apache Spark Graph Processing
By: Rindra Ramamonjison
Publisher: Packt Publishing
Pub. Date: September 10, 2015
Print ISBN-13: 978-1-78439-180-5
Web ISBN-13: 978-1-78439-895-8
Pages in Print Edition: 148
Subscriber Rating: 0 out of 5 rating [0 Ratings]

复制代码

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Apache Spark Processing processI Process apache rating

相关帖子

本帖被以下文库推荐

· Apache Spark NewOccidental|主题: 195, 订阅: 7
· Data Science NewOccidental|主题: 1233, 订阅: 120

回复

使用道具举报

沙发

Lisrelchen 发表于 2017-3-6 05:10:27 |只看作者 |坛友微信交流群

Loading the data
In this example, we will work with two CSV files people.csv and links.csv, which are contained in the directory $SPARKHOME/data/. Let's type the following commands to load these files into Spark:
scala> val people = sc.textFile("./data/people.csv")
people: org.apache.spark.rdd.RDD[String] = ./data/people.csv MappedRDD[81] at textFile at <console>:33
scala> val links = sc.textFile("./data/links.csv")
links: org.apache.spark.rdd.RDD[String] = ./data/links.csv MappedRDD[83] at textFile at <console>:33

复制代码

回复

使用道具举报

藤椅

Lisrelchen 发表于 2017-3-6 05:13:56 |只看作者 |坛友微信交流群

Transforming RDDs to VertexRDD and EdgeRDD

Going back to our example, let's construct the graph in three steps, as follows:
1.We define a case class Person, which has name and age as class parameters. Case classes are very useful when we need to do pattern matching on an object Person later on:
case class Person(name: String, age: Int)
2.Next, we are going to parse each line of the CSV texts inside people and links into new objects of type Person and Edge respectively, and collect the results in RDD[(VertexId, Person)] and RDD[Edge[String]]:
val peopleRDD: RDD[(VertexId, Person)] = people map { line =>
val row = line split ','
(row(0).toInt, Person(row(1), row(2).toInt))
}
scala> type Connection = String
scala> val linksRDD: RDD[Edge[Connection]] = links map {line =>
val row = line split ','
Edge(row(0).toInt, row(1).toInt, row(2))
}
3.Now, we can create our social graph and name it tinySocial using the factory method Graph(…):
scala> val tinySocial: Graph[Person, Connection] = Graph(peopleRDD, linksRDD)
tinySocial: org.apache.spark.graphx.Graph[Person,Connection] = org.apache.spark.graphx.impl.GraphImpl@128cd92a

复制代码

回复

使用道具举报

板凳

Lisrelchen 发表于 2017-3-6 05:17:01 |只看作者 |坛友微信交流群

The Graph factory method
The first one is the Graph factory method that we have already seen in the previous chapter. It is defined in the apply method of the companion object called Graph, which is as follows:
def apply[VD, ED](
vertices: RDD[(VertexId, VD)],
edges: RDD[Edge[ED]],
defaultVertexAttr: VD = null)
: Graph[VD, ED]
As we have seen before, this function takes two RDD collections: RDD[(VertexId, VD)] and RDD[Edge[ED]] as parameters for the vertices and edges respectively, to construct a Graph[VD, ED] parameter. The defaultVertexAttr attribute is used to assign the default attribute for the vertices that are present in the edge RDD but not in the vertex RDD. The Graph factory method is convenient when the RDD collections of edges and vertices are readily available.

复制代码

回复

使用道具举报

报纸

Lisrelchen 发表于 2017-3-6 05:17:59 |只看作者 |坛友微信交流群

edgeListFile
A more common situation is that your original dataset only represents the edges. In this case, GraphX provides the following GraphLoader.edgeListFile function that is defined in GraphLoader:
def edgeListFile(
sc: SparkContext,
path: String,
canonicalOrientation: Boolean = false,
minEdgePartitions: Int = 1)
: Graph[Int, Int]
It takes as an argument a path to the file that contains a list of edges. Each line of the file represents an edge of the graph with two integers in the form: sourceID destinationID. When reading the list, it ignores any comment line starting with #. Then, it constructs a graph from the specified edges with the corresponding vertices.
The minEdgePartitions argument is the minimum number of edge partitions to generate. If the adjacency list is partitioned with more blocks than minEdgePartitions, then more partitions will be created.

复制代码

回复

使用道具举报

地板

Lisrelchen 发表于 2017-3-6 05:18:24 |只看作者 |坛友微信交流群

fromEdges
Similar to GraphLoader.edgeListFile, the third function named Graph.fromEdges enables you to create a graph from an RDD[Edge[ED]] collection. Moreover, it automatically creates the vertices using the VertexID parameters specified by the edge RDD, as well as the defaultValue argument as a default vertex attribute:
def fromEdges[VD, ED](
edges: RDD[Edge[ED]],
defaultValue: VD)
: Graph[VD, ED]

复制代码

回复

使用道具举报

7楼

Lisrelchen 发表于 2017-3-6 05:18:54 |只看作者 |坛友微信交流群

fromEdgeTuples
The last graph builder function is Graph.fromEdgeTuples, which creates a graph from only an RDD of edge tuples, that is, a collection of the RDD[(VertexId, VertexId)] type. It assigns the edges the attribute value 1 by default:
def fromEdgeTuples[VD](
rawEdges: RDD[(VertexId, VertexId)],
defaultValue: VD,
uniqueEdges: Option[PartitionStrategy] = None)
: Graph[VD, Int]

复制代码

回复

使用道具举报

8楼

Lisrelchen 发表于 2017-3-6 05:22:22 |只看作者 |坛友微信交流群

Building directed graphs

The first graph that we will build is the Enron email communication network. If you have restarted your Spark shell, you need to again import the GraphX library. First, create a new folder called data inside $SPARKHOME and copy the dataset into it. This file contains the adjacency list of the email communications between the employees. Assuming that the current directory is $SPARKHOME, we can pass the file path to the GraphLoader.edgeListFile method:
scala> import org.apache.spark.graphx._
scala> import org.apache.spark.rdd._
scala> val emailGraph = GraphLoader.edgeListFile(sc, "./data/emailEnron.txt")
scala> emailGraph.vertices.take(5)
scala> emailGraph.edges.take(5)
In GraphX, all the edges must be directed. To express non-directed or bidirectional graphs, we can link each connected pair in both directions. In our email network, we can verify for instance that the 19021 node has both incoming and outgoing links. First, we collect the destination nodes that node 19021 communicates to:
scala> emailGraph.edges.filter(_.srcId == 19021).map(_.dstId).collect()
scala> emailGraph.edges.filter(_.dstId == 19021).map(_.srcId).collect()

复制代码

回复

使用道具举报

9楼

Lisrelchen 发表于 2017-3-6 05:30:57 |只看作者 |坛友微信交流群

Building a bipartite graph

##Create the case classes named Ingredient and Compound, and use Scala's inheritance so that these two classes are the children of a FNNode class.
scala> class FNNode(val name: String)
scala> case class Ingredient(override val name: String, category: String) extends FNNode(name)
scala> case class Compound(override val name: String, cas: String) extends FNNode(name)#data wrangling:
val ingredients: RDD[(VertexId, FNNode)] =
sc.textFile("./data/ingr_info.tsv").
filter(! _.startsWith("#")).
map {line =>
val row = line split '\t'
(row(0).toInt, Ingredient(row(1), row(2)))
}
val compounds: RDD[(VertexId, FNNode)] =
sc.textFile("./data/comp_info.tsv").
filter(! _.startsWith("#")).
map {line =>
val row = line split '\t'
(10000L + row(0).toInt, Compound(row(1), row(2)))
}
##Create an RDD[Edge[Int]] collection from the dataset named ingr_comp.tsv:
val links: RDD[Edge[Int]] =
sc.textFile("./data/ingr_comp.tsv").
filter(! _.startsWith("#")).
map {line =>
val row = line split '\t'
Edge(row(0).toInt, 10000L + row(1).toInt, 1)
}
##Concatenate the two sets of nodes into a single RDD, and pass it to the Graph() factory method along with the RDD link:
scala> val nodes = ingredients ++ compounds
scala> val foodNetwork = Graph(nodes, links)
scala> def showTriplet(t: EdgeTriplet[FNNode,Int]): String = "The ingredient " ++ t.srcAttr.name ++ " contains " ++ t.dstAttr.name
showTriplet: (t: EdgeTriplet[FNNode,Int])String
scala> foodNetwork.triplets.take(5).
foreach(showTriplet _ andThen println _)

复制代码

回复

使用道具举报

10楼

Lisrelchen 发表于 2017-3-6 05:33:36 |只看作者 |坛友微信交流群

Building a weighted social ego network

1.Import the absolute value function and the SparseVector class from the Breeze library:
import scala.math.abs
import breeze.linalg.SparseVector
2.Define a type synonym called Feature for SparseVector[Int]:
type Feature = breeze.linalg.SparseVector[Int]
3.Read the features inside the ego.feat file and collect them in a map whose keys and values are of the Long and Feature types, respectively:
val featureMap: Map[Long, Feature] =
Source.fromFile("./data/ego.feat").
getLines().
map{line =>
val row = line split ' '
val key = abs(row.head.hashCode.toLong)
val feat = SparseVector(row.tail.map(_.toInt))
(key, feat)
}.toMap
4.Hash the string that corresponds to the node ID, as follows:
val key = abs(row.head.hashCode.toLong)
5.Read the ego.edges file to create an RDD[Edge[Int]] collection of the links in the ego network.
val edges: RDD[Edge[Int]] =
sc.textFile("./data/ego.edges").
map {line =>
val row = line split ' '
val srcId = abs(row(0).hashCode.toLong)
val dstId = abs(row(1).hashCode.toLong)
val srcFeat = featureMap(srcId)
val dstFeat = featureMap(dstId)
val numCommonFeats = srcFeat dot dstFeat
Edge(srcId, dstId, numCommonFeats)
}

复制代码

回复

使用道具举报

发帖

本版微信群

加好友,备注cda
拉您进交流群

如有投资本站、合作意向或投放广告，请联系：13661292478（刘老师）

联系客服

邮箱：service@pinggu.org 投诉或不良信息处理：（010-68466864）

京ICP备16021002-2号京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明