[程序汇编]Sparks using Scala,Java and Python

0关注
62粉丝

VIP

已卖：4194份资源

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 50288 个
通用积分: 83.6306
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2016-3-29 10:08:13 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

#
# Spark for Python Developers
# Chapter 1 - Code Word Count
#
# Word count on manuscript using PySpark
# import regex module
import re
# import add from operator module
from operator import add
# read input file
file_in = sc.textFile('/home/an/Documents/A00_Documents/Spark4Py 20150315')
# count lines
print('number of lines in file: %s' % file_in.count())
# add up lenths of each line
#
chars = file_in.map(lambda s: len(s)).reduce(add)
print('number of characters in file: %s' % chars)
# Get words from the input file
words =file_in.flatMap(lambda line: re.split('\W+', line.lower().strip()))
# words of more than 3 characters
words = words.filter(lambda x: len(x) > 3)
# set count 1 per word
words = words.map(lambda w: (w,1))
# reduce phase - sum count all the words
words = words.reduceByKey(add)
# create tuple (count, word) and sort in descending
words = words.map(lambda x: (x[1], x[0])).sortByKey(False)
# take top 20 words by frequency
words.take(20)
# create function for hitogram of most frequent words
#
% matplotlib inline
import matplotlib.pyplot as plt
#
def histogram(words):
count = map(lambda x: x[1], words)
word = map(lambda x: x[0], words)
plt.barh(range(len(count)), count,color = 'grey')
plt.yticks(range(len(count)), word)
# Change order of tuple (word, count) from (count, word)
words = words.map(lambda x:(x[1], x[0]))
words.take(25)
# display histogram
histogram(words.take(25))
# words in one summarised statement
words = sc.textFile('/home/an/Documents/A00_Documents/Spark4Py 20150315')
.flatMap(lambda line: re.split('\W+', line.lower().strip()))
.filter(lambda x: len(x) > 3)
.map(lambda w: (w,1))
.reduceByKey(add)
.map(lambda x: (x[1], x[0])).sortByKey(False)
words.take(20)

复制代码

Spark for Python Developers

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Sparks python SCALA Using Parks Java

相关帖子

沙发

Lisrelchen 发表于 2016-3-29 10:15:38

[code]##
#
# Spark for Python - Chapter 5 - Code
#
##

#
# Spark Streaming Wordcount - netcat client
#
# Spark example code  from:
# https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py
#
"""
Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
Usage: network_wordcount.py <hostname> <port>
<hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data.
To run this on your local machine, you need to first run a Netcat server
`$ nc -lk 9999`
and then run the example
`$ bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999`
"""
from __future__ import print_function

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

if __name__ == "__main__":
if len(sys.argv) != 3:
      print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
      exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 1)

lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
counts = lines.flatMap(lambda line: line.split(" "))\
               .map(lambda word: (word, 1))\
               .reduceByKey(lambda a, b: a+b)
counts.pprint()

ssc.start()
ssc.awaitTermination()

#
#
# Twitter Streaming API Spark Streaming into an RDD-Queue to process tweets live
#
#

"""
Twitter Streaming API Spark Streaming into an RDD-Queue to process tweets live

Create a queue of RDDs that will be mapped/reduced one at a time in
1 second intervals.

To run this example use
'$ bin/spark-submit examples/AN_Spark/AN_Spark_Code/twitterstreaming.py'

"""
#
import time
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
import twitter
import dateutil.parser
import json

#
#
# Connecting Streaming Twitter with Streaming Spark via Queue
#

class Tweet(dict):
def __init__(self, tweet_in):
      super(Tweet, self).__init__(self)
      if tweet_in and 'delete' not in tweet_in:
         self['timestamp'] = dateutil.parser.parse(tweet_in[u'created_at']
                              ).replace(tzinfo=None).isoformat()
         self['text'] = tweet_in['text'].encode('utf-8')
         #self['text'] = tweet_in['text']
         self['hashtags'] = [x['text'].encode('utf-8') for x in tweet_in['entities']['hashtags']]
         #self['hashtags'] = [x['text'] for x in tweet_in['entities']['hashtags']]
         self['geo'] = tweet_in['geo']['coordinates'] if tweet_in['geo'] else None
         self['id'] = tweet_in['id']
         self['screen_name'] = tweet_in['user']['screen_name'].encode('utf-8')
         #self['screen_name'] = tweet_in['user']['screen_name']
         self['user_id'] = tweet_in['user']['id']

def connect_twitter():
twitter_stream = twitter.TwitterStream(auth=twitter.OAuth(
      token = "get_your_own_credentials",
      token_secret = "get_your_own_credentials",
      consumer_key = "get_your_own_credentials",
      consumer_secret = "get_your_own_credentials"))
return twitter_stream

def get_next_tweet(twitter_stream):
stream = twitter_stream.statuses.sample(block=True)
# testing = stream.next() # This is just to make sure the stream is emitting data.
tweet_in = None
while not tweet_in or 'delete' in tweet_in:
      tweet_in = stream.next()
      tweet_parsed = Tweet(tweet_in)
      # print(json.dumps(tweet_in, indent=2, sort_keys=True))
# return json.dumps(tweet_in, indent=2, sort_keys=True)
return json.dumps(tweet_parsed)

def process_rdd_queue(twitter_stream):
# Create the queue through which RDDs can be pushed to
# a QueueInputDStream
rddQueue = []
for i in range(3):
      rddQueue += [ssc.sparkContext.parallelize([get_next_tweet(twitter_stream)], 5)]

lines = ssc.queueStream(rddQueue)
lines.pprint()

if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingQueueStream")
ssc = StreamingContext(sc, 1)

# Instantiate the twitter_stream
twitter_stream = connect_twitter()
# Get RDD queue of the streams json or parsed
process_rdd_queue(twitter_stream)

ssc.start()
time.sleep(2)
ssc.stop(stopSparkContext=True, stopGraceFully=True)

#
#
# Kafka and Spark Streaming
#
#

#
# kafka producer
#
#
import time
from kafka.common import LeaderNotAvailableError
from kafka.client import KafkaClient
from kafka.producer import SimpleProducer
from datetime import datetime

def print_response(response=None):
if response:
      print('Error: {0}'.format(response[0].error))
      print('Offset: {0}'.format(response[0].offset))

def main():
kafka = KafkaClient("localhost:9092")
producer = SimpleProducer(kafka)
try:
      time.sleep(5)
      topic = 'test'
      for i in range(5):
         time.sleep(1)
         msg = 'This is a message sent from the kafka producer: ' \
               + str(datetime.now().time()) + ' -- '\
               + str(datetime.now().strftime("%A, %d %B %Y %I:%M%p"))
         print_response(producer.send_messages(topic, msg))
except LeaderNotAvailableError:
      # https://github.com/mumrah/kafka-python/issues/249
      time.sleep(1)
      print_response(producer.send_messages(topic, msg))

kafka.close()

if __name__ == "__main__":
main()

#
# kafka consumer
# consumes messages from a

藤椅

Lisrelchen 发表于 2016-3-29 10:19:28

To Be Continued

板凳

Lisrelchen 发表于 2016-3-29 10:24:07

package com.github.giocode.graphxbook
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
object SimpleGraphApp {
def main(args: Array[String]){
// Configure the program
val conf = new SparkConf()
.setAppName("Tiny Social")
.setMaster("local")
.set("spark.driver.memory", "2G")
val sc = new SparkContext(conf)
// Load some data into RDDs
val people = sc.textFile("../../data/people.csv")
val links = sc.textFile("../../data/links.csv")
// Parse the csv files into new RDDs
case class Person(name: String, age: Int)
type Connexion = String
val peopleRDD: RDD[(VertexId, Person)] = people map { line =>
val row = line split ','
(row(0).toInt, Person(row(1), row(2).toInt))
}
val linksRDD: RDD[Edge[Connexion]] = links map {line =>
val row = line split ','
Edge(row(0).toInt, row(1).toInt, row(2))
}
// Create the social graph of people
val tinySocial: Graph[Person, Connexion] = Graph(peopleRDD, linksRDD)
tinySocial.cache()
tinySocial.vertices.collect()
tinySocial.edges.collect()
// Extract links between coworkers and print their professional relationship
val profLinks: List[Connexion] = List("coworker", "boss", "employee","client", "supplier")
tinySocial.subgraph(profLinks contains _.attr).
triplets.foreach(t => println(t.srcAttr.name + " is a " + t.attr + " of " + t.dstAttr.name))
}
}

复制代码

Apache Spark Graph Processing

报纸

Lisrelchen 发表于 2016-3-29 10:25:00

import org.apache.spark.graphx._
import org.apache.spark.rdd._
import breeze.linalg.SparseVector
import scala.io.Source
import scala.math.abs
type Feature = breeze.linalg.SparseVector[Int]
// This takes the ego.feat file and creates a Map of feature
// The vertexId is created by hashing the string identifier to a Long
val featureMap: Map[Long, Feature] =
Source.fromFile("./data/ego.feat").
getLines().
map{ line =>
val row = line split ' '
val key = abs(row.head.hashCode.toLong)
val feat: Feature = SparseVector(row.tail.map(_.toInt))
(key, feat)
}.toMap
val edges: RDD[Edge[Int]] =
sc.textFile("./data/ego.edges").
map {line =>
val row = line split ' '
val srcId = abs(row(0).hashCode.toLong)
val dstId = abs(row(1).hashCode.toLong)
val numCommonFeats = featureMap(srcId) dot featureMap(dstId)
Edge(srcId, dstId, numCommonFeats)
}
val vertices: RDD[(VertexId, Feature)] =
sc.textFile("./data/ego.edges").
map{line =>
val key = abs(line.hashCode.toLong)
(key, featureMap(key))
}
val egoNetwork: Graph[Int,Int] = Graph.fromEdges(edges, 1)
egoNetwork.edges.filter(_.attr == 3).count()
egoNetwork.edges.filter(_.attr == 2).count()
egoNetwork.edges.filter(_.attr == 1).count()

复制代码

Apache Spark Graph Processing

地板

Lisrelchen 发表于 2016-3-29 10:26:41

package com.github.giocode.graphxbook
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.graphstream.graph.{Graph => GraphStream}
import org.graphstream.graph.implementations._
import org.graphstream.ui.j2dviewer.J2DGraphRenderer
import org.jfree.chart.axis.ValueAxis
import breeze.linalg.{SparseVector, DenseVector}
import breeze.plot._
import scala.io.Source
import scala.math.abs
object SimpleGraphVizApp {
def main(args: Array[String]){
// Configure the program
val conf = new SparkConf()
.setAppName("Tiny Social Viz")
.setMaster("local")
.set("spark.driver.memory", "2G")
val sc = new SparkContext(conf)
type Feature = breeze.linalg.SparseVector[Int]
// This takes the ego.feat file and creates a Map of feature
// The vertexId is created by hashing the string identifier to a Long
val featureMap: Map[Long, Feature] =
Source.fromFile("../../../data/ego.feat").
getLines().
map{ line =>
val row = line split ' '
val key = abs(row.head.hashCode.toLong)
val feat: Feature = SparseVector(row.tail.map(_.toInt))
(key, feat)
}.toMap
val edges: RDD[Edge[Int]] =
sc.textFile("../../../data/ego.edges").
map {line =>
val row = line split ' '
val srcId = abs(row(0).hashCode.toLong)
val dstId = abs(row(1).hashCode.toLong)
val numCommonFeats = featureMap(srcId) dot featureMap(dstId)
Edge(srcId, dstId, numCommonFeats)
}
val vertices: RDD[(VertexId, Feature)] =
sc.textFile("../../../data/ego.edges").
map{line =>
val key = abs(line.hashCode.toLong)
(key, featureMap(key))
}
val egoNetwork: Graph[Int,Int] = Graph.fromEdges(edges, 1)
// Setup GraphStream settings
System.setProperty("org.graphstream.ui.renderer", "org.graphstream.ui.j2dviewer.J2DGraphRenderer")
// Create a SingleGraph class for GraphStream visualization
val graph: SingleGraph = new SingleGraph("EgoSocial")
// Set up the visual attributes for graph visualization
graph.addAttribute("ui.stylesheet", "url(file:../../..//style/stylesheet)")
graph.addAttribute("ui.quality")
graph.addAttribute("ui.antialias")
// Load the graphX vertices into GraphStream nodes
for ((id,_) <- egoNetwork.vertices.collect()) {
val node = graph.addNode(id.toString).asInstanceOf[SingleNode]
}
// Load the graphX edges into GraphStream edges
for (Edge(x,y,_) <- egoNetwork.edges.collect()) {
val edge = graph.addEdge(x.toString ++ y.toString, x.toString, y.toString, true).asInstanceOf[AbstractEdge]
}
// Display the graph
graph.display()
// Function for computing degree distribution
def degreeHistogram(net: Graph[Int, Int]): Array[(Int, Int)] =
net.degrees.map(t => (t._2,t._1)).
groupByKey.map(t => (t._1,t._2.size)).
sortBy(_._1).collect()
val nn = egoNetwork.numVertices
val egoDegreeDistribution = degreeHistogram(egoNetwork).map({case (d,n) => (d,n.toDouble/nn)})
// Plot degree distribution with breeze-viz
val f = Figure()
val p = f.subplot(0)
val x = new DenseVector(egoDegreeDistribution map (_._1.toDouble))
val y = new DenseVector(egoDegreeDistribution map (_._2))
p.xlabel = "Degrees"
p.ylabel = "Degree distribution"
p += plot(x, y)
// f.saveas("/output/degrees-ego.png")
}
}

复制代码

Apache Spark Graph Processing

7楼

Lisrelchen 发表于 2016-3-29 10:29:21

import org.apache.spark.SparkContext
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.SparkConf
object MLlib01 {
//
def getCurrentDirectory = new java.io.File( "." ).getCanonicalPath
//
def parseCarData(inpLine : String) : Array[Double] = {
val values = inpLine.split(',')
val mpg = values(0).toDouble
val displacement = values(1).toDouble
val hp = values(2).toInt
val torque = values(3).toInt
val CRatio = values(4).toDouble
val RARatio = values(5).toDouble // Rear Axle Ratio
val CarbBarrells = values(6).toInt
val NoOfSpeed = values(7).toInt
val length = values(8).toDouble
val width = values(9).toDouble
val weight = values(10).toDouble
val automatic = values(11).toInt
return Array(mpg,displacement,hp,
torque,CRatio,RARatio,CarbBarrells,
NoOfSpeed,length,width,weight,automatic)
}
//
def carDataToLP(inpArray : Array[Double]) : LabeledPoint = {
return new LabeledPoint( inpArray(0),Vectors.dense ( inpArray(1), inpArray(2),inpArray(3),
inpArray(4), inpArray(5),inpArray(6),inpArray(7), inpArray(8),inpArray(9),
inpArray(10), inpArray(11) ) )
}
//
def main(args: Array[String]) {
println(getCurrentDirectory)
val conf = new SparkConf(false) // skip loading external settings
.setMaster("local") // could be "local[4]" for 4 threads
.setAppName("Chapter 9")
.set("spark.logConf", "true")
val sc = new SparkContext(conf) // ("local","Chapter 9") if using directly
println(s"Running Spark Version ${sc.version}")
//
val dataFile = sc.textFile("/Users/ksankar/fdps-vii/data/car-milage-no-hdr.csv")
val carRDD = dataFile.map(line => parseCarData(line))
//
// Let us find summary statistics
//
val vectors: RDD[Vector] = carRDD.map(v => Vectors.dense(v))
val summary = Statistics.colStats(vectors)
carRDD.foreach(ln=> {ln.foreach(no => print("%6.2f | ".format(no))); println()})
print("Count :");println(summary.count)
print("Max :");summary.max.toArray.foreach(m => print("%5.1f | ".format(m)));println
print("Min :");summary.min.toArray.foreach(m => print("%5.1f | ".format(m)));println
print("Mean :");summary.mean.toArray.foreach(m => print("%5.1f | ".format(m)));println
//
// correlations
//
val hp = vectors.map(x => x(2))
val weight = vectors.map(x => x(10))
var corP = Statistics.corr(hp,weight,"pearson") // default
println("hp to weight : Pearson Correlation = %2.4f".format(corP))
var corS = Statistics.corr(hp,weight,"spearman") // Need to specify
println("hp to weight : Spearman Correlation = %2.4f".format(corS))
//
val raRatio = vectors.map(x => x(5))
val width = vectors.map(x => x(9))
corP = Statistics.corr(raRatio,width,"pearson") // default
println("Rear Axle Ratio to width : Pearson Correlation = %2.4f".format(corP))
corS = Statistics.corr(raRatio,width,"spearman") // Need to specify
println("Rear Axle Ratio to width : Spearman Correlation = %2.4f".format(corS))
//
// Linear Regression
//
val carRDDLP = carRDD.map(x => carDataToLP(x)) // create a labeled point RDD
println(carRDDLP.count())
println(carRDDLP.first().label)
println(carRDDLP.first().features)
//
// Let us split the data set into training & test set using a very simple filter
//
val carRDDLPTrain = carRDDLP.filter( x => x.features(9) <= 4000)
val carRDDLPTest = carRDDLP.filter( x => x.features(9) > 4000)
println("Training Set : " + "%3d".format(carRDDLPTrain.count()))
println("Training Set : " + "%3d".format(carRDDLPTest.count()))
//
// Train a Linear Regression Model
// numIterations = 100, stepsize = 0.000000001
// without such a small step size the algorithm will diverge
//
val mdlLR = LinearRegressionWithSGD.train(carRDDLPTrain,100,0.000000001)
println(mdlLR.intercept) // Intercept is turned off when using LinearRegressionSGD object,
// so intercept will always be 0 for this code
println(mdlLR.weights)
//
// Now let us use the model to predict our test set
//
val valuesAndPreds = carRDDLPTest.map(p => (p.label, mdlLR.predict(p.features)))
val mse = valuesAndPreds.map( vp => math.pow( (vp._1 - vp._2),2 ) ).
reduce(_+_) / valuesAndPreds.count()
println("Mean Squared Error = " + "%6.3f".format(mse))
println("Root Mean Squared Error = " + "%6.3f".format(math.sqrt(mse)))
// Let us print what the model predicted
valuesAndPreds.take(20).foreach(m => println("%5.1f | %5.1f |".format(m._1,m._2)))
}
}

复制代码

Fast Data Processing with Spark - Second Edition

8楼

Lisrelchen 发表于 2016-3-29 10:30:27

import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.DecisionTree
object MLlib02 {
//
def getCurrentDirectory = new java.io.File( "." ).getCanonicalPath
//
// 0 pclass,1 survived,2 l.name,3.f.name, 4 sex,5 age,6 sibsp,7 parch,8 ticket,9 fare,10 cabin,
// 11 embarked,12 boat,13 body,14 home.dest
//
def str2Double(x: String) : Double = {
try {
x.toDouble
} catch {
case e: Exception => 0.0
}
}
//
def parsePassengerDataToLP(inpLine : String) : LabeledPoint = {
val values = inpLine.split(',')
//println(values)
//println(values.length)
//
val pclass = str2Double(values(0))
val survived = str2Double(values(1))
// skip last name, first name
var sex = 0
if (values(4) == "male") {
sex = 1
}
var age = 0.0 // a better choice would be the average of all ages
age = str2Double(values(5))
//
var sibsp = 0.0
age = str2Double(values(6))
//
var parch = 0.0
age = str2Double(values(7))
//
var fare = 0.0
fare = str2Double(values(9))
return new LabeledPoint(survived,Vectors.dense(pclass,sex,age,sibsp,parch,fare))
}
//
def main(args: Array[String]): Unit = {
println(getCurrentDirectory)
val sc = new SparkContext("local","Chapter 8")
println(s"Running Spark Version ${sc.version}")
//
val dataFile = sc.textFile("/Users/ksankar/fdps-vii/data/titanic3_01.csv")
val titanicRDDLP = dataFile.map(_.trim).filter( _.length > 1).
map(line => parsePassengerDataToLP(line))
//
println(titanicRDDLP.count())
//titanicRDDLP.foreach(println)
//
println(titanicRDDLP.first().label)
println(titanicRDDLP.first().features)
//
val categoricalFeaturesInfo = Map[Int, Int]()
val mdlTree = DecisionTree.trainClassifier(titanicRDDLP, 2, // numClasses
categoricalFeaturesInfo, // all features are continuous
"gini", // impurity
5, // Maxdepth
32) //maxBins
//
//println(mdlTree.depth)
println(mdlTree)
//
// Let us predict on the data set and see how well it works
// In real world, we should split the data to train & test; then predict the test data
//
val predictions = mdlTree.predict(titanicRDDLP.map(x=>x.features))
val labelsAndPreds = titanicRDDLP.map(x=>x.label).zip(predictions)
//
val mse = labelsAndPreds.map( vp => math.pow( (vp._1 - vp._2),2 ) ).
reduce(_+_) / labelsAndPreds.count()
println("Mean Squared Error = " + "%6f".format(mse))
//
// labelsAndPreds.foreach(println)
//
val correctVals = labelsAndPreds.aggregate(0.0)((x, rec) => x + (rec._1 == rec._2).compare(false), _ + _)
val accuracy = correctVals/labelsAndPreds.count()
println("Accuracy = " + "%3.2f%%".format(accuracy*100))
//
println("*** Done ***")
}
}

复制代码

Fast Data Processing with Spark - Second Edition

9楼

Lisrelchen 发表于 2016-3-29 10:32:35

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.clustering.KMeans
object MLlib03 {
def parsePoints(inpLine : String) : Vector = {
val values = inpLine.split(',')
val x = values(0).toInt
val y = values(1).toInt
return Vectors.dense(x,y)
}
//
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local","Chapter 8")
println(s"Running Spark Version ${sc.version}")
//
val dataFile = sc.textFile("/Users/ksankar/fdps-vii/data/cluster-points.csv")
val points = dataFile.map(_.trim).filter( _.length > 1).map(line => parsePoints(line))
//
println(points.count())
//
var numClusters = 2
val numIterations = 20
var mdlKMeans = KMeans.train(points, numClusters, numIterations)
//
println(mdlKMeans.clusterCenters)
//
var clusterPred = points.map(x=>mdlKMeans.predict(x))
var clusterMap = points.zip(clusterPred)
//
clusterMap.foreach(println)
//
clusterMap.saveAsTextFile("/Users/ksankar/fdps-vii/data/3x-cluster.csv")
//
// Now let us try 4 centers
//
numClusters = 4
mdlKMeans = KMeans.train(points, numClusters, numIterations)
clusterPred = points.map(x=>mdlKMeans.predict(x))
clusterMap = points.zip(clusterPred)
clusterMap.saveAsTextFile("/Users/ksankar/fdps-vii/data/5x-cluster.csv")
clusterMap.foreach(println)
}
}

复制代码

Fast Data Processing with Spark - Second Edition

10楼

Lisrelchen 发表于 2016-3-29 10:34:12

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._ // for implicit conversations
import org.apache.spark.mllib.recommendation.Rating
import org.apache.spark.mllib.recommendation.ALS
object MLlib04 {
def parseRating1(line : String) : (Int,Int,Double,Int) = {
val x = line.split("::")
val userId = x(0).toInt
val movieId = x(1).toInt
val rating = x(2).toDouble
val timeStamp = x(3).toInt/10
return (userId,movieId,rating,timeStamp)
}
//
def parseRating(x : (Int,Int,Double,Int)) : Rating = {
val userId = x._1
val movieId = x._2
val rating = x._3
val timeStamp = x._4 // ignore
return new Rating(userId,movieId,rating)
}
//
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local","Chapter 8")
println(s"Running Spark Version ${sc.version}")
//
val moviesFile = sc.textFile("/Users/ksankar/fdps-vii/data/medium/movies.dat")
val moviesRDD = moviesFile.map(line => line.split("::"))
println(moviesRDD.count())
//
val ratingsFile = sc.textFile("/Users/ksankar/fdps-vii/data/medium/ratings.dat")
val ratingsRDD = ratingsFile.map(line => parseRating1(line))
println(ratingsRDD.count())
//
ratingsRDD.take(5).foreach(println) // always check the RDD
//
val numRatings = ratingsRDD.count()
val numUsers = ratingsRDD.map(r => r._1).distinct().count()
val numMovies = ratingsRDD.map(r => r._2).distinct().count()
println("Got %d ratings from %d users on %d movies.".
format(numRatings, numUsers, numMovies))
//
// Split dataset into training, validation & test
// We can use random. But here we will use the last digit of the time stamp
//
val trainSet = ratingsRDD.filter(x => (x._4 % 10) < 6).map(x=>parseRating(x))
val validationSet = ratingsRDD.filter(x => (x._4 % 10) >= 6 & (x._4 % 10) < 8).map(x=>parseRating(x))
val testSet = ratingsRDD.filter(x => (x._4 % 10) >= 8).map(x=>parseRating(x))
println("Training: "+ "%d".format(trainSet.count()) +
", validation: " + "%d".format(validationSet.count()) + ", test: " +
"%d".format(testSet.count()) + ".")
//
// Now train the model using the training set
val rank = 10
val numIterations = 20
val mdlALS = ALS.train(trainSet,rank,numIterations)
//
// prepare validation set for prediction
//
val userMovie = validationSet.map {
case Rating(user, movie, rate) =>(user, movie)
}
//
// Predict and convert to Key-Value PairRDD
val predictions = mdlALS.predict(userMovie).map {
case Rating(user, movie, rate) => ((user, movie), rate)
}
//
println(predictions.count())
predictions.take(5).foreach(println)
//
// Now convert the validation set to PairRDD
//
val validationPairRDD = validationSet.map(r => ((r.user, r.product), r.rating))
println(validationPairRDD.count())
validationPairRDD.take(5).foreach(println)
println(validationPairRDD.getClass())
println(predictions.getClass())
//
// Now join the validation set with Predictions
// Then we can figure out how good our recommendations are
// Tip:
// Need to import org.apache.spark.SparkContext._
// Then MappedRDD would be converted implicitly to PairRDD
//
val ratingsAndPreds = validationPairRDD.join(predictions)
println(ratingsAndPreds.count())
ratingsAndPreds.take(3).foreach(println)
//
val mse = ratingsAndPreds.map(r => {
math.pow((r._2._1 - r._2._2),2)
}).reduce(_+_) / ratingsAndPreds.count()
val rmse = math.sqrt(mse)
println("*** Model Performance Metrics ***")
println("MSE = %2.5f".format(mse))
println("RMSE = %2.5f".format(rmse))
println("** Done **") }
}

复制代码

Fast Data Processing with Spark - Second Edition

[程序汇编]Sparks using Scala,Java and Python [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

[程序汇编]Sparks using Scala,Java and Python [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群