【GitHub】Spark Scala Learning Note [推广有奖]

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	11楼 Nicolle 发表于 2017-2-21 08:39:31 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	12楼 Nicolle 发表于 2017-2-21 08:41:24 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	13楼 Nicolle 发表于 2017-2-21 08:43:23 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

14楼

ReneeBK 发表于 2017-2-21 09:20:44

Data From S3
AWS Configuration
Select IAM Service
Create User and Copy the Access Key ID and Secret Access Key
Create Group and attach Policy(AmazonS3FullAccess).
Create S3 Bucket
Spark application in Scala
val sc = new SparkContext(new SparkConf().setAppName("Recommender").setMaster("local[2]"))
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3n.awsAccessKeyId", "***")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "***")
val base = "s3n://lin-spark-sample/ch3/"
val rawUserArtistData = sc.textFile(base + "user_artist_data.txt")

复制代码

15楼

西门高 发表于 2017-2-21 09:21:12

谢谢分享

16楼

ReneeBK 发表于 2017-2-21 09:21:56

Reading CSV
Import Dependencies
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.databricks.spark.csv._
Spark App in Scala
val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')
students.printSchema
students.show(3)

复制代码

17楼

ReneeBK 发表于 2017-2-21 09:22:43

Using Parquet
Import Dependencies
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.databricks.spark.csv._
Creating DataFrame from CSV
val conf = new SparkConf().setAppName("DataFrameToParquet").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Reading the CSV file StudentData.csv
val studentsDF = sqlContext.csvFile(filePath = "StudentData.csv", useHeader = true, delimiter = '|')

复制代码

18楼

ReneeBK 发表于 2017-2-21 09:25:26

Creating LabeledPoint
Firstly, take a glance to the sample data.
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
Each of our training samples has the following format: the last field is the quality of the wine, a rating from 1 to 10, and the first 11 fields are the properties of the wine. So, from our perspective, the quality of the wine is the y variable (the output) and the rest of them are x variables (input).
Loading Data into Spark RDD
Import
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
Code
val conf = new SparkConf().setAppName("linearRegressionWine").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("winequality-red.csv").map(line => line.split(";"))
Converting each instance into LabeledPoint
A LabeledPoint is a simple wrapper around the input features (our x variables) and our pre-predicted value (y variable) for these x input values:
Import
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
Code
val dataPoints = rdd.map(row =>
new LabeledPoint(
row.last.toDouble,
Vectors.dense(row.take(row.length - 1).map(str => str.toDouble))
)
).cache()

复制代码

19楼

ReneeBK 发表于 2017-2-21 09:26:48

复制代码

20楼

ReneeBK 发表于 2017-2-21 09:27:33

Summay Statistics
val featureVector = rdd.map(row => Vectors.dense(row.take(row.length-1).map(str => str.toDouble)))
val stats = Statistics.colStats(featureVector)
print(s"Max : ${stats.max}, Min : ${stats.min}, and Mean : ${stats.mean} and Variance : ${stats.variance}")

复制代码

扫码
拉您进交流群