楼主: ReneeBK
3349 26

【GitHub】Spark Scala Learning Note [推广有奖]

11
Nicolle 学生认证  发表于 2017-2-21 08:39:31
提示: 作者被禁止或删除 内容自动屏蔽

12
Nicolle 学生认证  发表于 2017-2-21 08:41:24
提示: 作者被禁止或删除 内容自动屏蔽

13
Nicolle 学生认证  发表于 2017-2-21 08:43:23
提示: 作者被禁止或删除 内容自动屏蔽

14
ReneeBK 发表于 2017-2-21 09:20:44
  1. Data From S3

  2. AWS Configuration

  3. Select IAM Service
  4. Create User and Copy the Access Key ID and Secret Access Key
  5. Create Group and attach Policy(AmazonS3FullAccess).
  6. Create S3 Bucket
  7. Spark application in Scala

  8.     val sc = new SparkContext(new SparkConf().setAppName("Recommender").setMaster("local[2]"))
  9.     val hadoopConf = sc.hadoopConfiguration
  10.     hadoopConf.set("fs.s3n.awsAccessKeyId", "***")
  11.     hadoopConf.set("fs.s3n.awsSecretAccessKey", "***")
  12.     val base = "s3n://lin-spark-sample/ch3/"
  13.     val rawUserArtistData = sc.textFile(base + "user_artist_data.txt")
复制代码

15
西门高 发表于 2017-2-21 09:21:12
谢谢分享

16
ReneeBK 发表于 2017-2-21 09:21:56
  1. Reading CSV

  2. Import Dependencies

  3. import org.apache.spark.SparkConf
  4. import org.apache.spark.SparkContext
  5. import org.apache.spark.sql.SQLContext
  6. import com.databricks.spark.csv._
  7. Spark App in Scala

  8. val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]")   
  9. val sc = new SparkContext(conf)
  10. val sqlContext = new SQLContext(sc)   
  11. val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')
  12. students.printSchema
  13. students.show(3)
复制代码

17
ReneeBK 发表于 2017-2-21 09:22:43
  1. Using Parquet

  2. Import Dependencies

  3. import org.apache.spark.SparkConf
  4. import org.apache.spark.SparkContext
  5. import org.apache.spark.sql.SQLContext
  6. import com.databricks.spark.csv._
  7. Creating DataFrame from CSV

  8. val conf = new SparkConf().setAppName("DataFrameToParquet").setMaster("local[2]")
  9. val sc = new SparkContext(conf)
  10. val sqlContext = new SQLContext(sc)

  11. // Reading the CSV file StudentData.csv
  12. val studentsDF = sqlContext.csvFile(filePath = "StudentData.csv", useHeader = true, delimiter = '|')
复制代码

18
ReneeBK 发表于 2017-2-21 09:25:26
  1. Creating LabeledPoint

  2. Firstly, take a glance to the sample data.
  3. 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
  4. 7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
  5. 7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
  6. 11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
  7. 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
  8. 7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
  9. 7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
  10. Each of our training samples has the following format: the last field is the quality of the wine, a rating from 1 to 10, and the first 11 fields are the properties of the wine. So, from our perspective, the quality of the wine is the y variable (the output) and the rest of them are x variables (input).
  11. Loading Data into Spark RDD

  12. Import
  13. import org.apache.spark.SparkConf
  14. import org.apache.spark.SparkContext
  15. import org.apache.spark.sql.SQLContext
  16. Code

  17. val conf = new SparkConf().setAppName("linearRegressionWine").setMaster("local[2]")
  18. val sc = new SparkContext(conf)
  19. val rdd = sc.textFile("winequality-red.csv").map(line => line.split(";"))
  20. Converting each instance into LabeledPoint

  21. A LabeledPoint is a simple wrapper around the input features (our x variables) and our pre-predicted value (y variable) for these x input values:
  22. Import
  23. import org.apache.spark.mllib.regression.LabeledPoint
  24. import org.apache.spark.mllib.linalg.Vectors
  25. Code

  26. val dataPoints = rdd.map(row =>
  27.     new LabeledPoint(
  28.           row.last.toDouble,
  29.           Vectors.dense(row.take(row.length - 1).map(str => str.toDouble))
  30.     )
  31.   ).cache()
复制代码

19
ReneeBK 发表于 2017-2-21 09:26:48
  1. Preparing the Training and Test Data

  2. val dataSplit = labeledDataPoints.randomSplit(Array(0.8, 0.2))
  3. val trainingSet = dataSplit(0)
  4. val testSet = dataSplit(1)
复制代码

20
ReneeBK 发表于 2017-2-21 09:27:33
  1. Summay Statistics

  2. val featureVector = rdd.map(row => Vectors.dense(row.take(row.length-1).map(str => str.toDouble)))
  3. val stats = Statistics.colStats(featureVector)
  4. print(s"Max : ${stats.max}, Min : ${stats.min}, and Mean : ${stats.mean} and Variance : ${stats.variance}")
复制代码

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
jg-xs1
拉您进交流群
GMT+8, 2026-1-2 10:32