请选择 进入手机版 | 继续访问电脑版
楼主: Nicolle
2166 2

Big Data Analytics with Spark [推广有奖]

巨擘

0%

还不是VIP/贵宾

-

TA的文库  其他...

Python(Must-Read Books)

SAS Programming

Must-Read Books

威望
16
论坛币
12402323 个
通用积分
1620.7415
学术水平
3305 点
热心指数
3329 点
信用等级
3095 点
经验
477211 点
帖子
23879
精华
91
在线时间
9878 小时
注册时间
2005-4-23
最后登录
2022-3-6

Nicolle 学生认证  发表于 2016-1-23 09:17:19 |显示全部楼层 |坛友微信交流群
提示: 作者被禁止或删除 内容自动屏蔽

本帖被以下文库推荐

ReneeBK 发表于 2017-3-6 06:44:38 |显示全部楼层 |坛友微信交流群

Machine Learning using Spark ML

  1. 1.eserve a portion of the dataset for model testing.

  2. val Array(trainingData, testData) = reviews.randomSplit(Array(0.8, 0.2))
  3. trainingData.count
  4. testData.count
  5. 2.Create a feature Vector for each sentence in the dataset.
  6. import org.apache.spark.ml.feature.Tokenizer
  7. val tokenizer = new Tokenizer()
  8.                       .setInputCol("text")
  9.                       .setOutputCol("words")

  10. val tokenizedData = tokenizer.transform(trainingData)
  11. tokenizedData: org.apache.spark.sql.DataFrame = [text: string, label: double, words: array<string>]

  12. import org.apache.spark.ml.feature.HashingTF
  13. val hashingTF = new HashingTF()
  14.                       .setNumFeatures(1000)
  15.                       .setInputCol(tokenizer.getOutputCol)
  16.                       .setOutputCol("features")


  17. 3.Check the schema of the DataFrame that will be output by the transform method of hashingTF.
  18. val hashedData = hashingTF.transform(tokenizedData)
  19. hashedData: org.apache.spark.sql.DataFrame = [text: string, label: double, words: array<string>, features: vector]

  20. 4.Fit a model on the training dataset.
  21. import org.apache.spark.ml.classification.LogisticRegression
  22. val lr = new LogisticRegression()
  23.                .setMaxIter(10)
  24.                .setRegParam(0.01)

  25. import org.apache.spark.ml.Pipeline
  26. val pipeline = new Pipeline()
  27.                      .setStages(Array(tokenizer, hashingTF, lr))

  28. val pipeLineModel = pipeline.fit(trainingData)
  29. 5.Evaluate how the generated model performs on both the training and test dataset.
  30. val testPredictions = pipeLineModel.transform(testData)

  31. val trainingPredictions = pipeLineModel.transform(trainingData)

  32. import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
  33. val evaluator = new BinaryClassificationEvaluator()
  34. Let’s now evaluate our model using Area under ROC as a metric.

  35. import org.apache.spark.ml.param.ParamMap
  36. val evaluatorParamMap = ParamMap(evaluator.metricName -> "areaUnderROC")

  37. val aucTraining = evaluator.evaluate(trainingPredictions, evaluatorParamMap)
  38. aucTraining: Double = 0.9999758519725919
  39. val aucTest = evaluator.evaluate(testPredictions, evaluatorParamMap)
  40. aucTest: Double = 0.6984384037015618
  41. Our model’s AUC score is close to 1.0 for the training dataset and 0.69 for the test dataset. As mentioned earlier, a score closer to 1.0 indicates a perfect model and a score closer to 0.50 indicates a worthless model. Our model performs very well on the training dataset, but not so well on the test dataset. A model will always perform well on the dataset that it was trained with. The true performance of a model is indicated by how well it does on an unseen test dataset. That is the reason we reserved a portion of the dataset for testing.

  42. One way to improve a model’s performance is to tune its hyperparameters. Spark ML provides a CrossValidator class that can help with this task. It requires a parameter grid over which it conducts a grid search to find the best hyperparameters using k-fold cross validation.

  43. Let’s build a parameter grid that we will use with an instance of the CrossValidator class.

  44. import org.apache.spark.ml.tuning.ParamGridBuilder
  45. val paramGrid = new ParamGridBuilder()
  46.   .addGrid(hashingTF.numFeatures, Array(10000, 100000))
  47.   .addGrid(lr.regParam, Array(0.01, 0.1, 1.0))
  48.   .addGrid(lr.maxIter, Array(20, 30))
  49.   .build()
  50. paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
  51. Array({
  52.         logreg_0427de6fa5fc-maxIter: 20,
  53.         hashingTF_24e660c4963c-numFeatures: 10000,
  54.         logreg_0427de6fa5fc-regParam: 0.01
  55. }, {
  56.         logreg_0427de6fa5fc-maxIter: 20,
  57.         hashingTF_24e660c4963c-numFeatures: 100000,
  58.         logreg_0427de6fa5fc-regParam: 0.01
  59. }, {
  60.         logreg_0427de6fa5fc-maxIter: 20,
  61.         hashingTF_24e660c4963c-numFeatures: 10000,
  62.         logreg_0427de6fa5fc-regParam: 0.1
  63. },...
  64. This code created a parameter grid consisting of two values for the number of features, three values for the regularization parameters, and two values for the maximum number of iterations. It can be used to do a grid search over 12 different combinations of the hyperparameter values. You can specify more options, but training a model will take longer since grid search is a brute-force method that tries all the different combinations in a parameter grid. As mentioned earlier, using a CrossValidator to do a grid search can be expensive in terms of CPU time.

  65. We now have all the parts required to tune the hyperparameters for the Transformers and Estimators in our machine learning pipeline.

  66. import org.apache.spark.ml.tuning.CrossValidator
  67. val crossValidator = new CrossValidator()
  68.                            .setEstimator(pipeline)
  69.                            .setEstimatorParamMaps(paramGrid)
  70.                            .setNumFolds(10)
  71.                            .setEvaluator(evaluator)

  72. val crossValidatorModel = crossValidator.fit(trainingData)
  73. The fit method in the CrossValidator class returns an instance of the CrossValidatorModel class. Similar to other model classes, it can be used as a Transformer that predicts a label for a given feature Vector.

  74. Let’s evaluate the performance of this model on the test dataset.

  75. val newPredictions = crossValidatorModel.transform(testData)
  76. newPredictions: org.apache.spark.sql.DataFrame = [text: string, label: double, words: array<string>, features: vector, rawPrediction: vector, probability: vector, prediction: double]
  77. val newAucTest = evaluator.evaluate(newPredictions, evaluatorParamMap)
  78. newAucTest: Double = 0.8182764603817234
  79. As you can see, the performance of the new model is 11% better than the previous model.

  80. Finally, let’s find the best model generated by crossValidator.

  81. val bestModel = crossValidatorModel.bestModel
复制代码

使用道具

ReneeBK 发表于 2017-3-6 06:51:30 |显示全部楼层 |坛友微信交流群
Machine Learning using Spark MLlib

  1. 1.Launch the Spark shell from a terminal.
  2. path/to/spark/bin/spark-shell --master local[*]
  3. 2.Create an RDD from the dataset.
  4. val lines = sc.textFile("/path/to/iris.data")
  5. lines.persist()
  6. 3.Remove empty lines from the dataset.
  7. val nonEmpty = lines.filter(_.nonEmpty)
  8. 4.Extract the features and labels.
  9. val parsed = nonEmpty map {_.split(",")}
  10. 5.Transform parsed to an RDD of LabeledPoint.
  11. val distinctSpecies = parsed.map{a => a(4)}.distinct.collect
  12. val textToNumeric = distinctSpecies.zipWithIndex.toMap
  13. 6.Create an RDD of LabeledPoint from parsed.
  14. import org.apache.spark.mllib.regression.LabeledPoint
  15. import org.apache.spark.mllib.linalg.{Vector, Vectors}

  16. val labeledPoints = parsed.map{a =>
  17.              LabeledPoint(textToNumeric(a(4)),
  18.                Vectors.dense(a(0).toDouble, a(1).toDouble, a(2).toDouble, a(3).toDouble))}
  19. 7.Split the dataset into training and test data.
  20. val dataSplits = labeledPoints.randomSplit(Array(0.8, 0.2))
  21. val trainingData = dataSplits(0)
  22. val testData = dataSplits(1)
  23. 8.Train a model.
  24. import org.apache.spark.mllib.classification.NaiveBayes
  25. val model = NaiveBayes.train(trainingData)
  26. 9.Evaluate our model on the test dataset.
  27. val predictionsAndLabels = testData.map{d => (model.predict(d.features), d.label)}
复制代码

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-3-29 23:45