楼主: Lisrelchen
3572 29

【Apache Spark】Apache Spark API By Example [推广有奖]

11
Lisrelchen 发表于 2017-3-8 10:46:30
  1. The ``where()`` clause is equivalent to ``filter()``.

  2. Copy to clipboardCopy
  3. val whereDF = explodeDF.where(($"firstName" === "xiangrui") || ($"firstName" === "michael")).sort($"lastName".asc)
  4. display(whereDF)
复制代码

12
Lisrelchen 发表于 2017-3-8 10:47:04
  1. Replace ``null`` values with ``–`` using DataFrame Na functions.

  2. Copy to clipboardCopy
  3. val naFunctions = explodeDF.na
  4. val nonNullDF = naFunctions.fill("--")
  5. display(nonNullDF)
复制代码

13
Lisrelchen 发表于 2017-3-8 10:47:32
  1. Retrieve only rows with missing firstName or lastName.

  2. Copy to clipboardCopy
  3. val filterNonNullDF = nonNullDF.filter($"firstName" === "" || $"lastName" === "").sort($"email".asc)
  4. display(filterNonNullDF)
复制代码

14
Lisrelchen 发表于 2017-3-8 10:48:19
  1. Example aggregations using ``agg()`` and ``countDistinct()``.

  2. Copy to clipboardCopy
  3. import org.apache.spark.sql.functions._
  4. Copy to clipboardCopy
  5. // Find the distinct (firstName, lastName) combinations
  6. val countDistinctDF = nonNullDF.select($"firstName", $"lastName")
  7.   .groupBy($"firstName", $"lastName")
  8.   .agg(countDistinct($"firstName") as "distinct_first_names")
  9. display(countDistinctDF)
复制代码

15
Lisrelchen 发表于 2017-3-8 10:49:01
  1. Compare the DataFrame and SQL Query Physical Plans (Hint: They should be the same.)

  2. Copy to clipboardCopy
  3. countDistinctDF.explain()
  4. Copy to clipboardCopy
  5. // register the DataFrame as a temp table so that we can query it using SQL
  6. nonNullDF.registerTempTable("databricks_df_example")

  7. // Perform the same query as the DataFrame above and return ``explain``
  8. sqlContext.sql("""
  9. SELECT firstName, lastName, count(distinct firstName) as distinct_first_names
  10. FROM databricks_df_example
  11. GROUP BY firstName, lastName
  12. """).explain
  13. Copy to clipboardCopy
  14. // Sum up all the salaries
  15. val salarySumDF = nonNullDF.agg("salary" -> "sum")
  16. display(salarySumDF)
复制代码

16
Lisrelchen 发表于 2017-3-8 10:55:43
  1. Creating Datasets

  2. You can simply call .toDS() on a sequence to convert the sequence to a Dataset.

  3. Copy to clipboardCopy
  4. val dataset = Seq(1, 2, 3).toDS()
  5. dataset.show()
  6. If you have a sequence of case classes, calling .toDS() will provide a dataset with all the necessary fields in the dataset.

  7. Copy to clipboardCopy
  8. case class Person(name: String, age: Int)

  9. val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 62)).toDS()
  10. personDS.show()
  11. Creating Datasets from a RDD You can call rdd.toDS() to convert an RDD into a Dataset.

  12. Copy to clipboardCopy
  13. val rdd = sc.parallelize(Seq((1, "Spark"), (2, "Databricks")))
  14. val integerDS = rdd.toDS()
  15. integerDS.show()
  16. Creating Datasets from a DataFrame You can call df.as[SomeCaseClass] to convert the DataFrame to a Dataset.

  17. Copy to clipboardCopy
  18. case class Company(name: String, foundingYear: Int, numEmployees: Int)
  19. val inputSeq = Seq(Company("ABC", 1998, 310), Company("XYZ", 1983, 904), Company("NOP", 2005, 83))
  20. val df = sc.parallelize(inputSeq).toDF()

  21. val companyDS = df.as[Company]
  22. companyDS.show()
  23. You can also deal with tuples while converting a DataFrame to Dataset without using a case class

  24. Copy to clipboardCopy
  25. val rdd = sc.parallelize(Seq((1, "Spark"), (2, "Databricks"), (3, "Notebook")))
  26. val df = rdd.toDF("Id", "Name")

  27. val dataset = df.as[(Int, String)]
  28. dataset.show()
复制代码

17
Lisrelchen 发表于 2017-3-8 10:57:11
  1. Word Count Example

  2. Copy to clipboardCopy
  3. val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")).toDS()
  4. val groupedDataset = wordsDataset.flatMap(_.toLowerCase.split(" "))
  5.                                  .filter(_ != "")
  6.                                  .groupBy("value")
  7. val countsDataset = groupedDataset.count()
  8. countsDataset.show()
复制代码

18
Lisrelchen 发表于 2017-3-8 11:00:36
  1. Converting Datasets to DataFrames
  2. The above 2 examples dealt with using pure Datasets APIs. You can also easily move from Datasets to DataFrames and leverage the DataFrames APIs. The below example shows the word count example that uses both Datasets and DataFrames APIs.

  3. Copy to clipboardCopy
  4. import org.apache.spark.sql.functions._

  5. val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")).toDS()
  6. val result = wordsDataset
  7.                .flatMap(_.split(" "))               // Split on whitespace
  8.                .filter(_ != "")                     // Filter empty words
  9.                .map(_.toLowerCase())
  10.                .toDF()                              // Convert to DataFrame to perform aggregation / sorting
  11.                .groupBy($"value")                   // Count number of occurences of each word
  12.                .agg(count("*") as "numOccurances")
  13.                .orderBy($"numOccurances" desc)      // Show most common words first
  14. result.show()
复制代码

19
franky_sas 发表于 2017-3-8 12:00:43

20
zgs3721 发表于 2017-3-8 12:03:18
谢谢分享

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2026-1-2 12:51