楼主: Lisrelchen
2912 29

【Apache Spark】Apache Spark API By Example [推广有奖]

11
Lisrelchen 发表于 2017-3-8 10:46:30 |只看作者 |坛友微信交流群
  1. The ``where()`` clause is equivalent to ``filter()``.

  2. Copy to clipboardCopy
  3. val whereDF = explodeDF.where(($"firstName" === "xiangrui") || ($"firstName" === "michael")).sort($"lastName".asc)
  4. display(whereDF)
复制代码

使用道具

12
Lisrelchen 发表于 2017-3-8 10:47:04 |只看作者 |坛友微信交流群
  1. Replace ``null`` values with ``–`` using DataFrame Na functions.

  2. Copy to clipboardCopy
  3. val naFunctions = explodeDF.na
  4. val nonNullDF = naFunctions.fill("--")
  5. display(nonNullDF)
复制代码

使用道具

13
Lisrelchen 发表于 2017-3-8 10:47:32 |只看作者 |坛友微信交流群
  1. Retrieve only rows with missing firstName or lastName.

  2. Copy to clipboardCopy
  3. val filterNonNullDF = nonNullDF.filter($"firstName" === "" || $"lastName" === "").sort($"email".asc)
  4. display(filterNonNullDF)
复制代码

使用道具

14
Lisrelchen 发表于 2017-3-8 10:48:19 |只看作者 |坛友微信交流群
  1. Example aggregations using ``agg()`` and ``countDistinct()``.

  2. Copy to clipboardCopy
  3. import org.apache.spark.sql.functions._
  4. Copy to clipboardCopy
  5. // Find the distinct (firstName, lastName) combinations
  6. val countDistinctDF = nonNullDF.select($"firstName", $"lastName")
  7.   .groupBy($"firstName", $"lastName")
  8.   .agg(countDistinct($"firstName") as "distinct_first_names")
  9. display(countDistinctDF)
复制代码

使用道具

15
Lisrelchen 发表于 2017-3-8 10:49:01 |只看作者 |坛友微信交流群
  1. Compare the DataFrame and SQL Query Physical Plans (Hint: They should be the same.)

  2. Copy to clipboardCopy
  3. countDistinctDF.explain()
  4. Copy to clipboardCopy
  5. // register the DataFrame as a temp table so that we can query it using SQL
  6. nonNullDF.registerTempTable("databricks_df_example")

  7. // Perform the same query as the DataFrame above and return ``explain``
  8. sqlContext.sql("""
  9. SELECT firstName, lastName, count(distinct firstName) as distinct_first_names
  10. FROM databricks_df_example
  11. GROUP BY firstName, lastName
  12. """).explain
  13. Copy to clipboardCopy
  14. // Sum up all the salaries
  15. val salarySumDF = nonNullDF.agg("salary" -> "sum")
  16. display(salarySumDF)
复制代码

使用道具

16
Lisrelchen 发表于 2017-3-8 10:55:43 |只看作者 |坛友微信交流群
  1. Creating Datasets

  2. You can simply call .toDS() on a sequence to convert the sequence to a Dataset.

  3. Copy to clipboardCopy
  4. val dataset = Seq(1, 2, 3).toDS()
  5. dataset.show()
  6. If you have a sequence of case classes, calling .toDS() will provide a dataset with all the necessary fields in the dataset.

  7. Copy to clipboardCopy
  8. case class Person(name: String, age: Int)

  9. val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 62)).toDS()
  10. personDS.show()
  11. Creating Datasets from a RDD You can call rdd.toDS() to convert an RDD into a Dataset.

  12. Copy to clipboardCopy
  13. val rdd = sc.parallelize(Seq((1, "Spark"), (2, "Databricks")))
  14. val integerDS = rdd.toDS()
  15. integerDS.show()
  16. Creating Datasets from a DataFrame You can call df.as[SomeCaseClass] to convert the DataFrame to a Dataset.

  17. Copy to clipboardCopy
  18. case class Company(name: String, foundingYear: Int, numEmployees: Int)
  19. val inputSeq = Seq(Company("ABC", 1998, 310), Company("XYZ", 1983, 904), Company("NOP", 2005, 83))
  20. val df = sc.parallelize(inputSeq).toDF()

  21. val companyDS = df.as[Company]
  22. companyDS.show()
  23. You can also deal with tuples while converting a DataFrame to Dataset without using a case class

  24. Copy to clipboardCopy
  25. val rdd = sc.parallelize(Seq((1, "Spark"), (2, "Databricks"), (3, "Notebook")))
  26. val df = rdd.toDF("Id", "Name")

  27. val dataset = df.as[(Int, String)]
  28. dataset.show()
复制代码

使用道具

17
Lisrelchen 发表于 2017-3-8 10:57:11 |只看作者 |坛友微信交流群
  1. Word Count Example

  2. Copy to clipboardCopy
  3. val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")).toDS()
  4. val groupedDataset = wordsDataset.flatMap(_.toLowerCase.split(" "))
  5.                                  .filter(_ != "")
  6.                                  .groupBy("value")
  7. val countsDataset = groupedDataset.count()
  8. countsDataset.show()
复制代码

使用道具

18
Lisrelchen 发表于 2017-3-8 11:00:36 |只看作者 |坛友微信交流群
  1. Converting Datasets to DataFrames
  2. The above 2 examples dealt with using pure Datasets APIs. You can also easily move from Datasets to DataFrames and leverage the DataFrames APIs. The below example shows the word count example that uses both Datasets and DataFrames APIs.

  3. Copy to clipboardCopy
  4. import org.apache.spark.sql.functions._

  5. val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")).toDS()
  6. val result = wordsDataset
  7.                .flatMap(_.split(" "))               // Split on whitespace
  8.                .filter(_ != "")                     // Filter empty words
  9.                .map(_.toLowerCase())
  10.                .toDF()                              // Convert to DataFrame to perform aggregation / sorting
  11.                .groupBy($"value")                   // Count number of occurences of each word
  12.                .agg(count("*") as "numOccurances")
  13.                .orderBy($"numOccurances" desc)      // Show most common words first
  14. result.show()
复制代码

使用道具

19
franky_sas 发表于 2017-3-8 12:00:43 |只看作者 |坛友微信交流群

使用道具

20
zgs3721 发表于 2017-3-8 12:03:18 |只看作者 |坛友微信交流群
谢谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-25 22:25