【Apache Spark】Apache Spark API By Example [推广有奖]

11楼

Lisrelchen 发表于 2017-3-8 10:46:30

The ``where()`` clause is equivalent to ``filter()``.
Copy to clipboardCopy
val whereDF = explodeDF.where(($"firstName" === "xiangrui") || ($"firstName" === "michael")).sort($"lastName".asc)
display(whereDF)

复制代码

12楼

Lisrelchen 发表于 2017-3-8 10:47:04

复制代码

13楼

Lisrelchen 发表于 2017-3-8 10:47:32

Retrieve only rows with missing firstName or lastName.
Copy to clipboardCopy
val filterNonNullDF = nonNullDF.filter($"firstName" === "" || $"lastName" === "").sort($"email".asc)
display(filterNonNullDF)

复制代码

14楼

Lisrelchen 发表于 2017-3-8 10:48:19

复制代码

15楼

Lisrelchen 发表于 2017-3-8 10:49:01

Compare the DataFrame and SQL Query Physical Plans (Hint: They should be the same.)
Copy to clipboardCopy
countDistinctDF.explain()
Copy to clipboardCopy
// register the DataFrame as a temp table so that we can query it using SQL
nonNullDF.registerTempTable("databricks_df_example")
// Perform the same query as the DataFrame above and return ``explain``
sqlContext.sql("""
SELECT firstName, lastName, count(distinct firstName) as distinct_first_names
FROM databricks_df_example
GROUP BY firstName, lastName
""").explain
Copy to clipboardCopy
// Sum up all the salaries
val salarySumDF = nonNullDF.agg("salary" -> "sum")
display(salarySumDF)

复制代码

16楼

Lisrelchen 发表于 2017-3-8 10:55:43

Creating Datasets
You can simply call .toDS() on a sequence to convert the sequence to a Dataset.
Copy to clipboardCopy
val dataset = Seq(1, 2, 3).toDS()
dataset.show()
If you have a sequence of case classes, calling .toDS() will provide a dataset with all the necessary fields in the dataset.
Copy to clipboardCopy
case class Person(name: String, age: Int)
val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 62)).toDS()
personDS.show()
Creating Datasets from a RDD You can call rdd.toDS() to convert an RDD into a Dataset.
Copy to clipboardCopy
val rdd = sc.parallelize(Seq((1, "Spark"), (2, "Databricks")))
val integerDS = rdd.toDS()
integerDS.show()
Creating Datasets from a DataFrame You can call df.as[SomeCaseClass] to convert the DataFrame to a Dataset.
Copy to clipboardCopy
case class Company(name: String, foundingYear: Int, numEmployees: Int)
val inputSeq = Seq(Company("ABC", 1998, 310), Company("XYZ", 1983, 904), Company("NOP", 2005, 83))
val df = sc.parallelize(inputSeq).toDF()
val companyDS = df.as[Company]
companyDS.show()
You can also deal with tuples while converting a DataFrame to Dataset without using a case class
Copy to clipboardCopy
val rdd = sc.parallelize(Seq((1, "Spark"), (2, "Databricks"), (3, "Notebook")))
val df = rdd.toDF("Id", "Name")
val dataset = df.as[(Int, String)]
dataset.show()

复制代码

17楼

Lisrelchen 发表于 2017-3-8 10:57:11

Word Count Example
Copy to clipboardCopy
val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")).toDS()
val groupedDataset = wordsDataset.flatMap(_.toLowerCase.split(" "))
.filter(_ != "")
.groupBy("value")
val countsDataset = groupedDataset.count()
countsDataset.show()

复制代码

18楼

Lisrelchen 发表于 2017-3-8 11:00:36

Converting Datasets to DataFrames
The above 2 examples dealt with using pure Datasets APIs. You can also easily move from Datasets to DataFrames and leverage the DataFrames APIs. The below example shows the word count example that uses both Datasets and DataFrames APIs.
Copy to clipboardCopy
import org.apache.spark.sql.functions._
val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")).toDS()
val result = wordsDataset
.flatMap(_.split(" ")) // Split on whitespace
.filter(_ != "") // Filter empty words
.map(_.toLowerCase())
.toDF() // Convert to DataFrame to perform aggregation / sorting
.groupBy($"value") // Count number of occurences of each word
.agg(count("*") as "numOccurances")
.orderBy($"numOccurances" desc) // Show most common words first
result.show()