楼主: Lisrelchen
3165 29

【Apache Spark】Apache Spark API By Example [推广有奖]

  • 0关注
  • 62粉丝

VIP

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
50164 个
通用积分
81.6228
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. Apache Spark API By Example
  2. A Command Reference for Beginners
  3. Matthias Langer, Zhen He
  4. Department of Computer Science and Computer Engineering
  5. La Trobe University
  6. Bundoora, VIC 3086
  7. Australia
  8. m.langer@latrobe.edu.au, z.he@latrobe.edu.au
  9. May 31, 2014
复制代码

本帖隐藏的内容

Apache Spark API By Example.rar (258.94 KB) 本附件包括:
  • Apache Spark API By Example.pdf


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Apache Spark example apache Spark ExamP 2014

本帖被以下文库推荐

沙发
auirzxp 学生认证  发表于 2017-3-8 10:28:01 |只看作者 |坛友微信交流群

使用道具

藤椅
Lisrelchen 发表于 2017-3-8 10:31:52 |只看作者 |坛友微信交流群
  1. Creating DataFrames with Python

  2. Copy to clipboardCopy
  3. # import pyspark class Row from module sql
  4. from pyspark.sql import *

  5. # Create Example Data - Departments and Employees

  6. # Create the Departments
  7. department1 = Row(id='123456', name='Computer Science')
  8. department2 = Row(id='789012', name='Mechanical Engineering')
  9. department3 = Row(id='345678', name='Theater and Drama')
  10. department4 = Row(id='901234', name='Indoor Recreation')

  11. # Create the Employees
  12. Employee = Row("firstName", "lastName", "email", "salary")
  13. employee1 = Employee('michael', 'armbrust', 'no-reply@berkeley.edu', 100000)
  14. employee2 = Employee('xiangrui', 'meng', 'no-reply@stanford.edu', 120000)
  15. employee3 = Employee('matei', None, 'no-reply@waterloo.edu', 140000)
  16. employee4 = Employee(None, 'wendell', 'no-reply@berkeley.edu', 160000)

  17. # Create the DepartmentWithEmployees instances from Departments and Employees
  18. departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2])
  19. departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
  20. departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4])
  21. departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])

  22. print department1
  23. print employee2
  24. print departmentWithEmployees1.employees[0].email
  25. Create the first DataFrame from a list of the rows.

  26. Copy to clipboardCopy
  27. departmentsWithEmployeesSeq1 = [departmentWithEmployees1, departmentWithEmployees2]
  28. df1 = sqlContext.createDataFrame(departmentsWithEmployeesSeq1)

  29. display(df1)
  30. Create a second DataFrame from a list of rows.

  31. Copy to clipboardCopy
  32. departmentsWithEmployeesSeq2 = [departmentWithEmployees3, departmentWithEmployees4]
  33. df2 = sqlContext.createDataFrame(departmentsWithEmployeesSeq2)

  34. display(df2)
复制代码

使用道具

板凳
Lisrelchen 发表于 2017-3-8 10:33:01 |只看作者 |坛友微信交流群
  1. An example using Pandas & Matplotlib Integration
  2. Copy to clipboardCopy
  3. import pandas as pd
  4. import matplotlib.pyplot as plt
  5. plt.clf()
  6. pdDF = nonNullDF.toPandas()
  7. pdDF.plot(x='firstName', y='salary', kind='bar', rot=45)
  8. display()
  9. Cleanup: Remove the parquet file.

  10. Copy to clipboardCopy
  11. dbutils.fs.rm("/tmp/databricks-df-example.parquet", True)
复制代码

使用道具

报纸
Lisrelchen 发表于 2017-3-8 10:35:24 |只看作者 |坛友微信交流群
  1. Create DataFrames using Scala
  2. Copy to clipboardCopy
  3. // Create the case classes for our domain
  4. case class Department(id: String, name: String)
  5. case class Employee(firstName: String, lastName: String, email: String, salary: Int)
  6. case class DepartmentWithEmployees(department: Department, employees: Seq[Employee])

  7. // Create the Departments
  8. val department1 = new Department("123456", "Computer Science")
  9. val department2 = new Department("789012", "Mechanical Engineering")
  10. val department3 = new Department("345678", "Theater and Drama")
  11. val department4 = new Department("901234", "Indoor Recreation")

  12. // Create the Employees
  13. val employee1 = new Employee("michael", "armbrust", "no-reply@berkeley.edu", 100000)
  14. val employee2 = new Employee("xiangrui", "meng", "no-reply@stanford.edu", 120000)
  15. val employee3 = new Employee("matei", null, "no-reply@waterloo.edu", 140000)
  16. val employee4 = new Employee(null, "wendell", "no-reply@princeton.edu", 160000)

  17. // Create the DepartmentWithEmployees instances from Departments and Employees
  18. val departmentWithEmployees1 = new DepartmentWithEmployees(department1, Seq(employee1, employee2))
  19. val departmentWithEmployees2 = new DepartmentWithEmployees(department2, Seq(employee3, employee4))
  20. val departmentWithEmployees3 = new DepartmentWithEmployees(department3, Seq(employee1, employee4))
  21. val departmentWithEmployees4 = new DepartmentWithEmployees(department4, Seq(employee2, employee3))
  22. Create the first DataFrame from a List of the Case Classes.

  23. Copy to clipboardCopy
  24. val departmentsWithEmployeesSeq1 = Seq(departmentWithEmployees1, departmentWithEmployees2)
  25. val df1 = departmentsWithEmployeesSeq1.toDF()
  26. display(df1)
  27. Create a 2nd DataFrame from a List of Case Classes.

  28. Copy to clipboardCopy
  29. val departmentsWithEmployeesSeq2 = Seq(departmentWithEmployees3, departmentWithEmployees4)
  30. val df2 = departmentsWithEmployeesSeq2.toDF()
  31. display(df2)
复制代码

使用道具

地板
Lisrelchen 发表于 2017-3-8 10:36:24 |只看作者 |坛友微信交流群
  1. Flattening using Scala
  2. If your data has several levels of nesting, here is a helper function to flatten your DataFrame to make it easier to work with.

  3. Copy to clipboardCopy
  4. val veryNestedDF = Seq(("1", (2, (3, 4)))).toDF()
  5. Copy to clipboardCopy
  6. import org.apache.spark.sql._
  7. import org.apache.spark.sql.functions._
  8. import org.apache.spark.sql.types._

  9. implicit class DataFrameFlattener(df: DataFrame) {
  10.   def flattenSchema: DataFrame = {
  11.     df.select(flatten(Nil, df.schema): _*)
  12.   }

  13.   protected def flatten(path: Seq[String], schema: DataType): Seq[Column] = schema match {
  14.     case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name, f.dataType))
  15.     case other => col(path.map(n => s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
  16.   }
  17. }
  18. Copy to clipboardCopy
  19. display(veryNestedDF)
  20. Copy to clipboardCopy
  21. display(veryNestedDF.flattenSchema)
  22. Cleanup: Remove the parquet file.

  23. Copy to clipboardCopy
  24. dbutils.fs.rm("/tmp/databricks-df-example.parquet", true)
复制代码

使用道具

7
Lisrelchen 发表于 2017-3-8 10:39:11 |只看作者 |坛友微信交流群
  1. Union 2 DataFrames.

  2. Copy to clipboardCopy
  3. val unionDF = df1.unionAll(df2)
  4. display(unionDF)
复制代码

使用道具

8
Lisrelchen 发表于 2017-3-8 10:41:03 |只看作者 |坛友微信交流群
  1. Write the Unioned DataFrame to a Parquet file.

  2. Copy to clipboardCopy
  3. // Remove the file if it exists
  4. dbutils.fs.rm("/tmp/databricks-df-example.parquet", true)
  5. unionDF.write.parquet("/tmp/databricks-df-example.parquet")
复制代码

使用道具

9
Lisrelchen 发表于 2017-3-8 10:41:37 |只看作者 |坛友微信交流群
  1. Read a DataFrame from the Parquet file.

  2. Copy to clipboardCopy
  3. val parquetDF = sqlContext.read.parquet("/tmp/databricks-df-example.parquet")
  4. Copy to clipboardCopy
  5. val explodeDF = parquetDF.explode($"employees") {
  6.   case Row(employee: Seq[Row]) => employee.map{ employee =>
  7.     val firstName = employee(0).asInstanceOf[String]
  8.     val lastName = employee(1).asInstanceOf[String]
  9.     val email = employee(2).asInstanceOf[String]
  10.     val salary = employee(3).asInstanceOf[Int]
  11.     Employee(firstName, lastName, email, salary)
  12.   }
  13. }.cache()
  14. display(explodeDF)
  15. Copy to clipboardCopy
  16. explodeDF
复制代码

使用道具

10
Lisrelchen 发表于 2017-3-8 10:43:43 |只看作者 |坛友微信交流群
  1. Use ``filter()`` to return only the rows that match the given predicate.

  2. Copy to clipboardCopy
  3. val filterDF = explodeDF
  4.   .filter($"firstName" === "xiangrui" || $"firstName" === "michael")
  5.   .sort($"lastName".asc)
  6. display(filterDF)
复制代码

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-11-23 17:41