【Apache Spark】Apache Spark API By Example

0关注
62粉丝

VIP

已卖：4194份资源

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 50288 个
通用积分: 83.6306
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2017-3-8 10:25:38 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Apache Spark API By Example
A Command Reference for Beginners
Matthias Langer, Zhen He
Department of Computer Science and Computer Engineering
La Trobe University
Bundoora, VIC 3086
Australia
m.langer@latrobe.edu.au, z.he@latrobe.edu.au
May 31, 2014

复制代码

本帖隐藏的内容

Apache Spark API By Example.rar (258.94 KB) 本附件包括：
Apache Spark API By Example.pdf

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Apache Spark example apache Spark ExamP 2014

本帖被以下文库推荐

· Data Science NewOccidental|主题: 1233, 订阅: 120

加关注串个门加好友发消息 6关注 76粉丝禁止访问 auirzxp 当前离线阅读权限 0 威望 1 级论坛币 229692 个通用积分 25268.5833 学术水平 4223 点热心指数 4861 点信用等级 4173 点经验 4496 点帖子 13492 精华 0 在线时间 12559 小时注册时间 2007-1-3 最后登录 2024-4-8 雷达卡	沙发 auirzxp 发表于 2017-3-8 10:28:01 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

藤椅

Lisrelchen 发表于 2017-3-8 10:31:52

Creating DataFrames with Python
Copy to clipboardCopy
# import pyspark class Row from module sql
from pyspark.sql import *
# Create Example Data - Departments and Employees
# Create the Departments
department1 = Row(id='123456', name='Computer Science')
department2 = Row(id='789012', name='Mechanical Engineering')
department3 = Row(id='345678', name='Theater and Drama')
department4 = Row(id='901234', name='Indoor Recreation')
# Create the Employees
Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('michael', 'armbrust', 'no-reply@berkeley.edu', 100000)
employee2 = Employee('xiangrui', 'meng', 'no-reply@stanford.edu', 120000)
employee3 = Employee('matei', None, 'no-reply@waterloo.edu', 140000)
employee4 = Employee(None, 'wendell', 'no-reply@berkeley.edu', 160000)
# Create the DepartmentWithEmployees instances from Departments and Employees
departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])
print department1
print employee2
print departmentWithEmployees1.employees[0].email
Create the first DataFrame from a list of the rows.
Copy to clipboardCopy
departmentsWithEmployeesSeq1 = [departmentWithEmployees1, departmentWithEmployees2]
df1 = sqlContext.createDataFrame(departmentsWithEmployeesSeq1)
display(df1)
Create a second DataFrame from a list of rows.
Copy to clipboardCopy
departmentsWithEmployeesSeq2 = [departmentWithEmployees3, departmentWithEmployees4]
df2 = sqlContext.createDataFrame(departmentsWithEmployeesSeq2)
display(df2)

复制代码

板凳

Lisrelchen 发表于 2017-3-8 10:33:01

An example using Pandas & Matplotlib Integration
Copy to clipboardCopy
import pandas as pd
import matplotlib.pyplot as plt
plt.clf()
pdDF = nonNullDF.toPandas()
pdDF.plot(x='firstName', y='salary', kind='bar', rot=45)
display()
Cleanup: Remove the parquet file.
Copy to clipboardCopy
dbutils.fs.rm("/tmp/databricks-df-example.parquet", True)

复制代码

报纸

Lisrelchen 发表于 2017-3-8 10:35:24

Create DataFrames using Scala
Copy to clipboardCopy
// Create the case classes for our domain
case class Department(id: String, name: String)
case class Employee(firstName: String, lastName: String, email: String, salary: Int)
case class DepartmentWithEmployees(department: Department, employees: Seq[Employee])
// Create the Departments
val department1 = new Department("123456", "Computer Science")
val department2 = new Department("789012", "Mechanical Engineering")
val department3 = new Department("345678", "Theater and Drama")
val department4 = new Department("901234", "Indoor Recreation")
// Create the Employees
val employee1 = new Employee("michael", "armbrust", "no-reply@berkeley.edu", 100000)
val employee2 = new Employee("xiangrui", "meng", "no-reply@stanford.edu", 120000)
val employee3 = new Employee("matei", null, "no-reply@waterloo.edu", 140000)
val employee4 = new Employee(null, "wendell", "no-reply@princeton.edu", 160000)
// Create the DepartmentWithEmployees instances from Departments and Employees
val departmentWithEmployees1 = new DepartmentWithEmployees(department1, Seq(employee1, employee2))
val departmentWithEmployees2 = new DepartmentWithEmployees(department2, Seq(employee3, employee4))
val departmentWithEmployees3 = new DepartmentWithEmployees(department3, Seq(employee1, employee4))
val departmentWithEmployees4 = new DepartmentWithEmployees(department4, Seq(employee2, employee3))
Create the first DataFrame from a List of the Case Classes.
Copy to clipboardCopy
val departmentsWithEmployeesSeq1 = Seq(departmentWithEmployees1, departmentWithEmployees2)
val df1 = departmentsWithEmployeesSeq1.toDF()
display(df1)
Create a 2nd DataFrame from a List of Case Classes.
Copy to clipboardCopy
val departmentsWithEmployeesSeq2 = Seq(departmentWithEmployees3, departmentWithEmployees4)
val df2 = departmentsWithEmployeesSeq2.toDF()
display(df2)

复制代码

地板

Lisrelchen 发表于 2017-3-8 10:36:24

Flattening using Scala
If your data has several levels of nesting, here is a helper function to flatten your DataFrame to make it easier to work with.
Copy to clipboardCopy
val veryNestedDF = Seq(("1", (2, (3, 4)))).toDF()
Copy to clipboardCopy
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
implicit class DataFrameFlattener(df: DataFrame) {
def flattenSchema: DataFrame = {
df.select(flatten(Nil, df.schema): _*)
}
protected def flatten(path: Seq[String], schema: DataType): Seq[Column] = schema match {
case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name, f.dataType))
case other => col(path.map(n => s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
}
}
Copy to clipboardCopy
display(veryNestedDF)
Copy to clipboardCopy
display(veryNestedDF.flattenSchema)
Cleanup: Remove the parquet file.
Copy to clipboardCopy
dbutils.fs.rm("/tmp/databricks-df-example.parquet", true)

复制代码

7楼

Lisrelchen 发表于 2017-3-8 10:39:11

Union 2 DataFrames.
Copy to clipboardCopy
val unionDF = df1.unionAll(df2)
display(unionDF)

复制代码

8楼

Lisrelchen 发表于 2017-3-8 10:41:03

Write the Unioned DataFrame to a Parquet file.
Copy to clipboardCopy
// Remove the file if it exists
dbutils.fs.rm("/tmp/databricks-df-example.parquet", true)
unionDF.write.parquet("/tmp/databricks-df-example.parquet")

复制代码

9楼

Lisrelchen 发表于 2017-3-8 10:41:37

Read a DataFrame from the Parquet file.
Copy to clipboardCopy
val parquetDF = sqlContext.read.parquet("/tmp/databricks-df-example.parquet")
Copy to clipboardCopy
val explodeDF = parquetDF.explode($"employees") {
case Row(employee: Seq[Row]) => employee.map{ employee =>
val firstName = employee(0).asInstanceOf[String]
val lastName = employee(1).asInstanceOf[String]
val email = employee(2).asInstanceOf[String]
val salary = employee(3).asInstanceOf[Int]
Employee(firstName, lastName, email, salary)
}
}.cache()
display(explodeDF)
Copy to clipboardCopy
explodeDF

复制代码

10楼

Lisrelchen 发表于 2017-3-8 10:43:43

Use ``filter()`` to return only the rows that match the given predicate.
Copy to clipboardCopy
val filterDF = explodeDF
.filter($"firstName" === "xiangrui" || $"firstName" === "michael")
.sort($"lastName".asc)
display(filterDF)

复制代码

【Apache Spark】Apache Spark API By Example [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

【Apache Spark】Apache Spark API By Example [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

扫码加我拉你入群