人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › winbugs及其他软件专版 › 【Github】Frank Kane's Taming Big Data with Apache S ...

发帖

楼主: Lisrelchen

3171 18

【Github】Frank Kane's Taming Big Data with Apache Spark and Python [推广有奖]

0关注
62粉丝

VIP

已卖：4196份资源

院士

67%

还不是VIP/贵宾

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望: 0 级
论坛币: 50294 个
通用积分: 83.8106
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2017-8-13 05:15:36 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Frank Kane's Taming Big Data with Apache Spark and Python

本帖隐藏的内容

https://github.com/PacktPublishing/Frank-Kanes-Taming-Big-Data-with-Apache-Spark-and-Python

This is the code repository for Frank Kane's Taming Big Data with Apache Spark and Python, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish.

About the Book

Frank Kane’s Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you’ll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python.

Apache Spark has emerged as the next big thing in the Big Data domain – quickly rising from an ascending technology to an established superstar in just a matter of years. Spark allows you to quickly extract actionable insights from large amounts of data, on a real-time basis, making it an essential tool in many modern businesses.

Frank has packed this book with over 15 interactive, fun-filled examples relevant to the real world, and he will empower you to understand the Spark ecosystem and implement production-grade real-time Spark projects with ease. ##Instructions and Navigation All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.

The code will look like the following:

from pyspark import SparkConf, SparkContext import collections conf = SparkConf().setMaster("local").setAppName("RatingsHistogram") sc = SparkContext(conf = conf)lines = sc.textFile("file:///SparkCourse/ml-100k/u.data") ratings = lines.map(lambda x: x.split()[2]) result = ratings.countByValue() sortedResults = collections.OrderedDict(sorted(result.items())) for key, value in sortedResults.items(): print("%s %i" % (key, value))

For this book you’ll need a Python development environment (Python 3.5 or newer), a Canopy installer, Java Development Kit, and of course Spark itself (Spark 2.0 and beyond).

We'll show you how to install this software in first chapter of the book.

This book is based on the Windows operating system, so installations are provided according to it. If you have Mac or Linux, you can follow this URL http://media.sundog-soft.com/spark-python-install.pdf, which contains written instructions on getting everything set up on Mac OS and on Linux.

Related Products

Suggestions and Feedback

Click here if you have any feedback or suggestions.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Apache Spark Big data python apache GitHub

本帖被以下文库推荐

· 编程语言(Coding Languages)|主题: 3936, 订阅: 126

沙发

Lisrelchen 发表于 2017-8-13 05:16:10

from pyspark.sql import SparkSession
from pyspark.sql import Row
import collections
# Create a SparkSession (Note, the config section is only for Windows!)
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("SparkSQL").getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), name=str(fields[1].encode("utf-8")), age=int(fields[2]), numFriends=int(fields[3]))
lines = spark.sparkContext.textFile("fakefriends.csv")
people = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(people).cache()
schemaPeople.createOrReplaceTempView("people")
# SQL can be run over DataFrames that have been registered as a table.
teenagers = spark.sql("SELECT * FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are RDDs and support all the normal RDD operations.
for teen in teenagers.collect():
print(teen)
# We can also use functions instead of SQL queries:
schemaPeople.groupBy("age").count().orderBy("age").show()
spark.stop()

复制代码

藤椅

Lisrelchen 发表于 2017-8-13 05:17:14

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("PopularMovies")
sc = SparkContext(conf = conf)
lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
movies = lines.map(lambda x: (int(x.split()[1]), 1))
movieCounts = movies.reduceByKey(lambda x, y: x + y)
flipped = movieCounts.map( lambda (x, y) : (y, x) )
sortedMovies = flipped.sortByKey()
results = sortedMovies.collect()
for result in results:
print(result)

复制代码

板凳

hjtoh 发表于 2017-8-13 06:09:59 来自手机

Lisrelchen 发表于 2017-8-13 05:15
Frank Kane's Taming Big Data with Apache Spark and Python
**** 本内容被作者隐藏 ****
This is the c ...

谢谢分享

报纸

军旗飞扬

发表于 2017-8-13 06:35:03

谢谢楼主分享！

地板

MouJack007 发表于 2017-8-13 07:32:47

谢谢楼主分享！

7楼

MouJack007 发表于 2017-8-13 07:33:40

8楼

albertwishedu 发表于 2017-8-13 08:52:09

9楼

franky_sas 发表于 2017-8-13 13:00:52

10楼

Rusty.Liao 发表于 2017-8-14 23:16:02

为什么都要回复才能下载呢

返回列表

12 下一页

发帖

本版微信群

加好友,备注jltj
拉您入交流群

京ICP备16021002号-2 京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明

【Github】Frank Kane's Taming Big Data with Apache Spark and Python [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

【Github】Frank Kane's Taming Big Data with Apache Spark and Python [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群