[Github]Apache Spark 2 for Beginners

0关注
62粉丝

VIP

已卖：4195份资源

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 50289 个
通用积分: 83.7506
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2017-8-22 08:25:48 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Apache Spark 2 for Beginners

本帖隐藏的内容

https://github.com/PacktPublishing/Apache-Spark-2-for-Beginners

This is the code repository for Apache Spark 2 for Beginners, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish. ##Instructions and Navigations All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.

Detailed installation steps (software-wise)

The steps should be listed in a way that it prepares the system environment to be able to test the codes of the book. ###1. Apache Spark: a. Download Spark version mentioned in the table
b. Build Spark from source or use the binary download and follow the detailed instructions given in the pagehttp://spark.apache.org/docs/latest/building-spark.html
c. If building Spark from source, make sure that the R profile is also built and the instructions to do that is given in the link given inthe step b.
###2. Apache Kafka a. Download Kafka version mentioned in the table
b. The “quick start” section of the Kafka documentation gives the instructions to setup Kafka.http://kafka.apache.org/documentation.html#quickstart
c. Apart from the installation instructions, the topic creation and the other Kafka setup pre-requisites have been covered in detail in the chapter of the book

The code will look like the following:

Python 3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19)[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

Spark 2.0.0 or above is to be installed on at least a standalone machine to run the code samples and do further activities to learn more about the subject. For Spark Stream Processing, Kafka needs to be installed and configured as a message broker with its command line producer producing messages and the application developed using Spark as a consumer of those messages.

##Related Products

Scala Design Patterns
Machine Learning with Spark
Quickstart Apache Axis2 ###Suggestions and Feedback Click here if you have any feedback or suggestions.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Apache Spark beginners beginner apache beginn

本帖被以下文库推荐

· Data Science NewOccidental|主题: 1233, 订阅: 120

沙发

Lisrelchen 发表于 2017-8-22 08:27:38

# coding: utf-8
# In[7]:
# Import all the required liraries
from pyspark.sql import Row
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
import pylab as P
get_ipython().magic('matplotlib inline')
plt.rcdefaults()
# In[8]:
# TODO - The following location has to be changed to the appropriate data file location
dataDir = "/Users/RajT/Documents/Writing/SparkForBeginners/To-PACKTPUB/Contents/B05289-05-SparkDataAnalysisWithPython/Data/ml-100k/"
# In[9]:
# Create the DataFrame of the user data set
lines = sc.textFile(dataDir + "u.user")
splitLines = lines.map(lambda l: l.split("|"))
usersRDD = splitLines.map(lambda p: Row(id=p[0], age=int(p[1]), gender=p[2], occupation=p[3], zipcode=p[4]))
usersDF = spark.createDataFrame(usersRDD)
usersDF.createOrReplaceTempView("users")
usersDF.show()
# In[10]:
# Create the DataFrame of the user data set with only one column age
ageDF = spark.sql("SELECT age FROM users")
ageList = ageDF.rdd.map(lambda p: p.age).collect()
ageDF.describe().show()
# In[11]:
# Age distribution of the users
plt.hist(ageList)
plt.title("Age distribution of the users\n")
plt.xlabel("Age")
plt.ylabel("Number of users")
plt.show(block=False)
# In[12]:
# Draw a probability density plot
from scipy.stats import gaussian_kde
density = gaussian_kde(ageList)
xAxisValues = np.linspace(0,100,1000) # Use the range of ages from 0 to 100 and the number of data points
density.covariance_factor = lambda : .5
density._compute_covariance()
plt.title("Age density plot of the users\n")
plt.xlabel("Age")
plt.ylabel("Density")
plt.plot(xAxisValues, density(xAxisValues))
plt.show(block=False)
# In[13]:
# The following example demonstrates the creation of multiple diagrams in one figure
# There are two plots on one row
# The first one is the histogram of the distribution
# The second one is the boxplot containing the summary of the distribution
plt.subplot(121)
plt.hist(ageList)
plt.title("Age distribution of the users\n")
plt.xlabel("Age")
plt.ylabel("Number of users")
plt.subplot(122)
plt.title("Summary of distribution\n")
plt.xlabel("Age")
plt.boxplot(ageList, vert=False)
plt.show(block=False)
# In[14]:
occupationsTop10 = spark.sql("SELECT occupation, count(occupation) as usercount FROM users GROUP BY occupation ORDER BY usercount DESC LIMIT 10")
occupationsTop10.show()
# In[15]:
occupationsTop10Tuple = occupationsTop10.rdd.map(lambda p: (p.occupation,p.usercount)).collect()
occupationsTop10List, countTop10List = zip(*occupationsTop10Tuple)
occupationsTop10Tuple
# In[16]:
# Top 10 occupations in terms of the number of users having that occupation who have rated movies
y_pos = np.arange(len(occupationsTop10List))
plt.barh(y_pos, countTop10List, align='center', alpha=0.4)
plt.yticks(y_pos, occupationsTop10List)
plt.xlabel('Number of users')
plt.title('Top 10 user types\n')
plt.gcf().subplots_adjust(left=0.15)
plt.show(block=False)
# In[17]:
occupationsGender = spark.sql("SELECT occupation, gender FROM users")
occupationsGender.show()
# In[18]:
occCrossTab = occupationsGender.stat.crosstab("occupation", "gender")
occCrossTab.show()
# In[19]:
occCrossTab.describe('M', 'F').show()
# In[20]:
occCrossTab.stat.cov('M', 'F')
# In[21]:
occCrossTab.stat.corr('M', 'F')
# In[22]:
occupationsCrossTuple = occCrossTab.rdd.map(lambda p: (p.occupation_gender,p.M, p.F)).collect()
occList, mList, fList = zip(*occupationsCrossTuple)
# In[23]:
N = len(occList)
ind = np.arange(N) # the x locations for the groups
width = 0.75 # the width of the bars: can also be len(x) sequence
p1 = plt.bar(ind, mList, width, color='r')
p2 = plt.bar(ind, fList, width, color='y', bottom=mList)
plt.ylabel('Count')
plt.title('Gender distribution by occupation\n')
plt.xticks(ind + width/2., occList, rotation=90)
plt.legend((p1[0], p2[0]), ('Male', 'Female'))
plt.gcf().subplots_adjust(bottom=0.25)
plt.show(block=False)
# In[24]:
occupationsBottom10 = spark.sql("SELECT occupation, count(occupation) as usercount FROM users GROUP BY occupation ORDER BY usercount LIMIT 10")
occupationsBottom10.show()
# In[25]:
occupationsBottom10Tuple = occupationsBottom10.rdd.map(lambda p: (p.occupation,p.usercount)).collect()
occupationsBottom10List, countBottom10List = zip(*occupationsBottom10Tuple)
# In[26]:
# Bottom 10 occupations in terms of the number of users having that occupation who have rated movies
explode = (0, 0, 0, 0,0.1,0,0,0,0,0.1)
plt.pie(countBottom10List, explode=explode, labels=occupationsBottom10List, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('Bottom 10 user types\n')
plt.show(block=False)
# In[27]:
zipTop10 = spark.sql("SELECT zipcode, count(zipcode) as usercount FROM users GROUP BY zipcode ORDER BY usercount DESC LIMIT 10")
zipTop10.show()
# In[28]:
zipTop10Tuple = zipTop10.rdd.map(lambda p: (p.zipcode,p.usercount)).collect()
zipTop10List, countTop10List = zip(*zipTop10Tuple)
# In[29]:
# Top 10 zipcodes in terms of the number of users living in that zipcode who have rated movies
explode = (0.1, 0, 0, 0,0,0,0,0,0,0) # explode a slice if required
plt.pie(countTop10List, explode=explode, labels=zipTop10List, autopct='%1.1f%%', shadow=True)
#draw a circle at the center of pie to make it look like a donut
centre_circle = plt.Circle((0,0),0.75,color='black', fc='white',linewidth=1.25)
plt.gcf().gca().add_artist(centre_circle)
# The aspect ratio is to be made equal. This is to make sure that pie chart is coming perfectly as a circle.
plt.axis('equal')
plt.text(- 0.25,0,'Top 10 zip codes')
plt.show(block=False)
# In[30]:
ages = spark.sql("SELECT occupation, age FROM users WHERE occupation ='administrator' ORDER BY age")
adminAges = ages.rdd.map(lambda p: p.age).collect()
ages.describe().show()
# In[31]:
ages = spark.sql("SELECT occupation, age FROM users WHERE occupation ='engineer' ORDER BY age")
engAges = ages.rdd.map(lambda p: p.age).collect()
ages.describe().show()
# In[32]:
ages = spark.sql("SELECT occupation, age FROM users WHERE occupation ='programmer' ORDER BY age")
progAges = ages.rdd.map(lambda p: p.age).collect()
ages.describe().show()
# In[33]:
# Box plots of the ages by profession
boxPlotAges = [adminAges, engAges, progAges]
boxPlotLabels = ['administrator','engineer', 'programmer' ]
x = np.arange(len(boxPlotLabels))
plt.figure()
plt.boxplot(boxPlotAges)
plt.title('Age summary statistics\n')
plt.ylabel("Age")
plt.xticks(x + 1, boxPlotLabels, rotation=0)
plt.show(block=False)
# In[34]:
movieLines = sc.textFile(dataDir + "u.item")
splitMovieLines = movieLines.map(lambda l: l.split("|"))
moviesRDD = splitMovieLines.map(lambda p: Row(id=p[0], title=p[1], releaseDate=p[2], videoReleaseDate=p[3], url=p[4], unknown=int(p[5]),action=int(p[6]),adventure=int(p[7]),animation=int(p[8]),childrens=int(p[9]),comedy=int(p[10]),crime=int(p[11]),documentary=int(p[12]),drama=int(p[13]),fantasy=int(p[14]),filmNoir=int(p[15]),horror=int(p[16]),musical=int(p[17]),mystery=int(p[18]),romance=int(p[19]),sciFi=int(p[20]),thriller=int(p[21]),war=int(p[22]),western=int(p[23])))
moviesDF = spark.createDataFrame(moviesRDD)
moviesDF.createOrReplaceTempView("movies")
# In[35]:
genreDF = spark.sql("SELECT sum(unknown) as unknown, sum(action) as action,sum(adventure) as adventure,sum(animation) as animation, sum(childrens) as childrens,sum(comedy) as comedy,sum(crime) as crime,sum(documentary) as documentary,sum(drama) as drama,sum(fantasy) as fantasy,sum(filmNoir) as filmNoir,sum(horror) as horror,sum(musical) as musical,sum(mystery) as mystery,sum(romance) as romance,sum(sciFi) as sciFi,sum(thriller) as thriller,sum(war) as war,sum(western) as western FROM movies")
genreList = genreDF.collect()
genreDict = genreList[0].asDict()
labelValues = list(genreDict.keys())
countList = list(genreDict.values())
genreDict
# In[36]:
# http://opentechschool.github.io/python-data-intro/core/charts.html
# Movie types and the counts
x = np.arange(len(labelValues))
plt.title('Movie types\n')
plt.ylabel("Count")
plt.bar(x, countList)
plt.xticks(x + 0.5, labelValues, rotation=90)
plt.gcf().subplots_adjust(bottom=0.20)
plt.show(block=False)
# In[37]:
yearDF = sqlContext.sql("SELECT substring(releaseDate,8,4) as releaseYear, count(*) as movieCount FROM movies GROUP BY substring(releaseDate,8,4) ORDER BY movieCount DESC LIMIT 10")
yearDF.show()
yearMovieCountTuple = yearDF.rdd.map(lambda p: (int(p.releaseYear),p.movieCount)).collect()
yearList,movieCountList = zip(*yearMovieCountTuple)
countArea = yearDF.rdd.map(lambda p: np.pi * (p.movieCount/15)**2).collect()
# In[38]:
# Top 10 years where the most number of movies have been released
plt.title('Top 10 movie release by year\n')
plt.xlabel("Year")
plt.ylabel("Number of movies released")
plt.ylim([0,max(movieCountList)])
colors = np.random.rand(10)
plt.scatter(yearList, movieCountList,c=colors)
plt.show(block=False)
# In[39]:
# Top 10 years where the most number of movies have been released
plt.title('Top 10 movie release by year\n')
plt.xlabel("Year")
plt.ylabel("Number of movies released")
plt.ylim([0,max(movieCountList) + 100])
colors = np.random.rand(10)
plt.scatter(yearList, movieCountList,c=colors, s=countArea)
plt.show(block=False)
# In[40]:
yearActionDF = spark.sql("SELECT substring(releaseDate,8,4) as actionReleaseYear, count(*) as actionMovieCount FROM movies WHERE action = 1 GROUP BY substring(releaseDate,8,4) ORDER BY actionReleaseYear DESC LIMIT 10")
yearActionDF.show()
yearActionDF.createOrReplaceTempView("action")
# In[41]:
yearDramaDF = spark.sql("SELECT substring(releaseDate,8,4) as dramaReleaseYear, count(*) as dramaMovieCount FROM movies WHERE drama = 1 GROUP BY substring(releaseDate,8,4) ORDER BY dramaReleaseYear DESC LIMIT 10")
yearDramaDF.show()
yearDramaDF.createOrReplaceTempView("drama")
# In[42]:
yearCombinedDF = spark.sql("SELECT a.actionReleaseYear as releaseYear, a.actionMovieCount, d.dramaMovieCount FROM action a, drama d WHERE a.actionReleaseYear = d.dramaReleaseYear ORDER BY a.actionReleaseYear DESC LIMIT 10")
yearCombinedDF.show()
# In[43]:
yearMovieCountTuple = yearCombinedDF.rdd.map(lambda p: (p.releaseYear,p.actionMovieCount, p.dramaMovieCount)).collect()
yearList,actionMovieCountList,dramaMovieCountList = zip(*yearMovieCountTuple)
# In[44]:
plt.title("Movie release by year\n")
plt.xlabel("Year")
plt.ylabel("Movie count")
line_action, = plt.plot(yearList, actionMovieCountList)
line_drama, = plt.plot(yearList, dramaMovieCountList)
plt.legend([line_action, line_drama], ['Action Movies', 'Drama Movies'],loc='upper left')
plt.gca().get_xaxis().get_major_formatter().set_useOffset(False)
plt.show(block=False)
# In[ ]:

复制代码

藤椅

西门高 发表于 2017-8-22 08:42:12

谢谢分享

板凳

MouJack007 发表于 2017-8-22 09:08:38

谢谢楼主分享！

报纸

MouJack007 发表于 2017-8-22 09:09:22

地板

lianqu 发表于 2017-8-22 09:17:43

7楼

终结天狼

发表于 2017-8-22 10:28:32

[Github]Apache Spark 2 for Beginners

8楼

钱学森64 发表于 2017-8-22 11:14:41

谢谢分享

9楼

mayipai

发表于 2017-9-8 11:12:04

thank you sir

[Github]Apache Spark 2 for Beginners [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

本版微信群

[Github]Apache Spark 2 for Beginners [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

本版微信群

扫码加我拉你入群