楼主: Lisrelchen
1082 8

[Github]Apache Spark 2 for Beginners [推广有奖]

  • 0关注
  • 62粉丝

VIP

已卖:4194份资源

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
50288 个
通用积分
83.6906
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

楼主
Lisrelchen 发表于 2017-8-22 08:25:48 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币



Apache Spark 2 for Beginners

本帖隐藏的内容

https://github.com/PacktPublishing/Apache-Spark-2-for-Beginners




This is the code repository for Apache Spark 2 for Beginners, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish. ##Instructions and Navigations All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.


Detailed installation steps (software-wise)

The steps should be listed in a way that it prepares the system environment to be able to test the codes of the book. ###1. Apache Spark: a. Download Spark version mentioned in the table
b. Build Spark from source or use the binary download and follow the detailed instructions given in the pagehttp://spark.apache.org/docs/latest/building-spark.html
c. If building Spark from source, make sure that the R profile is also built and the instructions to do that is given in the link given inthe step b.
###2. Apache Kafka a. Download Kafka version mentioned in the table
b. The “quick start” section of the Kafka documentation gives the instructions to setup Kafka.http://kafka.apache.org/documentation.html#quickstart
c. Apart from the installation instructions, the topic creation and the other Kafka setup pre-requisites have been covered in detail in the chapter of the book

The code will look like the following:

Python 3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19)[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

Spark 2.0.0 or above is to be installed on at least a standalone machine to run the code samples and do further activities to learn more about the subject. For Spark Stream Processing, Kafka needs to be installed and configured as a message broker with its command line producer producing messages and the application developed using Spark as a consumer of those messages.

##Related Products



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Apache Spark beginners beginner apache beginn

本帖被以下文库推荐

沙发
Lisrelchen 发表于 2017-8-22 08:27:38
  1. # coding: utf-8

  2. # In[7]:

  3. # Import all the required liraries
  4. from pyspark.sql import Row
  5. import matplotlib.pyplot as plt
  6. import numpy as np
  7. import matplotlib.pyplot as plt
  8. import pylab as P
  9. get_ipython().magic('matplotlib inline')
  10. plt.rcdefaults()


  11. # In[8]:

  12. # TODO - The following location has to be changed to the appropriate data file location
  13. dataDir = "/Users/RajT/Documents/Writing/SparkForBeginners/To-PACKTPUB/Contents/B05289-05-SparkDataAnalysisWithPython/Data/ml-100k/"


  14. # In[9]:

  15. # Create the DataFrame of the user data set
  16. lines = sc.textFile(dataDir + "u.user")
  17. splitLines = lines.map(lambda l: l.split("|"))
  18. usersRDD = splitLines.map(lambda p: Row(id=p[0], age=int(p[1]), gender=p[2], occupation=p[3], zipcode=p[4]))
  19. usersDF = spark.createDataFrame(usersRDD)
  20. usersDF.createOrReplaceTempView("users")
  21. usersDF.show()


  22. # In[10]:

  23. # Create the DataFrame of the user data set with only one column age
  24. ageDF = spark.sql("SELECT age FROM users")
  25. ageList = ageDF.rdd.map(lambda p: p.age).collect()
  26. ageDF.describe().show()


  27. # In[11]:

  28. # Age distribution of the users
  29. plt.hist(ageList)
  30. plt.title("Age distribution of the users\n")
  31. plt.xlabel("Age")
  32. plt.ylabel("Number of users")
  33. plt.show(block=False)


  34. # In[12]:

  35. # Draw a probability density plot
  36. from scipy.stats import gaussian_kde
  37. density = gaussian_kde(ageList)
  38. xAxisValues = np.linspace(0,100,1000) # Use the range of ages from 0 to 100 and the number of data points
  39. density.covariance_factor = lambda : .5
  40. density._compute_covariance()
  41. plt.title("Age density plot of the users\n")
  42. plt.xlabel("Age")
  43. plt.ylabel("Density")
  44. plt.plot(xAxisValues, density(xAxisValues))
  45. plt.show(block=False)


  46. # In[13]:

  47. # The following example demonstrates the creation of multiple diagrams in one figure
  48. # There are two plots on one row
  49. # The first one is the histogram of the distribution
  50. # The second one is the boxplot containing the summary of the distribution
  51. plt.subplot(121)
  52. plt.hist(ageList)
  53. plt.title("Age distribution of the users\n")
  54. plt.xlabel("Age")
  55. plt.ylabel("Number of users")
  56. plt.subplot(122)
  57. plt.title("Summary of distribution\n")
  58. plt.xlabel("Age")
  59. plt.boxplot(ageList, vert=False)
  60. plt.show(block=False)


  61. # In[14]:

  62. occupationsTop10 = spark.sql("SELECT occupation, count(occupation) as usercount FROM users GROUP BY occupation ORDER BY usercount DESC LIMIT 10")
  63. occupationsTop10.show()


  64. # In[15]:

  65. occupationsTop10Tuple = occupationsTop10.rdd.map(lambda p: (p.occupation,p.usercount)).collect()
  66. occupationsTop10List, countTop10List = zip(*occupationsTop10Tuple)
  67. occupationsTop10Tuple


  68. # In[16]:

  69. # Top 10 occupations in terms of the number of users having that occupation who have rated movies
  70. y_pos = np.arange(len(occupationsTop10List))
  71. plt.barh(y_pos, countTop10List, align='center', alpha=0.4)
  72. plt.yticks(y_pos, occupationsTop10List)
  73. plt.xlabel('Number of users')
  74. plt.title('Top 10 user types\n')
  75. plt.gcf().subplots_adjust(left=0.15)
  76. plt.show(block=False)


  77. # In[17]:

  78. occupationsGender = spark.sql("SELECT occupation, gender FROM users")
  79. occupationsGender.show()


  80. # In[18]:

  81. occCrossTab = occupationsGender.stat.crosstab("occupation", "gender")
  82. occCrossTab.show()


  83. # In[19]:

  84. occCrossTab.describe('M', 'F').show()


  85. # In[20]:

  86. occCrossTab.stat.cov('M', 'F')


  87. # In[21]:

  88. occCrossTab.stat.corr('M', 'F')


  89. # In[22]:

  90. occupationsCrossTuple = occCrossTab.rdd.map(lambda p: (p.occupation_gender,p.M, p.F)).collect()
  91. occList, mList, fList = zip(*occupationsCrossTuple)


  92. # In[23]:

  93. N = len(occList)
  94. ind = np.arange(N)    # the x locations for the groups
  95. width = 0.75          # the width of the bars: can also be len(x) sequence
  96. p1 = plt.bar(ind, mList, width, color='r')
  97. p2 = plt.bar(ind, fList, width, color='y', bottom=mList)
  98. plt.ylabel('Count')
  99. plt.title('Gender distribution by occupation\n')
  100. plt.xticks(ind + width/2., occList, rotation=90)
  101. plt.legend((p1[0], p2[0]), ('Male', 'Female'))
  102. plt.gcf().subplots_adjust(bottom=0.25)
  103. plt.show(block=False)


  104. # In[24]:

  105. occupationsBottom10 = spark.sql("SELECT occupation, count(occupation) as usercount FROM users GROUP BY occupation ORDER BY usercount LIMIT 10")
  106. occupationsBottom10.show()


  107. # In[25]:

  108. occupationsBottom10Tuple = occupationsBottom10.rdd.map(lambda p: (p.occupation,p.usercount)).collect()
  109. occupationsBottom10List, countBottom10List = zip(*occupationsBottom10Tuple)


  110. # In[26]:

  111. # Bottom 10 occupations in terms of the number of users having that occupation who have rated movies
  112. explode = (0, 0, 0, 0,0.1,0,0,0,0,0.1)  
  113. plt.pie(countBottom10List, explode=explode, labels=occupationsBottom10List, autopct='%1.1f%%', shadow=True, startangle=90)
  114. plt.title('Bottom 10 user types\n')
  115. plt.show(block=False)


  116. # In[27]:

  117. zipTop10 = spark.sql("SELECT zipcode, count(zipcode) as usercount FROM users GROUP BY zipcode ORDER BY usercount DESC LIMIT 10")
  118. zipTop10.show()


  119. # In[28]:

  120. zipTop10Tuple = zipTop10.rdd.map(lambda p: (p.zipcode,p.usercount)).collect()
  121. zipTop10List, countTop10List = zip(*zipTop10Tuple)


  122. # In[29]:

  123. # Top 10 zipcodes in terms of the number of users living in that zipcode who have rated movies
  124. explode = (0.1, 0, 0, 0,0,0,0,0,0,0)  # explode a slice if required
  125. plt.pie(countTop10List, explode=explode, labels=zipTop10List, autopct='%1.1f%%', shadow=True)
  126. #draw a circle at the center of pie to make it look like a donut
  127. centre_circle = plt.Circle((0,0),0.75,color='black', fc='white',linewidth=1.25)
  128. plt.gcf().gca().add_artist(centre_circle)
  129. # The aspect ratio is to be made equal. This is to make sure that pie chart is coming perfectly as a circle.
  130. plt.axis('equal')
  131. plt.text(- 0.25,0,'Top 10 zip codes')
  132. plt.show(block=False)


  133. # In[30]:

  134. ages = spark.sql("SELECT occupation, age FROM users WHERE occupation ='administrator' ORDER BY age")
  135. adminAges = ages.rdd.map(lambda p: p.age).collect()
  136. ages.describe().show()


  137. # In[31]:

  138. ages = spark.sql("SELECT occupation, age FROM users WHERE occupation ='engineer' ORDER BY age")
  139. engAges = ages.rdd.map(lambda p: p.age).collect()
  140. ages.describe().show()


  141. # In[32]:

  142. ages = spark.sql("SELECT occupation, age FROM users WHERE occupation ='programmer' ORDER BY age")
  143. progAges = ages.rdd.map(lambda p: p.age).collect()
  144. ages.describe().show()


  145. # In[33]:

  146. # Box plots of the ages by profession
  147. boxPlotAges = [adminAges, engAges, progAges]
  148. boxPlotLabels = ['administrator','engineer', 'programmer' ]
  149. x = np.arange(len(boxPlotLabels))
  150. plt.figure()
  151. plt.boxplot(boxPlotAges)
  152. plt.title('Age summary statistics\n')
  153. plt.ylabel("Age")
  154. plt.xticks(x + 1, boxPlotLabels, rotation=0)
  155. plt.show(block=False)


  156. # In[34]:

  157. movieLines = sc.textFile(dataDir + "u.item")
  158. splitMovieLines = movieLines.map(lambda l: l.split("|"))
  159. moviesRDD = splitMovieLines.map(lambda p: Row(id=p[0], title=p[1], releaseDate=p[2], videoReleaseDate=p[3], url=p[4], unknown=int(p[5]),action=int(p[6]),adventure=int(p[7]),animation=int(p[8]),childrens=int(p[9]),comedy=int(p[10]),crime=int(p[11]),documentary=int(p[12]),drama=int(p[13]),fantasy=int(p[14]),filmNoir=int(p[15]),horror=int(p[16]),musical=int(p[17]),mystery=int(p[18]),romance=int(p[19]),sciFi=int(p[20]),thriller=int(p[21]),war=int(p[22]),western=int(p[23])))
  160. moviesDF = spark.createDataFrame(moviesRDD)
  161. moviesDF.createOrReplaceTempView("movies")


  162. # In[35]:

  163. genreDF = spark.sql("SELECT sum(unknown) as unknown, sum(action) as action,sum(adventure) as adventure,sum(animation) as animation, sum(childrens) as childrens,sum(comedy) as comedy,sum(crime) as crime,sum(documentary) as documentary,sum(drama) as drama,sum(fantasy) as fantasy,sum(filmNoir) as filmNoir,sum(horror) as horror,sum(musical) as musical,sum(mystery) as mystery,sum(romance) as romance,sum(sciFi) as sciFi,sum(thriller) as thriller,sum(war) as war,sum(western) as western FROM movies")
  164. genreList = genreDF.collect()
  165. genreDict = genreList[0].asDict()
  166. labelValues = list(genreDict.keys())
  167. countList = list(genreDict.values())
  168. genreDict


  169. # In[36]:

  170. # http://opentechschool.github.io/python-data-intro/core/charts.html
  171. # Movie types and the counts
  172. x = np.arange(len(labelValues))
  173. plt.title('Movie types\n')
  174. plt.ylabel("Count")
  175. plt.bar(x, countList)
  176. plt.xticks(x + 0.5, labelValues, rotation=90)
  177. plt.gcf().subplots_adjust(bottom=0.20)
  178. plt.show(block=False)


  179. # In[37]:

  180. yearDF = sqlContext.sql("SELECT substring(releaseDate,8,4) as releaseYear, count(*) as movieCount FROM movies GROUP BY substring(releaseDate,8,4) ORDER BY movieCount DESC LIMIT 10")
  181. yearDF.show()
  182. yearMovieCountTuple = yearDF.rdd.map(lambda p: (int(p.releaseYear),p.movieCount)).collect()
  183. yearList,movieCountList = zip(*yearMovieCountTuple)
  184. countArea = yearDF.rdd.map(lambda p: np.pi * (p.movieCount/15)**2).collect()


  185. # In[38]:

  186. # Top 10 years where the most number of movies have been released
  187. plt.title('Top 10 movie release by year\n')
  188. plt.xlabel("Year")
  189. plt.ylabel("Number of movies released")
  190. plt.ylim([0,max(movieCountList)])
  191. colors = np.random.rand(10)
  192. plt.scatter(yearList, movieCountList,c=colors)
  193. plt.show(block=False)


  194. # In[39]:

  195. # Top 10 years where the most number of movies have been released
  196. plt.title('Top 10 movie release by year\n')
  197. plt.xlabel("Year")
  198. plt.ylabel("Number of movies released")
  199. plt.ylim([0,max(movieCountList) + 100])
  200. colors = np.random.rand(10)
  201. plt.scatter(yearList, movieCountList,c=colors, s=countArea)
  202. plt.show(block=False)


  203. # In[40]:

  204. yearActionDF = spark.sql("SELECT substring(releaseDate,8,4) as actionReleaseYear, count(*) as actionMovieCount FROM movies WHERE action = 1 GROUP BY substring(releaseDate,8,4) ORDER BY actionReleaseYear DESC LIMIT 10")
  205. yearActionDF.show()
  206. yearActionDF.createOrReplaceTempView("action")


  207. # In[41]:

  208. yearDramaDF = spark.sql("SELECT substring(releaseDate,8,4) as dramaReleaseYear, count(*) as dramaMovieCount FROM movies WHERE drama = 1 GROUP BY substring(releaseDate,8,4) ORDER BY dramaReleaseYear DESC LIMIT 10")
  209. yearDramaDF.show()
  210. yearDramaDF.createOrReplaceTempView("drama")


  211. # In[42]:

  212. yearCombinedDF = spark.sql("SELECT a.actionReleaseYear as releaseYear, a.actionMovieCount, d.dramaMovieCount FROM action a, drama d WHERE a.actionReleaseYear = d.dramaReleaseYear ORDER BY a.actionReleaseYear DESC LIMIT 10")
  213. yearCombinedDF.show()


  214. # In[43]:

  215. yearMovieCountTuple = yearCombinedDF.rdd.map(lambda p: (p.releaseYear,p.actionMovieCount, p.dramaMovieCount)).collect()
  216. yearList,actionMovieCountList,dramaMovieCountList = zip(*yearMovieCountTuple)


  217. # In[44]:

  218. plt.title("Movie release by year\n")
  219. plt.xlabel("Year")
  220. plt.ylabel("Movie count")
  221. line_action, = plt.plot(yearList, actionMovieCountList)
  222. line_drama, = plt.plot(yearList, dramaMovieCountList)
  223. plt.legend([line_action, line_drama], ['Action Movies', 'Drama Movies'],loc='upper left')
  224. plt.gca().get_xaxis().get_major_formatter().set_useOffset(False)
  225. plt.show(block=False)


  226. # In[ ]:
复制代码

藤椅
西门高 发表于 2017-8-22 08:42:12
谢谢分享

板凳
MouJack007 发表于 2017-8-22 09:08:38
谢谢楼主分享!

报纸
MouJack007 发表于 2017-8-22 09:09:22

地板
lianqu 发表于 2017-8-22 09:17:43

7
终结天狼 在职认证  发表于 2017-8-22 10:28:32
[Github]Apache Spark 2 for Beginners

8
钱学森64 发表于 2017-8-22 11:14:41
谢谢分享

9
mayipai 在职认证  发表于 2017-9-8 11:12:04
thank you sir

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-28 07:34