楼主: ReneeBK
1349 6

【Apache Spark】The Data Scientist's Guide to Apache Spark [推广有奖]

  • 1关注
  • 62粉丝

VIP

已卖:4900份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49655 个
通用积分
55.9937
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57805 点
帖子
4005
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

楼主
ReneeBK 发表于 2017-5-1 03:59:39 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
The Data Scientist's Guide to Apache Spark

本帖隐藏的内容

data-scientists-guide-apache-spark-master.zip (11.24 MB)


This repo contains notebook exercises for a workshop teaching the best practices of using Spark for practicing data scientists in the context of a data scientist’s standard workflow. By leveraging Spark’s APIs for Python and R to present practical applications, the technology will be much more accessible by decreasing the barrier to entry.

Materials

For the workshop (and after) we will use a Gitter chatroom to keep the conversation going: https://gitter.im/Jay-Oh-eN/data-scientists-guide-apache-spark.

And/or please do not hesitate to reach out to me directly via email at jonathan@galvanize.com or over twitter @clearspandex

The presentation can be found on Slideshare here.

Prerequisites

Prior experience with Python and the scientific Python stack is beneficial. Also knowledge of data science models and applications is preferred. This will not be an introduction to Machine Learning or Data Science, but rather a course for people proficient in these methods on a small scale to understand how to apply that knowledge in a distributed setting with Spark.

SetupSparkR with a Notebookinstall.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = c('http://irkernel.github.io/', getOption('repos')))IRkernel::installspec()
# Example: Set this to where Spark is installedSys.setenv(SPARK_HOME="/Users/[username]/spark")# This line loads SparkR from the installed directory.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))# if these two lines work, you are all setlibrary(SparkR)sc <- sparkR.init(master="local")Data

link = 'http://hopelessoptimism.com/static/data/airline-data'

The notebooks use a few datasets. For the DonorsChoose data, you can read the documentation here and download a zip (~0.5 gb) from: http://hopelessoptimism.com/static/data/donors_choose.zip

IPython Console Help

Q: How can I find out all the methods that are available on DataFrame?

  • In the IPython console type sales.[TAB]

  • Autocomplete will show you all the methods that are available.

  • To find more information about a specific method, say .cov type help(sales.cov)

  • This will display the API documentation for that method.


Spark Documentation

Q: How can I find out more about Spark's Python API, MLlib, GraphX, Spark Streaming, deploying Spark to EC2?

  • Go to https://spark.apache.org/docs/latest

  • Navigate using tabs to the following areas in particular.

  • Programming Guide > Quick Start, Spark Programming Guide, Spark Streaming, DataFrames and SQL, MLlib, GraphX, SparkR.

  • Deploying > Overview, Submitting Applications, Spark Standalone, YARN, Amazon EC2.

  • More > Configuration, Monitoring, Tuning Guide.


ReferencesSetup
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Apache Spark Scientist apache Spark Guide

本帖被以下文库推荐

沙发
ReneeBK 发表于 2017-5-1 04:05:52

K-means using Apache Spark

  1. All Together Now
  2. In [12]:
  3. import nltk, string

  4. def tokenize(text):
  5.     tokens = []
  6.    
  7.     for word in nltk.word_tokenize(text):
  8.         if word \
  9.             not in nltk.corpus.stopwords.words('english') \
  10.             and word not in string.punctuation \
  11.             and word != '``':   
  12.                 tokens.append(word)
  13.    
  14.     return tokens
  15. In [13]:
  16. sc.stop()
  17. In [14]:
  18. # run on all cores
  19. sc = ps.SparkContext()
  20. sc.master
  21. Out[14]:
  22. u'local[*]'
  23. In [23]:
  24. from collections import Counter
  25. In [25]:
  26. # parse input essay file
  27. essay_rdd = sc.textFile('file:///Users/jonathandinu/spark-ds-applications/data/donors_choose/essay_parse.json')
  28. row_rdd = essay_rdd.map(lambda x: json.loads(x))

  29. # tokenize documents
  30. tokenized_rdd = row_rdd.filter(lambda row: row['essay'] and row['essay'] != '') \
  31.                        .map(lambda row: row['essay']) \
  32.                        .map(lambda text: text.replace('\\n', '').replace('\r', '')) \
  33.                        .map(lambda text: tokenize(text))

  34. # compute term and document frequencies
  35. term_frequency = tokenized_rdd.map(lambda terms: Counter(terms))

  36. doc_frequency = term_frequency.flatMap(lambda counts: counts.keys()) \
  37.                              .map(lambda keys: (keys, 1)) \
  38.                              .reduceByKey(lambda a, b: a + b)
  39. In [28]:
  40. num_features = 10000

  41. top_terms = doc_frequency.top(num_features, key=lambda a: a[1])
  42. In [30]:
  43. import math

  44. total_docs = essay_rdd.count()
  45. idf = map(lambda tup: (tup[0], math.log(float(total_docs) / (1 + tup[1]))), top_terms)
  46. In [31]:
  47. broadcast_idf = sc.broadcast(idf)

  48. def vectorize(tokens):
  49.     word_counts = Counter(tokens)
  50.     doc_length = sum(word_counts.values())
  51.    
  52.     vector = [word_counts.get(word[0], 0) * word[1] / float(doc_length) for word in broadcast_idf.value]
  53.     return np.array(vector)
  54. In [32]:
  55. bag_of_words = tokenized_rdd.filter(lambda x: len(x) > 0).map(vectorize)
  56. In [33]:
  57. bag_of_words.persist()
  58. Out[33]:
  59. PythonRDD[30] at RDD at PythonRDD.scala:43
  60. In [34]:
  61. top_n = 10
  62. summary = bag_of_words.map(lambda x: map(lambda idx: broadcast_idf.value[idx][0], np.argsort(x)[::-1][:top_n]))
  63. In [35]:
  64. summary.take(15)
复制代码

藤椅
neuroexplorer 发表于 2017-5-1 07:00:58
thanks.........
已有 1 人评分经验 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 经验 + 20   查看全部评分

板凳
soccy 发表于 2017-5-1 09:06:19
已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

报纸
日月生辉 在职认证  发表于 2017-5-1 14:02:45
不错的好书
已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

地板
smartlife 在职认证  发表于 2017-5-1 19:43:59
已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

7
meretale 发表于 2017-5-2 13:23:27

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-17 08:49