人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › winbugs及其他软件专版 › 【Apache Spark】The Data Scientist's Guide to Apache ...

发帖

楼主: ReneeBK

1371 6

【Apache Spark】The Data Scientist's Guide to Apache Spark [推广有奖]

1关注
62粉丝

VIP

已卖：4901份资源

学术权威

14%

还不是VIP/贵宾

TA的文库 其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望: 1 级
论坛币: 49675 个
通用积分: 56.2487
学术水平: 370 点
热心指数: 273 点
信用等级: 335 点
经验: 57805 点
帖子: 4005
精华: 21
在线时间: 582 小时
注册时间: 2005-5-8
最后登录: 2023-11-26

楼主

ReneeBK 发表于 2017-5-1 03:59:39 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

The Data Scientist's Guide to Apache Spark

本帖隐藏的内容

data-scientists-guide-apache-spark-master.zip (11.24 MB)

This repo contains notebook exercises for a workshop teaching the best practices of using Spark for practicing data scientists in the context of a data scientist’s standard workflow. By leveraging Spark’s APIs for Python and R to present practical applications, the technology will be much more accessible by decreasing the barrier to entry.

Materials

For the workshop (and after) we will use a Gitter chatroom to keep the conversation going: https://gitter.im/Jay-Oh-eN/data-scientists-guide-apache-spark.

And/or please do not hesitate to reach out to me directly via email at jonathan@galvanize.com or over twitter @clearspandex

The presentation can be found on Slideshare here.

Prerequisites

Prior experience with Python and the scientific Python stack is beneficial. Also knowledge of data science models and applications is preferred. This will not be an introduction to Machine Learning or Data Science, but rather a course for people proficient in these methods on a small scale to understand how to apply that knowledge in a distributed setting with Spark.

Setup

SparkR with a Notebook

Install IRKernel

install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = c('http://irkernel.github.io/', getOption('repos')))IRkernel::installspec()

Set environment variables:

# Example: Set this to where Spark is installedSys.setenv(SPARK_HOME="/Users/[username]/spark")# This line loads SparkR from the installed directory.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))# if these two lines work, you are all setlibrary(SparkR)sc <- sparkR.init(master="local")Data

link = 'http://hopelessoptimism.com/static/data/airline-data'

The notebooks use a few datasets. For the DonorsChoose data, you can read the documentation here and download a zip (~0.5 gb) from: http://hopelessoptimism.com/static/data/donors_choose.zip

IPython Console Help

Q: How can I find out all the methods that are available on DataFrame?

In the IPython console type sales.[TAB]
Autocomplete will show you all the methods that are available.
To find more information about a specific method, say .cov type help(sales.cov)
This will display the API documentation for that method.

Spark Documentation

Q: How can I find out more about Spark's Python API, MLlib, GraphX, Spark Streaming, deploying Spark to EC2?

Go to https://spark.apache.org/docs/latest
Navigate using tabs to the following areas in particular.
Programming Guide > Quick Start, Spark Programming Guide, Spark Streaming, DataFrames and SQL, MLlib, GraphX, SparkR.
Deploying > Overview, Submitting Applications, Spark Standalone, YARN, Amazon EC2.
More > Configuration, Monitoring, Tuning Guide.

ReferencesSetup

Spark on Windows 7

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Apache Spark Scientist apache Spark Guide

本帖被以下文库推荐

· Apache Spark NewOccidental|主题: 195, 订阅: 7

沙发

ReneeBK 发表于 2017-5-1 04:05:52

K-means using Apache Spark

All Together Now
In [12]:
import nltk, string
def tokenize(text):
tokens = []
for word in nltk.word_tokenize(text):
if word \
not in nltk.corpus.stopwords.words('english') \
and word not in string.punctuation \
and word != '``':
tokens.append(word)
return tokens
In [13]:
sc.stop()
In [14]:
# run on all cores
sc = ps.SparkContext()
sc.master
Out[14]:
u'local[*]'
In [23]:
from collections import Counter
In [25]:
# parse input essay file
essay_rdd = sc.textFile('file:///Users/jonathandinu/spark-ds-applications/data/donors_choose/essay_parse.json')
row_rdd = essay_rdd.map(lambda x: json.loads(x))
# tokenize documents
tokenized_rdd = row_rdd.filter(lambda row: row['essay'] and row['essay'] != '') \
.map(lambda row: row['essay']) \
.map(lambda text: text.replace('\\n', '').replace('\r', '')) \
.map(lambda text: tokenize(text))
# compute term and document frequencies
term_frequency = tokenized_rdd.map(lambda terms: Counter(terms))
doc_frequency = term_frequency.flatMap(lambda counts: counts.keys()) \
.map(lambda keys: (keys, 1)) \
.reduceByKey(lambda a, b: a + b)
In [28]:
num_features = 10000
top_terms = doc_frequency.top(num_features, key=lambda a: a[1])
In [30]:
import math
total_docs = essay_rdd.count()
idf = map(lambda tup: (tup[0], math.log(float(total_docs) / (1 + tup[1]))), top_terms)
In [31]:
broadcast_idf = sc.broadcast(idf)
def vectorize(tokens):
word_counts = Counter(tokens)
doc_length = sum(word_counts.values())
vector = [word_counts.get(word[0], 0) * word[1] / float(doc_length) for word in broadcast_idf.value]
return np.array(vector)
In [32]:
bag_of_words = tokenized_rdd.filter(lambda x: len(x) > 0).map(vectorize)
In [33]:
bag_of_words.persist()
Out[33]:
PythonRDD[30] at RDD at PythonRDD.scala:43
In [34]:
top_n = 10
summary = bag_of_words.map(lambda x: map(lambda idx: broadcast_idf.value[idx][0], np.argsort(x)[::-1][:top_n]))
In [35]:
summary.take(15)

复制代码

藤椅

neuroexplorer 发表于 2017-5-1 07:00:58

thanks.........

已有 1 人评分	经验	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 经验 + 20 查看全部评分

板凳

soccy 发表于 2017-5-1 09:06:19

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

报纸

日月生辉

发表于 2017-5-1 14:02:45

不错的好书

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

地板

smartlife

发表于 2017-5-1 19:43:59

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

7楼

meretale 发表于 2017-5-2 13:23:27

返回列表

发帖

本版微信群

加好友,备注jltj
拉您入交流群

京ICP备16021002号-2 京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明

【Apache Spark】The Data Scientist's Guide to Apache Spark [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

K-means using Apache Spark

浏览过的帖子

浏览过的版块

本版微信群

【Apache Spark】The Data Scientist's Guide to Apache Spark [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

K-means using Apache Spark

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群