楼主: hooli
6282 38

Learning Spark by Holden Karau O'Reilly [推广有奖]

  • 2关注
  • 3粉丝

博士生

97%

还不是VIP/贵宾

-

威望
0
论坛币
26034 个
通用积分
2.2654
学术水平
88 点
热心指数
77 点
信用等级
47 点
经验
10596 点
帖子
148
精华
2
在线时间
328 小时
注册时间
2015-2-6
最后登录
2022-3-1

楼主
hooli 在职认证  发表于 2015-3-3 15:41:56 |只看作者 |坛友微信交流群|倒序 |AI写论文
相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Table of Contents
Preface................................................................................................................................................ 5
Audience................................................................................................................................................5
How This Book is Organized.............................................................................................................. 6
Supporting Books.................................................................................................................................6
Code Examples..................................................................................................................................... 7
Early Release Status and Feedback................................................................................................... 7
Chapter 1. Introduction to Data Analysis with Spark......................................................8
What is Apache Spark?....................................................................................................................... 8
A Unified Stack.....................................................................................................................................8
Who Uses Spark, and For What?......................................................................................................11
A Brief History of Spark.................................................................................................................... 13
Spark Versions and Releases............................................................................................................ 13
Spark and Hadoop............................................................................................................................. 14
Chapter 2. Downloading and Getting Started...................................................................15
Downloading Spark............................................................................................................................15
Introduction to Spark’s Python and Scala Shells.......................................................................... 16
Introduction to Core Spark Concepts.............................................................................................20
Standalone Applications...................................................................................................................23
Conclusion.......................................................................................................................................... 25
Chapter 3. Programming with RDDs................................................................................... 26
RDD Basics......................................................................................................................................... 26
Creating RDDs................................................................................................................................... 28
RDD Operations................................................................................................................................ 28
Passing Functions to Spark.............................................................................................................. 32
Common Transformations and Actions......................................................................................... 36
Persistence (Caching)........................................................................................................................46
Conclusion.......................................................................................................................................... 48
Chapter 4. Working with Key-Value Pairs.........................................................................49
4
Motivation.......................................................................................................................................... 49
Creating Pair RDDs........................................................................................................................... 49
Transformations on Pair RDDs....................................................................................................... 50
Actions Available on Pair RDDs......................................................................................................60
Data Partitioning................................................................................................................................61
Conclusion.......................................................................................................................................... 70
Chapter 5. Loading and Saving Your Data.......................................................................... 71
Motivation........................................................................................................................................... 71
Choosing a Format............................................................................................................................. 71
Formats............................................................................................................................................... 72
File Systems........................................................................................................................................88
Compression.......................................................................................................................................89
Databases............................................................................................................................................ 91
Conclusion.......................................................................................................................................... 93
About the Authors.....................................................................................................................

本帖隐藏的内容

Learning Spark.pdf (1.19 MB, 需要: 80 个论坛币)



注意:这个是pre-release版本



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Learning earning Reilly Holden Spark Media

已有 3 人评分经验 论坛币 学术水平 热心指数 信用等级 收起 理由
Lisrelchen + 100 + 100 + 5 + 5 精彩帖子
Nicolle + 100 + 100 + 5 + 5 精彩帖子
Multivariate + 1 + 1 + 1 精彩帖子

总评分: 经验 + 200  论坛币 + 200  学术水平 + 11  热心指数 + 11  信用等级 + 1   查看全部评分

本帖被以下文库推荐

沙发
Multivariate 发表于 2015-3-17 10:05:55 |只看作者 |坛友微信交流群
  1. /**
  2. * Illustrates a simple aggregate in scala to compute the average of an RDD
  3. */
  4. package com.oreilly.learningsparkexamples.scala

  5. import org.apache.spark._
  6. import org.apache.spark.rdd.RDD

  7. object BasicAvg {
  8.   def main(args: Array[String]) {
  9.     val master = args.length match {
  10.       case x: Int if x > 0 => args(0)
  11.       case _ => "local"
  12.     }
  13.     val sc = new SparkContext(master, "BasicAvg", System.getenv("SPARK_HOME"))
  14.     val input = sc.parallelize(List(1,2,3,4))
  15.     val result = computeAvg(input)
  16.     val avg = result._1 / result._2.toFloat
  17.     println(result)
  18.   }
  19.   def computeAvg(input: RDD[Int]) = {
  20.     input.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1),
  21.       (x,y) => (x._1 + y._1, x._2 + y._2))
  22.   }
  23. }
复制代码

使用道具

藤椅
fantuanxiaot 发表于 2015-3-17 10:32:23 |只看作者 |坛友微信交流群
已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

板凳
giggleholy 发表于 2015-3-17 10:35:17 |只看作者 |坛友微信交流群
spark is trendy, a must-read book
  1. /**
  2. * Illustrates loading a simple text file
  3. */
  4. package com.oreilly.learningsparkexamples.scala

  5. import org.apache.spark._

  6. object BasicAvgFromFile {
  7.     def main(args: Array[String]) {
  8.       if (args.length < 2) {
  9.         println("Usage: [sparkmaster] [inputfile]")
  10.         exit(1)
  11.       }
  12.       val master = args(0)
  13.       val inputFile = args(1)
  14.       val sc = new SparkContext(master, "BasicAvg", System.getenv("SPARK_HOME"))
  15.       val input = sc.textFile(inputFile)
  16.       val result = input.map(_.toInt).aggregate((0, 0))(
  17.         (acc, value) => (acc._1 + value, acc._2 + 1),
  18.         (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
  19.       val avg = result._1 / result._2.toFloat
  20.       println(result)
  21.     }
  22. }
复制代码


已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

报纸
uandi 发表于 2015-3-17 10:49:33 |只看作者 |坛友微信交流群
  1. /**
  2. * Illustrates loading a directory of files
  3. */
  4. package com.oreilly.learningsparkexamples.scala

  5. import org.apache.spark._
  6. import org.apache.spark.SparkContext._

  7. object BasicAvgFromFiles {
  8.     def main(args: Array[String]) {
  9.       if (args.length < 3) {
  10.         println("Usage: [sparkmaster] [inputdirectory] [outputdirectory]")
  11.         exit(1)
  12.       }
  13.       val master = args(0)
  14.       val inputFile = args(1)
  15.       val outputFile = args(2)
  16.       val sc = new SparkContext(master, "BasicAvgFromFiles", System.getenv("SPARK_HOME"))
  17.       val input = sc.wholeTextFiles(inputFile)
  18.       val result = input.mapValues{y =>
  19.         val nums = y.split(" ").map(_.toDouble)
  20.         nums.sum / nums.size.toDouble
  21.       }
  22.       result.saveAsTextFile(outputFile)
  23.     }
  24. }
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

地板
oink-oink 发表于 2015-3-17 11:35:18 |只看作者 |坛友微信交流群
已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

7
lhf8059 发表于 2015-3-17 12:09:37 |只看作者 |坛友微信交流群
  1. /**
  2. * Illustrates mapPartitions in scala
  3. */
  4. package com.oreilly.learningsparkexamples.scala

  5. import org.apache.spark._

  6. object BasicAvgMapPartitions {
  7.   case class AvgCount(var total: Int = 0, var num: Int = 0) {
  8.     def merge(other: AvgCount): AvgCount = {
  9.       total += other.total
  10.       num += other.num
  11.       this
  12.     }
  13.     def merge(input: Iterator[Int]): AvgCount = {
  14.       input.foreach{elem =>
  15.         total += elem
  16.         num += 1
  17.       }
  18.       this
  19.     }
  20.     def avg(): Float = {
  21.       total / num.toFloat;
  22.     }
  23.   }

  24.   def main(args: Array[String]) {
  25.     val master = args.length match {
  26.       case x: Int if x > 0 => args(0)
  27.       case _ => "local"
  28.     }
  29.     val sc = new SparkContext(master, "BasicAvgMapPartitions", System.getenv("SPARK_HOME"))
  30.     val input = sc.parallelize(List(1, 2, 3, 4))
  31.     val result = input.mapPartitions(partition =>
  32.       // Here we only want to return a single element for each partition, but mapPartitions requires that we wrap our return in an Iterator
  33.       Iterator(AvgCount(0, 0).merge(partition)))
  34.       .reduce((x,y) => x.merge(y))
  35.     println(result)
  36.   }
  37. }
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

8
tonyme2 在职认证  发表于 2015-3-17 12:09:57 |只看作者 |坛友微信交流群
  1. /**
  2. * Illustrates a simple fold in scala
  3. */
  4. package com.oreilly.learningsparkexamples.scala

  5. import org.apache.spark._

  6. object BasicAvgWithKryo {
  7.     def main(args: Array[String]) {
  8.       val master = args.length match {
  9.         case x: Int if x > 0 => args(0)
  10.         case _ => "local"
  11.       }
  12.       val conf = new SparkConf().setMaster(master).setAppName("basicAvgWithKryo")
  13.       conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  14.       val sc = new SparkContext(conf)
  15.       val input = sc.parallelize(List(1,2,3,4))
  16.       val result = input.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1),
  17.         (x,y) => (x._1 + y._1, x._2 + y._2))
  18.       val avg = result._1 / result._2.toFloat
  19.       println(result)
  20.     }
  21. }
复制代码

使用道具

9
fengyg 企业认证  发表于 2015-3-17 12:37:53 |只看作者 |坛友微信交流群
  1. /**
  2. * Illustrates filtering and union to extract lines with "error" or "warning"
  3. */
  4. package com.oreilly.learningsparkexamples.scala

  5. import org.apache.spark._
  6. import org.apache.spark.SparkContext._

  7. object BasicFilterUnionCombo {
  8.     def main(args: Array[String]) {
  9.       val conf = new SparkConf
  10.       conf.setMaster(args(0))
  11.       val sc = new SparkContext(conf)
  12.       val inputRDD = sc.textFile(args(1))
  13.       val errorsRDD = inputRDD.filter(_.contains("error"))
  14.       val warningsRDD = inputRDD.filter(_.contains("warn"))
  15.       val badLinesRDD = errorsRDD.union(warningsRDD)
  16.       println(badLinesRDD.collect().mkString("\n"))
  17.     }
  18. }
复制代码

使用道具

10
chinesesunboy 发表于 2015-3-17 12:50:52 |只看作者 |坛友微信交流群
  1. /**
  2. * Illustrates intersection by key
  3. */
  4. package com.oreilly.learningsparkexamples.scala

  5. import org.apache.spark._
  6. import org.apache.spark.rdd.PairRDDFunctions
  7. import org.apache.spark.rdd.RDD
  8. import org.apache.spark.SparkContext._

  9. import scala.reflect.ClassTag

  10. object BasicIntersectByKey {

  11.   def intersectByKey[K: ClassTag, V: ClassTag](rdd1: RDD[(K, V)], rdd2: RDD[(K, V)]): RDD[(K, V)] = {
  12.     rdd1.cogroup(rdd2).flatMapValues{
  13.       case (Nil, _) => None
  14.       case (_, Nil) => None
  15.       case (x, y) => x++y
  16.     }
  17.   }

  18.   def main(args: Array[String]) {
  19.     val master = args.length match {
  20.       case x: Int if x > 0 => args(0)
  21.       case _ => "local"
  22.     }
  23.     val sc = new SparkContext(master, "BasicIntersectByKey", System.getenv("SPARK_HOME"))
  24.     val rdd1 = sc.parallelize(List((1, "panda"), (2, "happy")))
  25.     val rdd2 = sc.parallelize(List((2, "pandas")))
  26.     val iRdd = intersectByKey(rdd1, rdd2)
  27.     val panda: List[(Int, String)] = iRdd.collect().toList
  28.     panda.map(println(_))
  29.     sc.stop()
  30.   }
  31. }
复制代码

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-26 16:38