【Jacek Laskowski】Spark Structured Streaming

0关注
62粉丝

VIP

已卖：4196份资源

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 50294 个
通用积分: 83.8106
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2017-5-15 07:33:16 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Spark Structured Streaming
Welcome to Spark Structured Streaming notebook!
I’m Jacek Laskowski, an independent consultant who is passionate about Apache Spark, Apache Kafka, Scala and sbt (with some flavour of Apache Mesos, Hadoop YARN, and quite recently DC/OS). I lead Warsaw Scala Enthusiasts and Warsaw Spark meetups in Warsaw, Poland.
Contact me at jacek@japila.pl or @jaceklaskowski to discuss Apache Spark opportunities, e.g. courses, workshops, mentoring or application development services.
If you like the Apache Spark notes you should seriously consider participating in my own, very hands-on Spark Workshops.
Tip
I’m also writing Mastering Apache Spark 2 Notebook, Apache Kafka Notebook and Spark Streaming Notebook.
Spark Structured Streaming Notebook serves as the ultimate place of mine to collect all the nuts and bolts of using Spark Structured Streaming. The notes aim to help me designing and developing better products with Apache Spark. It is also a viable proof of my understanding of Apache Spark. I do eventually want to reach the highest level of mastery in Apache Spark (as do you!)
The collection of notes serves as the study material for my trainings, workshops, videos and courses about Apache Spark. Follow me on twitter @jaceklaskowski to know it early. You will also learn about the upcoming events about Apache Spark.
Expect text and code snippets from Spark’s mailing lists, the official documentation of Apache Spark, StackOverflow, blog posts, books from O’Reilly (and other publishers), press releases, conferences, YouTube or Vimeo videos, Quora, the source code of Apache Spark, etc. Attribution follows.

复制代码

本帖隐藏的内容

Spark Structured Streaming Jacek Laskowski.pdf (421.27 KB)

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Ask WSK Apache Spark Conferences Independent

相关帖子

沙发

Lisrelchen 发表于 2017-5-15 07:42:14

DataStreamReader
DataStreamReader is an interface for reading streaming data in DataFrame from data sources with specified format, schema and options.
DataStreamReader offers support for the built-in formats: json, csv, parquet, text. parquet format is the default data source as configured using spark.sql.sources.default setting.
DataStreamReader is available using SparkSession.readStream method.
val spark: SparkSession = ...
val schema = spark.read
.format("csv")
.option("header", true)
.option("inferSchema", true)
.load("csv-logs/*.csv")
.schema
val df = spark.readStream
.format("csv")
.schema(schema)
.load("csv-logs/*.csv")

复制代码

藤椅

Lisrelchen 发表于 2017-5-15 07:43:09

FileStreamSource
FileStreamSource is a Source that reads text files from path directory as they appear. It uses LongOffset offsets.+
Note
It is used by DataSource.createSource for FileFormat.
You can provide the schema of the data and dataFrameBuilder - the function to build a DataFrame in getBatch at instantiation time.
// NOTE The source directory must exist
// mkdir text-logs
val df = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load("text-logs")
scala> df.printSchema
root
|-- value: string (nullable = true)

复制代码

板凳

Lisrelchen 发表于 2017-5-15 07:43:46

Creating KafkaSourceRDD Instance
KafkaSourceRDD takes the following when created:
SparkContext
Collection of key-value settings for executors reading records from Kafka topics
Collection of KafkaSourceRDDOffsetRange offsets
Timeout (in milliseconds) to poll data from Kafka
Used when KafkaSourceRDD is requested for records (for given offsets) and in turn requests CachedKafkaConsumer to poll for Kafka’s ConsumerRecords.
Flag to…FIXME
Flag to…FIXME
KafkaSourceRDD initializes the internal registries and counters.

复制代码

报纸

Lisrelchen 发表于 2017-5-15 07:45:13

MemoryStream

val spark: SparkSession = ???
implicit val ctx = spark.sqlContext
import org.apache.spark.sql.execution.streaming.MemoryStream
// It uses two implicits: Encoder[Int] and SQLContext
val intsIn = MemoryStream[Int]
val ints = intsIn.toDF
.withColumn("t", current_timestamp())
.withWatermark("t", "5 minutes")
.groupBy(window($"t", "5 minutes") as "window")
.agg(count("*") as "total")
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.streaming.ProcessingTime
import scala.concurrent.duration._
val totalsOver5mins = ints.writeStream
.format("memory")
.queryName("totalsOver5mins")
.outputMode(OutputMode.Append)
.trigger(ProcessingTime(30.seconds))
.start
scala> val zeroOffset = intsIn.addData(0, 1, 2)
zeroOffset: org.apache.spark.sql.execution.streaming.Offset = #0
totalsOver5mins.processAllAvailable()
spark.table("totalsOver5mins").show
scala> intsOut.show

复制代码

地板

Lisrelchen 发表于 2017-5-15 07:46:32

TextSocketSource

TextSocketSource is a streaming source that reads lines from a socket at the host and port (defined by parameters).+

It uses lines internal in-memory buffer to keep all of the lines that were read from a socket forever.

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.getOrCreate()
// Connect to localhost:9999
// You can use "nc -lk 9999" for demos
val textSocket = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load
import org.apache.spark.sql.Dataset
val lines: Dataset[String] = textSocket.as[String].map(_.toUpperCase)
val query = lines.writeStream.format("console").start
// Start typing the lines in nc session
// They will appear UPPERCASE in the terminal

复制代码

7楼

Lisrelchen 发表于 2017-5-15 07:46:56

DataStreamWriter
DataStreamWriter is a part of Structured Streaming that is responsible for writing the output of streaming queries to sinks and therefore starting their execution.
val people: Dataset[Person] = ...
import org.apache.spark.sql.streaming.ProcessingTime
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.OutputMode.Complete
df.writeStream
.queryName("textStream")
.outputMode(Complete)
.trigger(ProcessingTime(10.seconds))
.format("console")
.start

复制代码

8楼

Lisrelchen 发表于 2017-5-15 07:47:57

ConsoleSink
ConsoleSink is a streaming sink that is registered as the console format.
val spark: SparkSession = ...
spark.readStream
.format("text")
.load("server-logs/*.out")
.as[String]
.writeStream
.queryName("server-logs processor")
.format("console") // <-- uses ConsoleSink
.start
scala> spark.streams.active.foreach(println)
Streaming Query - server-logs processor [state = ACTIVE]
// in another terminal
$ echo hello > server-logs/hello.out

复制代码

9楼

Lisrelchen 发表于 2017-5-15 07:48:35

FileStreamSink
FileStreamSink is the streaming sink for the parquet format.
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, ProcessingTime}
val out = in.writeStream
.format("parquet")
.option("path", "parquet-output-dir")
.option("checkpointLocation", "checkpoint-dir")
.trigger(ProcessingTime(5.seconds))
.outputMode(OutputMode.Append)
.start()
FileStreamSink supports Append output mode only.
It uses spark.sql.streaming.fileSink.log.deletion (as isDeletingExpiredLog)

复制代码

10楼

Lisrelchen 发表于 2017-5-15 07:49:26

ForeachSink
ForeachSink is a typed Sink that passes records (of the type T) to ForeachWriter (one record at a time per partition).
It is used exclusively in foreach operator.
val records = spark.readStream
.format("text")
.load("server-logs/*.out")
.as[String]
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
override def open(partitionId: Long, version: Long) = true
override def process(value: String) = println(value)
override def close(errorOrNull: Throwable) = {}
}
records.writeStream
.queryName("server-logs processor")
.foreach(writer)
.start

复制代码

【Jacek Laskowski】Spark Structured Streaming [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

【Jacek Laskowski】Spark Structured Streaming [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群