请选择 进入手机版 | 继续访问电脑版
楼主: ada89k
1820 1

[问答] Apache Spark 2.2.0正式发布 [推广有奖]

  • 3关注
  • 72粉丝

院士

99%

还不是VIP/贵宾

-

威望
2
论坛币
621761 个
通用积分
1.0278
学术水平
123 点
热心指数
149 点
信用等级
82 点
经验
46289 点
帖子
1667
精华
3
在线时间
2442 小时
注册时间
2017-2-7
最后登录
2024-4-9

ada89k 在职认证  发表于 2017-7-12 22:03:03 |显示全部楼层 |坛友微信交流群

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

Apache Spark 2.2.0正式发布




Apache Spark 2.2.0 持续了半年的开发,从RC1 到 RC6 终于在今天正式发布了。本版本是 2.x 版本线的第三个版本。在这个版本 Structured Streaming 的实验性标记(experimental tag)已经被移除,这也意味着后面的 2.2.x 之后就可以放心在线上使用了。除此之外,这个版本的主要集中点是系统的可用性(usability)、稳定性(stability)以及代码的润色(polish),并没有什么其他重大更新。此版本的一些新功能:

支持 LATERAL VIEW OUTER explode(),
支持给表添加列(ALTER TABLE table_name ADD COLUMNS);
支持从Hive metastore 2.0/2.1中读取数据;
支持解析多行的JSON 或 CSV 文件;
Structured Streaming为R语言提供的API;
R语言支持完整的Catalog API;
R语言支持 DataFramecheckpointing

此外,这个版本移除了对 Java 7 以及 Hadoop 2.5及其之前版本的支持。详细的更新如下:

Core and Spark SQL

·     API updates

·              SPARK-19107: Support creatinghive table with DataFrameWriter and Catalog

·              SPARK-13721: Add support forLATERAL VIEW OUTER explode()

·              SPARK-18885: Unify CREATE TABLEsyntax for data source and hive serde tables

·              SPARK-16475: Added BroadcastHints BROADCAST, BROADCASTJOIN, and MAPJOIN, for SQL Queries

·              SPARK-18350: Support sessionlocal timezone

·              SPARK-19261: Support ALTERTABLE table_name ADD COLUMNS

·              SPARK-20420: Add events to theexternal catalog

·              SPARK-18127: Add hooks andextension points to Spark

·              SPARK-20576: Support generichint function in Dataset/DataFrame

·              SPARK-17203: Data sourceoptions should always be case insensitive

·              SPARK-19139: AES-basedauthentication mechanism for Spark

·     Performance and stability

·              Cost-Based Optimizer

·                       SPARK-17075 SPARK-17076SPARK-19020 SPARK-17077 SPARK-19350: Cardinality estimation for filter, join,aggregate, project and limit/sample operators

·                       SPARK-17080: Cost-based joinre-ordering

·                       SPARK-17626: TPC-DS performanceimprovements using star-schema heuristics

·              SPARK-17949: Introduce a JVMobject based aggregate operator

·              SPARK-18186: Partialaggregation support of HiveUDAFFunction

·              SPARK-18362 SPARK-19918: Filelisting/IO improvements for CSV and JSON

·              SPARK-18775: Limit the maxnumber of records written per file

·              SPARK-18761: Uncancellable /unkillable tasks shouldn’t starve jobs of resources

·              SPARK-15352: Topology awareblock replication

·     Other notable changes

·              SPARK-18352: Support forparsing multi-line JSON files

·              SPARK-19610: Support forparsing multi-line CSV files

·              SPARK-21079: Analyze TableCommand on partitioned tables

·              SPARK-18703: Drop StagingDirectories and Data Files after completion of Insertion/CTAS againstHive-serde Tables

·              SPARK-18209: More robust viewcanonicalization without full SQL expansion

·              SPARK-13446: [SPARK-18112]Support reading data from Hive metastore 2.0/2.1

·              SPARK-18191: Port RDD API touse commit protocol

·              SPARK-8425:Add blacklistmechanism for task scheduling

·              SPARK-19464: Remove support forHadoop 2.5 and earlier

·              SPARK-19493: Remove Java 7support

Programming guides: SparkProgramming Guide and Spark SQL, DataFrames and Datasets Guide.

Structured Streaming

·     General Availablity

·              SPARK-20844: The StructuredStreaming APIs are now GA and is no longer labeled experimental

·     Kafka Improvements

·              SPARK-19719: Support forreading and writing data in streaming or batch to/from Apache Kafka

·              SPARK-19968: Cached producerfor lower latency kafka to kafka streams.

·     API updates

·              SPARK-19067: Support forcomplex stateful processing and timeouts using [flat]MapGroupsWithState

·              SPARK-19876: Support for onetime triggers

·     Other notable changes

·              SPARK-20979: Rate source fortesting and benchmarks

Programming guide: StructuredStreaming Programming Guide.

MLlib

·     New algorithms in DataFrame-based API

·              SPARK-14709: LinearSVC (LinearSVM Classifier) (Scala/Java/Python/R)

·              SPARK-19635: ChiSquare test inDataFrame-based API (Scala/Java/Python)

·              SPARK-19636: Correlation inDataFrame-based API (Scala/Java/Python)

·              SPARK-13568: Imputer featuretransformer for imputing missing values (Scala/Java/Python)

·              SPARK-18929: Add Tweediedistribution for GLMs (Scala/Java/Python/R)

·              SPARK-14503: FPGrowth frequentpattern mining and AssociationRules (Scala/Java/Python/R)

·     Existing algorithms added to Python and R APIs

·              SPARK-18239: Gradient BoostedTrees ®

·              SPARK-18821: Bisecting K-Means®

·              SPARK-18080: Locality SensitiveHashing (LSH) (Python)

·              SPARK-6227: Distributed PCA andSVD for PySpark (in RDD-based API)

·     Major bug fixes

·              SPARK-19110:DistributedLDAModel.logPrior correctness fix

·              SPARK-17975: EMLDAOptimizerfails with ClassCastException (caused by GraphX checkpointing bug)

·              SPARK-18715: Fix wrong AICcalculation in Binomial GLM

·              SPARK-16473: BisectingKMeansfailing during training with “java.util.NoSuchElementException: key not found”for certain inputs

·              SPARK-19348:pyspark.ml.Pipeline gets corrupted under multi-threaded use

·              SPARK-20047: Box-constrainedLogistic Regression

Programming guide: MachineLearning Library (MLlib) Guide.

SparkR

The main focus of SparkR in the2.2.0 release was adding extensive support for existing Spark SQL features:

·     Major features

·              SPARK-19654: StructuredStreaming API for R

·              SPARK-20159: Support completeCatalog API in R

·              SPARK-19795: column functionsto_json, from_json

·              SPARK-19399: Coalesce onDataFrame and coalesce on column

·              SPARK-20020: Support DataFramecheckpointing

·              SPARK-18285: Multi-columnapproxQuantile in R

Programming guide: SparkR (Ron Spark).

GraphX

·     Bug fixes

·              SPARK-18847: PageRank givesincorrect results for graphs with sinks

·              SPARK-14804: GraphvertexRDD/EdgeRDD checkpoint results ClassCastException

·     Optimizations

·              SPARK-18845: PageRank initialvalue improvement for faster convergence

·              SPARK-5484: Pregel shouldcheckpoint periodically to avoid StackOverflowError

Programming guide: GraphXProgramming Guide.

Deprecations

·     MLlib

·              SPARK-18613: spark.ml LDAclasses should not expose spark.mllib in APIs. In spark.ml.LDAModel, deprecated oldLocalModel and getModel.

·     SparkR

·              SPARK-20195: deprecatecreateExternalTable

Changes of behavior

·     MLlib

·              SPARK-19787: DeveloperApiALS.train() uses default regParam value 0.1 instead of 1.0, in order to matchregular ALS API’s default regParam setting.

·     SparkR

·              SPARK-19291: This addedlog-likelihood for SparkR Gaussian Mixture Models, but doing so introduced a SparkRmodel persistence incompatibility: Gaussian Mixture Models saved from SparkR2.1 may not be loaded into SparkR 2.2. We plan to put in place backwardscompatibility guarantees for SparkR in the future.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Apache Spark apache Spark Park SPAR Apache Spark spark高速集群计算平台 DataFrame checkpointing

marcus10 发表于 2017-7-13 02:26:38 来自手机 |显示全部楼层 |坛友微信交流群
ada89k 发表于 2017-7-12 22:03
Apache Spark 2.2.0正式发布


一哥你好牛

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-16 13:15