发帖

楼主: oliyiyi

1890 1

WHERE IS APACHE HIVE GOING? TO IN-MEMORY COMPUTING. [推广有奖]

1关注
185
粉丝

版主

已卖：3000份资源

泰斗

1%

还不是VIP/贵宾

-

TA的文库 其他...

计量文库

0%

威望: 7 级
论坛币: -62545 个
通用积分: 31675.3973
学术水平: 1454 点
热心指数: 1573 点
信用等级: 1364 点
经验: 384234 点
帖子: 9629
精华: 66
在线时间: 5508 小时
注册时间: 2007-5-21
最后登录: 2025-7-8

楼主

oliyiyi 发表于 2016-10-7 07:03:20 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Apache Hive(™) is the most complete SQL on Hadoop system, supporting comprehensive SQL, a sophisticated cost-based optimizer, ACID transactions and fine-grained dynamic security.

Though Hive has proven itself on multi-petabyte datasets spanning thousands of nodes many interesting use cases demand more interactive performance on smaller datasets, requiring a shift to in-memory. Hive 2 marks the beginning of Hive’s journey from a disk-centric architecture to a memory-centric architecture through Hive LLAP (Live Long and Process). Since memory costs about 100x as much as disk, memory-centric architectures demand a careful design that makes the most of available resources.

In this blog, we’ll update benchmark results from our earlier blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More.”

NOT ALL MEMORY ARCHITECTURES ARE EQUAL

The in-memory shift is one of the most important trends of the last 10 years of computing. Still, RAM costs about 100x more than disk so it’s by far the most carefully budgeted and over-allocated resource in the datacenter. Applications can’t expect to hold all data in memory all the time and we know from experience that you won’t see substantial benefits from an in-memory architecture unless it’s carefully architected to make the most of memory.

This means:

You must cache only the specific data needed for processing queries.
The cache needs to be shared across all users who need the data.
The data needs to stay compressed in RAM to offset the 100x higher cost of RAM.

When we put this together we see a very sharp contrast between naive in-memory implementations versus architectures optimized from the ground up to make the most of memory.

In the Hadoop ecosystem we have seen firsthand that Type 1 systems offer limited benefits, so the Hive community decided to skip directly to a Type 2 In-Memory system. Hive LLAP delivers on all these attributes, with a single shared, managed cache that stores data compressed in memory. In addition, there is work actively underway to optimize this even further, using Flash to supersize the cache and extending LLAP’s ability to operate directly on compressed data without decompressing it.

BENCHMARKING LLAP AT 10 TB SCALE

Let’s put this architecture to the test with a realistic dataset size and workload. Our previous performance blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More”, discussed 4 reasons that LLAP delivers dramatically faster performance versus Hive on Tez. In that benchmark we saw 25+x performance boosts on ad-hoc queries with a dataset that fit entirely into the cluster’s memory.

In most cases, datasets will be far too large to fit in RAM so we need to understand if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets. To find out, we scaled the dataset up to 10 TB, 4x larger than aggregate cluster RAM, and we ran a number of far more complex queries.

Table 3 below shows how Hive LLAP is capable of running both At Speed and At Scale. The simplest query in the benchmark ran in 2.68 seconds on this 10 TB dataset while the most complex query, Query 64 performed a total of 37 joins and ran for more than 20 minutes.

If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1.

If we focus just on smaller queries that run in 30 seconds or less we see about a 9x speedup. At this 10 TB scale a total of 24 queries run in 30 seconds or less with LLAP.

HDP TEST ENVIRONMENTSOFTWARE:

Hive 1 (HDP 2.5)

Hive 2 (HDP 2.5)

Hive 1.2.1

Tez 0.7.0

Hadoop 2.7.3

No LLAP

Hive 2.1.0

Tez 0.8.4

Hadoop 2.7.3

All queries run through LLAP

OTHER HIVE / HADOOP SETTINGS:

All HDP software was deployed using Apache Ambari using HDP 2.5 software. All defaults were used in our installation. In spite of that we mention relevant net-new settings in HDP 2.5 important in this benchmark. These optimizations are set by default for new HDP 2.5 installs. Upgraded HDP clusters would need to set these explicitly.

hive.vectorized.execution.mapjoin.native.enabled=true;
hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
hive.vectorized.execution.mapjoin.minmax.enabled=true;
hive.vectorized.execution.reduce.enabled=true;
hive.llap.client.consistent.splits=true;
hive.optimize.dynamic.partition.hashjoin=true;

HARDWARE:

10x d2.8xlarge EC2 nodes

OS CONFIGURATION:

For the most part, OS defaults were used with 1 exception:

/proc/sys/net/core/somaxconn = 4000

DATA:

TPC-DS Scale 10000 data (10 TB), partitioned by date_sk columns, stored in ORC format with Zlib compression.

QUERIES:

The Hive test was driven by the Hive Testbench https://github.com/hortonworks/hive-testbench/tree/hive14 in data generation and in queries. The same query text was used both for Hive 1 and for Hive 2.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：computing memory apache Comput Where supporting thousands beginning journey dynamic

WHERE IS APACHE HIVE GOING? TO IN-MEMORY COMPUTING. [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

WHERE IS APACHE HIVE GOING? TO IN-MEMORY COMPUTING. [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群