关于本站
人大经济论坛-经管之家:分享大学、考研、论文、会计、留学、数据、经济学、金融学、管理学、统计学、博弈论、统计年鉴、行业分析包括等相关资源。
经管之家是国内活跃的在线教育咨询平台!
经管之家新媒体交易平台
提供"微信号、微博、抖音、快手、头条、小红书、百家号、企鹅号、UC号、一点资讯"等虚拟账号交易,真正实现买卖双方的共赢。【请点击这里访问】
论文
- 毕业论文 | 写毕业论文
- 毕业论文 | 为毕业论文找思路
- 毕业论文 | 可以有时间好好写 ...
- 毕业论文 | 毕业论文如何选较 ...
- 毕业论文 | 毕业论文选题通过 ...
- 毕业论文 | 还有三人的毕业论 ...
- 毕业论文 | 毕业论文答辩过程 ...
- 毕业论文 | 本科毕业论文,wi ...
考研考博
- 考博 | 南大考博经济类资 ...
- 考博 | 考博英语10000词汇 ...
- 考博 | 如果复旦、南大这 ...
- 考博 | 有谁知道春招秋季 ...
- 考博 | 工作与考博?到底 ...
- 考博 | 考博应该如何选择 ...
- 考博 | 考博失败了
- 考博 | 考博考研英语作文 ...
留学
- 日本留学 | 在日本留学心得
- 日本留学 | 日本留学生活必需 ...
- 日本留学 | 【留学日本】2015 ...
- 日本留学 | 日本海外留学8年来 ...
- 日本留学 | 日本留学费用_日本 ...
- 日本留学 | 求在日本留学的师 ...
- 日本留学 | 日本留学的有没有 ...
- 日本留学 | 日本留学
TOP热门关键词
坛友互助群 |
扫码加入各岗位、行业、专业交流群 |
Apache Hive(™) is the most complete SQL on Hadoop system, supporting comprehensive SQL, a sophisticated cost-based optimizer, ACID transactions and fine-grained dynamic security.
Though Hive has proven itself on multi-petabyte datasets spanning thousands of nodes many interesting use cases demand more interactive performance on smaller datasets, requiring a shift to in-memory. Hive 2 marks the beginning of Hive’s journey from a disk-centric architecture to a memory-centric architecture through Hive LLAP (Live Long and Process). Since memory costs about 100x as much as disk, memory-centric architectures demand a careful design that makes the most of available resources.
In this blog, we’ll update benchmark results from our earlier blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More.”
http://hortonworks.com/wp-content/uploads/2016/10/Screen-Shot-2016-10-05-at-2.17.23-PM.png
NOT ALL MEMORY ARCHITECTURES ARE EQUALThe in-memory shift is one of the most important trends of the last 10 years of computing. Still, RAM costs about 100x more than disk so it’s by far the most carefully budgeted and over-allocated resource in the datacenter. Applications can’t expect to hold all data in memory all the time and we know from experience that you won’t see substantial benefits from an in-memory architecture unless it’s carefully architected to make the most of memory.
This means:
- You must cache only the specific data needed for processing queries.
- The cache needs to be shared across all users who need the data.
- The data needs to stay compressed in RAM to offset the 100x higher cost of RAM.
When we put this together we see a very sharp contrast between naive in-memory implementations versus architectures optimized from the ground up to make the most of memory.
http://hortonworks.com/wp-content/uploads/2016/10/Screen-Shot-2016-10-05-at-2.17.30-PM.png
In the Hadoop ecosystem we have seen firsthand that Type 1 systems offer limited benefits, so the Hive community decided to skip directly to a Type 2 In-Memory system. Hive LLAP delivers on all these attributes, with a single shared, managed cache that stores data compressed in memory. In addition, there is work actively underway to optimize this even further, using Flash to supersize the cache and extending LLAP’s ability to operate directly on compressed data without decompressing it.
http://hortonworks.com/wp-content/uploads/2016/10/Screen-Shot-2016-10-05-at-2.17.37-PM.png
BENCHMARKING LLAP AT 10 TB SCALELet’s put this architecture to the test with a realistic dataset size and workload. Our previous performance blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More”, discussed 4 reasons that LLAP delivers dramatically faster performance versus Hive on Tez. In that benchmark we saw 25+x performance boosts on ad-hoc queries with a dataset that fit entirely into the cluster’s memory.
In most cases, datasets will be far too large to fit in RAM so we need to understand if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets. To find out, we scaled the dataset up to 10 TB, 4x larger than aggregate cluster RAM, and we ran a number of far more complex queries.
Table 3 below shows how Hive LLAP is capable of running both At Speed and At Scale. The simplest query in the benchmark ran in 2.68 seconds on this 10 TB dataset while the most complex query, Query 64 performed a total of 37 joins and ran for more than 20 minutes.
http://hortonworks.com/wp-content/uploads/2016/10/Screen-Shot-2016-10-05-at-2.17.41-PM.png
If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1.
http://hortonworks.com/wp-content/uploads/2016/10/Screen-Shot-2016-10-05-at-2.17.46-PM.png
If we focus just on smaller queries that run in 30 seconds or less we see about a 9x speedup. At this 10 TB scale a total of 24 queries run in 30 seconds or less with LLAP.
http://hortonworks.com/wp-content/uploads/2016/10/Screen-Shot-2016-10-05-at-8.35.20-PM.png
HDP TEST ENVIRONMENTSOFTWARE:Hive 1 (HDP 2.5) | Hive 2 (HDP 2.5) |
Hive 1.2.1 Tez 0.7.0 Hadoop 2.7.3 No LLAP | Hive 2.1.0 Tez 0.8.4 Hadoop 2.7.3 All queries run through LLAP |
All HDP software was deployed using Apache Ambari using HDP 2.5 software. All defaults were used in our installation. In spite of that we mention relevant net-new settings in HDP 2.5 important in this benchmark. These optimizations are set by default for new HDP 2.5 installs. Upgraded HDP clusters would need to set these explicitly.
- hive.vectorized.execution.mapjoin.native.enabled=true;
- hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
- hive.vectorized.execution.mapjoin.minmax.enabled=true;
- hive.vectorized.execution.reduce.enabled=true;
- hive.llap.client.consistent.splits=true;
- hive.optimize.dynamic.partition.hashjoin=true;
- 10x d2.8xlarge EC2 nodes
For the most part, OS defaults were used with 1 exception:
- /proc/sys/net/core/somaxconn = 4000
- TPC-DS Scale 10000 data (10 TB), partitioned by date_sk columns, stored in ORC format with Zlib compression.
- The Hive test was driven by the Hive Testbench https://github.com/hortonworks/hive-testbench/tree/hive14 in data generation and in queries. The same query text was used both for Hive 1 and for Hive 2.
免流量费下载资料----在经管之家app可以下载论坛上的所有资源,并且不额外收取下载高峰期的论坛币。
涵盖所有经管领域的优秀内容----覆盖经济、管理、金融投资、计量统计、数据分析、国贸、财会等专业的学习宝库,各类资料应有尽有。
来自五湖四海的经管达人----已经有上千万的经管人来到这里,你可以找到任何学科方向、有共同话题的朋友。
经管之家(原人大经济论坛),跨越高校的围墙,带你走进经管知识的新世界。
扫描下方二维码下载并注册APP
您可能感兴趣的文章
本站推荐的文章
- 哲学名言 | 【独家发布】经典哲学名言
- 哲学书籍 | 求推荐一本讲人生目标的哲学书籍 ...
- 哲学书籍 | 20部必读的哲学书籍
- 哲学书籍 | 经济人,开拓你逻辑思维的哲学书 ...
- 哲学书籍 | 哲学书籍
- 哲学书籍 | 哲学书籍
- 哲学书籍 | 哲学书籍
- 哲学书籍 | 经典的哲学书籍
人气文章
本文标题:WHERE IS APACHE HIVE GOING? TO IN-MEMORY COMPUTING.
本文链接网址:https://bbs.pinggu.org/jg/huiji_huijiku_4863768_1.html
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。