楼主: oliyiyi
1477 2

Hadoop and Big Data: The Top 6 Questions Answered [推广有奖]

版主

已卖:2993份资源

泰斗

1%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
101070 个
通用积分
31671.0211
学术水平
1454 点
热心指数
1573 点
信用等级
1364 点
经验
384134 点
帖子
9629
精华
66
在线时间
5508 小时
注册时间
2007-5-21
最后登录
2025-7-8

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

楼主
oliyiyi 发表于 2016-1-24 10:26:23 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

By Bruce Gilroy, Anvet.

I recently had the opportunity to participate in a series of panel discussions as part of a Big Data and Hadoop roadshow conducted by Avnet, HPE, Hortonworks and Suse.

In addition to great industry and technology presentations from Hortonworks and HPE, each event included an interactive panel discussion, featuring domain experts from HPE, Hortonworks and Avnet. The HPE and Hortonworks panelists varied by city, with a couple of repeat participants. My calendar lined up so I agreed to attend all three events for Avnet.

While I might not have been in love with the idea at the time (3 cities in 1.5 weeks), participating in all three events turned out to be very beneficial. In addition to gaining new and expanded insights from my fellow panelists, I got to engage with a broad set of individuals and organizations on big data and Hadoop. As the one constant on all three panels, I had the opportunity to experience the differences and similarities between the different audiences.

The audience dynamics varied from city to city, from local culture to types of companies/industries present to their average level of big data/Hadoop experience. The attendees in one city were (on average) fairly new or early in their big data strategies, whereas the audience in another city was (on average) further along in their big data journey. Naturally, this resulted in some different and unique questions and discussions from city to city. What really struck me however, were the similarities between the three diverse audiences when it came to their questions around big data and Hadoop. I was surprised at how often the same questions kept coming up and how similar many of the discussions were from event to event.

With that in mind, I thought it might be beneficial to our partner community to highlight some of the common questions that kept coming up around Hadoop and big data and summarize the panels’ responses. We actually agreed most of the time, making the task of summarizing the panel opinions not overly daunting.

Does Hadoop replace my existing Data Warehouse?

Panel says: No. Hadoop can be an extremely valuable extension to your data warehouse and even off-load some services from your data warehouse (such as ETL), but it does not replace it. Hadoop is not a RDBMS, it’s not an ACID compliant database, it’s not even a database. It is a file system (Hadoop Distributed File System or HDFS) and analytic/calculation engine (MapReduce). Yes, we can add SQL services like Hive and other processing engines like Spark but it still doesn’t replace an Enterprise Data Warehouse. Hive and other SQL on Hadoop tools are not full ANSI SQL standard, rather a sub-set of ANSI SQL 1992 features – which would have significant speed/performance implications. Hadoop is complementary to your data warehouse.

Of course, if we really wanted to complicate things, we could dig deeper into what you consider to be a data warehouse -and we would get a variety of answers that run the spectrum. And if the answer was something like “our data warehouse is really just a repository of data from a handful of sources, without any complex schemas or modeling” – then maybe you “could” actually move everything to Hadoop. But since that is fairly academic and probably of limited applicability to most enterprise customers, I’ll stick with my original answer of: no.

What about Spark, does it replace Hadoop?

Once again: No. Spark is an in-memory processing engine that can run on top of HDFS or stand-alone. As an in-memory engine, Spark is much faster than the traditional MapReduce approach. Spark can process data from HDFS, Hive, Flume and other data sources extremely fast, allowing Hadoop to be an effective streaming or real-time analytics platform. Spark can replace MapReduce as the right tool for many jobs, but it is just one part of the Hadoop ecosystem, which includes tools such as MapReduce, Spark, Storm, Hive, Hbase, Flume etc.

Are dedicated programmers/developers needed to deploy/manage a Hadoop system? Do I need to hire a Data Scientist?

You will certainly need some folks with Hadoop skills, database/data management skills, system admin skills, programing skills and analytics skills. Currently, the market isn’t oversaturated with Hadoop admins that possess all of these skills along with several deployments and a few years of management experience under their belts (I think we’ll see more over the next few years). Experienced DBAs can usually be effective Hadoop admins, as are good system admins (i.e. folks that know more than just navigating the GUI).
As for the data scientist, they’re great if you can find one (and afford him/her). You’re talking about someone who gets statistics, algorithms, coding, data and database technologies and the underlying business logic. In many cases, companies are leveraging the skills of multiple individuals already on staff as opposed to hiring a dedicated data scientist.

We hear about a lot of cool “science projects” but what are companies actually doing with Hadoop in production scenarios?

Over the last couple of years, we have seen more organizations using Hadoop in production environments. Some common examples include:

  • Consolidating data from multiple sources/methods into a “data lake”
  • Offloading ETL process from existing data warehouse
  • Predictive modeling/analytics (related to security, maintenance, marketing, supply-chain etc.)
  • Real-time or streaming analytics (when front-ended with an in-memory engine like Spark or SAP HANA
  • The specific use-case examples are plentiful now, across most verticals like healthcare, retail, financial services, manufacturing etc.

How do I start the Big Data journey? What use cases are low hanging fruits to try out first?

We were all passionately unanimous in our response to this one – and the answer is: have a use case. Ok, have a well-defined, small in scope, manageable, measurable use case that has the support of the business. Work with the business stakeholders to identify an attainable use case that will return measurable business value and secure their buy-in. One of the most cited reasons for failed big data projects is the lack of a well-defined, business-relevant use case.

Now, here is where we diverged a little. Some of the panelists advocate starting with an operational IT use case (such as offloading ETL or log management) as a first Hadoop project, then using that success as a proof point to secure business buy-in for a more business centric Hadoop project. While I don’t disagree with that approach, I’d still prefer to start with a use case that directly impacts business objectives.

What infrastructure is most appropriate for Hadoop?

One of the key tenets of Hadoop is that it was designed to leverage “commodity” hardware. As our panels were part of an HPE-centric event, we focused on HPE solutions. HPE infrastructure is obviously a great platform for a Hadoop cluster. Additionally, the HPE folks had some very interesting testing and benchmark data showing significant performance gains for some Hadoop workloads using HPE Moonshot systems with 3Par arrays for Hadoop. Yes, the approach is completely counter-intuitive, but the test results were compelling. Of course, there are many solid infrastructure options for Hadoop, Cisco, IBM, Lenovo to name a few. Many of which have validated reference architectures or frameworks for Hadoop, making design and deployment MUCH easier. There are even a few somewhat “turn-key” Hadoop infrastructure solutions that can be delivered pre-configured and pre-integrated, workload optimized either from the manufacturer or from Avnet.

There were plenty of other questions but in the interest of brevity (I realize that ship may have already sailed at this point), I’ll stop here. These seemed to be the most common questions across the three audiences. I won’t subject you to stuff about the rule of large numbers and statistical inference as related to Hadoop – unless that’s your thing, in which case please feel free to reach out directly.

Bio: Bruce Gilroy has more than 14 years of experience in the technology channel, and has held various technical, sales and solution development roles. Bruce is currently responsible for Converged Infrastructure and Big Data & Analytics solutions at Avnet, where he is focused on solutions development and channel enablement.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:questions question Big data Hadoop Answer technology conducted featuring industry recently

本帖被以下文库推荐

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

沙发
hjtoh 发表于 2016-1-24 10:28:40 来自手机
oliyiyi 发表于 2016-1-24 10:26
By Bruce Gilroy, Anvet.I recently had the opportunity to participate in a series of panel discussion ...
看起来很不错呀
已有 1 人评分论坛币 收起 理由
oliyiyi + 10 精彩帖子

总评分: 论坛币 + 10   查看全部评分

藤椅
bailihongchen 发表于 2016-1-24 14:44:26
thanks for sharing
已有 1 人评分论坛币 收起 理由
oliyiyi + 10 精彩帖子

总评分: 论坛币 + 10   查看全部评分

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2025-12-20 21:35