楼主: oliyiyi
720 0

Making Data Science Accessible – HDFS [推广有奖]

版主

已卖:2994份资源

泰斗

1%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
84105 个
通用积分
31671.0967
学术水平
1454 点
热心指数
1573 点
信用等级
1364 点
经验
384134 点
帖子
9629
精华
66
在线时间
5508 小时
注册时间
2007-5-21
最后登录
2025-7-8

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

楼主
oliyiyi 发表于 2016-8-7 15:47:40 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

By Dan Kellett, Director of Data Science, Capital One.

Disclaimer: This is my attempt to explain some of the ‘Big Data’ concepts using basic analogies. There are inevitably nuances my analogy misses.

What is HDFS?

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. MapReduce is a framework for efficient processing using a parallel, distributed algorithm (see my previous blog here). The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System).

Imagine you wanted to find out how much food each type of animal eats in a day. How would you do it?

The gigantic warehouse

One approach would be to buy or rent out a huge warehouse and store some of every type of animal in the world. Then you could study each type one at a time to get the information you needed – presumably starting with aardvarks and finishing with zebras.



The downside of this approach (other than the smell & the risk the lions would eat the antelopes) is that this would take a looooong time and would be very expensive (renting out a huge building for many years would add up).

This is similar to the approach we take with our data at the moment. Huge amounts of information about our customers are available and we are restricted to analyzing it through a fairly narrow pipeline.

Mini-centers

An alternative approach would be to split up the animals and send them to lots of smaller centers where each would could be assessed. Maybe the penguins go to Penzance, the lemurs to Leeds and the oxen to Oxford.



This would mean each center could study their animals and send the information back to a head office to be summarized. This would be much faster and a lot cheaper.

This is an example of parallel processing whereby you split up your data, analyze each part separately and bring it back together at the end.

Disaster!

The key drawback is that you are highly susceptible to failure in one of these mini-centers. What happens if there’s a break-in at the center in Lincoln and all the chickens escape? What if the systems go down in Dundee and all the information on sparrows is lost?



The solution to this is to still use mini-centers but to send animals to multiple centers. For example, you may send some rabbits to Birmingham, some to Edinburgh and some to Cardiff. You are then protected from individual failures and can still carry out the survey quickly.

This is from a very high level what HDFS (Hadoop Distributed File System) does. Data are split up in a way that the overall task is not impacted if an individual node fails. Each node carries out its designated task and then passes the results back to a central node to be aggregated.

When would I use HDFS?

As with any technique related to data science HDFS is one of many approaches you could take to solve a business problem using large amounts of data. The key is being able pick and choose when to take HDFS off the shelf. At a high level: HDFS may help you if your problem is huge but not hard. If you can parallelize your problem, then HDFS coupled with MapReduce should help.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Science accessible Science Making access framework efficient referring concepts previous

已有 1 人评分论坛币 收起 理由
Nicolle + 40 精彩帖子

总评分: 论坛币 + 40   查看全部评分

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2025-12-23 00:46