人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › winbugs及其他软件专版 › Making Data Science Accessible – HDFS

发帖

楼主: oliyiyi

756 0

Making Data Science Accessible – HDFS [推广有奖]

1关注
185
粉丝

版主

已卖：3000份资源

泰斗

还不是VIP/贵宾

TA的文库 其他...

计量文库

威望: 7 级
论坛币: -87545 个
通用积分: 31676.0721
学术水平: 1454 点
热心指数: 1573 点
信用等级: 1364 点
经验: 384234 点
帖子: 9629
精华: 66
在线时间: 5508 小时
注册时间: 2007-5-21
最后登录: 2025-7-8

楼主

oliyiyi 发表于 2016-8-7 15:47:40 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

By Dan Kellett, Director of Data Science, Capital One.

Disclaimer: This is my attempt to explain some of the ‘Big Data’ concepts using basic analogies. There are inevitably nuances my analogy misses.

What is HDFS?

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. MapReduce is a framework for efficient processing using a parallel, distributed algorithm (see my previous blog here). The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System).

Imagine you wanted to find out how much food each type of animal eats in a day. How would you do it?

The gigantic warehouse

One approach would be to buy or rent out a huge warehouse and store some of every type of animal in the world. Then you could study each type one at a time to get the information you needed – presumably starting with aardvarks and finishing with zebras.

The downside of this approach (other than the smell & the risk the lions would eat the antelopes) is that this would take a looooong time and would be very expensive (renting out a huge building for many years would add up).

This is similar to the approach we take with our data at the moment. Huge amounts of information about our customers are available and we are restricted to analyzing it through a fairly narrow pipeline.

Mini-centers

An alternative approach would be to split up the animals and send them to lots of smaller centers where each would could be assessed. Maybe the penguins go to Penzance, the lemurs to Leeds and the oxen to Oxford.

This would mean each center could study their animals and send the information back to a head office to be summarized. This would be much faster and a lot cheaper.

This is an example of parallel processing whereby you split up your data, analyze each part separately and bring it back together at the end.

Disaster!

The key drawback is that you are highly susceptible to failure in one of these mini-centers. What happens if there’s a break-in at the center in Lincoln and all the chickens escape? What if the systems go down in Dundee and all the information on sparrows is lost?

The solution to this is to still use mini-centers but to send animals to multiple centers. For example, you may send some rabbits to Birmingham, some to Edinburgh and some to Cardiff. You are then protected from individual failures and can still carry out the survey quickly.

This is from a very high level what HDFS (Hadoop Distributed File System) does. Data are split up in a way that the overall task is not impacted if an individual node fails. Each node carries out its designated task and then passes the results back to a central node to be aggregated.

When would I use HDFS?

As with any technique related to data science HDFS is one of many approaches you could take to solve a business problem using large amounts of data. The key is being able pick and choose when to take HDFS off the shelf. At a high level: HDFS may help you if your problem is huge but not hard. If you can parallelize your problem, then HDFS coupled with MapReduce should help.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Data Science accessible Science Making access framework efficient referring concepts previous

Making Data Science Accessible – HDFS [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Making Data Science Accessible – HDFS [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群