Data Algorithms:Recipes for Scaling Up with Hadoop and Spark [推广有奖]

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12402323 个通用积分 1620.8615 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 477211 点帖子 23879 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡 0% 加关注串个门加好友发消息 0关注 463 粉丝巨擘 0% 巨擘积分 77103, 距离下一级还需 999922896 积分权限: 自定义头衔, 签名中使用图片, 隐身, 设置帖子权限, 设置回复可见, 签名中使用代码道具: 涂鸦板, 彩虹炫, 雷达卡, 热点灯, 显身卡, 匿名卡, 金钱卡, 变色卡, 抢沙发, 置顶卡, 提升卡, 沉默卡, 千斤顶还不是VIP/贵宾 - 还不是VIP/贵宾购买后可立即获得权限: 隐身道具: 金钱卡, 涂鸦板, 变色卡, 彩虹炫, 雷达卡, 热点灯 TA的文库其他... Python(Must-Read Books) SAS Programming Must-Read Books 0% 威望 16 级论坛币 12402323 个通用积分 1620.8615 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 477211 点帖子 23879 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 该用户从未签到	Nicolle 发表于 2015-7-18 07:13:22 \|显示全部楼层 \|坛友微信交流群提示: 作者被禁止或删除内容自动屏蔽本帖被以下文库推荐 · Data Science NewOccidental\|主题: 1233, 订阅: 120

	回复使用道具举报提升卡置顶卡沉默卡变色卡抢沙发千斤顶显身卡

auirzxp

发表于 2015-7-18 07:28:30 |显示全部楼层 |坛友微信交流群

1.1 What is a Secondary Sort Problem?

What is a “secondary sorting” problem? “Secondary Sorting Problem” is the problem of sorting values associated with a key in the reduce phase. Sometimes, this is called “value-to-key conversion.” The “secondary sorting” technique will enable us to sort the values (in ascending or descending order) passed to each reducer.

The goal of this chapter is to implement “secondary sort” design-pattern by MapReduce/Hadoop and Spark. In software design and programming, a design pattern is a reusable algorithm (typically, a design pattern is not presented in a specific programming language – but can be implemented by many programming languages) that is a solution to a commonly occurring problem.

MapReduce framework automatically sorts the keys generated by mappers. This means that, before starting reducers all intermediate (key, value) pairs generated by mappers must be sorted by key (and not by value). Values passed to each reducer are not sorted (arbitrarily ordered) at all and they can be in any order. What if we want to sort reducer’s values also? MapReduce/Hadoop and Spark do not sort values for a reducer. For example, for some applications (such as time series data), you want your reducer data to be sorted. Secondary Sort design pattern enable us to sort redcer’s values.

First we focus on MapReduce/Hadoop solution. Let’s look at the MapReduce paradigm and then explain the concept of the Secondary Sort:

map(key1, value1) → list(key2, value2)

reduce(key2, list(value2)) → list(key3, value3)

First, the map() function receives a key-value pair input, (key1, value1). Then it outputs another (any number of them) key-value pair, (key2, value2). Second, the reduce() function receives as input another key-value pair, (key2, list(value2)), and outputs (any number of them) (key3, value3).

Now consider the following key-value pair (key2, list(value2)) as an input for a reducer:

list(value2) = (V1, V2, ..., Vn)

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

使用道具举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12402323 个通用积分 1620.8615 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 477211 点帖子 23879 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	Nicolle 发表于 2015-7-19 11:42:58 \|显示全部楼层 \|坛友微信交流群提示: 作者被禁止或删除内容自动屏蔽

	回复使用道具举报显身卡

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12402323 个通用积分 1620.8615 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 477211 点帖子 23879 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	Nicolle 发表于 2015-7-19 11:46:06 \|显示全部楼层 \|坛友微信交流群提示: 作者被禁止或删除内容自动屏蔽

	回复使用道具举报显身卡

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12402323 个通用积分 1620.8615 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 477211 点帖子 23879 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	Nicolle 发表于 2015-7-19 11:47:55 \|显示全部楼层 \|坛友微信交流群提示: 作者被禁止或删除内容自动屏蔽

	回复使用道具举报显身卡

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12402323 个通用积分 1620.8615 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 477211 点帖子 23879 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	Nicolle 发表于 2015-7-19 11:49:36 \|显示全部楼层 \|坛友微信交流群提示: 作者被禁止或删除内容自动屏蔽

	回复使用道具举报显身卡

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12402323 个通用积分 1620.8615 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 477211 点帖子 23879 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	Nicolle 发表于 2015-7-19 11:53:01 \|显示全部楼层 \|坛友微信交流群提示: 作者被禁止或删除内容自动屏蔽

	回复使用道具举报显身卡