What is a “secondary sorting” problem? “Secondary Sorting Problem” is the problem of sorting values associated with a key in the reduce phase. Sometimes, this is called “value-to-key conversion.” The “secondary sorting” technique will enable us to sort the values (in ascending or descending order) passed to each reducer.
The goal of this chapter is to implement “secondary sort” design-pattern by MapReduce/Hadoop and Spark. In software design and programming, a design pattern is a reusable algorithm (typically, a design pattern is not presented in a specific programming language – but can be implemented by many programming languages) that is a solution to a commonly occurring problem.
MapReduce framework automatically sorts the keys generated by mappers. This means that, before starting reducers all intermediate (key, value) pairs generated by mappers must be sorted by key (and not by value). Values passed to each reducer are not sorted (arbitrarily ordered) at all and they can be in any order. What if we want to sort reducer’s values also? MapReduce/Hadoop and Spark do not sort values for a reducer. For example, for some applications (such as time series data), you want your reducer data to be sorted. Secondary Sort design pattern enable us to sort redcer’s values.
First we focus on MapReduce/Hadoop solution. Let’s look at the MapReduce paradigm and then explain the concept of the Secondary Sort:
map(key1, value1) → list(key2, value2)
reduce(key2, list(value2)) → list(key3, value3)
First, the map() function receives a key-value pair input, (key1, value1). Then it outputs another (any number of them) key-value pair, (key2, value2). Second, the reduce() function receives as input another key-value pair, (key2, list(value2)), and outputs (any number of them) (key3, value3).
Now consider the following key-value pair (key2, list(value2)) as an input for a reducer:
list(value2) = (V1, V2, ..., Vn)