1 NameNodeNameNode 是 Hadoop 体系结构中最重要的一个部分,Hadoop 采用的是主从结构的分布式计算与分布式存储,其中的分布式存储系统被称为Hadoop的文件系统,即HDFS。NameNode位于HDFS的主服务器,它指导从服务器的DataNode执行I/O任务;它记录文件如何被分块,又存储在哪些节点;监控整个Hadoop集群是否运行正常。
2 DataNode
DataNode 位于Hadoop集群的从服务器端,即每个从节点都会有一个DataNode守护进程,它负责将HDFS数据块读取、写入到本地文件系统中。如读取文件时,文件被分割成多个块,因此由NameNode通知 Client每个数据块存储在哪个DataNode,从而Client直接与DataNode守护进程通信,来执行读取文件操作。
3 Secondary NameNode
Secondary NameNode 是一个用于监测HDFS集群运行状态的辅助守护进程,类似NameNode节点,它一般独占一台服务器,但是Secondary NameNode不接收、记录HDFS的实时变化,它只与NameNode通信,并根据集群所设置的时间间隔来获得HDFS元数据的快照。另外,Secondary NameNode最大的作用就是解决Hadoop集群中单点失效的问题,当NameNode出现故障时,可以将Secondary NameNode 作为NameNode,以将损失最小化。
4 JobTracker
JobTracker,是应用程序和Hadoop之间的重要纽带,它通常运行在Hadoop集群的主服务器上面。当提交任务时,JobTracker就会确定执行计划,如包括决定处理哪些文件、为不同的任务分配节点等。如果任务失败,JobTracker会自动重启动。
5 TaskTracker
JobTracker作为主节点监控MapReduce作业的整个执行过程,而TaskTracker管理各个任务在每个从节点上的执行情况。每个TaskTracker负责执行由JobTracker分配的任务,而且每个TaskTracker可以生成多个JVM来并行的处理多个map或reduce任务。TaskTracker的另一个重要作用就是持续不断地和JobTracker通信,JobTracker每隔一定的时间就会收到来值TaskTracker的”心跳”,如果超过一定时间还未收到,则JobTracker就会假定TaskTracker已经崩溃了,从而重新分配作业任务到其它的节点。
《 Hadoop in Action 》的第四章,有个专利数据分析程序,使用的数据是 专利引用数据 cite75_99.txt(但是从网上下载下来的数据包需要用其它工具转换,如需要该数据包的朋友可以留下邮箱,我可以共享给大家),数据结果如下:
“CITING”,”CITED” ————————》 专利号,被引用的专利号
3858241,956203
3858241,1324234
3858241,3398406
3858241,3557384
3858241,3634889
3858242,1515701
……..
《 Hadoop in Action 》 中的这个程序要求是读取专利数据,并对其实现倒排,即倒排后的数据格式为:被引用的专利好 “主”引用的专利号1,专利号2,……专利号n 。
MapReduce 程序如下:
public class PatentSort extends Configured implements Tool {
public static class Map extends MapReduceBase implements Mapper{
public void map(Text key,Text value,OutputCollector output,Reporter reporter) throws IOException{
output.collect(value, key); //将输入记录的键、值交换位置,实现倒排
}
}
public static class Reduce extends MapReduceBase implements Reducer {
@Override
public void reduce(Text key, Iterator values, //每个reduce负责处理同key的map的输出
OutputCollector output, Reporter reporter)
throws IOException {
String value = "";
while(values.hasNext()) { //将最后结果以形式输出
if(value.length() > 0) {
value += ",";
}
value += values.next().toString();
}
output.collect(key, new Text(value));
}
}
public static void main(String[] args) throws Exception {
int result = ToolRunner.run(new Configuration(), new PatentSort(), args);
System.exit(result);
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
JobConf job = new JobConf(conf,PatentSort.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("PatentSort");
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(KeyValueTextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.set("key.value.separator.in.input.line", ","); //设置key和value的分割符
JobClient.runJob(job);
return 0;
}
}
之所把这个实验贴出来,是因为这次用的数据相对来说比运行Wordcount程序所用的数据大很多,这个数据集有两百多M了,在运行这个程序时,在Myeclipse控制台可以看到数据分块、合并、输出等过程,对理解MapReduce作业运行的过程的理解很有帮助。部分输出信息如下: 13/09/15 00:45:10 INFO mapred.MapTask: Spilling map output: record full = true
13/09/15 00:45:10 INFO mapred.MapTask: bufstart = 46057106; bufend = 50245047; bufvoid = 99614720
13/09/15 00:45:10 INFO mapred.MapTask: kvstart = 262136; kvend = 196599; length = 327680
13/09/15 00:45:11 INFO mapred.MapTask: Finished spill 11
13/09/15 00:45:11 INFO mapred.MapTask: Spilling map output: record full = true
13/09/15 00:45:11 INFO mapred.MapTask: bufstart = 50245047; bufend = 54433427; bufvoid = 99614720
13/09/15 00:45:11 INFO mapred.MapTask: kvstart = 196599; kvend = 131062; length = 327680
13/09/15 00:45:11 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/hadoop/in/CITE75_9.txt:0+67108864
13/09/15 00:45:11 INFO mapred.MapTask: Finished spill 12
13/09/15 00:45:12 INFO mapred.MapTask: Spilling map output: record full = true
13/09/15 00:45:12 INFO mapred.MapTask: bufstart = 54433427; bufend = 58621727; bufvoid = 99614720
13/09/15 00:45:12 INFO mapred.MapTask: kvstart = 131062; kvend = 65525; length = 327680
13/09/15 00:45:12 INFO mapred.MapTask: Finished spill 13
13/09/15 00:45:12 INFO mapred.JobClient: map 87% reduce 0%
13/09/15 00:45:12 INFO mapred.MapTask: Spilling map output: record full = true
13/09/15 00:45:12 INFO mapred.MapTask: bufstart = 58621727; bufend = 62809851; bufvoid = 99614720
13/09/15 00:45:12 INFO mapred.MapTask: kvstart = 65525; kvend = 327669; length = 327680
13/09/15 00:45:12 INFO mapred.MapTask: Starting flush of map output
13/09/15 00:45:13 INFO mapred.MapTask: Finished spill 14
13/09/15 00:45:13 INFO mapred.MapTask: Finished spill 15
13/09/15 00:45:13 INFO mapred.Merger: Merging 16 sorted segments
13/09/15 00:45:13 INFO mapred.Merger: Merging 7 intermediate segments out of a total of 16
13/09/15 00:45:14 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/hadoop/in/CITE75_9.txt:0+67108864
13/09/15 00:45:14 INFO mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 71062652 bytes
13/09/15 00:45:15 INFO mapred.JobClient: map 100% reduce 0%
13/09/15 00:45:17 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/09/15 00:45:17 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/hadoop/in/CITE75_9.txt:0+67108864
13/09/15 00:45:17 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/hadoop/in/CITE75_9.txt:0+67108864
13/09/15 00:45:17 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/09/15 00:45:17 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@82751
13/09/15 00:45:17 INFO mapred.MapTask: numReduceTasks: 1
13/09/15 00:45:17 INFO mapred.MapTask: io.sort.mb = 100
13/09/15 00:45:17 INFO mapred.MapTask: data buffer = 79691776/99614720
13/09/15 00:45:17 INFO mapred.MapTask: record buffer = 262144/327680
13/09/15 00:45:18 INFO mapred.MapTask: Spilling map output: record full = true
13/09/15 00:45:18 INFO mapred.MapTask: bufstart = 0; bufend = 4188318; bufvoid = 99614720
13/09/15 00:45:18 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
13/09/15 00:45:19 INFO mapred.MapTask: Finished spill 0
13/09/15 00:45:19 INFO mapred.MapTask: Spilling map output: record full = true
13/09/15 00:45:19 INFO mapred.MapTask: bufstart = 4188318; bufend = 8376823; bufvoid = 99614720
13/09/15 00:45:19 INFO mapred.MapTask: kvstart = 262144; kvend = 196607; length = 327680
13/09/15 00:45:19 INFO mapred.MapTask: Finished spill 1
13/09/15 00:45:19 INFO mapred.MapTask: Spilling map output: record full = true
13/09/15 00:45:19 INFO mapred.MapTask: bufstart = 8376823; bufend = 12565560; bufvoid = 99614720
13/09/15 00:45:19 INFO mapred.MapTask: kvstart = 196607; kvend = 131070; length = 327680
13/09/15 00:45:20 INFO mapred.MapTask: Finished spill 2
13/09/15 00:45:20 INFO mapred.MapTask: Spilling map output: record full = true
13/09/15 00:45:20 INFO mapred.MapTask: bufstart = 12565560; bufend = 16754556; bufvoid = 99614720
13/09/15 00:45:20 INFO mapred.MapTask: kvstart = 131070; kvend = 65533; length = 327680
13/09/15 00:45:20 INFO mapred.MapTask: Finished spill 3
13/09/15 00:45:21 INFO mapred.MapTask: Spilling map output: record full = true
13/09/15 00:45:21 INFO mapred.MapTask: bufstart = 16754556; bufend = 20943460; bufvoid = 99614720
13/09/15 00:45:21 INFO mapred.MapTask: kvstart = 65533; kvend = 327677; length = 327680
....
13/09/15 00:45:55 INFO mapred.MapTask: Finished spill 12
13/09/15 00:45:55 INFO mapred.MapTask: Starting flush of map output
13/09/15 00:45:55 INFO mapred.MapTask: Finished spill 13
13/09/15 00:45:55 INFO mapred.Merger: Merging 14 sorted segments
13/09/15 00:45:55 INFO mapred.Merger: Merging 5 intermediate segments out of a total of 14
13/09/15 00:45:56 INFO mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 65367310 bytes
13/09/15 00:45:56 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/hadoop/in/CITE75_9.txt:201326592+61733814
13/09/15 00:45:57 INFO mapred.JobClient: map 100% reduce 0%
13/09/15 00:45:59 INFO mapred.Task: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting
13/09/15 00:45:59 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/hadoop/in/CITE75_9.txt:201326592+61733814
13/09/15 00:45:59 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/hadoop/in/CITE75_9.txt:201326592+61733814
13/09/15 00:45:59 INFO mapred.Task: Task 'attempt_local_0001_m_000003_0' done.
13/09/15 00:45:59 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5c3987
13/09/15 00:45:59 INFO mapred.LocalJobRunner:
13/09/15 00:45:59 INFO mapred.Merger: Merging 4 sorted segments
13/09/15 00:45:59 INFO mapred.Merger: Down to the last merge-pass, with 4 segments left of total size: 278550201 bytes
13/09/15 00:45:59 INFO mapred.LocalJobRunner:
13/09/15 00:46:05 INFO mapred.LocalJobRunner: reduce > reduce
13/09/15 00:46:06 INFO mapred.JobClient: map 100% reduce 78%
13/09/15 00:46:08 INFO mapred.LocalJobRunner: reduce > reduce
13/09/15 00:46:09 INFO mapred.JobClient: map 100% reduce 85%
13/09/15 00:46:11 INFO mapred.LocalJobRunner: reduce > reduce
13/09/15 00:46:12 INFO mapred.JobClient: map 100% reduce 92%
13/09/15 00:46:14 INFO mapred.LocalJobRunner: reduce > reduce
13/09/15 00:46:15 INFO mapred.JobClient: map 100% reduce 97%
13/09/15 00:46:15 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
13/09/15 00:46:15 INFO mapred.LocalJobRunner: reduce > reduce
13/09/15 00:46:15 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
13/09/15 00:46:15 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/user/hadoop/out/patentsort
13/09/15 00:46:17 INFO mapred.LocalJobRunner: reduce > reduce
13/09/15 00:46:17 INFO mapred.LocalJobRunner: reduce > reduce
13/09/15 00:46:17 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
13/09/15 00:46:18 INFO mapred.JobClient: map 100% reduce 100%
13/09/15 00:46:18 INFO mapred.JobClient: Job complete: job_local_0001
13/09/15 00:46:18 INFO mapred.JobClient: Counters: 23
13/09/15 00:46:18 INFO mapred.JobClient: File Input Format Counters
13/09/15 00:46:18 INFO mapred.JobClient: Bytes Read=263068601
13/09/15 00:46:18 INFO mapred.JobClient: File Output Format Counters
13/09/15 00:46:18 INFO mapred.JobClient: Bytes Written=149348170
13/09/15 00:46:18 INFO mapred.JobClient: FileSystemCounters
13/09/15 00:46:18 INFO mapred.JobClient: FILE_BYTES_READ=1705046007
13/09/15 00:46:18 INFO mapred.JobClient: HDFS_BYTES_READ=928810869
13/09/15 00:46:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2410163852
13/09/15 00:46:18 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=149348170
13/09/15 00:46:18 INFO mapred.JobClient: Map-Reduce Framework
13/09/15 00:46:18 INFO mapred.JobClient: Map output materialized bytes=278550217
13/09/15 00:46:18 INFO mapred.JobClient: Map input records=15489787
13/09/15 00:46:18 INFO mapred.JobClient: Reduce shuffle bytes=0
13/09/15 00:46:18 INFO mapred.JobClient: Spilled Records=53285087
13/09/15 00:46:18 INFO mapred.JobClient: Map output bytes=247570619
13/09/15 00:46:18 INFO mapred.JobClient: Total committed heap usage (bytes)=1427439616
13/09/15 00:46:18 INFO mapred.JobClient: CPU time spent (ms)=0
13/09/15 00:46:18 INFO mapred.JobClient: Map input bytes=263060406
13/09/15 00:46:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=408
13/09/15 00:46:18 INFO mapred.JobClient: Combine input records=0
13/09/15 00:46:18 INFO mapred.JobClient: Reduce input records=15489787
13/09/15 00:46:18 INFO mapred.JobClient: Reduce input groups=3199417
13/09/15 00:46:18 INFO mapred.JobClient: Combine output records=0
13/09/15 00:46:18 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
13/09/15 00:46:18 INFO mapred.JobClient: Reduce output records=3199417
13/09/15 00:46:18 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
13/09/15 00:46:18 INFO mapred.JobClient: Map output records=15489787
文章来自:http://leezk.com