Appendix B. Overview of Hadoop
This appendix gives a brief overview of Hadoop, focusing on elements that are of interest to Pig users. For a thorough discussion of Hadoop see [Tom White Hadoop, The Definitive Guide O'Reilly, http://oreilly.com/catalog/9781449389734/]. Hadoop's two main components are MapReduce and HDFS.
MapReduce
MapReduce is the framework for running jobs in Hadoop. It provides a simple and powerful paradigm for parallelizing data processing.
The JobTracker is the central coordinator of jobs in MapReduce. It controls which jobs are being run, which resources they are assigned, etc. On each node in the cluster there is a TaskTracker that is responsible for running the map or reduce tasks assigned to it by the JobTracker.
MapReduce views its input as a collection of records. When reading from HDFS a record is usually a single line of text. Each record has a key and a value. There is no requirement that data be sorted by key or that the keys are unique. Similarly MapReduce produces a set of records, each with a key and value.
MapReduce operates on data in jobs. Every job has one input and one output.[31] MapReduce breaks each job into a series of tasks. These tasks are of two primary types, map and reduce.
Map Phase
In the map phase, MapReduce gives the user an opportunity to operate on every record in the data set individually. This phase is commonly used to project out unwanted fields, transform fields, or to apply filters. Certain types of joins and grouping can also be done in the map (e.g. joins where the data is already sorted or hash based aggregation). There is no requirement that for every input record there should be one output record. Maps can choose to remove records or explode one record into multiple records.
Every MapReduce job specifies an
InputFormat. This class is responsible for
determining how data is split across map tasks and for providing a
RecordReader.
In order to specify how data is split across
tasks an InputFormat divides
the input data into a set of InputSplits. Each
InputSplit is given to an individual map. In
addition to information on what to read, the
InputSplit includes a list of nodes that should be
used to read the data. In this way, when the data resides on HDFS,
MapReduce is able to move the computation to the data.
The RecordReader provided
by an InputFormat reads input data and produces key
value pairs to be passed into the map. This class controls
how data is decompressed (if necessary), and how it is converted to Java
types that MapReduce can work with.
Combiner Phase
The combiner gives applications a chance to apply their reducer logic early. As the map phase writes output it is serialized and placed into an in memory buffer. When this buffer fills, MapReduce will sort the buffer and then run the combiner, if the application has provided an implementation for it. The resulting output is then written to local disk, to be picked up by the shuffle phase and sent to the reducers. MapReduce may choose not to run the combiner if it determines it will be more efficient not to.
After the shuffle each reducer will have one input for each map. The reducer needs to merge these inputs in order to begin processing. It is not efficient to merge too many inputs simultaneously. Thus, if the number of inputs exceeds a certain value, the data will be merged and rewritten to disk before being given to the reducer. During this merge the combiner will be applied in an attempt to reduce the size of the input data. See Hadoop's documentation for a discussion of how and when this pre-reduce merge is triggered.
Since the combine phase will be run zero, one, or multiple times, the input and output keys and values of the combiner must be of the same type.
Shuffle Phase
During the shuffle phase MapReduce partitions data among the various reduces.
MapReduce uses a class called
Partitioner to partition records to reducers
during the shuffle phase. An implementation of
Partitioner takes the key and value of the record
as well as the total number of reduce tasks, and returns the reduce task
number that the record should go to. By default, MapReduce uses
HashPartitioner, which calls
hashCode() on the key and returns the result modulo the
number of reduce tasks. MapReduce users can override this default to
use their own implementation of Partitioner. See
the Hadoop documentation for more details on
Partitioners.
Data arriving on the reducer has been partitioned and sorted by the map, combine, and shuffle phases. By default the data is sorted by the partition key. For example if a user has a data set partitioned on user id, then in the reducer it will be sorted by userid as well. Thus, MapReduce uses sorting to group like keys together. It is possible to specify additional sort keys beyond the partition key. So, for example, the user could choose to partition by user id and sort also by timestamp. This feature is useful as the user does not have to implement their own sorting on the reduce data.
Reduce Phase
The input to the reduce phase is each key from the shuffle plus all of the records associated with that key. Since all records with the same value for the key are now collected together, it is possible to do joins and aggregation operations such as counting. Parallelism in the reduce is explicitly controlled by the MapReduce user. MapReduce jobs that do not require a reduce phase can set the reduce count to zero. These are referred to as map only jobs.
Output Phase
The reducer, or map in a map only job, writes its
output via an OutputFormat.
OutputFormat is responsible for providing a
RecordWriter, which takes the key value pairs
produced by the task and stores them. This includes serializing,
possibly compressing, and writing them to HDFS, HBase, etc. The
OutputFormat is also responsible for providing
the OutputCommitter, which is used to do post
output operations such as cleaning up after failure and indicating to
the storage medium that data is available (e.g. a database
commit).
Distributed Cache
Sometimes all or many of the tasks in a MapReduce job will need to access a single file or a set of files. For example, when joining a large file with a small file one approach is to open the small file as a side file (that is, open it directly in your map task rather than specify it as an input to your MapReduce job), load it into memory, and do the join in the map phase. When thousands of map or reduce tasks attempt to open the same HDFS file simultaneously this puts a large strain on the NameNode and the DataNodes storing that file. To avoid this situation MapReduce provides the distributed cache. The distributed cache allows users to specify as part of their MapReduce job any HDFS files they want every task to have access to. These files are then copied onto the local disk of the task nodes as part of the task initiation. Map or reduce tasks can then read these as local files.
Handling Failure
Part of the power of MapReduce is that it handles failure and retry for the user. If you have a MapReduce job that involves 10,000 map tasks (not an uncommon situation), the odds that at least one machine will fail during that job are reasonably high. Rather than trying to remove failure from the system, MapReduce is designed with the assumption that failure is common and must be coped with. When a given map or reduce task fails, MapReduce handles spawning a replacement task to do the work. Sometimes, it does not even wait for tasks to fail. When a task is slow it may spawn a duplicate to see if it can get the task done sooner. This is referred to as speculative execution. After a task fails a certain number of times, four by default, MapReduce gives up and declares the task and the job a failure.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system that stores files across all of the nodes in a Hadoop cluster. It handles breaking the files into large blocks and distributing them across different machines. It also makes multiple copies of each block so that if any one machine fails no data is lost or unavailable. By default it makes three copies of each block, though this value is configurable. One copy is always written locally to the node where the write is executed. If your Hadoop cluster is spread across multiple racks, HDFS will write one copy of the block on the same rack as the machine the write is happening on, and one copy on a machine in a different rack. When a machine or disk dies, or blocks are corrupted, HDFS will handle making another copy of the lost blocks to ensure that the proper number of replicas are maintained.
HDFS is designed specifically to support MapReduce. The block sizes are large, 64M by default. Many users set them higher, 128M or even 256M. Storing data in large blocks works well for MapReduce's batch model, where it is assumed that every job will read all of the records in a file. Modern disks are much faster at sequential read then seek. Thus for large data sets, if you require more than a few records, sequentially reading the entire data set out-performs random reads. The three-way duplication of data, beyond obviously providing fault tolerance, also serves MapReduce because it gives the JobTracker more options for locating map tasks on the same machine as one of the blocks.
HDFS presents a POSIX like interface to users, and provides standard file system features, such as file ownership and permissions, security, and quotas.
The brain of HDFS is the NameNode. It is responsible for maintaining the master list of files in HDFS, and handles the mapping of file names to blocks, knowing where each block is stored, and making sure each block is replicated the appropriate number of times. DataNodes are machines that store HDFS data. They store each block in a separate file. Each DataNode is collocated with a TaskTracker to allow moving of the computation to data.
[31] It is possible to bend this rule, as Pig and many other applications do. For example, the one input can be a concatenation of multiple input files and files can be opened on the side in tasks and written to or read from. But conceptually each job has one primary input and one primary output.




Add a comment



Add a comment