MemoryLeak.in: How to determine the number of mappers and reducers in a MapReduce job?

Many times I have seen people asking questions on StackOverflow and several forums that how can we set the number of mappers and reducers in a Hadoop based MapReduce job? Or how can we determine or calculate the number of mappers and reducers? I will try to answer these questions in this post.

How to determine the number of mappers?

It’s relatively easy to determine but harder to control the number of mappers as compared to the number of reducers.
Number of mappers can be determined as follows:
First determine that the input files are splittable or not. GZipped files and some other compressed files are inherently not splittable by the Hadoop. Normal text files, JSON docs etc. are splittable.
If the files are splittable:
1. Calculate the total size of input files.
2. The number of mappers = total size calculated above / input split size defined in Hadoop configuration.
For example, if the total size of input is 1GB and input split size is set to 128 MB then:
number of mappers = 1 x 1024 / 128 = 8 mappers.

If the files are not splittable:
1. In this case the number of mappers is equal to the number of input files. If the size of file is too huge, it can be a bottleneck to the performance of the whole MapReduce job. On Amazon EMR, we can use S3DistCP with –outputCodec none to download the non-splittable files from S3 to HDFS and then extract them to make them splittable.

How to determine the number of reducers?

It’s relatively tricky to determine the number of reducers. Number of reducers can be explicitly set in a MapReduce job by using job.setNumReduceTasks() in the run() method of the MapReduce job. If it is set explicitly, then it overrides all the (related) settings and the number of reducers is always equal to the number specified as parameter to job.setNumReduceTasks().
If the number of reducers is not set explicitly, then it depends on the Hadoop configuration of the cluster. If the number of reducers per task node is set to a number x and if the number of nodes(except the master) in the cluster is y, then the total number of reducers is given by x*y.

Few important points to keep in mind:

The number of reducers does not depend on the key set of the output of mapper. For instance, if the key set of the output of mapper has 1 million unique keys, it doesn’t mean that there will be 1 million reducers. The concept here is, all the values related to a particular key goes to only one reducer, but this in no way means that a reducer will get only values related to a single key only.

For example, consider the following mapper outputs:
Mapper(k1,v1), Mapper(k1,v2), Mapper(k1,v3)
Mapper(k2,w1), Mapper(k2,w2)
Mapper(k3,u1), Mapper(k3,u2), Mapper(k3,u3), Mapper(k3,u4)
So, the values related to k1 – v1,v2 and v3 will go into a single reducer, say R1, and it won’t get split up into multiple reducers. But it doesn’t mean that R1 would have only 1 key k1 to process. It may have values of k2 or k3 also. But for any key that a reducer receives, all the values associated to that key will come to the same reducer.
We can write our own CustomPartitioner class if we need to override the default behaviour of key-value distribution among the reducers.
Setting higher number of reducers does not guarantee that the size of output files would be less and distributed evenly among several files. It depends on the key set of the output of mapper too. Mapper output key set should be sufficiently large to distribute the output data among several smaller size files.

MemoryLeak.in

Pages

Friday, May 24, 2013

How to determine the number of mappers and reducers in a MapReduce job?

No comments:

Post a Comment