Pages

Friday, May 24, 2013

How to determine the number of mappers and reducers in a MapReduce job?

Many times I have seen people asking questions on StackOverflow and several forums that how can we set the number of mappers and reducers in a Hadoop based MapReduce job? Or how can we determine or calculate the number of mappers and reducers? I will try to answer these questions in this post.

How to determine the number of mappers?

It’s relatively easy to determine but harder to control the number of mappers as compared to the number of reducers.
Number of mappers can be determined as follows:
First determine that the input files are splittable or not. GZipped files and some other compressed files are inherently not splittable by the Hadoop. Normal text files, JSON docs etc. are splittable.
If the files are splittable:
1. Calculate the total size of input files.
2. The number of mappers = total size calculated above / input split size defined in Hadoop configuration.
For example, if the total size of input is 1GB and input split size is set to 128 MB then:
number of mappers = 1 x 1024 / 128 = 8 mappers.

Sunday, May 19, 2013

What is Big Data?

These days IT world is abuzz with the term “Big Data”. So, what really is Big Data? By definition:

“Amount of data that is so huge that it can’t be processed – stored, searched, analysed, shared etc. – through traditional computing. “ Here traditional computing refers to non-parallel non-distributed systems like Relational Database Management Systems or RDMS etc.

The three V’s of Big Data are:

  1. Variety: Big Data can arrive from various sources and in various formats like CSVs, JSONs, TSVs etc.
  2. Velocity: The velocity with which data is generated is generally very high and is generated at varied speeds.
  3. Volume: The volume of Big Data is very very high, may be even in terms of Petabytes.

What is the source of Big Data?
Most common source of Big Data is the log that is generated by any system. For example, whenever we access a website, it logs our IP address, browser info, OS, datetime of access and few other details regarding us, to a log file. A website with a huge traffic can generate log files of around 1-2 GB per day or even more. Every day, we create approximately 2.5 quintillion bytes of data. It is estimated that 90% of the data in the world today has been created in the last two years alone. This is the velocity at which we are producing data! New sources of these kind of data are piling up with each passing day.

Why really do we need to process such huge data?
The answer is – to derive some meaningful information/insight out of it. Meaningful information most commonly refers to business value that we can be derived from the Big Data. For example, if we are deploying a new version of a website, we can run analytics on the log files of the older as well as newer version of the website to generate meaningful information such as – was the new design widely accepted? Does it caused a negative impact on the traffic to the website? We can even use the logs to predict the expected traffic during a particular period of a day in a week. This is the era of collective intelligence.