Data Scientist @Salesforce.com
Background Illustrations provided by: http://edison.rutgers.edu/

Input-format using Record reader in Hadoop

Hello Folks, its been a while since I updated my blog. I thought of writing about Recordreaders in Hadoop. Most of my recent past developments were dealing with different types of input formats be it logs, events, unstructured data-sets, Key-Value(NoSQL) data stores etc. So I have to really stress on Customizing Input format of my data. 

Heads-up : I haven’t written much on how Record Reader methods how it works in depth. If anyone needs it, please let me know I will write another blog on it. Online documentation is good enough to understand. 

Before diving further, lets get a small background on why we need them in the first place and what sort of helpful features they bring into the environment. Hadoop is one of the best frameworks that is designed for data warehouses which were in need of better, faster analytics. It can can deal with Terabytes of data without any issues and with much lower latency and many big companies like Facebook, Twitter etc have scaled it to work on Petabytes. Now the real challenge is dumping the unstructured data onto HDFS or CFS or whatever filesystem you have decided to go with. It might sound like an easy task to just use -put or -copyFromLocal to hdfs. You’re right, it is easy. But we have to make sure that data is moved properly in some job-friendly manner. For that we have to understand Splits, most used word in Map-reduce world.

What are splits?  There are two phases in hadoop workflow where we hear about splits. Before sending files to Mappers, files are split and then sent to each mapper separately. For example take a textile of log data. it is divided into multiple splits and then moved to mappers for mapping. Each individual split is again divided into records which are nothing but lines in the text log file. This is one kind of split. Another split is what we hear most of the times. 64MB or 128MB splits which we configure in our xml configurations. This is the split that happens while storing this post Map-Reduce data onto data nodes. 

Why inputformat is required? Not every record you get from unstructured data is valid. This is the basic assumption in any sort of file. Machine generated data is highly unstructured and uncontrollable most of the times. For example, exceptions in between logs, empty records in key-value pairs, Unidentifiable format of data etc. You have to prepare our Instances to be able to handle these kind of exceptions while splitting. For that we use Record reader. While loading data we have to customize the input format and do it as per our convenience. Best thing is we can customize our input format according to our needs how we want to store our data in hdfs. Dumping all the data as it is, is really  a bad idea while running jobs on it or running hive/pig queries on it. We have to try to make that data clean enough to make it Job-Friendly. Below are some instances we come across why we need InputFormatting. 

Few usecases : 

1. Take off all the exceptions in the log file and load data line by line. Lines with exception doesn’t start from the beginning but somewhere in the middle of the line which is bad while reading.

2. I want to read my log file as a factor of 3. 3 lines will be one record or record ends whenever there is a timestamp etc. 

3. Take off all the regular expressions in the data and load only text

… 

Resources : Couple of resources to see the code. 

http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/RecordReader.html

http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

I have been working on one API which can be loaded as a jar and just call some standard methods like cleanRegex(), blankRemoval(), convertToBinary() etc. Any other suggestions would be very helpful to me and anyone who is seeking for more standard methods.