Data Scientist
Background Illustrations provided by:

Take a bow Little Master - From a Gangulian

1996 World cup is one such tournament when I had developed a great bond with this small 5’5” man, fondly called Little Master. In that tournament, We lost many matches just because he didn’t encourage fellow batsmen enough to support him on field. I remember the day India lost a league match to Srilanka when Sachin hit a memorable century. As usual after he got out I haven’t watched rest of the match, turned on TV in the evening only to witness Lankans needed 5 runs in 2 overs. Post match at presentations, My father translated English to me that Azhar was telling they could have well backed Sachin. Thanks to Azhar at least he realized it but sorry he couldn’t find a solution resulting in same Sachin handled the burden for many years. 

 I still carry those moments not because I bunked the school in number of occasions to watch the game but the expression of Sachin at presentations that he could have got a bit more support from team. It’s priceless. Imagine you are in a team and not a team yet, thats the life of Master in late Nineties.  Many embarrassing situations we had to face like semifinal match with Srilanka, Match with Zimbabwe in 1999 world cup, Match with South Africa in 2001 where he hit a 90+ and even tried to save doing the job of a bowler, test match with Pakistan and we lost by 12 runs just because his century was not complemented, match with Australia at the age of 37 when he scored 175 and left the ground giving a victory almost to cherish.. It’s never ending saga. 

Wake up any Indian cricket fan in deep sleep and enquire on India vs Zimbabwe match in 1999 World cup. One of the pathetic matches Indian cricket team had ever played. Many of us were disheartened and given the fact that Sachin had to leave England to pay last respects to his father. It’s never an exaggeration if I say that India lost the match just because he was not there. This argument was strengthened when he went back to England the very third day after he lost his father and hit an astounding knock of 140 against Kenya. Never really thought about how his emotions were, how he maintained that temperament throughout the day, rather I just enjoyed every boundary and every run. A sigh of relief that Sachin is back to team and India can win world cup, it didn’t happen though. 

Sachin had tried his best to take all the stress in very tough situations but not his devotees, perhaps thats the quality of GOD. Forgive me if I am talking more about defeats, Only when we lost I realized importance of Sachin. In countless occasions I blamed him that he got out early and hence India lost. Our life was very easy and soothing, he just asked us to sit back relaxed at home eating home made moms food watching him ruling the 65 yard field in mid Indian summer facing opponents like McGrath, Ambrose, Vaas, Muralitharan, Pollock, Walsh, Wasim Akram, Waqar Younis, Gough to name a few. Each and every one of them got their share of memories from Sachin just like us. I am sure left with memories that I can share with my kids, but can I get those moments again in my life? Do I get that pleasure of watching a player like him again in my lifetime, waiting for him, dreaming for another century, thinking what to do rest of the day if he was out early? 

When I was 10-12 years old, and we used to have power cut almost every day. Day night matches between India Srilanka New zealand. I remember calling my grandfather in office every 5 minute to check what’s the score of Sachin. Perhaps its much more important to me than score of India. Actually it is. Its no lesser but on the day of world cup final 2003, I couldn’t skip my school as it was my final exams day. My mother told me that she will be praying for Sachin to play a big innings again like how he did against Pakistan. Couple of hours later I went home only to know Sachin got out early and her prayers were not enough. This man has heavy connection to families, have his touch to emotions. I could see number of old people went to Wankhede to witness his last minutes on cricket field. This man is bigger than nation.

When he thanked each and every person who helped him in his growth, am sure millions might have connected and proud to be part of his success. I am no different. As a poet said It is indeed ironic that the nation which boasts the maximum religions in the world still believes in one God and that is none other than Sachin Tendulkar. 

Can I get my time back? Can Indian cricket be same again? Can the world cricket be same? Can somewhere even dare to wear jersey 10? Can someone do the justice of batting at number 4 for India? I am not getting answers to these questions and will never. Every word is out of heart. I take a lot of inspiration from this person throughout my life. How he respected Parents, teachers, friends, well-wishers and most importantly his Wife. It’s a take out moment that he mentioned her contribution in his success. Once again proved and this time GOD himself acknowledge the role of women in Man’s success, With all due respect to women. 

Finally, I will never correlate my golden life with Sachin in post Sachin life. ‘Cricke’ is never same. Sachin is not just another cricketer, he is a brand. Many things might have changed and will change too but not adulation for Sachin. We’ll remember you forever Master. Your personality might be little but not your impression on us. THANK YOU.. THANK YOU.. THANK YOU FOR EVERYTHING!!!!!!

A Small Story - Power of Big Data

I was talking to a Hadoop Architect on how Big Data is turning the world. It is amazing to see the clear changes it is bringing in the society. Most of them are very helpful and few of them are funny. I want to share one small  story of a Father and his teen-aged girl.  

His family is a long time customer of a departmental store since the age of no internet. They get the coupons and deals through the mails all the time and they make best use of them. But lately, father has noticed that they are getting all the baby related stuff like Diapers, kids food, kids body wash etc. He was so surprised and after a bit got frustrated. One day he decided to lash out at them and dialed customer care to check how come he is receiving all those baby related stuff as its been more than 16 years since his last kid was born.

His call was then transferred to a technical support person to explain him the scenario. Unfortunately, its his daughter who was searching for all baby related items for discount and apparently she was pregnant. Those guys got the search history through her cookies and the super efficient search engine they built got the log history and did all the analytics it can do to find out the insights of their customer. It was a shocker to the father and a blow to the kid.

This is the power of Hadoop. We are getting closer to what our customers need day by day. There will be a day soon when we read the customers exactly before they start thinking. Kudos to Data Scientists! 

Prison Break - Zillionth time

I dont remember the count of watching Prison Break. I have never found it uninteresting or enough. All about one person and it is Michael Scofield. Hats off to paul scheuring for designing Scofield. There is no way someone can do it again or could have done already. 

   Michael Scofield is a personality who can do things. It’s not about he has solution to every problem he face but it’s all about how he think when he got a problem. No panic, No fear just stick to what can be done. thats all he is about. I admire him to the core and a true inspiration. 

That’s all for the day. A satisfactory day with a great deal of work and workout. Every Monday starts a fresh with great energy and it’s intact this week too. 

Food For Thought

I want to share something I have been thinking for a while of how else could we track each and every day of our life. Simply Diaries, but I am tired of starting and stopping it number of times. I dont know any solution but I want to look into it. 

I want to know what I did on January 25th 2007. I am numb and dumb. As a firm believer of we learn things each and every day of our life, I dont want to lose what I learned or what I did in my past. 

For now, I am all thinking of using Blogging like Tumblr or Micro-Blogging like Twitter which to an extent is a little closer to what I am looking for. 

FAQ - Java Debugging and Memory management

I would like to answer couple of questions which are frequently asked in Interviews and how would I approach in my development. Please note that these are the steps I follow, they may not be ideal.

Q) Things a programmer should be careful in a memory managed language.

A) : As a programmer, I always try to be very conscious about what data structures I use in my programs. Out of my experience in the past three four years, 90% of the problems with memory leaks are with bad data structures. While the remaining 10% I had is with bad JVM configuration like heap space allocated, weak memory reference etc. 

While I declare data structures, I have to have enough clarity on what to use when. 

      Few crucial steps I can think of on top of my mind.      

          a. Make sure all the objects which are inactive are garbage collected. Inactive objects are considered to be weakly reference objects and they aren’t supposed to hangout for longer time.

          b. After utilization of any variables or any data structures, deallocate all the references. Keeping them in memory for long time cause JVM to run slower and also high possibility of memory leaks.

          c. It is never easy to find out which class is causing memory leaks, so always good to have profiler running on top of your IDE. It will help us in a big way to find out which class creates more objects and how much they run. 

          d. Memory leaks can’t be detected anyway unless you see a performance degradation. So, always record time of how long your classes run

Q) If you had a bug in your logic how did u debug it and how would you approach it?

A) : Debugging is very challenging and interesting task in any developers work. I usually dive into the code after reproducing the error in the application. Because that is where we find the issues mostly. But this is not usually suggested. I follow some quick steps to debug.

       a. Find out the root cause of the problem and go to that class. Thorough understanding of code helps you in directly going to the appropriate class.

        b. Debugging in Eclipse is the most popular and ideal way to see if your program is behaving as expected.

        c. There is no limit to how many unit tests you write.  Add more unit tests to the existing suite. 

        d. While debugging or running unit tests, you may reproduce the error. 

        e. If the error is not visible in your code, there are other places to focus at are inappropriate dependencies like jars, JVM memory, other project specific compatibilities. 

Thanks and suggestions are welcome. This is not all the exact scenario but I believe it will be a starting point for both the questions. 

Compression Techniques - Snappy vs LZO

Today, I want to give a little more insight on Compression techniques Snappy and LZO. Even though at a point they might look same, there are majot differences when and where they compress files. Compression is very helpful in Distributed systems to maximize disk usage and proper compression helps out in saving time for file or block transfer.

In hadoop, lowest level of compression is at block level same like in existing linux systems(In Linux, when u change a character in a file, it rewrites entire memory block as its the lowest level). Compression occurs only outside HDFS no matter what compression codec you use be it gzip, lzo or whatever unless until you explicitly say some file boundary limit to what level you want to compress in hadoop. 

Now in general, difference between LZO and Snappy comes only when we try to compress out of the box that means before writing them in HDFS. Keypoint is Snappy is nonsplittable and LZO is splittable.

Lets say we load a file, If we use Snappy, it takes whole file, compress it and load into blocks. Apart from beautiful speed Snappy is providing, drawback here is when we try to read the blocks, it can decompress only at file level. It can’t decompress at node level instead it has to bring all blocks to one place  and then read from file header till end of the file. The reason behind this behavior is file in Snappy it is not splittable before loading in HDFS. It compress the file as it is and process further.

Where as in LZO, it splits the file and compress it in to individual blocks and write onto HDFS. So even when you read it doesn’t need to merge all the blocks and put in one place. It can read blocks at the node level, decompress and fetch.

Out of nutshell, I would say I prefer Snappy for its efficiency in speed even though many people go with LZO as it compress to a better extent and widely available for different platforms..

Input-format using Record reader in Hadoop

Hello Folks, its been a while since I updated my blog. I thought of writing about Recordreaders in Hadoop. Most of my recent past developments were dealing with different types of input formats be it logs, events, unstructured data-sets, Key-Value(NoSQL) data stores etc. So I have to really stress on Customizing Input format of my data. 

Heads-up : I haven’t written much on how Record Reader methods how it works in depth. If anyone needs it, please let me know I will write another blog on it. Online documentation is good enough to understand. 

Before diving further, lets get a small background on why we need them in the first place and what sort of helpful features they bring into the environment. Hadoop is one of the best frameworks that is designed for data warehouses which were in need of better, faster analytics. It can can deal with Terabytes of data without any issues and with much lower latency and many big companies like Facebook, Twitter etc have scaled it to work on Petabytes. Now the real challenge is dumping the unstructured data onto HDFS or CFS or whatever filesystem you have decided to go with. It might sound like an easy task to just use -put or -copyFromLocal to hdfs. You’re right, it is easy. But we have to make sure that data is moved properly in some job-friendly manner. For that we have to understand Splits, most used word in Map-reduce world.

What are splits?  There are two phases in hadoop workflow where we hear about splits. Before sending files to Mappers, files are split and then sent to each mapper separately. For example take a textile of log data. it is divided into multiple splits and then moved to mappers for mapping. Each individual split is again divided into records which are nothing but lines in the text log file. This is one kind of split. Another split is what we hear most of the times. 64MB or 128MB splits which we configure in our xml configurations. This is the split that happens while storing this post Map-Reduce data onto data nodes. 

Why inputformat is required? Not every record you get from unstructured data is valid. This is the basic assumption in any sort of file. Machine generated data is highly unstructured and uncontrollable most of the times. For example, exceptions in between logs, empty records in key-value pairs, Unidentifiable format of data etc. You have to prepare our Instances to be able to handle these kind of exceptions while splitting. For that we use Record reader. While loading data we have to customize the input format and do it as per our convenience. Best thing is we can customize our input format according to our needs how we want to store our data in hdfs. Dumping all the data as it is, is really  a bad idea while running jobs on it or running hive/pig queries on it. We have to try to make that data clean enough to make it Job-Friendly. Below are some instances we come across why we need InputFormatting. 

Few usecases : 

1. Take off all the exceptions in the log file and load data line by line. Lines with exception doesn’t start from the beginning but somewhere in the middle of the line which is bad while reading.

2. I want to read my log file as a factor of 3. 3 lines will be one record or record ends whenever there is a timestamp etc. 

3. Take off all the regular expressions in the data and load only text


Resources : Couple of resources to see the code.

I have been working on one API which can be loaded as a jar and just call some standard methods like cleanRegex(), blankRemoval(), convertToBinary() etc. Any other suggestions would be very helpful to me and anyone who is seeking for more standard methods. 

Programming Exercise - III Tokenizer in Java.

Hello Friends, Today I have got one use case I was working on.  Lets say we have a big textfile. I want to do a wordcount on that file, wich counts number of times each word in that file exists. Immediately I can think of hadoop classic wordcount example and write a simple map-reduce program to do that stuff for me. In a regular map-reduce process, we tokenize all the words and then pass those key-value pairs to any mapper and reducer to calculate number of times each word is in the file. For that, we need to have a proper understanding of tokenizing.

Tokenizer API in Java provides multiple methods like hasMoreElements(), hasMoreTokens(),nextElement(), nextToken(), countTokens(). 

We get the instance of tokenizer with following declaration in line 34.

StringTokenizer st = new StringTokenizer(readFile,”., “);

Delimiters I have used are ‘period(.)’ and ‘Space(  )’. Our Strings are tokenized by using above delimiters. We can use any lists or maps to store these individual tokens. 

Huge files processing became significant in todays programming practices. In Java, we either use ‘Split’ or ‘tokenize’ for it. I want to give a simple tokenizer program where I am giving two arguments filename and word I want to count in the file.

Hope I am clear.

  1. package com.myCode.topCoder;
  2. import*;
  3. import java.util.*;
  4. public class Tokenizer{
  5. public static void main(String[] args) throws IOException
  6. {
  7. int wordcount = 0;
  8. String word = ””;
  9. //list<String> readFile = new list<String>();
  10. String readFile = ””;
  11. if(args.length == 0) {
  12. System.out.println("Please specify a filename");
  13. System.exit(1);
  14. }
  15. InputStreamReader reader = new InputStreamReader(new FileInputStream(args[0]));
  16. BufferedReader br = new BufferedReader(reader);
  17. String line = br.readLine();
  18. //Read file in a String or list depends on size of file. Here I am taking String.
  19. while(line!=null)
  20. {
  21. readFile+=line;
  22. line = br.readLine();
  23. }
  24. Scanner scan = new Scanner(args[1]);
  25. word =[1]);
  26. StringTokenizer st = new StringTokenizer(readFile,”, “);
  27. System.out.println(word);
  28. // Tokenizing readFile which has all words from file.
  29. while(st.hasMoreTokens())
  30. {
  31. String s = st.nextToken();
  32. System.out.println(s);
  33. if(word.equals(s.toLowerCase()))
  34. wordcount+=1;
  35. }
  36. System.out.println(wordcount);
  37. }
  38. }

Inputs are most welcome.. 

Just one correction regarding ‘Effective way to create strings’. When we write  new String(“Ravi”); the new string does not get created in the string pool. It gets created in the regular heap (young gen) as opposed to the string pool in perm gen. If we want to move this string to the pool, you need to call the intern method (I think…). Otherwise, clear and precise article. Thanks you