RavisRealm

  • Archive
  • RSS
  • Ask me anything
  • Submit ur view

Input-format using Record reader in Hadoop

Hello Folks, its been a while since I updated my blog. I thought of writing about Recordreaders in Hadoop. Most of my recent past developments were dealing with different types of input formats be it logs, events, unstructured data-sets, Key-Value(NoSQL) data stores etc. So I have to really stress on Customizing Input format of my data. 

Heads-up : I haven’t written much on how Record Reader methods how it works in depth. If anyone needs it, please let me know I will write another blog on it. Online documentation is good enough to understand. 

Before diving further, lets get a small background on why we need them in the first place and what sort of helpful features they bring into the environment. Hadoop is one of the best frameworks that is designed for data warehouses which were in need of better, faster analytics. It can can deal with Terabytes of data without any issues and with much lower latency and many big companies like Facebook, Twitter etc have scaled it to work on Petabytes. Now the real challenge is dumping the unstructured data onto HDFS or CFS or whatever filesystem you have decided to go with. It might sound like an easy task to just use -put or -copyFromLocal to hdfs. You’re right, it is easy. But we have to make sure that data is moved properly in some job-friendly manner. For that we have to understand Splits, most used word in Map-reduce world.

To understand Map-Reduce please follow Googles paper: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/mapreduce-osdi04.pdf

What are splits?  There are two phases in hadoop workflow where we hear about splits. Before sending files to Mappers, files are split and then sent to each mapper separately. For example take a textile of log data. it is divided into multiple splits and then moved to mappers for mapping. Each individual split is again divided into records which are nothing but lines in the text log file. This is one kind of split. Another split is what we hear most of the times. 64MB or 128MB splits which we configure in our xml configurations. This is the split that happens while storing this post Map-Reduce data onto data nodes. 

Why inputformat is required? Not every record you get from unstructured data is valid. This is the basic assumption in any sort of file. Machine generated data is highly unstructured and uncontrollable most of the times. For example, exceptions in between logs, empty records in key-value pairs, Unidentifiable format of data etc. You have to prepare our Instances to be able to handle these kind of exceptions while splitting. For that we use Record reader. While loading data we have to customize the input format and do it as per our convenience. Best thing is we can customize our input format according to our needs how we want to store our data in hdfs. Dumping all the data as it is, is really  a bad idea while running jobs on it or running hive/pig queries on it. We have to try to make that data clean enough to make it Job-Friendly. Below are some instances we come across why we need InputFormatting. 

Few usecases : 

1. Take off all the exceptions in the log file and load data line by line. Lines with exception doesn’t start from the beginning but somewhere in the middle of the line which is bad while reading.

2. I want to read my log file as a factor of 3. 3 lines will be one record or record ends whenever there is a timestamp etc. 

3. Take off all the regular expressions in the data and load only text

… 

Resources : Couple of resources to see the code. 

http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/RecordReader.html

http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

I have been working on one API which can be loaded as a jar and just call some standard methods like cleanRegex(), blankRemoval(), convertToBinary() etc. Any other suggestions would be very helpful to me and anyone who is seeking for more standard methods. 

    • #hadoop
    • #recordreader
    • #big data
    • #nosql
    • #inputformat
  • 5 days ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Programming Exercise - III Tokenizer in Java.

Hello Friends, Today I have got one use case I was working on.  Lets say we have a big textfile. I want to do a wordcount on that file, wich counts number of times each word in that file exists. Immediately I can think of hadoop classic wordcount example and write a simple map-reduce program to do that stuff for me. In a regular map-reduce process, we tokenize all the words and then pass those key-value pairs to any mapper and reducer to calculate number of times each word is in the file. For that, we need to have a proper understanding of tokenizing.

Tokenizer API in Java provides multiple methods like hasMoreElements(), hasMoreTokens(),nextElement(), nextToken(), countTokens(). 

We get the instance of tokenizer with following declaration in line 34.

StringTokenizer st = new StringTokenizer(readFile,”., “);

Delimiters I have used are ‘period(.)’ and ‘Space(  )’. Our Strings are tokenized by using above delimiters. We can use any lists or maps to store these individual tokens. 

Huge files processing became significant in todays programming practices. In Java, we either use ‘Split’ or ‘tokenize’ for it. I want to give a simple tokenizer program where I am giving two arguments filename and word I want to count in the file.

Hope I am clear.

  1. package com.myCode.topCoder;
  2. import java.io.*;
  3. import java.util.*;
  4. public class Tokenizer{
  5. public static void main(String[] args) throws IOException
  6. {
  7. int wordcount = 0;
  8. String word = ””;
  9. //list<String> readFile = new list<String>();
  10. String readFile = ””;
  11. if(args.length == 0) {
  12. System.out.println(“Please specify a filename”);
  13. System.exit(1);
  14. }
  15. InputStreamReader reader = new InputStreamReader(new FileInputStream(args[0]));
  16. BufferedReader br = new BufferedReader(reader);
  17. String line = br.readLine();
  18. //Read file in a String or list depends on size of file. Here I am taking String.
  19. while(line!=null)
  20. {
  21. readFile+=line;
  22. line = br.readLine();
  23. }
  24. Scanner scan = new Scanner(args[1]);
  25. word = scan.next(args[1]);
  26. StringTokenizer st = new StringTokenizer(readFile,”, “);
  27. System.out.println(word);
  28. // Tokenizing readFile which has all words from file.
  29. while(st.hasMoreTokens())
  30. {
  31. String s = st.nextToken();
  32. System.out.println(s);
  33. if(word.equals(s.toLowerCase()))
  34. wordcount+=1;
  35. }
  36. System.out.println(wordcount);
  37. }
  38. }

Inputs are most welcome.. 

    • #Java
    • #API
    • #Tokenizers
    • #Wordcount
  • 2 months ago
  • 3
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Just one correction regarding ‘Effective way to create strings’. When we write  new String(“Ravi”); the new string does not get created in the string pool. It gets created in the regular heap (young gen) as opposed to the string pool in perm gen. If we want to move this string to the pool, you need to call the intern method (I think…). Otherwise, clear and precise article. Thanks you

  • 3 months ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

you seem to have done a lot of analysis on strings memory consumption.

Thanks

Web Security

  • 3 months ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Programming Exercise - II ReplaceAll() method in Java.

Hello All, I got some more time to write another piece of a small algorithm tweaking a little bit. Creating an usecase to use replaceAll method.

Question :  I am taking a string as input and I need to find whether it is palindrome or not. It is a simple question as long as we think string doesn’t have any special characters. Ex : “K a,y”ak” should return true as it is a palindrome without any special characters. With our usual approach it won’t. 

Solution : Thanks to Java libraries we have a method called replaceAll which replaces all symbols in our string with whichever ‘value’ we provide.

Syntax : String s;

              s=s.replaceAll(“Regex”,”value”);

 

Lets code using above syntax. Here I want to remove all spaces in my string s and check for palindrome. 

boolean isPalindrome(String s)

{

s = s.replaceAll(” “,”“); // Taking off spaces. We can write “or”(” “|”.”|”%”) operations in the regex method to replace multiple special symbols in our string.

int n = s.length();

if(n%2 ==0)   // Even length

{

for(int i=0;i<n/2- 1 ; i++) // Looping only half

{

if((s.charAt(i) == s.charAt(n-1-i) )

return true;

else

return false;

}

}

else // Odd length

{

for(int i=0; i< n/2 - 1; i++)

{

if(s.charAt(i) == s.charAt(n-1-i) )

{

return true;

}

else

return false;

}

}

}

Suggestions and inputs are most welcome. 

    • #Java
  • 3 months ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Programming Exercise - I Reverse content in a file.

Hello Folks, I want to post one Interview question I was asked by a big company recently. Here is the question. 

Q) Create a file into which huge data is written. Assume data is already there and need not write now. This file is put in the search engine. For now no negative cases like different language or file is empty etc. File is contended with huge data and assume it is very big. Our job is to read the string in file and print in reverse order based on white spaces. Ignore any dots or special characters. 

 For ex : “My name is Ravi and I am 24 years old” should output “old years 24 am I and Ravi is name My old ”

I proposed 3 solutions. 

1. Since it is a large file, I split this into chunks using file splitter lets say 20 10MB files. I create 10 threads and do my operations in parallel and print them. 

2. A straightforward approach which I copied below. Read the file, take in a string and reverse the string and print it. 

3. Use file.seek(offset) which comes from RandomAccess library. Advantage here is we could avoid steps like reading the data in a string and reversing. We directly access it in file and print/write in console or another file respectively. 

To make it simple I tried with 5MB of data, I follows steps 2 and 3 to solve. With Method 2 I could complete it in 14178ms which is invariably bad. Method 3 fetched the same result in 6375ms which I think is still bad but better than former solution. 

package com.myCode.topCoder;

import java.io.*;

import java.util.Scanner;  

public class RevLargeString {

public static  void main(String args[])

{

File f = new File(“BigString”); // We can read file instead directly in scanner

//StringBuffer sb = new StringBuffer();  

long startTime = System.currentTimeMillis();

try {

  String rwords =””;

  String printReverse =””;

  Scanner scanner = new Scanner(f);

  while(scanner.hasNextLine())

  {

  rwords = scanner.nextLine();

  System.out.println(rwords);

  }

  String[] reverseWords = rwords.split(” “);

  for(int i= reverseWords.length - 1; i>=0; i—) 

  {

  printReverse += reverseWords[i]+” “;  

  }

  System.out.println(printReverse);

} catch (FileNotFoundException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

long endTime = System.currentTimeMillis();

        long totalTime = endTime - startTime;

        System.out.println(totalTime);

}

}



Method 2 Output : 

My name is Ravi and I am 24 years old My name is….

old years 24 am I and Ravi is name My old …….

14178

Method 3 Output : 

My name is Ravi and I am 24 years old My name is….

old years 24 am I and Ravi is name My old …….

6375

I haven’t created classes for 1 and 3. But I can give some idea of how could we do it in method 3. 

Method 3 : 

// Create a randomaccess file instance like          RandomAccessFile r = new RandomAccessFile(“BigString”, “rw”);

// Now start reading backwards in file.                Loop(f.length() to 0)

// Get file pointer. r.getfilepointer();

// Use f.seek(offset) — Here for offset we would give a position where we see a space while looping back. 

// Print from charAt(i) to before occurrence of space. 

Please give suggestions or any other  way we can do it in better time. 

  • 3 months ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

My take on Graphsearch

Pretty good. Yes, I would say Facebook is pretty much keeping it’s value for what users are investing their time in. Everyone has to agree that newsfeed and timeline are two basic and significant building blocks of Facebook. All these years Facebook has been trying hard to keep both of them dynamic enough to entice users. Starting from their engineering team to their production support team everyone is involved in making it success. We’re seeing the result in the form of 1billion odd users signed up so far and many more to follow.      

When Mark Zuckerberg was asked about their future plans in Techcrunch Disrupt, answer given by him had very much positive impact on their stock performance at box-office. Mark pronounced that they want to focus on search which in-turn attracted investors. His answer was immediately followed up by another question which says what else would they offer other than what Google or Bing are offering. Facebook CEO said they are putting their best efforts on providing Graph search which would help users to customize their search in a better way. It is indeed very interesting when heard for first time. Then I started my research on how different is Graph search from regular search we have? I will try to explain it in my way. 

For instance If you want to know what is the best italian restaurant in San Francisco, we provide keywords to pull that information. This is regular search we do called as keyword search, pretty much makes sense isn’t it? Now just imagine you might want to know if you had been to that restaurant before. With graph search I can do it. I will change my keyword search into any of the following sentences. 

“Did I go to any top-rated italian restaurants in San Francisco?” or 

“What locations of Olive Garden I have been to since 2006?”

What I just did? I added more sense to my search. Isn’t it beautiful? To me at least it is awesome..I can even customize my query more like this. 

“Please tell me what movie theaters I have gone to with my friend XYZ in 2010” - It will fetch me what I want. 

Wow!! That’s even better. By looking at this, I would say Facebook has added another beautiful feature to their eco-system. Since last year we have seen timeline, gifts(would call it a Facebook version of e-commerce) and now Graph search. I am sure Graphsearch is very innovative concept and has it’s niche already created in the coming years. If there is really anything we miss in graph search then it is about sharing the search results, trivial for now.

If you ask me what is Facebook trying to do with all these features, I will say they’re building an ecosystem where you can do almost everything like socialize, shop, search, graph search, job search and what else in pipeline? 

Good Job Mark and team.. 

  • 4 months ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Efficient way to create strings in Java ..

Strings are the most common datatype we use in any programming language. In Java, there are many ways we can create them. Before we know how to create them, we first need to understand how string memory is created internally in heap by JVM. Whenever a string is initialized, heap gives that particular string a memory reference in string pool. Stringpool is nothing but a pool which stores all the strings being used by program and is a subset of heap memory. This is very simple architecture to understand. 

    The catch is not where strings are stored but how they are stored. I will show you two common practices we follow.

A)  //Assigning a value to string variable

   String s1 = “Ravi”;

B)  // Using new Operator

   String s4 = new String(“Ravi”);

We all know Strings in Java are immutable which means once memory is assigned to a string it cannot be reassigned. Out of the two above methods, method A is preferred. Why? I will explain with a sample program. 

import java.io.*;

public class analyzeString {

String s1 = “Ravi”;

String s2 = “Ravi”;

String s3 = “Sourav”;

String s4 = new String(“Ravi”);

public static void main(String args[])

{

     //Print values

}

}

In the above program string s1 is initialized to value ‘Ravi’. Here when I assigned s2 also to ‘Ravi’, it is not going to create another reference in string pool. s2 will also point to same memory location s1 is pointing to. In this way we are optimizing string pool saving extra bytes for s2. Now consider s4 which is initialized using operator new and assigned with same value ‘Ravi’. The catch here is it is not going to point to same location as s1 and s2. Operator ‘new’ will create another location in string pool. So, another instance is created with same value thus wasting memory. 

Thanks for reading and suggestions are most welcome. 

  • 4 months ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Must try mobile applications !!

Along with built in applications we get with our phones, we have a privilege to install third party applications which helps us in number of ways. Most of us are familiar with many popular apps. I want to give a quick overview of some of them I use which are awesomely built and incredibly useful. These are all reviewed out of my personal experience and not based on user ratings. 

1. Aviary : It’s a free photo editing application. Aviary has almost everything what you want in a photo editor. You can add multiple effects, enhance the quality, add stickers and all other features that are shown in below Pictures. One awesome feature it offers to app developers is, it is easy to integrate Aviary to their applications. You can get API for free. I am waiting for an update to fit my 4 Inch Retina display.

2. Path : Path is yet another path breaking app. It is very UI friendly social networking platform. We can do many things what we want to do in a social networking site and perhaps even more. What makes it so social is at least now, it is ads-free platform. I don’t know how they make money may be from deals like Nike+ integration. But from a user’s stand point, I enjoy every bit of it. I add photos, activities, music, books, checkin, share, nudge. Most importantly and unlike other social sites, it has 150 user limit which makes it intriguing. So we share with only who are important to us, not a bad idea. 

3. Flipboard : By now, majority of the people who use smartphones might know about it. Flipboard brings all kinds of news at one place in a very fascinating way. You flip the screen to read articles just like how you do on a notepad. You can add different genres of blogs, websites you like and read it at one place. Flipboard is very fast and also I did not see a big consumption of data. Good thing is, you don’t need to sign in to read.

4. Evernote : There are times when you want to take a quick notes, write articles sort of things. Evernote is the best place to do that stuff. We can have evernote everywhere starting from our phones to chrome plugins to desktop applications with single click feature. So what? what makes it so special? Its syncing feature. When I create a note and start writing till I leave my work, I can resume it while walking with my phone. Yes, it syncs instantly. Even then, don’t we have other applications which functions in similar way? Yes there are, but I haven’t seen so easily customizable app. You can segregate notebooks and create notes in each notebook. What I am missing here is a notebook in another notebook. It might be helpful for some users like me who looks for hierarchy of notebooks. 

5. Google plus : G+ is a must try application. Many people who use other social apps don’t use Google plus very often from their smartphones. Since Google updated G+ couple of months ago, things have changed a lot. Killer feature in G+ is the way you see pictures, updates when you scroll down. In addition, it is very fast and dynamic. Despite two crashes since I installed, I never felt anything uncomfortable with it. It is just amazing. Still what I miss here is tagging friends. I am kind of person who likes to tag friends when I add pictures, do checkins. I did not see anything about it so far or do I miss anything?

6. Skitch : This is another gem of applications from Evernote. I have been using it since I started using mac and now in iPhone5 as well. Skitch is nothing but a white board with a piece of chalk. You write whatever you like, you draw whatever you want. Skitch provides us tools to draw up on our convenience compatible to screen size. How many of you like to post status with your handwriting rather than what you type? Just like what a phablet offers. Then you got to try Skitch. Evernote has recently announced they are integrating it with its core application. 

7. Pulse : Pulse is another great news feed application which is very dynamic and easy to use like Flipboard. The new pulse has a great feature of distinguishing your categories of choice. Have an account, update your news feed with your choice and use it across multiple platforms. 

8. Quora : Who does’t know about Quora? I am also very addictive to it like everyone else. Some people say it is nothing but question and answers forum like any other groups. But, I haven’t seen any other application which brings everything at one place. Quora is not a new idea but they integrated different ideas in one UI. Follow the topics you like, post questions, get answers to many exotic questions from experts around the world. If you ask me what is Quora in one sentence, I will say it teaches me more than what I read in books. I feel I am personally interacting with different people around the globe.

9. Viddy : If you are like me, Don’t you want to capture some beautiful things that happens around you and share with your friends? Viddy is for you. You can take small 15 second videos and add themes, change background music to it. It is an easy to use app and has some good themes depends on your videos. It’s another must try application even if you are a big fan of Socialcam. 

10. Cards : Last but not least, Cards. It takes off a lot of physical effort from you. Apple developed cards which helps you sending greeting cards to your friends provided address. We have access to different kinds of templates based on occasion and update them with our custom text and pictures. Don’t you want to give it a try? 

Hope it helps !!!

    • #iPhone5 mobileapps
  • 7 months ago
  • 1
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Constructors vs Static Factory Methods

While initializing variables in my programs, I used to use constructors a lot. Gradually I have moved to static factory methods somehow as suggested by many in stack overflow and then tried to find out proper reasons.  I wanted to share my knowledge and experience on this one since long. But I had to wait till I get clarity and confidence to give some info on Static factory methods and why I prefer to chose them instead of Constructors. So I feel writing would makes sense than tweeting on this.  Lets take a dig on little background for these two methods.

First of all why we use constructors in Java? Constructors allows you to initialize variables. It gives a value to our variables as soon as we create objects. The basic constraint to use Constructor is we have to use same name as class name.

Let me give a very easy instance of creating constructors. 

Public class DigitalMusic

{

     public String albumName;

     public float albumNumber;

     //Constructor

     public DigitalMusic(float number, String name)

     {

          albumName = name;

          albumNumber = number;

     }     

     public void print()

     {

         //Print statements

     }

}

As simple as that. Now you can initialize your variables using the constructor by passing values through instances. I want to create 2 objects and pass values to it. 

Public Static void main(String args[0])

{

     DigitalMusic d1 = new DigitalMusic(121897, “21”);

     DigitalMusic d2 = new DigitalMusic(134560, “19”); // 21 and 19 are strings here. 

     d1.print();

     d2.print();

}

Fair enough and as a Java developer I pretty much know the use of constructors and when to use it. But as a reader I get one obvious question. When we have easily readable and understandable constructors, why do we have to go with other options like static factory methods? Yes, It is fairly reasonable question to ask. I will try to answer this in my way. Ofcourse there might be number of articles on this but knowing everything at one place could run things faster while coding.  

Before going further, I want to make it clear that Static factory method has nothing to do with Factory pattern which we find in Design patterns. I was also a little confused about this until I read “Effective Java” By Joshua Bloch, a famous Googler and blogger. 

Lets talk about the advantages of static factory methods. 

a) Static factory method can have different names unlike constructors. So, we can name it according to our context to make it more easy while dealing it from our client classes. 

b) Instantiation is of course as easy as constructors and we can do it on demand. 

c) Unlike constructors we can represent the objects with the help of static factory methods.

d) We don’t need to create an object each time we invoke static factory method. I personally like this aspect of static factory methods. Because there are number of times we see memory leaks because constructers are invoked number of times unnecessarily. Then we had to chose other methods which leads to number of methods like nullifying objects or careful usage of caches or in the worst case using of finalizers which I totally dislike to do. To avoid all these insidious consequences, lets avoid this situation before hand.  

e) “Another useful advantage while inheriting number of classes is it allows immutable classes to use preconstructed instances, or to cache instances as they’re constructed, and dispense them repeatedly to avoid creating unnecessary duplicate objects.”

f) There are times when you want to return objects of subclass type. Fortunately, this method do that for you without any pain boxing or creating another instance just for that. 

These are some very handy benefits I found in my experience. 

For reference some popular methods we use in this methods are : valueOf, Of, getInstance, newInstance, getType, newType.. 

In the same class we created before, we can create the following static factory method which I want to show how to use in the presence of constructors.  For suppose, presume we used private constructor which is not reachable by subclasses.

//Static factory which initializes variables

public static DigitalMusic valueOf(float number, String name)

     {

          return new DigitalMusic(number, name);

     }

Now it doesn’t matter if you create a public or private constructer. It is reachable by every subclass in the above case. You can find out usage of some other methods.

One might ask if these are doing everything, why we constructors are popular widely? Because, Static methods has couple of drawbacks as well like every other method. I googled the same question to find it out myself. Following are the two valid reasons I found.

  1. Disadvantage of providing only static factory methods is that classes without public or protected constructors cannot be subclassed.
  2. A second disadvantage of static factory methods is that they are not readily distinguishable from other static methods which I think is not a major disadvantage which could potentially stop us.

So, Now I see we have a good scope of avoiding undesired memory leakages, coding friendly environment for client class coordination sort of things. 

Thanks for time and appreciate inputs. Discussions are always a better way to share and learn. 

    • #Java7 Constructors
  • 7 months ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
Page 1 of 4
← Newer • Older →

About

Software Engineer, Avid learner....

My Tweets

loading tweets…

  • RSS
  • Random
  • Archive
  • Ask me anything
  • Submit ur view
  • Mobile
Effector Theme by Pixel Union