Data Scientist @Salesforce.com
Background Illustrations provided by: http://edison.rutgers.edu/

Programming Exercise - III Tokenizer in Java.

Hello Friends, Today I have got one use case I was working on.  Lets say we have a big textfile. I want to do a wordcount on that file, wich counts number of times each word in that file exists. Immediately I can think of hadoop classic wordcount example and write a simple map-reduce program to do that stuff for me. In a regular map-reduce process, we tokenize all the words and then pass those key-value pairs to any mapper and reducer to calculate number of times each word is in the file. For that, we need to have a proper understanding of tokenizing.

Tokenizer API in Java provides multiple methods like hasMoreElements(), hasMoreTokens(),nextElement(), nextToken(), countTokens(). 

We get the instance of tokenizer with following declaration in line 34.

StringTokenizer st = new StringTokenizer(readFile,”., “);

Delimiters I have used are ‘period(.)’ and ‘Space(  )’. Our Strings are tokenized by using above delimiters. We can use any lists or maps to store these individual tokens. 

Huge files processing became significant in todays programming practices. In Java, we either use ‘Split’ or ‘tokenize’ for it. I want to give a simple tokenizer program where I am giving two arguments filename and word I want to count in the file.

Hope I am clear.

  1. package com.myCode.topCoder;
  2. import java.io.*;
  3. import java.util.*;
  4. public class Tokenizer{
  5. public static void main(String[] args) throws IOException
  6. {
  7. int wordcount = 0;
  8. String word = ””;
  9. //list<String> readFile = new list<String>();
  10. String readFile = ””;
  11. if(args.length == 0) {
  12. System.out.println("Please specify a filename");
  13. System.exit(1);
  14. }
  15. InputStreamReader reader = new InputStreamReader(new FileInputStream(args[0]));
  16. BufferedReader br = new BufferedReader(reader);
  17. String line = br.readLine();
  18. //Read file in a String or list depends on size of file. Here I am taking String.
  19. while(line!=null)
  20. {
  21. readFile+=line;
  22. line = br.readLine();
  23. }
  24. Scanner scan = new Scanner(args[1]);
  25. word = scan.next(args[1]);
  26. StringTokenizer st = new StringTokenizer(readFile,”, “);
  27. System.out.println(word);
  28. // Tokenizing readFile which has all words from file.
  29. while(st.hasMoreTokens())
  30. {
  31. String s = st.nextToken();
  32. System.out.println(s);
  33. if(word.equals(s.toLowerCase()))
  34. wordcount+=1;
  35. }
  36. System.out.println(wordcount);
  37. }
  38. }

Inputs are most welcome..