Category Archives: java

How to read CSV files in Java – A case study of Iterator and Decorator

In this post, I will talk about how to read CSV (Comma-separated values) files using Apache Common CSV. From this case study, we will learn how to use Iterator and Decorator in context of design pattern to improve the reusability in different situations. But before we get started, I guess I have to answer two questions first.

  1. Why do I need a third party library if there are more than enough DIY posts talking about how to read CSV files?
    It is true that when you google “java csv parser”, you will get several related posts. But even if you are a beginner, you won’t be satisfied with these shallow methods. Of course using BufferedReader and String.split() will successfully parse a typical CSV file, but you won’t learn ANYTHING from it except making redundant. On the other hand, like what I will show below, using and studying Apache Common CSV will teach you several topics in Design Pattern, for instance iterator and decorator.

  2. Why Apache Common CSV, not others?
    As far as I know, there are several other libraries on Sourceforge or Google code. However, if you look into details of their code, forgive my criticism, none of them are flexible and manageable: some are too simple to meet users various requirements; others are too complicated and painful to use. Furthermore, most of them I’ve come across don’t have commercial-friendly licenses. You know, sometimes, it really scares users off.

Apache Common CSV is still in sandbox, which means there are currently no official download and stable release. But nightly builds may be available.

Using Iterator to hide underlying representation

Let me begin with a sample CSV file, where each record is located on a separate line, delimited by a line break. The first line is the header containing two names COL1 and COL2 corresponding to the fields in the file. The rest of the file contains three records with fields separated by commas.


The code using Apache Common CSV to read this file is:

public void test() throws FileNotFoundException, IOException {
  CSVParser parser = new CSVParser(
      new FileReader("test.csv"), 
  for (CSVRecord record : parser) {

CSVParser is used to parse CSV files according to the specified format. Here I use the default CSVFormat together with setting withHeader() with no argument. This enables the parser to treat the first line of the CSV file as the header and to make the record.get("COL1") valid. CSVParser provides an iterative way of reading records. Here we meet the first design pattern Iterator. It provides a way to access the records of a CSV file sequentially without exposing its underlying representation, like how to skip over comment line and how to map the column name to the field value. For each record, we use CSVRecord.get(String name) to retrieve the field value by its name.

CSVRecord provides different ways to access the field value: by name or by index. If you are not sure the field has a value or is empty, CSVRecord.isSet(String name) can be called before. If you just want to check whether a name has been defined to the parser, call CSVRecord.isMapped(String name) instead.

Using Decorator to allow different behaviors

CSVFormat.DEFAULT or CSVFormat.RFC4180 follows the RFC4180 format. So fields enclosed in double quotes can be handled too, such as


In RFC4180, fields in a CSV file should be separated by commas. But in general, the library can handle arbitrary delimiter like TAB or space. To make the code reusable, the library provides a way to create your own CSVFormat,

CSVFormat format = CSVFormat.newFormat(',')

The above format is same as the CSVFormat.DEFAULT. Here we encounter another design pattern Decorator, which allows behavior to be added to an individual object, either statically or dynamically, without affecting the behavior of other objects from the same class. In the case of CSVFormat, every withXXX() method returns a new CSVFormat that is equal to the calling one but with one attribute modified. The question here might be why not just return the self-reference this? I think it is because the later way will fail the following code

CSVFormat format = CSVFormat.newFormat(','); CSVFormat format1 = format.withQuoteChar('"'); CSVFormat format2 = format.withHeader(); read more

Most efficient way to increment a Map value in Java — Only search the key once

This question may be considered too basic, but is frequently asked in the forums. In this post, I will discuss one way that only searches the key in Map ONCE.

Let’s first look at an example. Say I’m creating a string frequency list, using a Map, where each key is a String that is being counted and the value is an Integer that’s incremented each time a String is added. A straightforward way of achieving it is

int count = map.containsKey(string) ? map.get(string) : 0;
map.put(string, count + 1);

This piece of code runs quite slowly because it contains three potentially expensive operations on a map, namely containsKey(), get(), and put(). Each requires to search the key in the map. Now let’s refactor the code for better performance.

Integer vs MutableInteger vs AtomicInteger

One important reason that we have to invoke three expensive operations is using Integer for counting. In Java, Integer is immutable. It prevents us modifing the integer value after construction. Therefore, to increment a counter, we have to first get the integer from the map, then create another new integer by adding one, and put it back to the map.

To make the counter mutable, there are several ways. One is to simply create your own MutableInteger, like what I showed below.

public class MutableInteger {

  private int val;

  public MutableInteger(int val) {
    this.val = val;

  public int get() {
    return val;

  public void set(int val) {
    this.val = val;

Another way might be using AtomicInteger in Java, which is used in applications such as atomically incremented counters. But the main choice for AtomicInteger is if you want to achieve thread safety with the operations on the integer. Therefore it cannot be used as a replacement for an Integer. Based on this, if thread-safety is not a strong consideration of your project, I won’t recommend using AtomicInteger.

Search the key only once

After using the MutableInteger, we can change the above code to

if (map.containsKey(string)) {
  MutableInteger count = map.get(string);
  count.set(count.get() + 1);
} else {
  map.put(string, new MutableInteger(1));


MutableInteger count = map.get(string); if (count != null) { count.set(count.get() + 1); } else { map.put(string, new MutableInteger(1)); } read more

How to extend enum in Java

In Java feature, it is not possible to have one enum extend another. For most of cases you may encounter, it is reasonable: not only would it be confusing whether to enumerate over all of the elements of a superclass and its subclass, but also make difficult to interact with switch statements.

But occasionally, you may still want to make an enum class extensible. For example, you need an enum to represent basic fonts supported by your system and want third parties to be able to add new fonts. The solution we came up with was to use an enum and a name-based lookup. Both will be done by implementing a mixin interface.

In Java, a mixin is an interface which contains a combination of methods from other classes. It encourages code reuse and avoid multiple inheritance as well. This post will suggest one way to restore the subclassing variant using mixin. The solution has a few parts (copied from “Mixing-in an Enum”):

  • Define an interface with the needed functionality.
  • Declare enums implementing the interface.
  • Include a factory method mapping from names to objects implementing the interface.

This combination of a enum for known values and an interface for extensibility provides a good alternative to string-based provider lookups.

An example

Let’s begin with a simple example. There are a number of basic monospaced fonts supported for an Ubuntu system, such as “Courier”, “Consolas”, and “DejaVu”, so using an enum for the default installed fonts would be structured. For example, BasicMonoFont is created and has constants for the “Courier”, “Consolas”, and “DejaVu”.

public enum BasicMonoFont {
  Courier("Courier", new File("/usr/share/fonts")), //
  Consolas("Consolas", new File("/usr/share/fonts")), //
  DejaVu("DejaVu", new File("/usr/share/fonts"));

  private final String fontName;
  private final File   location;

  private BasicMonoFont(String fontName, File location) {
    this.fontName = fontName;
    this.location = location;

  public String getFontName() {
    return fontName;

  public File getLocation() {
    return location;

But we need to add other fonts too, such as “Lucida”, “Monaco”, and “Andale”. We can, of course, change the source code to add new fonts, but it absolutely lacks of extensibility. The solution we came up with was to use an enum and a name-based lookup. Both will be done by implementing a mixin interface.

1. Mixin interface

In Java, a mixin is an interface which contains a combination of methods from other classes. It encourages code reuse and avoid multiple inheritance as well. To solve the problem above, we first define a interface with needed operations. In this case, the MonoFont has two methods, getFontName() and getLocation(). The first is used to retrieve the key of the font and the second is used to store its installed location.

public interface MonoFont { public String getFontName(); public File getLocation(); } read more

A Java Implementation of Fibonacci Heap

This project provides a Java implementation of Fibonacci Heap. Fibonacci Heap is a heap data structure consisting a collection of min-heap-ordered trees. It is an advanced amortized data structure, whose amortized performance is better than common collections like Linked List. However, some of its operations can run as slowly as linear time in worst case, therefore make it inappropriate for general use. On the other hand, especially in terms of algorithm learning and time analysis, it is a very good data structure.

Introduction to Algorithms (3rd ed) Chap 21 and Wiki give a very detailed discussion of Fibonacci heap. For the sake of simplicity, I only re-print the running time comparison between Linked List and Fibonacci Heap

Operations  | Linked List | Fibonacci Heap
insert      |     O(1)    |   O(1)
minimum     |     O(n)    |   O(1)
extractMin  |     O(n)    |   O(log n)
union       |     O(1)    |   O(1)
decreaseKey |     O(1)    |   O(l)*
delete      |     O(1)    |   O(log n)*

* Amortized time

Here are explanations of operations, modified from Introduction to Algorithms (3rd ed)

  • insert(x) inserts a node x into the heap.
  • minimum() returns the node in the heap with minimum key.
  • extractMin() deletes the node with minimum key from the heap.
  • union(H) merge heap H and create a new one.
  • decreaseKey(x,k) assigns to node x within the heap the new key value k, which is assumed to be no greater than its current key value.
  • delete(x) deletes node x from the heap.


In FabonacciHeap, much of the code is based on the algorithms in the Introduction to Algorithms (3rd ed) Chap 21. The amortized cost of most of these methods is O(1), making it a very fast data structure. Several have an actual running time of O(1). extractMin() and delete() have O(log n) amortized running times because they do the heap consolidation().

The data structure of a Fibonacci heap is (copied from Warren Armstrong’s website)

For each node FibonacciHeap.Entry,

  • Two pointers left and right are used to form a circular double linked list.
  • One pointer parent is used to its parent.
  • One pointer child is used to one of its children.
  • degree stores the number of children.
  • mark “indicates whether node has lost a child since the last it was make the child of another node.” The boolean mark is very useful to amortize the running time of decrease-key.
  • An integer key is used to indicate the key value of the node. Because in delete(), the key needs to set to MIN_VALUE, Comparable cannot be used for a more general case.


FabonacciHeap provides an implementation of a Fibonacci heap in Java. Although the source code supports many operations on the heap, from the above comparison table, it can be seen that only insert(), minimum(), extractMin(), and union() are recommended to use.

After that, the use of FibonacciHeap is straightforward.

// construct a heap FibonacciHeap heap = new FibonacciHeap(); FibonacciHeap.Entry n1 = new FibonacciHeap.Entry(23, 0); heap.insert(n1); FibonacciHeap.Entry n2 = new FibonacciHeap.Entry(7, 0); heap.insert(n2); FibonacciHeap.Entry n3 = new FibonacciHeap.Entry(35, 0); heap.insert(n3); // minimum System.out.println(heap.minimum()); // extractMin heap.extractMin(); System.out.println(heap.minimum()); // decrease key heap.decreaseKey(n1, 1); System.out.println(heap.minimum()); read more

How to use Opennlp to do part-of-speech tagging

How to use Opennlp to do part-of-speech tagging


The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. One of the most popular machine learning models it supports is Maximum Entropy Model (MaxEnt) for Natural Language Processing task. Among others, part-os-speech tagging (POS tagging) is one of the most common NLP tasks.

In short, POS tagging is the process of marking up a word to a particular part-of-speech. It is usually required to build more advanced text processing services. Explaination of POS tagging and its techniques using MaxEnt is beyond the scope of this post. So I won’t discuss it further. For more details, readers are recommended to refer this report. Apache Opennlp Document provides are very brief introduction of how to use the POS tagger tool. However, researchers and developers may want to look in depth of its usage. This is what this post aims at.

Library and model

The latest version of opennlp package can be found here. The current version is 1.5.*. After downloaded, we can extract it to $OPENNLP_HOME.

Apache Opennlp project also provides models compatible with the above package. Among these models, we will use en-pos-maxent.bin which is a MaxEnt model with tag dictoinary and can be downloaded here. The tag set is the Penn Treebank tag set, however it is unknown which corpus was used to train this model. But this model is language dependent, therefore, I assume it will perform well for common English. After downloaded, we can put the model either in the opennlp folder $OPENNLP_HOME, or other places. For the sake of simplicity, I will just put it under $OPENNLP_HOME.

After got the opennlp package and the model (put the model in the opennlp folder), let’s have a look at how to use it for the POS tagging. As far as I know, there are at least three ways to invoke the tagger: command line, POSTaggerTool, and POSTaggerME. Each will be fully discussed later.

Command line

The easiest way to try out the POS tagger is the command line tool. The tool is called opennlp located in the bin folder. Suppose the opennlp folder is $OPENNLP_HOME, the command line is

$OPENNLP_HOME/bin/opennlp POSTagger $OPENNLP_HOME/en-pos-maxent.bin 
    < sentences

The POS tagger will read sentences from the file sentences. The format is: one sentence per line with words and punctuation tokenized. For example, the input sentence can be (Note, the stop at the end is tokenized too)

Here we have investigated the functional domains of the proteins .

The output is the sentences with POS tags in the format of “word_pos [space] word_pos [space] ...“. So the result of the above example will be,

Here_RB we_PRP have_VBP investigated_VBN the_DT functional_JJ domains_NNS of_IN 
the_DT proteins_NNS ._PERIOD


You can use POSTaggerTool to invoke the tagger. The tool takes only one parameter, the path of the model file. Actually, the command line above eventually invokes POSTaggerTool at the background. So it seems no advantage to use this snippet of code. But it makes no hurt to know more. :p

public static void main(String[] args) {
  POSTaggerTool tool = new POSTaggerTool();


If you want to embbed the POS tagger into the application, you have to use POSTaggerME. A complete code of using POSTaggerME is attached at the very end of this post. In this section, I will only use pieces of codes to discuss its usage.

To invoke POSTaggerME, you first need to load the model, then initialize POSTaggerME.

// load the model InputStream modelIn = new FileInputStream("en-pos-maxent.bin"); POSModel model = new POSModel(modelIn); // initialize POSTaggerME POSTaggerME tagger = new POSTaggerME(model); read more

Load Penn Treebank with offsets


When I am doing relation extraction based on parsing trees, it is always very helpful to map the leaves of parsing trees back the original text. So when I am searching a leave (or an inner node), I can find where it comes from. In order to do so, I wrote a script to add offsets into the leaves. The script prints parsing trees in “Penn Treebank” format. For example, together with the sentence and the tree

After 48 h, cells were harvested for luciferase assay.

(S1 (S (S (PP (IN After) 
              (NP (CD 48) 
                  (NN h))) 
          (, ,) 
          (NP (NNS cells)) 
          (VP (VBD were) 
              (VP (VBN harvested) 
                  (PP (IN for) 
                      (NP (NN luciferase) 
                          (NN assay)))))) 
       (. .)))

the parsing tree with offsets looks like

(S1 (S (S (PP (IN After_294_299) (NP (CD 48_300_302) (NN h_303_304))) (, ,_304_305) (NP (NNS cells_306_311)) (VP (VBD were_312_316) (VP (VBN harvested_317_326) (PP (IN for_327_330) (NP (NN luciferase_331_341) (NN assay_342_347)))))) (. ._347_348))) read more

Split a string into pairs of words

From stackoverflow:


Given a string such as “aa bb cc dd ee ff“, is there a regex that works with String.split() to extract two words at a time? The expected result is:

[aa bb, cc dd, ee ff]

Note: This question is about the split regex. It is not about “finding a work-around” or other “making it work another way” solutions.


The regex expression is (?<!Gw+)s. Here are some explanation of the regex: read more