Monthly Archives: September 2013

How to extend enum in Java

In Java feature, it is not possible to have one enum extend another. For most of cases you may encounter, it is reasonable: not only would it be confusing whether to enumerate over all of the elements of a superclass and its subclass, but also make difficult to interact with switch statements.

But occasionally, you may still want to make an enum class extensible. For example, you need an enum to represent basic fonts supported by your system and want third parties to be able to add new fonts. The solution we came up with was to use an enum and a name-based lookup. Both will be done by implementing a mixin interface.

In Java, a mixin is an interface which contains a combination of methods from other classes. It encourages code reuse and avoid multiple inheritance as well. This post will suggest one way to restore the subclassing variant using mixin. The solution has a few parts (copied from “Mixing-in an Enum”):

  • Define an interface with the needed functionality.
  • Declare enums implementing the interface.
  • Include a factory method mapping from names to objects implementing the interface.

This combination of a enum for known values and an interface for extensibility provides a good alternative to string-based provider lookups.

An example

Let’s begin with a simple example. There are a number of basic monospaced fonts supported for an Ubuntu system, such as “Courier”, “Consolas”, and “DejaVu”, so using an enum for the default installed fonts would be structured. For example, BasicMonoFont is created and has constants for the “Courier”, “Consolas”, and “DejaVu”.

public enum BasicMonoFont {
  Courier("Courier", new File("/usr/share/fonts")), //
  Consolas("Consolas", new File("/usr/share/fonts")), //
  DejaVu("DejaVu", new File("/usr/share/fonts"));

  private final String fontName;
  private final File   location;

  private BasicMonoFont(String fontName, File location) {
    this.fontName = fontName;
    this.location = location;
  }

  public String getFontName() {
    return fontName;
  }

  public File getLocation() {
    return location;
  }
}

But we need to add other fonts too, such as “Lucida”, “Monaco”, and “Andale”. We can, of course, change the source code to add new fonts, but it absolutely lacks of extensibility. The solution we came up with was to use an enum and a name-based lookup. Both will be done by implementing a mixin interface.

1. Mixin interface

In Java, a mixin is an interface which contains a combination of methods from other classes. It encourages code reuse and avoid multiple inheritance as well. To solve the problem above, we first define a interface with needed operations. In this case, the MonoFont has two methods, getFontName() and getLocation(). The first is used to retrieve the key of the font and the second is used to store its installed location.

public interface MonoFont {
  public String getFontName();
  public File getLocation();
}

read more

How to convert video using MEncoder

MEncoder is a free command line video decoding, encoding, and filtering tool. As a by-product of MPlayer, it can convert all formats that Mplayer can understand.

install mencoder

sudo apt-get install mencode

avi → mp4

mencoder 1.avi -o 1.mp4 -oac copy -ovc lavc -lavcopts vcodec=mpeg1video -of mpeg
  • o: output file name.
  • oac: audio codecs for encoding. copy does not reencode but just copy compressed frames.
  • ovc: video codecs for encoding. lavc means using one of libavcodec’s video codecs.
  • lavcopts: libavcodec’s video codecs. Here we choose MPEG-1 video.
  • of: output container formats. mpeg means MPEG-1 and MPEG-2 PS.

read more

When to post technology blogs?

Don’t mistake me: I’m not disagreeing with the importance of quality content in posting; on the contrary, I always believe that creating original content is the most essential part of a successful blog. But beyond that, probably we can do a bit better.

When to post is another important aspect for a successful blog. Are certain times better than others? The answer is absolutely Yes, but it depends on the industry and the nature of your group personality. This article only focuses on technology blogs. In this post, I attempt to combine different research resources and draw a few basic posting guidelines by time of hour and day. The timing in the post is relative to the time zone. My main resources include

read more

A Java Implementation of Fibonacci Heap

This project provides a Java implementation of Fibonacci Heap. Fibonacci Heap is a heap data structure consisting a collection of min-heap-ordered trees. It is an advanced amortized data structure, whose amortized performance is better than common collections like Linked List. However, some of its operations can run as slowly as linear time in worst case, therefore make it inappropriate for general use. On the other hand, especially in terms of algorithm learning and time analysis, it is a very good data structure.

Introduction to Algorithms (3rd ed) Chap 21 and Wiki give a very detailed discussion of Fibonacci heap. For the sake of simplicity, I only re-print the running time comparison between Linked List and Fibonacci Heap

Operations  | Linked List | Fibonacci Heap
-------------------------------------------
insert      |     O(1)    |   O(1)
minimum     |     O(n)    |   O(1)
extractMin  |     O(n)    |   O(log n)
union       |     O(1)    |   O(1)
-------------------------------------------
decreaseKey |     O(1)    |   O(l)*
delete      |     O(1)    |   O(log n)*

* Amortized time

Here are explanations of operations, modified from Introduction to Algorithms (3rd ed)

  • insert(x) inserts a node x into the heap.
  • minimum() returns the node in the heap with minimum key.
  • extractMin() deletes the node with minimum key from the heap.
  • union(H) merge heap H and create a new one.
  • decreaseKey(x,k) assigns to node x within the heap the new key value k, which is assumed to be no greater than its current key value.
  • delete(x) deletes node x from the heap.

Implementation

In FabonacciHeap, much of the code is based on the algorithms in the Introduction to Algorithms (3rd ed) Chap 21. The amortized cost of most of these methods is O(1), making it a very fast data structure. Several have an actual running time of O(1). extractMin() and delete() have O(log n) amortized running times because they do the heap consolidation().

The data structure of a Fibonacci heap is (copied from Warren Armstrong’s website)



For each node FibonacciHeap.Entry,

  • Two pointers left and right are used to form a circular double linked list.
  • One pointer parent is used to its parent.
  • One pointer child is used to one of its children.
  • degree stores the number of children.
  • mark “indicates whether node has lost a child since the last it was make the child of another node.” The boolean mark is very useful to amortize the running time of decrease-key.
  • An integer key is used to indicate the key value of the node. Because in delete(), the key needs to set to MIN_VALUE, Comparable cannot be used for a more general case.

Usage

FabonacciHeap provides an implementation of a Fibonacci heap in Java. Although the source code supports many operations on the heap, from the above comparison table, it can be seen that only insert(), minimum(), extractMin(), and union() are recommended to use.

After that, the use of FibonacciHeap is straightforward.

// construct a heap
FibonacciHeap heap = new FibonacciHeap();
FibonacciHeap.Entry n1 = new FibonacciHeap.Entry(23, 0);
heap.insert(n1);
FibonacciHeap.Entry n2 = new FibonacciHeap.Entry(7, 0);
heap.insert(n2);
FibonacciHeap.Entry n3 = new FibonacciHeap.Entry(35, 0);
heap.insert(n3);

// minimum
System.out.println(heap.minimum());

// extractMin
heap.extractMin();
System.out.println(heap.minimum());

// decrease key
heap.decreaseKey(n1, 1);
System.out.println(heap.minimum());

read more

How to delete all .svn in a folder

If you need to remove all .svn folders in a project, run

find . -name ".svn" -exec rm -rf {} ;

or more precisely, run

find . -type d -name '.svn' -print -exec rm -rf {} ;

The script starts from the current directory and searches recursively to the sub-folders. This script also works on other names by replacing “.svn”. But if the name is too general especially if you use the regular expression, you may end up by deleting a bunch of folders/files you don’t want to. A safer way is to run find first to verify all found folders/files are really what you intend to remove.

read more

How to use Opennlp to do part-of-speech tagging

How to use Opennlp to do part-of-speech tagging

Introduction

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. One of the most popular machine learning models it supports is Maximum Entropy Model (MaxEnt) for Natural Language Processing task. Among others, part-os-speech tagging (POS tagging) is one of the most common NLP tasks.

In short, POS tagging is the process of marking up a word to a particular part-of-speech. It is usually required to build more advanced text processing services. Explaination of POS tagging and its techniques using MaxEnt is beyond the scope of this post. So I won’t discuss it further. For more details, readers are recommended to refer this report. Apache Opennlp Document provides are very brief introduction of how to use the POS tagger tool. However, researchers and developers may want to look in depth of its usage. This is what this post aims at.

Library and model

The latest version of opennlp package can be found here. The current version is 1.5.*. After downloaded, we can extract it to $OPENNLP_HOME.

Apache Opennlp project also provides models compatible with the above package. Among these models, we will use en-pos-maxent.bin which is a MaxEnt model with tag dictoinary and can be downloaded here. The tag set is the Penn Treebank tag set, however it is unknown which corpus was used to train this model. But this model is language dependent, therefore, I assume it will perform well for common English. After downloaded, we can put the model either in the opennlp folder $OPENNLP_HOME, or other places. For the sake of simplicity, I will just put it under $OPENNLP_HOME.

After got the opennlp package and the model (put the model in the opennlp folder), let’s have a look at how to use it for the POS tagging. As far as I know, there are at least three ways to invoke the tagger: command line, POSTaggerTool, and POSTaggerME. Each will be fully discussed later.

Command line

The easiest way to try out the POS tagger is the command line tool. The tool is called opennlp located in the bin folder. Suppose the opennlp folder is $OPENNLP_HOME, the command line is

$OPENNLP_HOME/bin/opennlp POSTagger $OPENNLP_HOME/en-pos-maxent.bin 
    < sentences

The POS tagger will read sentences from the file sentences. The format is: one sentence per line with words and punctuation tokenized. For example, the input sentence can be (Note, the stop at the end is tokenized too)

Here we have investigated the functional domains of the proteins .

The output is the sentences with POS tags in the format of “word_pos [space] word_pos [space] ...“. So the result of the above example will be,

Here_RB we_PRP have_VBP investigated_VBN the_DT functional_JJ domains_NNS of_IN 
the_DT proteins_NNS ._PERIOD

POSTaggerTool

You can use POSTaggerTool to invoke the tagger. The tool takes only one parameter, the path of the model file. Actually, the command line above eventually invokes POSTaggerTool at the background. So it seems no advantage to use this snippet of code. But it makes no hurt to know more. :p

public static void main(String[] args) {
  POSTaggerTool tool = new POSTaggerTool();
  System.err.println(tool.getHelp());
  tool.run(args);
}

POSTaggerME

If you want to embbed the POS tagger into the application, you have to use POSTaggerME. A complete code of using POSTaggerME is attached at the very end of this post. In this section, I will only use pieces of codes to discuss its usage.

To invoke POSTaggerME, you first need to load the model, then initialize POSTaggerME.

// load the model
InputStream modelIn = new FileInputStream("en-pos-maxent.bin");
POSModel model = new POSModel(modelIn);
// initialize POSTaggerME
POSTaggerME tagger = new POSTaggerME(model);

read more

Load Penn Treebank with offsets

Question

When I am doing relation extraction based on parsing trees, it is always very helpful to map the leaves of parsing trees back the original text. So when I am searching a leave (or an inner node), I can find where it comes from. In order to do so, I wrote a script to add offsets into the leaves. The script prints parsing trees in “Penn Treebank” format. For example, together with the sentence and the tree

After 48 h, cells were harvested for luciferase assay.

(S1 (S (S (PP (IN After) 
              (NP (CD 48) 
                  (NN h))) 
          (, ,) 
          (NP (NNS cells)) 
          (VP (VBD were) 
              (VP (VBN harvested) 
                  (PP (IN for) 
                      (NP (NN luciferase) 
                          (NN assay)))))) 
       (. .)))

the parsing tree with offsets looks like

(S1 (S (S (PP (IN After_294_299) 
              (NP (CD 48_300_302) 
                  (NN h_303_304))) 
          (, ,_304_305) 
          (NP (NNS cells_306_311)) 
          (VP (VBD were_312_316) 
              (VP (VBN harvested_317_326) 
                  (PP (IN for_327_330) 
                  (NP (NN luciferase_331_341) 
                      (NN assay_342_347)))))) 
       (. ._347_348)))

read more

Recommend a paper: Empirical Analysis of Programming Language Adoption

Leo A. Meyerovich and Ariel Rabkin. Empirical Analysis of Programming Language Adoption. OOPSLA ’13. 2013.

The following text is copied and slightly modified from the paper.

Some programming languages become widely popular while others fail to grow beyond their niche or disappear altogether. This paper uses survey methodology to identify the factors that lead to language adoption. Leo A. Meyerovich and Ariel Rabkin analyzed large datasets, including over 200,000 SourceForge projects, 590,000 projects tracked by Ohloh, and multiple surveys of 1,000-13,000 programmers. They report several prominent findings.

read more