Category Archives: toolkits

How to build PDFsam enhanced on Linux

PDFsam (PDF Split and Merge) is a free and open source desktop utility designed to perform pdf documents manipulation (pdf merge, pdf split, page rotation etc). PDFsam Enhanced is the Pro version. It has additional features including encryption/decryption of pdf files, extracting attached files, mixing two pdf files, etc. This post tells you how to download the source code, compile and use it for free.

Recommend – LaTeX symbol classifier

When I am working with LaTeX, one difficulty is to memorize a symbol. Finding it via Internet is not easy either. Detexify is quite useful. You can simply draw the symbol and it will do OCR and search the matched one.

MathJax with blogger.com

MathJax in blogger.com is useful when I need to input equation in my blog. This is a instruction and test of using MathJax in blogger.com.

  1. set blogger.com template to “Simple”. “Dynamic Views” seems not working.
  2. edit HTML by adding the following code between <head> and </head>: <script src='http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' type='text/javascript'/>
  3. type math equation in the blog.

How to fix Gedit LaTex Plugin error while saving BibTex

Gedit (3.4.1) gives the error below while saving a BibTex file (Ubuntu 12.04). The saving works fine, apart from the error popping up.

'bibtex-error'

  Traceback (most recent call last):
  File "/usr/lib/gedit/plugins/latex/util.py", line 116, in decorated_function
  return function(*args, **kw)
  File "/usr/lib/gedit/plugins/latex/bibtex/editor.py", line 141, in __parse
  self.remove_markers("bibtex-error")
  File "/usr/lib/gedit/plugins/latex/editor.py", line 493, in remove_markers
  type_record = self._marker_types[marker_type]
  KeyError: 'bibtex-error'

How to fix it?

How to read CSV files in Java – A case study of Iterator and Decorator

In this post, I will talk about how to read CSV (Comma-separated values) files using Apache Common CSV. From this case study, we will learn how to use Iterator and Decorator in context of design pattern to improve the reusability in different situations. But before we get started, I guess I have to answer two questions first.

  1. Why do I need a third party library if there are more than enough DIY posts talking about how to read CSV files?
    It is true that when you google “java csv parser”, you will get several related posts. But even if you are a beginner, you won’t be satisfied with these shallow methods. Of course using BufferedReader and String.split() will successfully parse a typical CSV file, but you won’t learn ANYTHING from it except making redundant. On the other hand, like what I will show below, using and studying Apache Common CSV will teach you several topics in Design Pattern, for instance iterator and decorator.

  2. Why Apache Common CSV, not others?
    As far as I know, there are several other libraries on Sourceforge or Google code. However, if you look into details of their code, forgive my criticism, none of them are flexible and manageable: some are too simple to meet users various requirements; others are too complicated and painful to use. Furthermore, most of them I’ve come across don’t have commercial-friendly licenses. You know, sometimes, it really scares users off.

Apache Common CSV is still in sandbox, which means there are currently no official download and stable release. But nightly builds may be available.

Using Iterator to hide underlying representation

Let me begin with a sample CSV file, where each record is located on a separate line, delimited by a line break. The first line is the header containing two names COL1 and COL2 corresponding to the fields in the file. The rest of the file contains three records with fields separated by commas.

COL1,COL2
a,b
c,d
e,f

The code using Apache Common CSV to read this file is:

public void test() throws FileNotFoundException, IOException {
  CSVParser parser = new CSVParser(
      new FileReader("test.csv"), 
      CSVFormat.DEFAULT.withHeader());
  for (CSVRecord record : parser) {
    System.out.printf("%st%sn", 
      record.get("COL1"), 
      record.get("COL2"));
  }
  parser.close();
}

CSVParser is used to parse CSV files according to the specified format. Here I use the default CSVFormat together with setting withHeader() with no argument. This enables the parser to treat the first line of the CSV file as the header and to make the record.get("COL1") valid. CSVParser provides an iterative way of reading records. Here we meet the first design pattern Iterator. It provides a way to access the records of a CSV file sequentially without exposing its underlying representation, like how to skip over comment line and how to map the column name to the field value. For each record, we use CSVRecord.get(String name) to retrieve the field value by its name.

CSVRecord provides different ways to access the field value: by name or by index. If you are not sure the field has a value or is empty, CSVRecord.isSet(String name) can be called before. If you just want to check whether a name has been defined to the parser, call CSVRecord.isMapped(String name) instead.

Using Decorator to allow different behaviors

CSVFormat.DEFAULT or CSVFormat.RFC4180 follows the RFC4180 format. So fields enclosed in double quotes can be handled too, such as

"COL1","COL2"
"a","b"
"c","d"
"e","f"

In RFC4180, fields in a CSV file should be separated by commas. But in general, the library can handle arbitrary delimiter like TAB or space. To make the code reusable, the library provides a way to create your own CSVFormat,

CSVFormat format = CSVFormat.newFormat(',')
    .withQuoteChar('"')
    .withHeader();

The above format is same as the CSVFormat.DEFAULT. Here we encounter another design pattern Decorator, which allows behavior to be added to an individual object, either statically or dynamically, without affecting the behavior of other objects from the same class. In the case of CSVFormat, every withXXX() method returns a new CSVFormat that is equal to the calling one but with one attribute modified. The question here might be why not just return the self-reference this? I think it is because the later way will fail the following code

CSVFormat format = CSVFormat.newFormat(','); CSVFormat format1 = format.withQuoteChar('"'); CSVFormat format2 = format.withHeader(); read more

Tidy config for XML

The following configuration won’t wrap text, which is useful if users don’t want to insert spaces while reformatting XML files.

char-encoding: utf8
indent: auto
indent-spaces: 2
wrap: 0

Usage:

tidy -xml -i -config tidy.config -m XMLFILE

A Java Implementation of Fibonacci Heap

This project provides a Java implementation of Fibonacci Heap. Fibonacci Heap is a heap data structure consisting a collection of min-heap-ordered trees. It is an advanced amortized data structure, whose amortized performance is better than common collections like Linked List. However, some of its operations can run as slowly as linear time in worst case, therefore make it inappropriate for general use. On the other hand, especially in terms of algorithm learning and time analysis, it is a very good data structure.

Introduction to Algorithms (3rd ed) Chap 21 and Wiki give a very detailed discussion of Fibonacci heap. For the sake of simplicity, I only re-print the running time comparison between Linked List and Fibonacci Heap

Operations  | Linked List | Fibonacci Heap
-------------------------------------------
insert      |     O(1)    |   O(1)
minimum     |     O(n)    |   O(1)
extractMin  |     O(n)    |   O(log n)
union       |     O(1)    |   O(1)
-------------------------------------------
decreaseKey |     O(1)    |   O(l)*
delete      |     O(1)    |   O(log n)*

* Amortized time

Here are explanations of operations, modified from Introduction to Algorithms (3rd ed)

  • insert(x) inserts a node x into the heap.
  • minimum() returns the node in the heap with minimum key.
  • extractMin() deletes the node with minimum key from the heap.
  • union(H) merge heap H and create a new one.
  • decreaseKey(x,k) assigns to node x within the heap the new key value k, which is assumed to be no greater than its current key value.
  • delete(x) deletes node x from the heap.

Implementation

In FabonacciHeap, much of the code is based on the algorithms in the Introduction to Algorithms (3rd ed) Chap 21. The amortized cost of most of these methods is O(1), making it a very fast data structure. Several have an actual running time of O(1). extractMin() and delete() have O(log n) amortized running times because they do the heap consolidation().

The data structure of a Fibonacci heap is (copied from Warren Armstrong’s website)



For each node FibonacciHeap.Entry,

  • Two pointers left and right are used to form a circular double linked list.
  • One pointer parent is used to its parent.
  • One pointer child is used to one of its children.
  • degree stores the number of children.
  • mark “indicates whether node has lost a child since the last it was make the child of another node.” The boolean mark is very useful to amortize the running time of decrease-key.
  • An integer key is used to indicate the key value of the node. Because in delete(), the key needs to set to MIN_VALUE, Comparable cannot be used for a more general case.

Usage

FabonacciHeap provides an implementation of a Fibonacci heap in Java. Although the source code supports many operations on the heap, from the above comparison table, it can be seen that only insert(), minimum(), extractMin(), and union() are recommended to use.

After that, the use of FibonacciHeap is straightforward.

// construct a heap FibonacciHeap heap = new FibonacciHeap(); FibonacciHeap.Entry n1 = new FibonacciHeap.Entry(23, 0); heap.insert(n1); FibonacciHeap.Entry n2 = new FibonacciHeap.Entry(7, 0); heap.insert(n2); FibonacciHeap.Entry n3 = new FibonacciHeap.Entry(35, 0); heap.insert(n3); // minimum System.out.println(heap.minimum()); // extractMin heap.extractMin(); System.out.println(heap.minimum()); // decrease key heap.decreaseKey(n1, 1); System.out.println(heap.minimum()); read more

How to use Opennlp to do part-of-speech tagging

How to use Opennlp to do part-of-speech tagging

Introduction

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. One of the most popular machine learning models it supports is Maximum Entropy Model (MaxEnt) for Natural Language Processing task. Among others, part-os-speech tagging (POS tagging) is one of the most common NLP tasks.

In short, POS tagging is the process of marking up a word to a particular part-of-speech. It is usually required to build more advanced text processing services. Explaination of POS tagging and its techniques using MaxEnt is beyond the scope of this post. So I won’t discuss it further. For more details, readers are recommended to refer this report. Apache Opennlp Document provides are very brief introduction of how to use the POS tagger tool. However, researchers and developers may want to look in depth of its usage. This is what this post aims at.

Library and model

The latest version of opennlp package can be found here. The current version is 1.5.*. After downloaded, we can extract it to $OPENNLP_HOME.

Apache Opennlp project also provides models compatible with the above package. Among these models, we will use en-pos-maxent.bin which is a MaxEnt model with tag dictoinary and can be downloaded here. The tag set is the Penn Treebank tag set, however it is unknown which corpus was used to train this model. But this model is language dependent, therefore, I assume it will perform well for common English. After downloaded, we can put the model either in the opennlp folder $OPENNLP_HOME, or other places. For the sake of simplicity, I will just put it under $OPENNLP_HOME.

After got the opennlp package and the model (put the model in the opennlp folder), let’s have a look at how to use it for the POS tagging. As far as I know, there are at least three ways to invoke the tagger: command line, POSTaggerTool, and POSTaggerME. Each will be fully discussed later.

Command line

The easiest way to try out the POS tagger is the command line tool. The tool is called opennlp located in the bin folder. Suppose the opennlp folder is $OPENNLP_HOME, the command line is

$OPENNLP_HOME/bin/opennlp POSTagger $OPENNLP_HOME/en-pos-maxent.bin 
    < sentences

The POS tagger will read sentences from the file sentences. The format is: one sentence per line with words and punctuation tokenized. For example, the input sentence can be (Note, the stop at the end is tokenized too)

Here we have investigated the functional domains of the proteins .

The output is the sentences with POS tags in the format of “word_pos [space] word_pos [space] ...“. So the result of the above example will be,

Here_RB we_PRP have_VBP investigated_VBN the_DT functional_JJ domains_NNS of_IN 
the_DT proteins_NNS ._PERIOD

POSTaggerTool

You can use POSTaggerTool to invoke the tagger. The tool takes only one parameter, the path of the model file. Actually, the command line above eventually invokes POSTaggerTool at the background. So it seems no advantage to use this snippet of code. But it makes no hurt to know more. :p

public static void main(String[] args) {
  POSTaggerTool tool = new POSTaggerTool();
  System.err.println(tool.getHelp());
  tool.run(args);
}

POSTaggerME

If you want to embbed the POS tagger into the application, you have to use POSTaggerME. A complete code of using POSTaggerME is attached at the very end of this post. In this section, I will only use pieces of codes to discuss its usage.

To invoke POSTaggerME, you first need to load the model, then initialize POSTaggerME.

// load the model InputStream modelIn = new FileInputStream("en-pos-maxent.bin"); POSModel model = new POSModel(modelIn); // initialize POSTaggerME POSTaggerME tagger = new POSTaggerME(model); read more

Load Penn Treebank with offsets

Question

When I am doing relation extraction based on parsing trees, it is always very helpful to map the leaves of parsing trees back the original text. So when I am searching a leave (or an inner node), I can find where it comes from. In order to do so, I wrote a script to add offsets into the leaves. The script prints parsing trees in “Penn Treebank” format. For example, together with the sentence and the tree

After 48 h, cells were harvested for luciferase assay.

(S1 (S (S (PP (IN After) 
              (NP (CD 48) 
                  (NN h))) 
          (, ,) 
          (NP (NNS cells)) 
          (VP (VBD were) 
              (VP (VBN harvested) 
                  (PP (IN for) 
                      (NP (NN luciferase) 
                          (NN assay)))))) 
       (. .)))

the parsing tree with offsets looks like

(S1 (S (S (PP (IN After_294_299) (NP (CD 48_300_302) (NN h_303_304))) (, ,_304_305) (NP (NNS cells_306_311)) (VP (VBD were_312_316) (VP (VBN harvested_317_326) (PP (IN for_327_330) (NP (NN luciferase_331_341) (NN assay_342_347)))))) (. ._347_348))) read more