Category Archives: toolkits

How to convert video using MEncoder

MEncoder is a free command line video decoding, encoding, and filtering tool. As a by-product of MPlayer, it can convert all formats that Mplayer can understand.

install mencoder

sudo apt-get install mencode

avi → mp4

mencoder 1.avi -o 1.mp4 -oac copy -ovc lavc -lavcopts vcodec=mpeg1video -of mpeg

  • o: output file name.
  • oac: audio codecs for encoding. copy does not reencode but just copy compressed frames.
  • ovc: video codecs for encoding. lavc means using one of libavcodec’s video codecs.
  • lavcopts: libavcodec’s video codecs. Here we choose MPEG-1 video.
  • of: output container formats. mpeg means MPEG-1 and MPEG-2 PS.
  • read more

    A Java Implementation of Fibonacci Heap

    This project provides a Java implementation of Fibonacci Heap. Fibonacci Heap is a heap data structure consisting a collection of min-heap-ordered trees. It is an advanced amortized data structure, whose amortized performance is better than common collections like Linked List. However, some of its operations can run as slowly as linear time in worst case, therefore make it inappropriate for general use. On the other hand, especially in terms of algorithm learning and time analysis, it is a very good data structure.

    Introduction to Algorithms (3rd ed) Chap 21 and Wiki give a very detailed discussion of Fibonacci heap. For the sake of simplicity, I only re-print the running time comparison between Linked List and Fibonacci Heap

    Operations  | Linked List | Fibonacci Heap
    -------------------------------------------
    insert      |     O(1)    |   O(1)
    minimum     |     O(n)    |   O(1)
    extractMin  |     O(n)    |   O(log n)
    union       |     O(1)    |   O(1)
    -------------------------------------------
    decreaseKey |     O(1)    |   O(l)*
    delete      |     O(1)    |   O(log n)*
    
    * Amortized time
    

    Here are explanations of operations, modified from Introduction to Algorithms (3rd ed)

    • insert(x) inserts a node x into the heap.
    • minimum() returns the node in the heap with minimum key.
    • extractMin() deletes the node with minimum key from the heap.
    • union(H) merge heap H and create a new one.
    • decreaseKey(x,k) assigns to node x within the heap the new key value k, which is assumed to be no greater than its current key value.
    • delete(x) deletes node x from the heap.

    Implementation

    In FabonacciHeap, much of the code is based on the algorithms in the Introduction to Algorithms (3rd ed) Chap 21. The amortized cost of most of these methods is O(1), making it a very fast data structure. Several have an actual running time of O(1). extractMin() and delete() have O(log n) amortized running times because they do the heap consolidation().

    The data structure of a Fibonacci heap is (copied from Warren Armstrong’s website)



    For each node FibonacciHeap.Entry,

    • Two pointers left and right are used to form a circular double linked list.
    • One pointer parent is used to its parent.
    • One pointer child is used to one of its children.
    • degree stores the number of children.
    • mark “indicates whether node has lost a child since the last it was make the child of another node.” The boolean mark is very useful to amortize the running time of decrease-key.
    • An integer key is used to indicate the key value of the node. Because in delete(), the key needs to set to MIN_VALUE, Comparable cannot be used for a more general case.

    Usage

    FabonacciHeap provides an implementation of a Fibonacci heap in Java. Although the source code supports many operations on the heap, from the above comparison table, it can be seen that only insert(), minimum(), extractMin(), and union() are recommended to use.

    After that, the use of FibonacciHeap is straightforward.

    // construct a heap FibonacciHeap heap = new FibonacciHeap(); FibonacciHeap.Entry n1 = new FibonacciHeap.Entry(23, 0); heap.insert(n1); FibonacciHeap.Entry n2 = new FibonacciHeap.Entry(7, 0); heap.insert(n2); FibonacciHeap.Entry n3 = new FibonacciHeap.Entry(35, 0); heap.insert(n3); // minimum System.out.println(heap.minimum()); // extractMin heap.extractMin(); System.out.println(heap.minimum()); // decrease key heap.decreaseKey(n1, 1); System.out.println(heap.minimum()); read more

    How to delete all .svn in a folder

    If you need to remove all .svn folders in a project, run

    find . -name ".svn" -exec rm -rf {} ;
    

    or more precisely, run

    find . -type d -name '.svn' -print -exec rm -rf {} ;
    

    The script starts from the current directory and searches recursively to the sub-folders. This script also works on other names by replacing “.svn”. But if the name is too general especially if you use the regular expression, you may end up by deleting a bunch of folders/files you don’t want to. A safer way is to run find first to verify all found folders/files are really what you intend to remove. read more

    How to use Opennlp to do part-of-speech tagging

    How to use Opennlp to do part-of-speech tagging

    Introduction

    The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. One of the most popular machine learning models it supports is Maximum Entropy Model (MaxEnt) for Natural Language Processing task. Among others, part-os-speech tagging (POS tagging) is one of the most common NLP tasks.

    In short, POS tagging is the process of marking up a word to a particular part-of-speech. It is usually required to build more advanced text processing services. Explaination of POS tagging and its techniques using MaxEnt is beyond the scope of this post. So I won’t discuss it further. For more details, readers are recommended to refer this report. Apache Opennlp Document provides are very brief introduction of how to use the POS tagger tool. However, researchers and developers may want to look in depth of its usage. This is what this post aims at.

    Library and model

    The latest version of opennlp package can be found here. The current version is 1.5.*. After downloaded, we can extract it to $OPENNLP_HOME.

    Apache Opennlp project also provides models compatible with the above package. Among these models, we will use en-pos-maxent.bin which is a MaxEnt model with tag dictoinary and can be downloaded here. The tag set is the Penn Treebank tag set, however it is unknown which corpus was used to train this model. But this model is language dependent, therefore, I assume it will perform well for common English. After downloaded, we can put the model either in the opennlp folder $OPENNLP_HOME, or other places. For the sake of simplicity, I will just put it under $OPENNLP_HOME.

    After got the opennlp package and the model (put the model in the opennlp folder), let’s have a look at how to use it for the POS tagging. As far as I know, there are at least three ways to invoke the tagger: command line, POSTaggerTool, and POSTaggerME. Each will be fully discussed later.

    Command line

    The easiest way to try out the POS tagger is the command line tool. The tool is called opennlp located in the bin folder. Suppose the opennlp folder is $OPENNLP_HOME, the command line is

    $OPENNLP_HOME/bin/opennlp POSTagger $OPENNLP_HOME/en-pos-maxent.bin 
        < sentences
    

    The POS tagger will read sentences from the file sentences. The format is: one sentence per line with words and punctuation tokenized. For example, the input sentence can be (Note, the stop at the end is tokenized too)

    Here we have investigated the functional domains of the proteins .
    

    The output is the sentences with POS tags in the format of “word_pos [space] word_pos [space] ...“. So the result of the above example will be,

    Here_RB we_PRP have_VBP investigated_VBN the_DT functional_JJ domains_NNS of_IN 
    the_DT proteins_NNS ._PERIOD
    

    POSTaggerTool

    You can use POSTaggerTool to invoke the tagger. The tool takes only one parameter, the path of the model file. Actually, the command line above eventually invokes POSTaggerTool at the background. So it seems no advantage to use this snippet of code. But it makes no hurt to know more. :p

    public static void main(String[] args) {
      POSTaggerTool tool = new POSTaggerTool();
      System.err.println(tool.getHelp());
      tool.run(args);
    }
    

    POSTaggerME

    If you want to embbed the POS tagger into the application, you have to use POSTaggerME. A complete code of using POSTaggerME is attached at the very end of this post. In this section, I will only use pieces of codes to discuss its usage.

    To invoke POSTaggerME, you first need to load the model, then initialize POSTaggerME.

    // load the model InputStream modelIn = new FileInputStream("en-pos-maxent.bin"); POSModel model = new POSModel(modelIn); // initialize POSTaggerME POSTaggerME tagger = new POSTaggerME(model); read more

    Load Penn Treebank with offsets

    Question

    When I am doing relation extraction based on parsing trees, it is always very helpful to map the leaves of parsing trees back the original text. So when I am searching a leave (or an inner node), I can find where it comes from. In order to do so, I wrote a script to add offsets into the leaves. The script prints parsing trees in “Penn Treebank” format. For example, together with the sentence and the tree

    After 48 h, cells were harvested for luciferase assay.
    
    (S1 (S (S (PP (IN After) 
                  (NP (CD 48) 
                      (NN h))) 
              (, ,) 
              (NP (NNS cells)) 
              (VP (VBD were) 
                  (VP (VBN harvested) 
                      (PP (IN for) 
                          (NP (NN luciferase) 
                              (NN assay)))))) 
           (. .)))
    

    the parsing tree with offsets looks like

    (S1 (S (S (PP (IN After_294_299) (NP (CD 48_300_302) (NN h_303_304))) (, ,_304_305) (NP (NNS cells_306_311)) (VP (VBD were_312_316) (VP (VBN harvested_317_326) (PP (IN for_327_330) (NP (NN luciferase_331_341) (NN assay_342_347)))))) (. ._347_348))) read more

    Change gedit embedded terminal colors

    For Ubuntu only:

    1. install dconf-tools and gconf-editor
    2. in gconf-editor, navigate to apps → gnome-terminal → profiles → Default
    3. in dconf-tools, navigate to org → gnome → gedit → plugins → terminal
    4. uncheck “use-theme-colors”
    5. copy values of “background-color“, “foreground-color“, and “palette“, from gconf-editor to dconf-tools

    Install Gnuplot 4.6 with PDF on Ubuntu

    It is always hard to install gnuplot manually on Ubuntu, especially if you want to plot diagram in PDF, JPEG, or PNG formats. This short 101 article describes one way to install gnuplot with PDF on Ubuntu.

    1. download pdflib-light and extract to $PDFLIB
    2. compile and install pdflib-light

      cd $PDFLIB
      ./configure  
      make  
      sudo make install
      
    3. refresh lib cache: sudo ldconfig

    4. download gunplot and extract to $GNUPLOT

    5. compile and install gnuplot

      cd $GNUPLOT  
      ./configure --with-pdf  
      make  
      sudo make install  
      

    Note: other packages which can be installed via apt-get read more

    Best Markdown Editors for Windows, Linux, and the web

    Markdown is a lightweight markup language, allowing people “to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML)”. An excellent Markdown Syntax Guide is by Daring Fireball. Sites such as GitHub, reddit, Diaspora, Stack Overflow, OpenStreetMap, and SourceForge use Markdown to facilitate discussion between users. GitHub uses “GitHub Flavored Markdown” (GFM) for messages, issues, and comments. It differs from standard Markdown (SM) in a few significant ways and adds some additional functionality. read more