How to use Opennlp to do part-of-speech tagging

By | September 19, 2013

How to use Opennlp to do part-of-speech tagging

Introduction

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. One of the most popular machine learning models it supports is Maximum Entropy Model (MaxEnt) for Natural Language Processing task. Among others, part-os-speech tagging (POS tagging) is one of the most common NLP tasks.

In short, POS tagging is the process of marking up a word to a particular part-of-speech. It is usually required to build more advanced text processing services. Explaination of POS tagging and its techniques using MaxEnt is beyond the scope of this post. So I won’t discuss it further. For more details, readers are recommended to refer this report. Apache Opennlp Document provides are very brief introduction of how to use the POS tagger tool. However, researchers and developers may want to look in depth of its usage. This is what this post aims at.

Library and model

The latest version of opennlp package can be found here. The current version is 1.5.*. After downloaded, we can extract it to $OPENNLP_HOME.

Apache Opennlp project also provides models compatible with the above package. Among these models, we will use en-pos-maxent.bin which is a MaxEnt model with tag dictoinary and can be downloaded here. The tag set is the Penn Treebank tag set, however it is unknown which corpus was used to train this model. But this model is language dependent, therefore, I assume it will perform well for common English. After downloaded, we can put the model either in the opennlp folder $OPENNLP_HOME, or other places. For the sake of simplicity, I will just put it under $OPENNLP_HOME.

After got the opennlp package and the model (put the model in the opennlp folder), let’s have a look at how to use it for the POS tagging. As far as I know, there are at least three ways to invoke the tagger: command line, POSTaggerTool, and POSTaggerME. Each will be fully discussed later.

Command line

The easiest way to try out the POS tagger is the command line tool. The tool is called opennlp located in the bin folder. Suppose the opennlp folder is $OPENNLP_HOME, the command line is

$OPENNLP_HOME/bin/opennlp POSTagger $OPENNLP_HOME/en-pos-maxent.bin 
    < sentences

The POS tagger will read sentences from the file sentences. The format is: one sentence per line with words and punctuation tokenized. For example, the input sentence can be (Note, the stop at the end is tokenized too)

Here we have investigated the functional domains of the proteins .

The output is the sentences with POS tags in the format of “word_pos [space] word_pos [space] ...“. So the result of the above example will be,

Here_RB we_PRP have_VBP investigated_VBN the_DT functional_JJ domains_NNS of_IN 
the_DT proteins_NNS ._PERIOD

POSTaggerTool

You can use POSTaggerTool to invoke the tagger. The tool takes only one parameter, the path of the model file. Actually, the command line above eventually invokes POSTaggerTool at the background. So it seems no advantage to use this snippet of code. But it makes no hurt to know more. :p

public static void main(String[] args) {
  POSTaggerTool tool = new POSTaggerTool();
  System.err.println(tool.getHelp());
  tool.run(args);
}

POSTaggerME

If you want to embbed the POS tagger into the application, you have to use POSTaggerME. A complete code of using POSTaggerME is attached at the very end of this post. In this section, I will only use pieces of codes to discuss its usage.

To invoke POSTaggerME, you first need to load the model, then initialize POSTaggerME.

// load the model
InputStream modelIn = new FileInputStream("en-pos-maxent.bin");
POSModel model = new POSModel(modelIn);
// initialize POSTaggerME
POSTaggerME tagger = new POSTaggerME(model);

After that, you can call tagger.tag() to tag sentences. The method tag takes an array of strings. Each element of the array is a word. tag will return an array of strings too, each of which contains a POS tag corresponding to the word. For example, if we run

String sent[] = new String[]{"Here", "we", "have", "investigated", "the", 
    "functional", "domains", "of", "the", " proteins", "."};        
String tags[] = tagger.tag(sent);

The tags will be

["RB", "PRP", "VBP", "VBN", "DT", "JJ", "NNS", "IN", "DT", "NNS", "PERIOD"]

However, it is always boring to provide an array of words to POSTaggerME. Usually, we prefer to just giving a string. To tokenize a string into words, one way is using String.split(“s+”) to split the string into words. Another way is to use WhitespaceTokenizer in the opennlp package. Personally, I prefer the second way, because it is more convient and robust. The following code did the same thing like above.

String line = "Here we have investigated the"
    + " functional domains of the proteins ."
String[] sent = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(sent);

The complete code

Below I provide the complete code of using POSTaggerME for part-of-speech tagging. It provides better IO exception handling and performance monitoring, but the basic idea is same as discussed above. BasicCmdLineTool is used to provide basic function of generating help messages and descriptions. PerformanceMonitor is used to count the line number. Also, I don’t tag those lines starting with “//”. By doing so, I can add comments in the text file, just like Java/C++. I think it is very helpful if you want to add additional information between sentences, such as their location and version number.

package postagger;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintStream;

import opennlp.tools.cmdline.BasicCmdLineTool;
import opennlp.tools.cmdline.CLI;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

/**
 * Part-of-speech tagger using MaxEnt model
 * 
 * @author Yifan Peng
 * @version 09/18/2013
 */
public class PosTagger extends BasicCmdLineTool {

  public static void main(String args[]) {
    PosTagger tagger = new PosTagger();
    tagger.run(args);
  }

  @Override
  public String getShortDescription() {
    return "Part-of-speech tagger using MaxEnt model";
  }

  @Override
  public String getHelp() {
    return "Usage: " + CLI.CMD + " " + getName() 
        + " [MODEL] [INPUT] [OUTPUT]";
  }

  /**
   * Part-of-speech tagger
   * 
   * @param args 0 - model name, absolute path; 
   *             1 - input filename, absolute path; 
   *             2 - output filename, absolute path
   */
  @Override
  public void run(String[] args) {

    if (args.length != 3) {
      System.out.println(getHelp());
      return;
    }

    String modelPath = args[0];
    String inputPath = args[1];
    String outputPath = args[2];

    // read model
    InputStream modelIn = null;
    POSModel model = null;
    try {
      modelIn = new FileInputStream(modelPath);
      model = new POSModel(modelIn);
    } catch (IOException e) {
      // Model loading failed, handle the error
      e.printStackTrace();
      System.err.println("cannot read model: " + modelPath);
      System.err.println(getHelp());
      System.exit(1);
    }

    // initialize POSTaggerME
    POSTaggerME tagger = new POSTaggerME(model);

    // read input file
    ObjectStream<String> lineStream = null;
    PrintStream printer = null;
    String line = null;
    try {
      lineStream = new PlainTextByLineStream(new InputStreamReader(
          new FileInputStream(inputPath)));
      printer = new PrintStream(new FileOutputStream(outputPath));

      PerformanceMonitor perfMon = new PerformanceMonitor(
          System.err, "sent");
      perfMon.start();

      while ((line = lineStream.read()) != null) {
        if (line.isEmpty()) {
          printer.println();
        } else if (line.startsWith("//")) {
          printer.println(line);
        } else {
          String[] sent = WhitespaceTokenizer.INSTANCE.tokenize(line);
          String[] tags = tagger.tag(sent);
          POSSample sample = new POSSample(sent, tags);
          printer.println(sample.toString());
        }
        perfMon.incrementCounter();
      }
      lineStream.close();
      printer.close();
    } catch (IOException e) {
      // Model loading failed, handle the error
      e.printStackTrace();
      System.err.println(getHelp());
      System.exit(1);
    } finally {
      if (lineStream != null) {
        try {
          lineStream.close();
        } catch (IOException e) {
        }
      }
      if (printer != null) {
        printer.close();
      }
    }
  }
}

Leave a Reply

Your email address will not be published. Required fields are marked *