How to use Opennlp to do part-of-speech tagging
Introduction
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. One of the most popular machine learning models it supports is Maximum Entropy Model (MaxEnt) for Natural Language Processing task. Among others, part-os-speech tagging (POS tagging) is one of the most common NLP tasks.
In short, POS tagging is the process of marking up a word to a particular part-of-speech. It is usually required to build more advanced text processing services. Explaination of POS tagging and its techniques using MaxEnt is beyond the scope of this post. So I won’t discuss it further. For more details, readers are recommended to refer this report. Apache Opennlp Document provides are very brief introduction of how to use the POS tagger tool. However, researchers and developers may want to look in depth of its usage. This is what this post aims at.
Library and model
The latest version of opennlp package can be found here. The current version is 1.5.*. After downloaded, we can extract it to $OPENNLP_HOME
.
Apache Opennlp project also provides models compatible with the above package. Among these models, we will use en-pos-maxent.bin
which is a MaxEnt model with tag dictoinary and can be downloaded here. The tag set is the Penn Treebank tag set, however it is unknown which corpus was used to train this model. But this model is language dependent, therefore, I assume it will perform well for common English. After downloaded, we can put the model either in the opennlp folder $OPENNLP_HOME
, or other places. For the sake of simplicity, I will just put it under $OPENNLP_HOME
.
After got the opennlp package and the model (put the model in the opennlp folder), let’s have a look at how to use it for the POS tagging. As far as I know, there are at least three ways to invoke the tagger: command line, POSTaggerTool, and POSTaggerME. Each will be fully discussed later.
Command line
The easiest way to try out the POS tagger is the command line tool. The tool is called opennlp
located in the bin
folder. Suppose the opennlp folder is $OPENNLP_HOME
, the command line is
$OPENNLP_HOME/bin/opennlp POSTagger $OPENNLP_HOME/en-pos-maxent.bin < sentences
The POS tagger will read sentences from the file sentences
. The format is: one sentence per line with words and punctuation tokenized. For example, the input sentence can be (Note, the stop at the end is tokenized too)
Here we have investigated the functional domains of the proteins .
The output is the sentences with POS tags in the format of “word_pos [space] word_pos [space] ...
“. So the result of the above example will be,
Here_RB we_PRP have_VBP investigated_VBN the_DT functional_JJ domains_NNS of_IN
the_DT proteins_NNS ._PERIOD
POSTaggerTool
You can use POSTaggerTool to invoke the tagger. The tool
takes only one parameter, the path of the model file. Actually, the command line above eventually invokes POSTaggerTool
at the background. So it seems no advantage to use this snippet of code. But it makes no hurt to know more. :p
public static void main(String[] args) { POSTaggerTool tool = new POSTaggerTool(); System.err.println(tool.getHelp()); tool.run(args); }
POSTaggerME
If you want to embbed the POS tagger into the application, you have to use POSTaggerME
. A complete code of using POSTaggerME
is attached at the very end of this post. In this section, I will only use pieces of codes to discuss its usage.
To invoke POSTaggerME
, you first need to load the model, then initialize POSTaggerME
.
// load the model InputStream modelIn = new FileInputStream("en-pos-maxent.bin"); POSModel model = new POSModel(modelIn); // initialize POSTaggerME POSTaggerME tagger = new POSTaggerME(model);
After that, you can call tagger.tag() to tag sentences. The method tag takes an array of strings. Each element of the array is a word. tag will return an array of strings too, each of which contains a POS tag corresponding to the word. For example, if we run
String sent[] = new String[]{"Here", "we", "have", "investigated", "the", "functional", "domains", "of", "the", " proteins", "."}; String tags[] = tagger.tag(sent);
The tags will be
["RB", "PRP", "VBP", "VBN", "DT", "JJ", "NNS", "IN", "DT", "NNS", "PERIOD"]
However, it is always boring to provide an array of words to POSTaggerME
. Usually, we prefer to just giving a string. To tokenize a string into words, one way is using String.split(“s+”) to split the string into words. Another way is to use WhitespaceTokenizer
in the opennlp package. Personally, I prefer the second way, because it is more convient and robust. The following code did the same thing like above.
String line = "Here we have investigated the" + " functional domains of the proteins ." String[] sent = WhitespaceTokenizer.INSTANCE.tokenize(line); String[] tags = tagger.tag(sent);
The complete code
Below I provide the complete code of using POSTaggerME
for part-of-speech tagging. It provides better IO exception handling and performance monitoring, but the basic idea is same as discussed above. BasicCmdLineTool
is used to provide basic function of generating help messages and descriptions. PerformanceMonitor
is used to count the line number. Also, I don’t tag those lines starting with “//”. By doing so, I can add comments in the text file, just like Java/C++. I think it is very helpful if you want to add additional information between sentences, such as their location and version number.
package postagger; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.PrintStream; import opennlp.tools.cmdline.BasicCmdLineTool; import opennlp.tools.cmdline.CLI; import opennlp.tools.cmdline.PerformanceMonitor; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSSample; import opennlp.tools.postag.POSTaggerME; import opennlp.tools.tokenize.WhitespaceTokenizer; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; /** * Part-of-speech tagger using MaxEnt model * * @author Yifan Peng * @version 09/18/2013 */ public class PosTagger extends BasicCmdLineTool { public static void main(String args[]) { PosTagger tagger = new PosTagger(); tagger.run(args); } @Override public String getShortDescription() { return "Part-of-speech tagger using MaxEnt model"; } @Override public String getHelp() { return "Usage: " + CLI.CMD + " " + getName() + " [MODEL] [INPUT] [OUTPUT]"; } /** * Part-of-speech tagger * * @param args 0 - model name, absolute path; * 1 - input filename, absolute path; * 2 - output filename, absolute path */ @Override public void run(String[] args) { if (args.length != 3) { System.out.println(getHelp()); return; } String modelPath = args[0]; String inputPath = args[1]; String outputPath = args[2]; // read model InputStream modelIn = null; POSModel model = null; try { modelIn = new FileInputStream(modelPath); model = new POSModel(modelIn); } catch (IOException e) { // Model loading failed, handle the error e.printStackTrace(); System.err.println("cannot read model: " + modelPath); System.err.println(getHelp()); System.exit(1); } // initialize POSTaggerME POSTaggerME tagger = new POSTaggerME(model); // read input file ObjectStream<String> lineStream = null; PrintStream printer = null; String line = null; try { lineStream = new PlainTextByLineStream(new InputStreamReader( new FileInputStream(inputPath))); printer = new PrintStream(new FileOutputStream(outputPath)); PerformanceMonitor perfMon = new PerformanceMonitor( System.err, "sent"); perfMon.start(); while ((line = lineStream.read()) != null) { if (line.isEmpty()) { printer.println(); } else if (line.startsWith("//")) { printer.println(line); } else { String[] sent = WhitespaceTokenizer.INSTANCE.tokenize(line); String[] tags = tagger.tag(sent); POSSample sample = new POSSample(sent, tags); printer.println(sample.toString()); } perfMon.incrementCounter(); } lineStream.close(); printer.close(); } catch (IOException e) { // Model loading failed, handle the error e.printStackTrace(); System.err.println(getHelp()); System.exit(1); } finally { if (lineStream != null) { try { lineStream.close(); } catch (IOException e) { } } if (printer != null) { printer.close(); } } } }