Category Archives: research

Resolve coreference using Stanford CoreNLP

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. Stanford CoreNLP coreference resolution system is the state-of-the-art system to resolve coreference in the text. To use the system, we usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition, and parsing. However sometimes, we use others tools for preprocessing, particulaly when we are working on a specific domain. In these cases, we need a stand-alone coreference resolution system. This post demenstrates how to create such a system using Stanford CoreNLP.

Load properties

In general, we can just create an empty Properties, because the Stanford CoreNLP tool can automatically load the default one in the model jar file, which is under edu.stanford.nlp.pipeline.

In other cases, we would like to use specific properties. The following code shows one example of loading the property file from the working directory.

private static final String PROPS_SUFFIX = ".properties";

  private Properties loadProperties(String name) {
    return loadProperties(name, 
  private Properties loadProperties(String name, ClassLoader loader) {
    if (name.endsWith(PROPS_SUFFIX))
      name = name.substring(0, name.length() - PROPS_SUFFIX.length());
    name = name.replace('.', '/');
    name += PROPS_SUFFIX;
    Properties result = null;

    // Returns null on lookup failures
    System.err.println("Searching for resource: " + name);
    InputStream in = loader.getResourceAsStream(name);
    try {
      if (in != null) {
        InputStreamReader reader = new InputStreamReader(in, "utf-8");
        result = new Properties();
        result.load(reader); // Can throw IOException
    } catch (IOException e) {
      result = null;
    } finally {

    return result;

read more

Video recordings of CLSP JHU summer workshop 2014

The Summer Workshop brought together three recurring themes: improved recognition of conversational speech, probabilistic representations of linguistic meaning, and abstract meaning representations for machine translation.

Some highlights

  • Martha Palmer – Designing Abstract Meaning Representations for Machine Translation
  • Percy Liang – The State of the Art in Semantic Parsing
  • Shalom Lappin – A Rich Probabilistic Type Theory for the Semantics of Natural Language
  • David McAllester – The Problem of Reference
  • Stephan Oepen – Broad-Coverage Semantic Dependency Parsing
  • Giorgio Satta – Synchronous Rewriting for Natural Language Processing

read more

[FWD] Two stories from a research paper: Content Without Context is Meaningles

Two stories from a research paper: Content Without Context is Meaningless.

1.1 Machine Learning Hammer

Mark Twain once said: “To a man with a hammer, everything looks like a nail.” His observation is definitely very relevant to current trends in content analysis. We have a Machine Learning Hammer (ML Hammer) that we want to use for solving any problem that needs to be solved. The problem is neither with learning nor with the hammer; the problem is with people who fail to learn that not every problem is a new learning problem [1]. … If we can identify such a feature set, then we can easily model each object by its appropriate feature values. The challenges are

read more

My PhD Proposal Defense: A Study of Relation Extraction

There has been an increasing effort in recent years to extract relations between entities from natural language text. In this dissertation, I will focus on various aspects of recognizing biomedical relations between entities reported in scientific articles.

Approaches to the relation extraction task can be categorized into two major classes: (1) pattern-based approaches and (2) machine learning-based approaches. Pattern-based approaches often use manually-designed rules to extract relations. They do not need annotated corpora for the training, and it is generally possible to obtain patterns that yield good precision. But pattern-based approaches require domain experts to be closely involved in the design of the system. It is impractical to manually encode all the patterns necessary for a high recall. On the other hand, machine learning-based approaches are data-driven that can derive models for automated extraction from a set of annotated data. But their ability to generalize from examples can be hindered by lack of domain and linguistic knowledge and their performance is critically dependent on availability of large amount of annotated data, which might be expensive to create.

read more

Recommend: The Science of Scientific Writing

George Gopen and Judith Swan. The Science of Scientific Writing. American Scientist. 1990, 78: 550-558.

Our examples of scientific writing have ranged from the merely cloudy to the virtually opaque; yet all of them could be made significantly more comprehensible by observing the following structural principles:

  1. Follow a grammatical subject as soon as possible with its verb.
  2. Place in the stress position the “new information” you want the reader to emphasize.
  3. Place the person or thing whose “story” a sentence is telling at the beginning of the sentence, in the topic position.
  4. Place appropriate “old information” (material already stated in the discourse) in the topic position for linkage backward and contextualization forward.
  5. Articulate the action of every clause or sentence in its verb.
  6. In general, provide context for your reader before asking that reader to consider anything new.
  7. In general, try to ensure that the relative emphases of the substance coincide with the relative expectations for emphasis raised by the structure.

Recommend: How to become good at peer review: A guide for young scientists

Peer review is at the heart of the scientific method. While it’s by no means a perfect system, it is still the best system of scientific quality control that we have. Many graduate programs don’t explicitly teach courses on how to review papers. Instead, a young scientist may learn how to review a paper under the guidance of his or her mentor, How to become good at peer review: A guide for young scientists is a post to put together a set of guidelines for young scientists.

Recommend: Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!

Publications of EMNLP 2013 are released:

On the list, I found a very interested article “Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!“. It discusses the disconnect between industry and academia: while rule-based IE dominates the commercial world, it is widely regarded as dead-end technology by the academia. The following table summarizes the pros and cons of machine learning and rule-based information extraction technologies (reproduced from the above paper).

When to post technology blogs?

Don’t mistake me: I’m not disagreeing with the importance of quality content in posting; on the contrary, I always believe that creating original content is the most essential part of a successful blog. But beyond that, probably we can do a bit better.

When to post is another important aspect for a successful blog. Are certain times better than others? The answer is absolutely Yes, but it depends on the industry and the nature of your group personality. This article only focuses on technology blogs. In this post, I attempt to combine different research resources and draw a few basic posting guidelines by time of hour and day. The timing in the post is relative to the time zone. My main resources include

read more

How to use Opennlp to do part-of-speech tagging

How to use Opennlp to do part-of-speech tagging


The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. One of the most popular machine learning models it supports is Maximum Entropy Model (MaxEnt) for Natural Language Processing task. Among others, part-os-speech tagging (POS tagging) is one of the most common NLP tasks.

In short, POS tagging is the process of marking up a word to a particular part-of-speech. It is usually required to build more advanced text processing services. Explaination of POS tagging and its techniques using MaxEnt is beyond the scope of this post. So I won’t discuss it further. For more details, readers are recommended to refer this report. Apache Opennlp Document provides are very brief introduction of how to use the POS tagger tool. However, researchers and developers may want to look in depth of its usage. This is what this post aims at.

Library and model

The latest version of opennlp package can be found here. The current version is 1.5.*. After downloaded, we can extract it to $OPENNLP_HOME.

Apache Opennlp project also provides models compatible with the above package. Among these models, we will use en-pos-maxent.bin which is a MaxEnt model with tag dictoinary and can be downloaded here. The tag set is the Penn Treebank tag set, however it is unknown which corpus was used to train this model. But this model is language dependent, therefore, I assume it will perform well for common English. After downloaded, we can put the model either in the opennlp folder $OPENNLP_HOME, or other places. For the sake of simplicity, I will just put it under $OPENNLP_HOME.

After got the opennlp package and the model (put the model in the opennlp folder), let’s have a look at how to use it for the POS tagging. As far as I know, there are at least three ways to invoke the tagger: command line, POSTaggerTool, and POSTaggerME. Each will be fully discussed later.

Command line

The easiest way to try out the POS tagger is the command line tool. The tool is called opennlp located in the bin folder. Suppose the opennlp folder is $OPENNLP_HOME, the command line is

$OPENNLP_HOME/bin/opennlp POSTagger $OPENNLP_HOME/en-pos-maxent.bin 
    < sentences

The POS tagger will read sentences from the file sentences. The format is: one sentence per line with words and punctuation tokenized. For example, the input sentence can be (Note, the stop at the end is tokenized too)

Here we have investigated the functional domains of the proteins .

The output is the sentences with POS tags in the format of “word_pos [space] word_pos [space] ...“. So the result of the above example will be,

Here_RB we_PRP have_VBP investigated_VBN the_DT functional_JJ domains_NNS of_IN 
the_DT proteins_NNS ._PERIOD


You can use POSTaggerTool to invoke the tagger. The tool takes only one parameter, the path of the model file. Actually, the command line above eventually invokes POSTaggerTool at the background. So it seems no advantage to use this snippet of code. But it makes no hurt to know more. :p

public static void main(String[] args) {
  POSTaggerTool tool = new POSTaggerTool();


If you want to embbed the POS tagger into the application, you have to use POSTaggerME. A complete code of using POSTaggerME is attached at the very end of this post. In this section, I will only use pieces of codes to discuss its usage.

To invoke POSTaggerME, you first need to load the model, then initialize POSTaggerME.

// load the model
InputStream modelIn = new FileInputStream("en-pos-maxent.bin");
POSModel model = new POSModel(modelIn);
// initialize POSTaggerME
POSTaggerME tagger = new POSTaggerME(model);

read more

Load Penn Treebank with offsets


When I am doing relation extraction based on parsing trees, it is always very helpful to map the leaves of parsing trees back the original text. So when I am searching a leave (or an inner node), I can find where it comes from. In order to do so, I wrote a script to add offsets into the leaves. The script prints parsing trees in “Penn Treebank” format. For example, together with the sentence and the tree

After 48 h, cells were harvested for luciferase assay.

(S1 (S (S (PP (IN After) 
              (NP (CD 48) 
                  (NN h))) 
          (, ,) 
          (NP (NNS cells)) 
          (VP (VBD were) 
              (VP (VBN harvested) 
                  (PP (IN for) 
                      (NP (NN luciferase) 
                          (NN assay)))))) 
       (. .)))

the parsing tree with offsets looks like

(S1 (S (S (PP (IN After_294_299) 
              (NP (CD 48_300_302) 
                  (NN h_303_304))) 
          (, ,_304_305) 
          (NP (NNS cells_306_311)) 
          (VP (VBD were_312_316) 
              (VP (VBN harvested_317_326) 
                  (PP (IN for_327_330) 
                  (NP (NN luciferase_331_341) 
                      (NN assay_342_347)))))) 
       (. ._347_348)))

read more