Load Penn Treebank with offsets

By | September 17, 2013

Question

When I am doing relation extraction based on parsing trees, it is always very helpful to map the leaves of parsing trees back the original text. So when I am searching a leave (or an inner node), I can find where it comes from. In order to do so, I wrote a script to add offsets into the leaves. The script prints parsing trees in “Penn Treebank” format. For example, together with the sentence and the tree

After 48 h, cells were harvested for luciferase assay.

(S1 (S (S (PP (IN After) 
              (NP (CD 48) 
                  (NN h))) 
          (, ,) 
          (NP (NNS cells)) 
          (VP (VBD were) 
              (VP (VBN harvested) 
                  (PP (IN for) 
                      (NP (NN luciferase) 
                          (NN assay)))))) 
       (. .)))

the parsing tree with offsets looks like

(S1 (S (S (PP (IN After_294_299) 
              (NP (CD 48_300_302) 
                  (NN h_303_304))) 
          (, ,_304_305) 
          (NP (NNS cells_306_311)) 
          (VP (VBD were_312_316) 
              (VP (VBN harvested_317_326) 
                  (PP (IN for_327_330) 
                  (NP (NN luciferase_331_341) 
                      (NN assay_342_347)))))) 
       (. ._347_348)))

The next question will then be: How to load this tree and store offsets.

Solution

I think there are at least two options:

  1. create a subclass of Tree, or
  2. create a subclass of Label.

Both should work. In this post, I will discuss the second option. There are a couple of predefined labels in the stanford parser package. Here I choose to create a subclass of StringLabel. I didn’t use StringLabel directly because I want to print the tree with offsets as well. To achieve this, I need to override toString method as shown below. Note that, StringLabel has already got beginPosition and endPosition.

package tree;

import edu.stanford.nlp.ling.Label;
import edu.stanford.nlp.ling.LabelFactory;
import edu.stanford.nlp.ling.StringLabel;

@SuppressWarnings("serial")
public class MyLabel extends StringLabel {

  public MyLabel() {
    super();
  }

  public MyLabel(Label label) {
    super(label);
  }

  public MyLabel(String str, int beginPosition, int endPosition) {
    super(str, beginPosition, endPosition);
  }

  public MyLabel(String str) {
    super(str);
  }

  @Override
  public LabelFactory labelFactory() {
    return new MyLabelFactory();
  }

  @Override
  public String toString() {
    if (beginPosition() != -1) {
      return super.value() + "_" + beginPosition() + "_" + endPosition();
    } else {
      return super.value();
    }
  }
}

As specified in the javadoc of interface Label,

A subclass that extends another Label class should override the definition of labelFactory(), since the contract for this method is that it should return a factory for labels of the exact same object type.

I need to create a new LabelFactory to create objects of class MyLabel. The code is tricky.

package tree;

import edu.stanford.nlp.ling.Label;
import edu.stanford.nlp.ling.LabelFactory;

public class MyLabelFactory implements LabelFactory {

  /**
   * Create a new StringLabel with the given content.
   * @param labelStr
   * @param beginPosition Start offset in original text (inclusive)
   * @param endPosition  End offset in original text (exclusive)
   * @return
   */
  public Label newLabel(String labelStr, int beginPosition, 
      int endPosition) {
    return new MyLabel(labelStr, beginPosition, endPosition);
  }

  @Override
  public Label newLabel(String labelStr) {
    return new MyLabel(labelStr);
  }

  @Override
  public Label newLabel(String labelStr, int options) {
    return new MyLabel(labelStr);
  }

  @Override
  public Label newLabelFromString(String encodedLabelStr) {
    return new MyLabel(encodedLabelStr);
  }

  @Override
  public Label newLabel(Label oldLabel) {
    return new MyLabel(oldLabel);
  }
}

Now, we can focus on how to load the “Penn Treebank” file. Stanford parser package provides PennTreeReaderFactory. It can specify your own TreeFactory, which acts to creat objects of class Tree. Here you can either use LabeledScoredTreeNode which represents a tree composed of a root label and array of daughter parse trees, or create your own tree structure. For sake of simplicity, I extend LabeledScoredTreeNode by creating

package tree;

import java.util.List;

import edu.stanford.nlp.ling.Label;
import edu.stanford.nlp.ling.LabelFactory;
import edu.stanford.nlp.trees.LabeledScoredTreeNode;
import edu.stanford.nlp.trees.Tree;

@SuppressWarnings("serial")
public class MyTree extends LabeledScoredTreeNode {

  public MyTree(Label label) {
    super(label);
  }

  public MyTree(Label label, List<Tree> children) {
    super(label);
    setChildren(children);
  }
  
  

  @Override
  public LabelFactory labelFactory() {
    return new MyLabelFactory();
  }

  @Override
  public String toString() {
    return toStringBuilder(new StringBuilder(), false).toString();
  }
}

Here is another tricky part. I override toString() method by specifying boolean parameter printOnlyLabelValue to false. In this way, the toStringBuilder() will use Label‘s toSting() method to print “Penn Treebank” string; otherwise, it will use Label‘s value method. That’s why I modified toString() in the MyLabel class before.

You may ask why I didn’t override value in the [MyLabel] instead? My reason is it will make harder to compare two tree nodes in future. For example, if I want to check whether a inner node is an NP, I may write a statement t1.value().equals("NP"). This eventually will compare “NP_from_to” to “NP” if I overrode the value method. That is absolutely wrong.

OK, after defined these classes, now we can go back to create a new TreeFactory. In the method newLeaf(), I parsed the string in the format of “string_from_to” and stored offset information in the MyLabel. Note that only leaves contain offsets in the previous example. (Of course, it is easy to add offsets in the inner nodes too).

package tree;

import java.util.List;

import edu.stanford.nlp.ling.Label;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreeFactory;

public class MyTreeFactory implements TreeFactory {

  private static final MyLabelFactory lf = new MyLabelFactory();

  @Override
  public Tree newLeaf(String word) {
    int lastUnderline = word.lastIndexOf('_');
    if (lastUnderline == -1) {
      return new MyTree(lf.newLabel(word));
    }

    int to = Integer.parseInt(word.substring(lastUnderline + 1));
    int secondLastUnderline = word.lastIndexOf('_', lastUnderline - 1);

    if (secondLastUnderline == -1) {
      return new MyTree(lf.newLabel(word));
    }

    int from = Integer.parseInt(word.substring(
        secondLastUnderline + 1,
        lastUnderline));

    return new MyTree(lf.newLabel(
        word.substring(0, secondLastUnderline),
        from,
        to));
  }

  @Override
  public Tree newLeaf(Label label) {
    return new MyTree(label);
  }

  @Override
  public Tree newTreeNode(String parent, List<Tree> children) {
    return new MyTree(lf.newLabel(parent), children);
  }

  @Override
  public Tree newTreeNode(Label parent, List<Tree> children) {
    return newTreeNode(parent.value(), children);
  }
}

Finally, I create an unit test file, and proved two ways to use load PTB string, by string and by file “test.txt”.

package test;
package test;

import org.junit.Test;

import tree.MyTreeFactory;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreeFactory;
import edu.stanford.nlp.trees.Trees;

public class PennReaderTest {

  @Test
  public void test1() {

    TreeFactory tf = new MyTreeFactory();

    Tree t = Trees
        .readTree(
            "(S1 (S (S (PP (IN After_294_299) (NP (CD 48_300_302) "
            + "(NN h_303_304))) (, ,_304_305) (NP (NNS cells_306_311)) "
            + "(VP (VBD were_312_316) (VP (VBN harvested_317_326) "
            + "(PP (IN for_327_330) (NP (NN luciferase_331_341) "
            + "(NN assay_342_347)))))) (. ._347_348)))",
            tf);
    System.out.println(t);
  }

  @Test
  public void test2() {

    TreeReaderFactory trf = new PennTreeReaderFactory(
        new MyTreeFactory());

    String filename = "test.txt";
    MemoryTreebank treebank = new MemoryTreebank(trf);
    treebank.loadPath(filename, null, true);

    for (Tree t : treebank) {
      System.out.println(t);
    }
    System.out.println();
  }
}

Leave a Reply

Your email address will not be published. Required fields are marked *