Load Penn Treebank with offsets


When I am doing relation extraction based on parsing trees, it is always very helpful to map the leaves of parsing trees back the original text. So when I am searching a leave (or an inner node), I can find where it comes from. In order to do so, I wrote a script to add offsets into the leaves. The script prints parsing trees in “Penn Treebank” format. For example, together with the sentence and the tree

After 48 h, cells were harvested for luciferase assay.

(S1 (S (S (PP (IN After) 
              (NP (CD 48) 
                  (NN h))) 
          (, ,) 
          (NP (NNS cells)) 
          (VP (VBD were) 
              (VP (VBN harvested) 
                  (PP (IN for) 
                      (NP (NN luciferase) 
                          (NN assay)))))) 
       (. .)))

the parsing tree with offsets looks like

(S1 (S (S (PP (IN After_294_299) (NP (CD 48_300_302) (NN h_303_304))) (, ,_304_305) (NP (NNS cells_306_311)) (VP (VBD were_312_316) (VP (VBN harvested_317_326) (PP (IN for_327_330) (NP (NN luciferase_331_341) (NN assay_342_347)))))) (. ._347_348))) read more