Toolkits

  • bioc is a Python package of BioC data structures and encoder/decoder. bioc exposes an API familiar to users of the standard library marshal and pickle modules. Development of bioc happens on GitHub
  • The Common is a project focused on various aspects of resuable Java components. Its principal goal is to try new things! Therefore, if I find a good thrid-party package that can fulfill the same function, I would rather reuse it, unless I feel interested to know how they work. The Common is written in Java and licensed under the BSD 3-clause license. Note that you can use it anywhere, except in your homework :P. The Common includes but not limited to the following components

  • pengyifan-bioc is another Java implementation of BioC format. BioC XML format can be used to share text documents and annotations. The development of Java BioC IO API is independent of the particular XML parser used.

  • BioNLP2BioC is a simple converter that converts BioNLP-ST 2011 and 2013 GE task corpora to BioC format. BioC format is a simple XML format to share text documents and annotations. In the converted data, text files (in .txt) in the BioNLP corpora are split by ‘newlines’ and stored into BioCPassages. Entities (in .a1) and event triggers (in .a2) are stored into separate passages based on their positions in the text files. Target annotations (in .a2), including event, relation, event modification, and equivalence, are annotated at the document level. This converter was created to participate BioCreative IV Track 1.

  • iSimp: A challenge in designing and applying NLP systems to biomedical text is the complexity of sentences. One possible approach to alleviate this situation is to simplify the sentences. We developed iSimp, which can reduce the sentence syntactic complexity, thus improving the performance of NLP systems (e.g., relation extraction systems). To make iSimp readily usable in NLP and text mining tools, we participate BioCreative IV BioC track, and adopt the BioC format, a simple XML format to share text documents and annotations. The Java API developed as part of the iSimp project becomes part of the public release of the BioC package.

  • Beamer theme with UD logo contains definitions of UD colors (yellow and blue) and UD logo. It also contains some smart charts that I think will be useful in slides.

  • Resume Latex template provides a Latex template of resume. I used version 1 four years ago, but currently I am using version 4, mainly because it follows basic design principles, like fonts, alignment, and structure. It is neat, simple, but no simpler.

  • This Part-of-speech model is trained on Gennia corpus (1279685-9817603). The pre-trained model is for the OpenNLP 1.5 series. The model is language dependent and only perform well if the model language matches the language of the input text. The model is zip compressed (like a jar file), they must not be uncompressed.

  • PennTreebankReader is a piece of java codes to read parsing trees of Penn Treebank bracketing format. It is written without using any third-party packages. DefaultMultableTreeNode and DefaultTreeModel in javax.swing.tree are used for tree container.

  • Woogle Pinyin Input Method is a Java Pinyin Input Method. The most important difference between it and current input methods is that it integrates more syntax information, n-gram and dependency parsing. (1) n-gram is the most widely used language model which are integrated in the products of Sougou IME, Google IME and MS IME, and (2) Dependency parsing can capture the long distance information in the sentence. We expect to bring it into the n-gram. Note: the source code was not updated since Feb. 2010.

Leave a Reply

Your email address will not be published. Required fields are marked *