How to convert sourcecode to HTML, RTF, SVG, etc.

I am working on my paper which needs a piece of XML to syntax highlighted. I’ve found Sublime with “Copy as RTF” plugins is useful, but as a programmer I prefer something that being done via commend line, and more importantly being easily customizable.

So I did a some searches and came across highlight. To install it on Ubuntu is quite simple

sudo apt-get install highlight

Then I can use highlight to convert the XML file to RTF and copy it to the paper I am working on. read more

A Java implementation of data structures and code to read/write Brat standoff format.

A Java implementation of data structures and code to read/write Brat standoff format. https://github.com/yfpeng/pengyifan-brat

Brat

(from brat standoff format)

Annotations created in brat are stored on disk in a standoff format: annotations are stored separately from the annotated document text, which is never modified by the tool.

For each text document in the system, there is a corresponding annotation file. The two are associated by the file naming convention that their base name (file name without suffix) is the same: for example, the file DOC-1000.ann contains annotations for the file DOC-1000.txt.

Within the document, individual annotations are connected to specific spans of text through character offsets. For example, in a document beginning “Japan was today struck by …” the text “Japan” is identified by the offset range 0..5. (All offsets all indexed from 0 and include the character at the start offset but exclude the character at the end offset.)

Getting started

<dependency>
  <groupId>com.pengyifan.brat</groupId>
  <artifactId>pengyifan-brat</artifactId>
  <version>1.1.0</version>
</dependency>

or

<repositories> <repository> <id>oss-sonatype</id> <name>oss-sonatype</name> <url>https://oss.sonatype.org/content/repositories/snapshots/</url> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories> ... <dependency> <groupId>com.pengyifan.brat</groupId> <artifactId>pengyifan-brat</artifactId> <version>1.2.0-SNAPSHOT</version> </dependency> read more

Another Java implementation of BioC

Data structures and code to read/write BioC XML. [https://github.com/yfpeng/pengyifan-bioc(https://github.com/yfpeng/pengyifan-bioc)

BioC

BioC XML format can be used to share text documents and annotations.
The development of Java BioC IO API is independent of the particular XML parser used. read more

Resolve coreference using Stanford CoreNLP

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. Stanford CoreNLP coreference resolution system is the state-of-the-art system to resolve coreference in the text. To use the system, we usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition, and parsing. However sometimes, we use others tools for preprocessing, particulaly when we are working on a specific domain. In these cases, we need a stand-alone coreference resolution system. This post demenstrates how to create such a system using Stanford CoreNLP.

Load properties

In general, we can just create an empty Properties, because the Stanford CoreNLP tool can automatically load the default one in the model jar file, which is under edu.stanford.nlp.pipeline.

In other cases, we would like to use specific properties. The following code shows one example of loading the property file from the working directory.

private static final String PROPS_SUFFIX = ".properties"; private Properties loadProperties(String name) { return loadProperties(name, Thread.currentThread().getContextClassLoader()); } private Properties loadProperties(String name, ClassLoader loader) { if (name.endsWith(PROPS_SUFFIX)) name = name.substring(0, name.length() - PROPS_SUFFIX.length()); name = name.replace('.', '/'); name += PROPS_SUFFIX; Properties result = null; // Returns null on lookup failures System.err.println("Searching for resource: " + name); InputStream in = loader.getResourceAsStream(name); try { if (in != null) { InputStreamReader reader = new InputStreamReader(in, "utf-8"); result = new Properties(); result.load(reader); // Can throw IOException } } catch (IOException e) { result = null; } finally { IOUtils.closeIgnoringExceptions(in); } return result; } read more

How to access file resources in JUnit tests

In maven, any file under src/test/resources is copied to target/test-classes. How to access these resource files in JUnit? Using the class’s resource. It will locate the file in the test’s classpath /target/test-classes.

URL url = this.getClass().getResource("/" + TEST_FILENAME);
File file = new File(url.getFile());

Referee’ quotes published by Environmental Microbiology

I recently read a series of referee’ quotes published by Environmental Microbiology. As stated in every article, the referees are busy, serious individuals who give selflesslyof their precious time to improve manuscript. But, once in a while, their humour (or admiration) gets the better of them. Here are some quotes that I like most.

2007

  • I recommend the authors to get in contact with, e.g. sanitary engineers or fermentation/process engineersand not to try and invent the wheel again.
  • I only am willing to read this again if it has less than 20pages (Ed.: the original submission had 54)!
  • read more

    Video recordings of CLSP JHU summer workshop 2014

    The Summer Workshop brought together three recurring themes: improved recognition of conversational speech, probabilistic representations of linguistic meaning, and abstract meaning representations for machine translation.

    Some highlights

  • Martha Palmer – Designing Abstract Meaning Representations for Machine Translation
  • Percy Liang – The State of the Art in Semantic Parsing
  • Shalom Lappin – A Rich Probabilistic Type Theory for the Semantics of Natural Language
  • David McAllester – The Problem of Reference
  • Stephan Oepen – Broad-Coverage Semantic Dependency Parsing
  • Giorgio Satta – Synchronous Rewriting for Natural Language Processing
  • read more