Category Archives: java

How to use Whatizit Web Services in Java

Whatizit is a text processing system that allows you to do textmining tasks on text. It is also available as a Web Service whose underlying idea is to ensure that software from various sources work well together. Whatizit is built on open standards of Simple Object Access Protocol (SOAP) and Web Services Description Language (WSDL). For the transport layer itself, Web Services uses most of the commonly available network protocols, especially Hypertext Transfer Protocol (HTTP). For more information on WSDL please refer to the W3C WSDL v1.1 Document.

read more

How to create a Web Application Project with Java/Maven/Jetty

How to create a Web Application Project with Java/Maven/Jetty or Tomcat

In this article, we create a simple web application with the Maven Archetype plugin. We’ll run this web application in a Servlet container named Jetty, add some dependencies, write simple Servlets, and generate a WAR file. At the end of this article, you will also be able to deploy the service in Tomcat.

System requirements

Creating the Web Service Step by Step

This section explains how to create this simple web project from an EMPTY folder.

Creating the Simple Web Project

To create your web application

$ mvn archetype:generate -DgroupId=com.pengyifan.simpleweb \
      -DartifactId=simple-webapp \
      -Dpackage=com.pengyifan.simpleweb \
      -DarchetypeArtifactId=maven-archetype-webapp \
      -Dversion=1.0-SNAPSHOT \
      -DinteractiveMode=false

...
[INFO] BUILD SUCCESS

Once the Maven Archetype plugin creates the project, change the directory into the simple-webapp directory and take a look at the pom.xml. You should see the

<project xmlns="http://maven.apache.org/POM/4.0.0" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
  http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.pengyifan.simpleweb</groupId>
  <artifactId>simple-webapp</artifactId>
  <packaging>war</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>simple-webapp Maven Webapp</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
  <build>
    <finalName>simple-webapp</finalName>
  </build>
</project>

read more

How to collect Immutable Collection in Java

To begin with this story, let’s first have a look at how to creat a List from Stream in Java

List<String> sublist = list
  .stream()
  .filter(...)
  .collect(Collectors.toList());

This works perfectly fine but what if we want the list to be immutable? We could do this

List<String> immutableSubList = Collections.unmodifiableList(sublist);

or if we would like to use Guava ImmutableList, we could do

ImmutableList<String> immutableSubList = ImmutableList.copyOf(sublist);

However this is a bit awkward to use since the list will be copied one more time. If we want to do this in a lot of places throughout the code base, it is not fluid. Instead, what we want is

ImmutableList<String> sublist = list
  .stream()
  .filter(...)
  .collect(ImmutableCollectors.toList());

This post will discuss how to create the Collector of ImmutableList.

Collector

To create a Collector, we will use the static method of.

public static<t, A, R> Collector<T, A, R> of(
  Supplier<A> supplier,
  BiConsumer<A, T> accumulator,
  BinaryOperator<A> combiner,
  Function<A, R> finisher,
  Characteristics... characteristics);

read more

A Java implementation of data structures and code to read/write Brat standoff format.

A Java implementation of data structures and code to read/write Brat standoff format. https://github.com/yfpeng/pengyifan-brat

Brat

(from brat standoff format)

Annotations created in brat are stored on disk in a standoff format: annotations are stored separately from the annotated document text, which is never modified by the tool.

For each text document in the system, there is a corresponding annotation file. The two are associated by the file naming convention that their base name (file name without suffix) is the same: for example, the file DOC-1000.ann contains annotations for the file DOC-1000.txt.

Within the document, individual annotations are connected to specific spans of text through character offsets. For example, in a document beginning “Japan was today struck by …” the text “Japan” is identified by the offset range 0..5. (All offsets all indexed from 0 and include the character at the start offset but exclude the character at the end offset.)

Getting started

<dependency>
  <groupId>com.pengyifan.brat</groupId>
  <artifactId>pengyifan-brat</artifactId>
  <version>1.1.0</version>
</dependency>

or

<repositories>
    <repository>
        <id>oss-sonatype</id>
        <name>oss-sonatype</name>
        <url>https://oss.sonatype.org/content/repositories/snapshots/</url>
        <snapshots>
            <enabled>true</enabled>
        </snapshots>
    </repository>
</repositories>
...
<dependency>
  <groupId>com.pengyifan.brat</groupId>
  <artifactId>pengyifan-brat</artifactId>
  <version>1.2.0-SNAPSHOT</version>
</dependency>

read more

Another Java implementation of BioC

Data structures and code to read/write BioC XML. [https://github.com/yfpeng/pengyifan-bioc(https://github.com/yfpeng/pengyifan-bioc)

BioC

BioC XML format can be used to share text documents and annotations.
The development of Java BioC IO API is independent of the particular XML parser used.

read more

Java data structure to use C implementation of word2vec

Data structure to use C implementation of word2vec. https://github.com/yfpeng/pengyifan-word2vec

Getting started

com.pengyifan.word2vec pengyifan-word2vec 0.0.1 `

or


    
        oss-sonatype
        oss-sonatype
        https://oss.sonatype.org/content/repositories/snapshots/
        
            true
        
    

...

  com.pengyifan.word2vec
  pengyifan-word2vec
  0.0.1-SNAPSHOT

Webpage

The official word2vec webpage is available with all up-to-date instructions and code.

Resolve coreference using Stanford CoreNLP

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. Stanford CoreNLP coreference resolution system is the state-of-the-art system to resolve coreference in the text. To use the system, we usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition, and parsing. However sometimes, we use others tools for preprocessing, particulaly when we are working on a specific domain. In these cases, we need a stand-alone coreference resolution system. This post demenstrates how to create such a system using Stanford CoreNLP.

Load properties

In general, we can just create an empty Properties, because the Stanford CoreNLP tool can automatically load the default one in the model jar file, which is under edu.stanford.nlp.pipeline.

In other cases, we would like to use specific properties. The following code shows one example of loading the property file from the working directory.

private static final String PROPS_SUFFIX = ".properties";

  private Properties loadProperties(String name) {
    return loadProperties(name, 
       Thread.currentThread().getContextClassLoader());
  }
  
  private Properties loadProperties(String name, ClassLoader loader) {
    if (name.endsWith(PROPS_SUFFIX))
      name = name.substring(0, name.length() - PROPS_SUFFIX.length());
    name = name.replace('.', '/');
    name += PROPS_SUFFIX;
    Properties result = null;

    // Returns null on lookup failures
    System.err.println("Searching for resource: " + name);
    InputStream in = loader.getResourceAsStream(name);
    try {
      if (in != null) {
        InputStreamReader reader = new InputStreamReader(in, "utf-8");
        result = new Properties();
        result.load(reader); // Can throw IOException
      }
    } catch (IOException e) {
      result = null;
    } finally {
      IOUtils.closeIgnoringExceptions(in);
    }

    return result;
  }

read more

How to access file resources in JUnit tests

In maven, any file under src/test/resources is copied to target/test-classes. How to access these resource files in JUnit? Using the class’s resource. It will locate the file in the test’s classpath /target/test-classes.

URL url = this.getClass().getResource("/" + TEST_FILENAME);
File file = new File(url.getFile());