Monthly Archives: January 2014

Using regex to hanging indent a paragraph in Java

This post shows how to hanging indent a long paragraph using regular expression. The method will consider word boundaries, which means it will not break words for the indentation. To illustrate the problem, consider the following example

There has been an increasing effort in recent years to extract relations between entities from natural language text. In this dissertation, I will focus on various aspects of recognizing biomedical relations between entities reported in scientific articles.

The output should be

There has been an increasing effort in recent years to extract relations between
  entities from natural language text. In this dissertation, I will focus on
  various aspects of recognizing biomedical relations between entities reported
  in scientific articles.

My method

We need a regular expression to break the paragraph into a sequence of strings with fixed length. Suppose the text width is 80 and the indent is 3, the length of first string is 80. All remainders’ length is 77.

The main process of the algorithm is following

  1. Get the first 80 characters
  2. For the remaining strings, replace splitting points with three spaces

To find the splitting points, we use the regular expression (.{1,77})\s+. This regex searches a substring whose length is less and equal to 77 and whose last char is not a white space. After finding it, we replace the group ($1) with $1\n. Therefore, the java code should look like this

String regex = "(.{1,77})\\s+";
String replacement = "   $1\n";
text.replaceAll(regex, replacement);

This regex works perfect except for the last line. If the given text doesn’t end with a whitespace, like \n, the last line will not be handled correctly. Consider the last line as

in scientific articles.

In the last search, the regex cannot find the whitespace at the end of the line, so it will locate the space between “scientific” and “articles”. As the result, we will get

...
   in scientific
articles.

To overcome this problem, I add a fake “\n” at the end of the paragraph. After formatting, I remove it then.

Other part of the code are trivial. Here I attach my source code. I use Apache common libraries to generate indent spaces and assert the validation of indent. For more recent codes, you can check my Github

/**
   * Format a paragraph to that has all lines but the first indented.
   * 
   * @param text text to be formatted
   * @param hangIndent hanging indentation. hangIndent >= 0
   * @param width the width of formatted paragraph
   * @param considerSpace true if only split at white spaces.
   * @return
   */
  public static String hangIndent(String text, int hangIndent, int width,
      boolean considerSpace) {
    Validate.isTrue(
        hangIndent >= 0,
        "hangIndent should not be negative: %d",
        hangIndent);
    Validate.isTrue(width >= 0, "text width should not be negative: %d",
        width);
    Validate.isTrue(
        hangIndent < width,
        "hangIndent should not be less than width: "
        + "hangIndent=%d, width=%d",
        hangIndent,
        width);

    StringBuilder sb = new StringBuilder(text.substring(0, hangIndent));
    // Needed to handle last line correctly.
    // Will be trimmed at last
    text = text.substring(hangIndent) + "\n";
    // hang indent
    String spaces = org.apache.commons.lang3.StringUtils
        .repeat(' ', hangIndent);
    String replacement = spaces + "$1\n";
    String regex = "(.{1," + (width - hangIndent) + "})";
    if (considerSpace) {
      regex += "\\s+";
    }
    text = text.replaceAll(regex, replacement);
    // remove first spaces and last "\n"
    text = text.substring(hangIndent, text.length() - 1);
    return sb.append(text).toString();
  }

read more

My PhD Proposal Defense: A Study of Relation Extraction

There has been an increasing effort in recent years to extract relations between entities from natural language text. In this dissertation, I will focus on various aspects of recognizing biomedical relations between entities reported in scientific articles.

Approaches to the relation extraction task can be categorized into two major classes: (1) pattern-based approaches and (2) machine learning-based approaches. Pattern-based approaches often use manually-designed rules to extract relations. They do not need annotated corpora for the training, and it is generally possible to obtain patterns that yield good precision. But pattern-based approaches require domain experts to be closely involved in the design of the system. It is impractical to manually encode all the patterns necessary for a high recall. On the other hand, machine learning-based approaches are data-driven that can derive models for automated extraction from a set of annotated data. But their ability to generalize from examples can be hindered by lack of domain and linguistic knowledge and their performance is critically dependent on availability of large amount of annotated data, which might be expensive to create.

read more

dpkg: error processing tex-common

Got the problem while installing latex-cjk-chinese

fmtutil-sys failed. Output has been stored in
/tmp/fmtutil.t6EnBlWW
Please include this file if you report a bug.

dpkg: error processing tex-common (--configure):
 subprocess installed post-installation script returned error exit status 1
Errors were encountered while processing:
 tex-common
E: Sub-process /usr/bin/dpkg returned an error code (1)

After checking the file fmtutil.t6EnBlWW, I find the problem is

! I can’t find file `loadhyph-zh-latn.tex’. This is due to the inconsistence between texlive 2011 and tex2012. In tex2012, the file is renamed to `loadhyph-zh-latn-pinyin.tex’. Therefore, the solution is
  1. go to /etc/texmf/hyphen.d
  2. change loadhyph-zh-latn.tex to loadhyph-zh-latn-pinyin.tex
  3. reinstall tex-common

read more

Recommend: The Science of Scientific Writing

George Gopen and Judith Swan. The Science of Scientific Writing. American Scientist. 1990, 78: 550-558.

Our examples of scientific writing have ranged from the merely cloudy to the virtually opaque; yet all of them could be made significantly more comprehensible by observing the following structural principles:

  1. Follow a grammatical subject as soon as possible with its verb.
  2. Place in the stress position the “new information” you want the reader to emphasize.
  3. Place the person or thing whose “story” a sentence is telling at the beginning of the sentence, in the topic position.
  4. Place appropriate “old information” (material already stated in the discourse) in the topic position for linkage backward and contextualization forward.
  5. Articulate the action of every clause or sentence in its verb.
  6. In general, provide context for your reader before asking that reader to consider anything new.
  7. In general, try to ensure that the relative emphases of the substance coincide with the relative expectations for emphasis raised by the structure.

read more

Recommend: Becoming a Data Scientist – Curriculum via Metromap

Becoming a Data Scientist – Curriculum via Metromap, by Swami Chandrasekaran.

Data Science, Machine Learning, Big Data Analytics, Cognitive Computing … One thing is for sure; you cannot become a data scientist overnight. … But how do you go about becoming one? Where to start? When do you start seeing light at the end of the tunnel? What is the learning roadmap? What tools and techniques do I need to know? How will you know when you have achieved your goal?