Monthly Archives: October 2013

C++11 reading list

C++11 (formerly known as C++0x) is the most recent version of the standard of the C++ programming language. After it was approved by ISO in 2011, many books have been published to embrace the updates. Here are THREE core books that I recommend: one language tutorial, one library tutorial, and one bible.

C++ Primer (5th Edition)



Although it is called “primer”, this book is actually written for both beginners and experienced C++ programmers. The 5th Edition is fully updated and recast for C++11 standard as well. As a real tutorial of C++ programming language, it provides authoritative and comprehensive introduction to C++11. Another highlight is its huge amount of examples to help readers learn and understand the language fast.

read more

How to set terminal title dynamically to the current working directory?

It is sometimes helpful to set a terminal window title from a script, so that you can put a couple of reminders of how to do things there. The Xfce4-terminal preference tells me it can be done by setting the “Dynamically-set title” position, but never says how to dynamically put text in the title bar.

Using Zsh

As I’m gradually switching to Zsh, this short post explains how to dynamically set pwd, or current working directory, to xfce terminal in Zsh. Actually, it only contains two steps:

read more

函数式编程的另类指南(5)

The following part is not maintained anymore. Please go to 函数式程序设计的另类指南 for the whole translation.

以下内容不再更新,浏览全部翻译,请访问 函数式程序设计的另类指南

原文链接:Functional Programming For The Rest of Us
原文作者:Vyacheslav Akhmechet

函数式编程的优点

你可能会认为我根本无法对上面那个变态的函数给出合理的解释。我开始学习函数式编程时,也这么想。不过事实证明我错了。有许多很好的理由来支持这样的写法,当然其中一些是主观因素。比如,有人号称函数式程序易于理解。我不会拿这些理由出来说事,因为小孩子都知道:情人眼里出西施。不过我还能找到很多客观理由。

read more

Tikz example – Kernel trick

In Support Vector Machines, the learning algorithms can only solve linearly separable problems. However, this isn’t strictly true. Since all feature vectors only occurred in dot-products k(xi,xj)= xi·xj, the “kernel trick” can be applied, by replacing dot-products by another kernel (Boser et al., 1992). A more formal statement of kernel trick is that

Given an algorithm which is formulated in terms of a positive definite kernel k, one can construct an alternative algorithm by replacing k by another positive definite kernel k∗ (Schlkopf and Smola, 2002).

read more

Recommend: Wall Street Bosses Are Calling This ‘The Best Cover Letter Ever’ – But Not Everyone Agrees

The following letter is reposted from Wall Street Bosses Are Calling This ‘The Best Cover Letter Ever’ – But Not Everyone Agrees

From: BLOCKED
Sent: Monday, January 14, 2013 1:14PM
To: BLOCKED
Subject: Summer Internship

Dear BLOCKED

My name is (BLOCKED) and I am an undergraduate finance student at (BLOCKED). I met you the summer before last at Smith & Wollensky’s in New York when I was touring the east coast with my uncle, (BLOCKED). I just wanted to thank you for taking the time to talk with me that night.

read more

How to read CSV files in Java – A case study of Iterator and Decorator

In this post, I will talk about how to read CSV (Comma-separated values) files using Apache Common CSV. From this case study, we will learn how to use Iterator and Decorator in context of design pattern to improve the reusability in different situations. But before we get started, I guess I have to answer two questions first.

  1. Why do I need a third party library if there are more than enough DIY posts talking about how to read CSV files?
    It is true that when you google “java csv parser”, you will get several related posts. But even if you are a beginner, you won’t be satisfied with these shallow methods. Of course using BufferedReader and String.split() will successfully parse a typical CSV file, but you won’t learn ANYTHING from it except making redundant. On the other hand, like what I will show below, using and studying Apache Common CSV will teach you several topics in Design Pattern, for instance iterator and decorator.

  2. Why Apache Common CSV, not others?
    As far as I know, there are several other libraries on Sourceforge or Google code. However, if you look into details of their code, forgive my criticism, none of them are flexible and manageable: some are too simple to meet users various requirements; others are too complicated and painful to use. Furthermore, most of them I’ve come across don’t have commercial-friendly licenses. You know, sometimes, it really scares users off.

Apache Common CSV is still in sandbox, which means there are currently no official download and stable release. But nightly builds may be available.

Using Iterator to hide underlying representation

Let me begin with a sample CSV file, where each record is located on a separate line, delimited by a line break. The first line is the header containing two names COL1 and COL2 corresponding to the fields in the file. The rest of the file contains three records with fields separated by commas.

COL1,COL2
a,b
c,d
e,f

The code using Apache Common CSV to read this file is:

public void test() throws FileNotFoundException, IOException {
  CSVParser parser = new CSVParser(
      new FileReader("test.csv"), 
      CSVFormat.DEFAULT.withHeader());
  for (CSVRecord record : parser) {
    System.out.printf("%st%sn", 
      record.get("COL1"), 
      record.get("COL2"));
  }
  parser.close();
}

CSVParser is used to parse CSV files according to the specified format. Here I use the default CSVFormat together with setting withHeader() with no argument. This enables the parser to treat the first line of the CSV file as the header and to make the record.get("COL1") valid. CSVParser provides an iterative way of reading records. Here we meet the first design pattern Iterator. It provides a way to access the records of a CSV file sequentially without exposing its underlying representation, like how to skip over comment line and how to map the column name to the field value. For each record, we use CSVRecord.get(String name) to retrieve the field value by its name.

CSVRecord provides different ways to access the field value: by name or by index. If you are not sure the field has a value or is empty, CSVRecord.isSet(String name) can be called before. If you just want to check whether a name has been defined to the parser, call CSVRecord.isMapped(String name) instead.

Using Decorator to allow different behaviors

CSVFormat.DEFAULT or CSVFormat.RFC4180 follows the RFC4180 format. So fields enclosed in double quotes can be handled too, such as

"COL1","COL2"
"a","b"
"c","d"
"e","f"

In RFC4180, fields in a CSV file should be separated by commas. But in general, the library can handle arbitrary delimiter like TAB or space. To make the code reusable, the library provides a way to create your own CSVFormat,

CSVFormat format = CSVFormat.newFormat(',')
    .withQuoteChar('"')
    .withHeader();

The above format is same as the CSVFormat.DEFAULT. Here we encounter another design pattern Decorator, which allows behavior to be added to an individual object, either statically or dynamically, without affecting the behavior of other objects from the same class. In the case of CSVFormat, every withXXX() method returns a new CSVFormat that is equal to the calling one but with one attribute modified. The question here might be why not just return the self-reference this? I think it is because the later way will fail the following code

CSVFormat format = CSVFormat.newFormat(',');
CSVFormat format1 = format.withQuoteChar('"');
CSVFormat format2 = format.withHeader();

read more

Undefined reference to Sqrt — A quick note to compile SVM-multiclass

When I compiled SVM-multiclass using make, I got an error msg saying “undefined reference to sqrt“. I check the makefile and found that -lm is included in the gcc flags.

The trick here is to put the library AFTER the module you are compiling. The problem is a reference thing. The linker resolves references in order, so when the library is BEFORE the module being compiled, the linker gets confused and does not think that any of the functions in the library are needed. By putting the library AFTER the module, the references to the library in the module are resolved by the linker.

read more

Most efficient way to increment a Map value in Java — Only search the key once

This question may be considered too basic, but is frequently asked in the forums. In this post, I will discuss one way that only searches the key in Map ONCE.

Let’s first look at an example. Say I’m creating a string frequency list, using a Map, where each key is a String that is being counted and the value is an Integer that’s incremented each time a String is added. A straightforward way of achieving it is

int count = map.containsKey(string) ? map.get(string) : 0;
map.put(string, count + 1);

This piece of code runs quite slowly because it contains three potentially expensive operations on a map, namely containsKey(), get(), and put(). Each requires to search the key in the map. Now let’s refactor the code for better performance.

Integer vs MutableInteger vs AtomicInteger

One important reason that we have to invoke three expensive operations is using Integer for counting. In Java, Integer is immutable. It prevents us modifing the integer value after construction. Therefore, to increment a counter, we have to first get the integer from the map, then create another new integer by adding one, and put it back to the map.

To make the counter mutable, there are several ways. One is to simply create your own MutableInteger, like what I showed below.

public class MutableInteger {

  private int val;

  public MutableInteger(int val) {
    this.val = val;
  }

  public int get() {
    return val;
  }

  public void set(int val) {
    this.val = val;
  }
}

Another way might be using AtomicInteger in Java, which is used in applications such as atomically incremented counters. But the main choice for AtomicInteger is if you want to achieve thread safety with the operations on the integer. Therefore it cannot be used as a replacement for an Integer. Based on this, if thread-safety is not a strong consideration of your project, I won’t recommend using AtomicInteger.

Search the key only once

After using the MutableInteger, we can change the above code to

if (map.containsKey(string)) {
  MutableInteger count = map.get(string);
  count.set(count.get() + 1);
} else {
  map.put(string, new MutableInteger(1));
}

or

MutableInteger count = map.get(string);
if (count != null) {
  count.set(count.get() + 1);
} else {
  map.put(string, new MutableInteger(1));
}

read more

Tidy config for XML

The following configuration won’t wrap text, which is useful if users don’t want to insert spaces while reformatting XML files.

char-encoding: utf8
indent: auto
indent-spaces: 2
wrap: 0

Usage:

tidy -xml -i -config tidy.config -m XMLFILE

Recommend: Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!

Publications of EMNLP 2013 are released: http://aclweb.org/anthology/D/D13/

On the list, I found a very interested article “Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!“. It discusses the disconnect between industry and academia: while rule-based IE dominates the commercial world, it is widely regarded as dead-end technology by the academia. The following table summarizes the pros and cons of machine learning and rule-based information extraction technologies (reproduced from the above paper).

read more