Category Archives: design pattern

How to read CSV files in Java – A case study of Iterator and Decorator

In this post, I will talk about how to read CSV (Comma-separated values) files using Apache Common CSV. From this case study, we will learn how to use Iterator and Decorator in context of design pattern to improve the reusability in different situations. But before we get started, I guess I have to answer two questions first.

  1. Why do I need a third party library if there are more than enough DIY posts talking about how to read CSV files?
    It is true that when you google “java csv parser”, you will get several related posts. But even if you are a beginner, you won’t be satisfied with these shallow methods. Of course using BufferedReader and String.split() will successfully parse a typical CSV file, but you won’t learn ANYTHING from it except making redundant. On the other hand, like what I will show below, using and studying Apache Common CSV will teach you several topics in Design Pattern, for instance iterator and decorator.

  2. Why Apache Common CSV, not others?
    As far as I know, there are several other libraries on Sourceforge or Google code. However, if you look into details of their code, forgive my criticism, none of them are flexible and manageable: some are too simple to meet users various requirements; others are too complicated and painful to use. Furthermore, most of them I’ve come across don’t have commercial-friendly licenses. You know, sometimes, it really scares users off.

Apache Common CSV is still in sandbox, which means there are currently no official download and stable release. But nightly builds may be available.

Using Iterator to hide underlying representation

Let me begin with a sample CSV file, where each record is located on a separate line, delimited by a line break. The first line is the header containing two names COL1 and COL2 corresponding to the fields in the file. The rest of the file contains three records with fields separated by commas.

COL1,COL2
a,b
c,d
e,f

The code using Apache Common CSV to read this file is:

public void test() throws FileNotFoundException, IOException {
  CSVParser parser = new CSVParser(
      new FileReader("test.csv"), 
      CSVFormat.DEFAULT.withHeader());
  for (CSVRecord record : parser) {
    System.out.printf("%st%sn", 
      record.get("COL1"), 
      record.get("COL2"));
  }
  parser.close();
}

CSVParser is used to parse CSV files according to the specified format. Here I use the default CSVFormat together with setting withHeader() with no argument. This enables the parser to treat the first line of the CSV file as the header and to make the record.get("COL1") valid. CSVParser provides an iterative way of reading records. Here we meet the first design pattern Iterator. It provides a way to access the records of a CSV file sequentially without exposing its underlying representation, like how to skip over comment line and how to map the column name to the field value. For each record, we use CSVRecord.get(String name) to retrieve the field value by its name.

CSVRecord provides different ways to access the field value: by name or by index. If you are not sure the field has a value or is empty, CSVRecord.isSet(String name) can be called before. If you just want to check whether a name has been defined to the parser, call CSVRecord.isMapped(String name) instead.

Using Decorator to allow different behaviors

CSVFormat.DEFAULT or CSVFormat.RFC4180 follows the RFC4180 format. So fields enclosed in double quotes can be handled too, such as

"COL1","COL2"
"a","b"
"c","d"
"e","f"

In RFC4180, fields in a CSV file should be separated by commas. But in general, the library can handle arbitrary delimiter like TAB or space. To make the code reusable, the library provides a way to create your own CSVFormat,

CSVFormat format = CSVFormat.newFormat(',')
    .withQuoteChar('"')
    .withHeader();

The above format is same as the CSVFormat.DEFAULT. Here we encounter another design pattern Decorator, which allows behavior to be added to an individual object, either statically or dynamically, without affecting the behavior of other objects from the same class. In the case of CSVFormat, every withXXX() method returns a new CSVFormat that is equal to the calling one but with one attribute modified. The question here might be why not just return the self-reference this? I think it is because the later way will fail the following code

CSVFormat format = CSVFormat.newFormat(',');
CSVFormat format1 = format.withQuoteChar('"');
CSVFormat format2 = format.withHeader();

read more