Split a string into pairs of words

By | September 16, 2013

From stackoverflow:

Question

Given a string such as “aa bb cc dd ee ff“, is there a regex that works with String.split() to extract two words at a time? The expected result is:

[aa bb, cc dd, ee ff]

Note: This question is about the split regex. It is not about “finding a work-around” or other “making it work another way” solutions.

Solution

The regex expression is (?<!Gw+)s. Here are some explanation of the regex:

  • (?<!regex1)regex2: zero-width negative lookbehind. If the pattern regex1 inside can NOT be matched ending at the position, the following pattern regex2 matches. For example, (<!t)s matches the second t in “students“, because the first t is followed by s.

  • G: matches at the position where the previous match ended, or the position where the current match attempt started. For example, G[a] will match the first two as then fail in “aa_aa“, because the third a is followed by _.

Therefore, the String.split() first finds space (s), then checks whether the space is followed by “space+word” (?<!\G\w+). Note that \G stores the previous matched space, which means the first, third, and fifth space in the string “aa bb cc dd ee ff“. If the space is followed by a space and a word, then String.split() splits the string; otherwise, it keeps searching the next space. The snippet of java code following:

String input = "aa bb cc dd ee ff";
String[] pairs = input.split("(?<!\G\w+)\s");
System.out.println(Arrays.toString(pairs));

Leave a Reply

Your email address will not be published. Required fields are marked *