From stackoverflow:
Question
Given a string such as “aa bb cc dd ee ff
“, is there a regex that works with String.split()
to extract two words at a time? The expected result is:
[aa bb, cc dd, ee ff]
Note: This question is about the split regex. It is not about “finding a work-around” or other “making it work another way” solutions.
Solution
The regex expression is (?<!Gw+)s
. Here are some explanation of the regex:
-
(?<!regex1)regex2
: zero-width negative lookbehind. If the patternregex1
inside can NOT be matched ending at the position, the following patternregex2
matches. For example,(<!t)s
matches the secondt
in “students
“, because the firstt
is followed bys
. -
G
: matches at the position where the previous match ended, or the position where the current match attempt started. For example,G[a]
will match the first twoa
s then fail in “aa_aa
“, because the thirda
is followed by_
.
Therefore, the String.split()
first finds space (s
), then checks whether the space is followed by “space+word” (?<!\G\w+
). Note that \G
stores the previous matched space, which means the first, third, and fifth space in the string “aa bb cc dd ee ff
“. If the space is followed by a space and a word, then String.split()
splits the string; otherwise, it keeps searching the next space. The snippet of java code following:
String input = "aa bb cc dd ee ff"; String[] pairs = input.split("(?<!\G\w+)\s"); System.out.println(Arrays.toString(pairs));