Ahh, my first blog entry...
The Regex.Split() method offers a clean way to break a delimited string of text into substrings. Its advantage over the standard String.Split() method is that you can use a delimiter that is variable or not precisely known. A simple example suited for regex is a comma-delimited list where the delimiter may or may not be followed by a space, as:
string[] fields = Regex.Split(delimitedList, ", ?");
There are times, however, when the Split method is not the best choice for splitting a string. Consider a comma-delimited string; where each field may be surrounded by quotes, and when quoted the field may contain embedded commas. Example:
Mister King,"123rd St, Redmond",425-555-1234
and the split should produce:
[0] => Mister King
[1] => "123rd St, Redmond"
[2] => 425-555-1234
To use the Regex.Split() method, you would need to figure out some way to match a comma, except when it is within an opening-quote and a closing-quote. While it may be possible to devise such a pattern, it would be pretty tricky and certainly more complex than an alternative approach. That alternative approach is to use the Regex.Matches() method with a pattern that explicitly matches each field instead of each delimiter.
Regex rex = new Regex(
//match quoted text if possible
//otherwise match until a comma is found
@" ""[^""]*"" | [^,]+ ",
RegexOptions.IgnorePatternWhitespace);
int i=0;
foreach(Match m in rex.Matches(delimitedList))
Console.WriteLine("[{0}] => {1}", i++, m.Value);
If more than one delimeter is possible (e.g., commas or semi-colons), or if more than one set of quotes are possible (e.g., double-quotes or single-quotes), it is easy enough to enhance the pattern:
@" ""[^""]*"" | '[^']*' | [^,;]+ "
But, it gets a little tricky if you need to allow for empty fields, such as input like:
Mister King,,425-555-1234
A la,
Regex rex = new Regex(
@" (?: ,|^ ) ( ""[^""]*"" | [^,]* ) ",
RegexOptions.IgnorePatternWhitespace);
int i=0;
foreach(Match m in rex.Matches(delimitedList))
Console.WriteLine("[{0}] => {1}", i++, m.Groups[1].Value);
We now match both the delimiter and the field that immediately follows it. To accomplish that, the first grouping in the pattern matches either a comma (the delimiter), or the beginning of input (the first field does not have a delimeter preceding it, so we have to match beginning of input). When we extract the fields from the collection of matches, then, we just need to ignore the delimiters we've matched. The code above does that by making the delimiter-match a non-capturing group ((?: )); and by acquiring the field text from the capturing group instead of the entire text of each match.
The above pattern works for all cases except when the first field is empty. I'll leave dealing with that as a future excercise. Hint: match the entire input with a single match, and capture all the fields into a capturing group.
The above pattern matches a field and the delimiter that immediately precedes it. A similar approach is to match a field and the delimiter that immediately follows it. Note the similarities of this pattern and code:
Regex rex = new Regex(
@" ( ""[^""]*"" | [^,]* ) (?: ,|$ ) ",
RegexOptions.IgnorePatternWhitespace);
int i=0;
foreach(Match m in rex.Matches(delimitedList))
Console.WriteLine("[{0}] => {1}", i++, m.Groups[1].Value);
Note this approach doesn't eliminate the empty-field issue.
-Wayne