Jeff
Perrin needed a function to return the first N words in a string (to create
a small summary or a snippet thingy). He did it using the manual and awkward
method of parsing the string manually. That method is more error prone and
usually makes for less readable code. Fortunately, you can use regular
expressions here quite nicely. Here's a test that makes sure that we get the
first 4 words in a string and the function "FindFirstWords" that does this very
easily using a simple regular expression.
What I'm doing here is that I'm using the expression to find the first 4
occurrences of text that is composed of alphanumeric text with one or more
spaces after it. Then I simply iterate over the match I found. The match should
contain 4 captures inside it - one for each "word" that was found.
It's not fully tested as you can see. I only wrote one test to see it works
on this sort of sentence. More tests could and should be added to test other
cases. In fact, if this were reall TDD, I would have started with a test of an
empty string, and continued on to test getting only one word, and then two and
so on.
[Test]
public void TestRegexFindFirstNWords()
{
const string
INPUT =
"this is word
four five six seven eight nine ten eleven twelve thirteen!";
const int
NUM_WORDS_TO_RETURN = 4;
string output = FindFirstWords (INPUT, NUM_WORDS_TO_RETURN);
string expectedOutput = "this
is word four ";
Assert.AreEqual(expectedOutput,output);
}
private string FindFirstWords (string input,
int howManyToFind)
{
string REGEX
= @"([\w]+\s+){" + howManyToFind + "}";
StringBuilder output = new StringBuilder();
foreach (Capture capture
in Regex.Match(input,REGEX).Captures)
{
output.Append( capture.Value)
;
}
return output.ToString();
}