Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

IRegex

Roy Osherove's (?<thoughts>.*)

Using Regex to return the first N words in a string

Jeff Perrin needed a function to return the first N words in a string (to create a small summary or a snippet thingy). He did it using the manual and awkward method of parsing the string manually. That method is more error prone and usually makes for less readable code. Fortunately, you can use regular expressions here quite nicely. Here's a test that makes sure that we get the first 4 words in a string and the function "FindFirstWords" that does this very easily using a simple regular expression.

What I'm doing here is that I'm using the expression to find the first 4 occurrences of text that is composed of alphanumeric text with one or more spaces after it. Then I simply iterate over the match I found. The match should contain 4 captures inside it - one for each "word" that was found.

It's not fully tested as you can see. I only wrote one test to see it works on this sort of sentence. More tests could and should be added to test other cases. In fact, if this were reall TDD, I would have started with a test of an empty string, and continued on to test getting only one word, and then two and so on.

[Test]

public void TestRegexFindFirstNWords()

{

      const string INPUT =

"this is word four five six seven eight nine ten eleven twelve thirteen!";

      const int NUM_WORDS_TO_RETURN = 4;

 

      string output = FindFirstWords (INPUT, NUM_WORDS_TO_RETURN);

 

      string expectedOutput = "this is word four ";

      Assert.AreEqual(expectedOutput,output);

}

 

private string FindFirstWords (string input, int howManyToFind)

{

      string REGEX = @"([\w]+\s+){" + howManyToFind + "}";

      StringBuilder output = new StringBuilder();

      foreach (Capture capture in Regex.Match(input,REGEX).Captures)

      {

            output.Append( capture.Value) ;

      }

      return output.ToString();

}

Sponsor
Published Friday, January 07, 2005 4:51 AM by royo

Comments

 

royo said:

A problem with the regex you are using is that you'll hit a sang if your string has words that aren't purely alpha or your string has puncutation. For example your function wouldn't work for the first four words of the string
"I am. Are you ready?"
The regex won't match. You'll also have problems with contractions. The previous senetence would return an incorrect value.
Although it doesn't return a specific number of words I recently posted regex on the regexlib http://www.regexlib.com/REDetails.aspx?regexp_id=961 that I used for the same purpose of Jeff's post that being getting a summary of a long string of text but not breaking in the middle of a word. As written its is limited to ASCII but instead of number of words it returns so many chracters between a minimum and a max. You could wrap it in a function like you've done here to alter the min and max.
January 7, 2005 2:12 PM
 

royo said:

Would it not make sense to use \b to identify a word boundary?
Then get all characters upto boundary n+1
February 7, 2005 7:59 AM
Anonymous comments are disabled

This Blog

Syndication

Tags

No tags have been created or used yet.

News



MSN: RoyOsherove@hotmail.com


Get Feedable Today!
Get serious about regular expressions with this free tool
I Recommend:
More recommendations
What is RSS?

Amazon Associate <!-- Begin Nedstat Basic code --> <!-- Title: IRegex --> <!-- URL: http://regexblogs.com/royo/ --> <script language="JavaScript" type="text/javascript" src="http://m1.nedstatbasic.net/basic.js"> </script> <script language="JavaScript" type="text/javascript" > <!-- nedstatbasic("ACiUkwy89+aed+tCOH1yf7rmzOpw", 0); // --> </script> <noscript> Nedstat Basic - Free web site statistics </noscript> <!-- End Nedstat Basic code -->