Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Match all specified chars/strings between a start string/tag and an end string/tag.

Last post 03-17-2010, 11:38 AM by Jam. 10 replies.
Sort Posts: Previous Next
  •  03-09-2010, 4:48 AM 60524

    Match all specified chars/strings between a start string/tag and an end string/tag.

    Hello.

    I did not manage to find appropriate script for my need in this fantastic forum, so I hope someone could help me with this case.
    (I believe this is quite trivial for one who really knows regex (I have just used some well known scripts in my code and not able quickly create this one).)

     Here is information of what I have got so far and what I wish to have:

    Search text:
    <member><name>Some Nice Name</name><string>Here & here is some special characters & etc.</string></member>

    Regex I have so far:
    <member><name>Some Nice Name<\/name>\s*(.*)\s*<\/member>

    This matches with:
    <string>Here & here is some special characters & etc.</string>

    It is already a start, I think, but now I should get it to match only to specified characters/strings.
    So that the matches here would be & and &.

    The idea is that i could change the start strings, end strings and also all of the specified strings found between. This way I could find and replace values in one "configuration" setting quite flexible.

    The code that executes the given regex is C# 2.0.

    Btw, my usage is through a configuration file and therefore I am not able to loop or do other perhaps more useful ways to get where I want.

     

    BR,
    Jam

  •  03-14-2010, 3:05 PM 60901 in reply to 60524

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    Hello again.

     

    I had locked my question at first, so now I hope someone still finds this, and perhaps could give me some ideas how to make the right regex.

    Any help appreciated.

     

    BR,

    Jam

  •  03-14-2010, 5:26 PM 60911 in reply to 60901

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    look at this: (?<=<member><name>[\w\W]+</name><string>)[\w\W]+(?=</string></member>)
  •  03-14-2010, 8:28 PM 60921 in reply to 60524

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    You are running into a very common error (and BrandJoker's solution suffers from exactly the same problem) - the '+' and '*' quantifiers are greedy and will match as many characters as possible. In cases such as this where you are locating an starting tag and trying to then match the corresponding ending tag, this greediness can be disastrous.

    Given the text

    <start>words<end>more words<start>still more<end>last words

    and you want to match from "<start>" to "<end>" the approach you have (both) taken would look like

    <start>.*<end>

    and would match the parts that are underlined in the sample text. This is because the match locates the "<start>" OK but then the '.*' will grab all characters to the end of the string/line (depending on the setting of the "singleline" option). When it tries to match the "<end>" characters, it can't as it is at the end of the string. Therefore the regex engine starts to backtrack by releasing one character at a time from those it has already matched until it either runs our of characters or can make a match. In effect this is searching from the END for the matching tag.

    (BrandJoker uses the '[\W\w]' and similar ways to create a character set that contains all characters - in effect the same as '.' with the 'singleline' mode on. The impact of the quantifier is the same in all such cases).

    There are several ways to stop this. One of the simpler is to use the '*?' (or '+?') form of the quantifier which makes it non-greedy - it will match as few characters as possible rather than as many. However this form is not always available (but is in the .NET regex you are using). However this can run away on you if you are not careful. Some people (me included) like to be specific about how far we search ahead and so like forms such as '((?!<end>).)*' which only goes as far as the specific character sequence. you can also extend this to multiple termination sequences which is useful in many cases.

    Susan

  •  03-15-2010, 9:34 AM 60948 in reply to 60921

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    Hello.

     

    Thanks from your answers!

    I agree with being spesific (when possible) about how far to search ahead.

    Unfortunately I still do not get how to get a regex that finds the spesific chars/words inside the start and end match.

    I tried this:

    (?<=<member><name>Some Nice Name<\/name>)[\&]*((?!<\/member>).)*

     But it did not return correct matches.

    The explination looks quite correct for me in Expresso analyzer...

    - "Match a prefix, but exclude it from the capture"

    - "Any character in this class [\&], any number repetition"

    - A numbered capture group for the </member> end tag, done as I understood from you,  Susan.

     

    Could you please help me further with this?

    Thank you! :-)

  •  03-15-2010, 6:33 PM 61050 in reply to 60921

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    Aussie Susan:

    (BrandJoker uses the '[\W\w]' and similar ways to create a character set that contains all characters - in effect the same as '.' with the 'singleline' mode on. The impact of the quantifier is the same in all such cases).

    There are several ways to stop this. One of the simpler is to use the '*?' (or '+?') form of the quantifier which makes it non-greedy - it will match as few characters as possible rather than as many. However this form is not always available (but is in the .NET regex you are using). However this can run away on you if you are not careful. Some people (me included) like to be specific about how far we search ahead and so like forms such as '((?!<end>).)*' which only goes as far as the specific character sequence. you can also extend this to multiple termination sequences which is useful in many cases.

    Hi Susan,

    you are right.

    I have to feed my .NET test environment with more than one line ;-)

     

  •  03-15-2010, 11:31 PM 61066 in reply to 60948

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    Lets go back to the beginning and make sure that I understnad you correctly.

    You have a single example string you want to process:

    <member><name>Some Nice Name</name><string>Here & here is some special characters & etc.</string></member>

    You are wanting to locate the text between some starting character sequence and some other ending character sequence. I think I'm OK so far.

    Now, what is the "starting character sequence" you are wanting to use? Looking at your original pattern, it looks like "<member><name>" but elsewhere you seem to want to involve the "&" character in a way I don't understand.

    Similarly, what is the "ending character sequence"? It looks like it should be "</member>" but, again, I'm not sure.

    Can you perhaps provide us with several sample strings and  identify the parts that you are wanting to locate, as well as answering the above questions.

    Susan

  •  03-16-2010, 3:40 AM 61082 in reply to 61066

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    Hi.

     

    Ok, sorry...I try again. :-)

    You have understood correct the first part.

    The second part is: I want to find spesific chars/strings inside the result of the first part.

    Search string:

    <member><name>Some Nice Name</name><string>Here & here is some special characters & etc.</string></member>

    Matches I want: "&" and "&"

    So if I instead of & would like to find string "re", then the matches I want is: "re" and "re" (from the words "Here" and "here").

    Still one example, looking for character "ä":

    Search string:

    <member><name>NameWith_ä_and_ö</name><string>Here is the ä and ä you wanted, still one ä and maybe one ö also</string></member>

    Matches I want are "ä" and "ä" and "ä"...marked bold in the search string.

     

    Thanks you help so far!

    BR,

    Jam

  •  03-16-2010, 6:39 PM 61242 in reply to 61082

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    OK, let me replay what I understnad and then give you a pattern that does that.

    You want to find a specific character sequence (in your examples you have used "&", "re" and "ä") as long as that character is between "<string>" and "</string>" tags that are themselves between "<member><name>" and "</member>" tags.

    I think you want to perform some form of string replacement on the located character sequences and so you only want to locate the target strings themselves and not the tags on either side.

    If that is so, then try:

    (?<=<member><name>.*?<string>[^<>]*?)ä(?=[^<>]*?</string>.*?</member>)

    This locates the "ä" character and checks that it is preceded by the "<member><name>" character sequence, some random characters (including none) and the "<string>" tag, and then any other characters except "<" and ">" which would indicate another tag between the "<string>" and the target character(s). On the other side of the target character(s) it performs a similar check to make sure that there follows some random characters that do not form a tag, the "</string>" tags, more characters and then the "</member>" tag.

    Note that the '.*?' sequences work in the example you have provided but I can imagine strings that may confuse this pattern into identifying the target characters incorrectly. Also, this pattern checks both the "before" and "after" conditions, just to be sure.

    You say that you want to make this flexible in terms of being able to change the various character sequences that are used within various parts of the pattern. Using this as the basis, I hope that you can see where you can make the necessary substitutions to build the pattern string in the way you need.

    Susan

  •  03-17-2010, 4:40 AM 61267 in reply to 61242

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    Fantastic Susan!

    That was really helpful.

    Now I have this

    (?<=<member><name>NameWith_ä_and_ö</name><string>[^<>]*?)ä(?=[^<>]*?</string></member>)

    which gives me the result I want WHEN all the search string is on one line. When there is a line feed (or CRLF) f.ex. between "</name>" and "<string>", then this is not working.

    I will continue searching how to allow CRLF and perhaps also tabs in the search string...

    If I haven't replied here before you see this, then I really would appreciate if you still could help me with that.

     

    Jam

  •  03-17-2010, 11:38 AM 61302 in reply to 61267

    Re: Match all specified chars/strings between a start string/tag and an end string/tag.

    Hi.

     

    Now I tried this one:

    (?<=<member>\s*<name>\s*NameWith_ä_and_ö\s*</name>\s*<string>\s*[^<>]*?)ä(?=[^<>]*?\s*</string>\s*</member>)

    and it returns the wished result also when the search string is on multible lines.

    BUT, is that correct/best way to do it?

     

    BR,

    Jam

View as RSS news feed in XML