I was asked how to parse for the value inside an XML tag, containing an attribute of some kind. Here's a simple way:
Say you have this as an input:
<Customer VIP=“false“>Danny Glover</Customer>
<Customer VIP=“true“>John Roberts</Customer>
<Customer VIP=“false“>Eva Gardner</Customer>
The value we want eventually is bolded.
it's fairly easy to set up an expression like this. :
Fire up The Regulator and put in the input text in the “input“ field.
Then start out writing an expression to matche any XML element:
In Regex, each letter usually matches itself, so if we wanted to match the text “roy“ in an input, we could simply write “roy“ in the regex and it would match. But some chars are “special“ chars, part of the regex syntax, so like in any good language, we need to “escape“ them using a backslash. The “<“ and “>“ are a good example of this. to match “<“ we need to write “\<“. THe same for the “>“ char. So, our expression could start out like this:
Some explanation:
- \s = one space or a tab
- .* = actually two things: “.“ means “anything“, and “*“ means “0 or more“
- SO, what I'm actually writing here is: “Match anything that starts with “<Customer VIP=“[anything at any length]“>
Pressing F5 and running this would match all the XML elements's starting node (you can see the results in the match tree of the regulator.
Now we are in the important part: Whatever comes after this match and before the closing node in the actual text we want. We don't really care what it is as long as we can access this information easily. Since we want to match anything at any legth inside these tags, we can use “.*“ in there. But, if we want to get back the value in those tags easily in code, we need to put them in a “group“. A group is a logical construct that we wrap around parts of our matched text, so that we can later refer to it in code using the Regex object model exposed by the .Net framework.
To wrap something in a group we simply put round braces on it:
- \<Customer\sVIP=“.*“\>(.*)\</Customer\>
explanation:
- The red part signifies the group I've just created.
- The expressoin after the group simply matches the closing tag of the customer node/
- To get to the “meant“, the group we created, our code might look like this (if we were parsing one line at a time):
string pattern =“\<Customer\sVIP=“.*“\>(.*)\</Customer\>“;
string input=“<Customer VIP=“true“>John Roberts</Customer>“;
Match match = Regex.Match(input,pattern);
if(match.Success)
{
//valueInside should now contain the value “John Roberts“
string valueInside = match.Groups[0].Value;
}
It's that easy. If we really wanted to, we could get the group by name. To give the group a name in the expression, put inside the round bracesm before the expression in them, this text: “?<MyGroupName>”:
- \<Customer\sVIP=“.*“\>(?<InnerText>.*)\</Customer\>
Now we can access the group like so:
string valueInside = match.Groups[“InnerText“].Value;
Can you see the power in this? I can now divide my text up to as many groups as I want and simply refer to each part I want by name, without having to parse a single string in the conventionaly methods. Groups can even nest inside other groups, but you still get them all by name in just the same way.