Hello boys and girls. Wow it's been a
while since I've done this. I want to touch on a very useful but
often overlooked feature of regex, grouping. While I haven't been
blogging I have been active on a message board here or there. A
question I see quite often is “I want to find a match in a string
but I don't want part of the match” or “I need the value of this
portion of the string” Now often I see solutions to these type of
questions that involve look-arounds. Any why they certainly work
they aren't the only way to achieve the desired results. New regex
users seem to believe they can only access the full match. Most
regex engine support groups, where in a match you can access a
certain portion of the full match. Groups are identified by
parenthesis. Every pair of plain parenthesis is a group. For
example the regex pattern
/(Hello)\x20(world)/
There are two groups in the regex. The
regex itself matches the string “Hello world”, group 1 contains
the string “Hello”, group 2 contains the string “world”.
Note neither group contains the space between the two words but it is
part of the full match. Most implementations of regular expressions
allow you a ways to access these groups. They are contained in a
collection inside the match object. Now you'll need to consult your
regex documentation to know the exact layout but in most of these
implementation there is a zero based collection where Item 0 is the
full match and Item 1..n is whichever group element your regex
contains, if any.
Now if you notice I said every pair of
plain parenthesis is a
group. The reason I stress plain is because there are other group
constructs, the aforementioned look-arounds being some. Now support
for the other constructs vary in implementations so again consult
your regex documentation to see which ones you have. The other
grouping constructs consist of a open parenthesis immediately
followed by a question mark, which then is followed by the characters
that define that particular grouping construct. Again consult your
documentation to see which characters define what. I'm not going to
go into all of them here but they basically fall into two categories.
Capturing and non-Capturing. The plain parenthesis I've mentioned
above are a capturing group. However capturing requires extra
resources in some case you need the extra speed, but you still to
group a certain part of the pattern together for either necessity or
readability or both. This is where you'll what to using a
non-capturing group (?:pattern), a open parenthesis followed
immediately by a question mark followed immediately by a colon. The
difference here is that the data matched in the group is not add to
the collection of submatch in the Match object.
Taking our
previous example and making the first group non-capturing
/(?:Hello)\x20(world)/
Where before we
had two groups here we only have one. Group 1 contains the
string”world”. Now this example is not a very practical use of a
non-capturing group. Typically you'd use them in more complex
regexes that have a grouping but you really don't care about the
sub-matches.
One more quick
thing about capturing groups basically each left parenthesis is the
index of the sub-match in the groups collection. So if you have
nested parenthesis count every (plain) left parenthesis to know which
index to use to reference it. Some of the advance grouping constructs
and regex options can affect the ordering but if you are using them
hopefully you've read their effects so I won't go over that here.
/((Hello)|(Goodbye
Cruel))\x20(world)/
The above regex
has 4 capturing groups (not counting group 0). Can you find them?
Now it should match either the string “Hello world” or “Goodbye
Cruel world” Now I want to point out that not all the groups will
participate in the match, but the are still part of the Groups
collection. There will always be 4 groups, just one will always be
empty. Which one depends on which string was matched.
If “Hello
world” was matched the groups are
-
Hello
-
Hello
-
(empty)
-
world
If “Goodbye
Cruel world” was matched the groups are
-
Goodbye
Cruel
-
(empty)
-
Goodbye
Cruel
-
world
in both case
group 0 would be the full match
If you note in
both case two groups contain the same value. Even if you need to know
whether “Hello” or “Goodbye Cruel” was match, you certainly
don't need to know it twice. Plus the inner parenthesis have
different index you'd have to check if you want to use those. This
is where you'd use the non-capturing group to simplify your groups
collection.
/((?:Hello)|(?:Goodbye
Cruel))\x20(world)/
Now we are back
down to two groups. Group 1 contains either “Hello” or “Goodbye
Cruel” depending on which string was matched. Group 2 always
contains “world”
However keep in
mind in some cases you'll want to use the inner index do determine
which group was matched. So using non-capturing groups isn't
necessarily a better thing it just depends on if you need to access
those groups or not. But if you are not doing anything with them
don't capture them.
These are just
two of the basic grouping constructs and they are general supported
across implementations of regex, but not always. But if they are you
can use the to easily dissect larger matches.