One thing I've noticed among rookie
regex users is that they focus way too much on what they are trying
to match and not on where they are trying to match it from. There is
a tenancy to drastically under estimate the importance of their
source data, which they are trying to apply their regex against.
I see this all the time on forums where
posters asking for help almost never post any sample data, and most
of the few that do post a sample they made up on the spot. Over on
they RegexAdvice construction forum we've added guidelines for
posting a question. The whole point of the guidelines are to
expedite getting questions answered. One of the things we ask is
that posters from a sample of the actual data they are trying to
match. We explicitly ask they don't make up a sample. Unfortunately
most people don't follow this request, along with others, in the
guidelines, which generally result in those questions taking longer
to get answered. On this particular point not supplying any sample is
worst offense because anyone trying to help you can't see what you
are working with and can only give an untested guess solution It
like calling a mechanic on the phone and saying “My car isn't
running right” leaving your number and waiting for him to call back
to tell you how to fix it. He can call back and tell you to change
your oil, but without more detail he can't know that will fix your
problem. However making up a sample doesn't help much either and is
almost just as bad. I know what people think they are being helpful
and not boring people with the bloody details by providing a simple
example to work with but generally it's not helpful since it doesn't
represent their real problem. In regex construction details matter.
Often the made up sample is too simple. So what they end of getting
is a solution that works great for their sample but doesn't handle
their real problem well, if at all. So now you are back where you
started. This is like calling your mechanic on the phone and saying
“My car making a funny noise.” While it's less vague than “isn't
running right” it's still vague. You may know exactly what kind of
noise, and how loud it is and where it coming from but that sentence
doesn't communicate any of that. Made up samples are about the same
when asking for regex help.
The majority of regular expression
problems are content sensitive. Meaning the approach of how to match
the target string may depend on where it is the the source string.
Now sometime the target string is unique from the surrounding text
that the context seems irrelevant. Say you are trying to find a
string of digits somewhere in your source data. And there just
happens to be only one occurrence of digits are in your source. Well
it that case a generic regex that find digits will be good enough.
Source : “abc345def” or “a125bcd”
Regex: \d+
Now I said the content seems irrelevant
in this case but it is as relevant as ever. The context is digits
only occur once in the string.
But relevance becomes clearer if you
have a source file where there are two or more occurrences of text
that mimic your target data but only one is actually the data you
want. For example let say you want to find area codes of a phone
number. Well if you agree that an area code is a 3 digit number the
regex \d{3} might be all need. It depends. On what you ask? Come
on you know, say it with me....”Your Source data!”
If your source data is like the case I
mentioned first, where the only digits in the files are area codes or
if there are other digit in the file that are not area codes those
other numbers are less than three consecutive digits then the simple
regex should work
Sample
1) AL 205, 256, 334
2) AK 907
3) AR 479
4) etc.
But if your source is a bunch phone
numbers
1) Mom (555) 123-4567
2) Sis 1-555-234-4567
3) Friend 5553456789
Well now you have a problem. While you
general task is still the same, you have to approach it differently
because of the source. You can't simply look for 3 consecutive
digits since that would return 3 matches for each line. But if we
know how the data is structured we can focus on how the data we want
sticks out from the rest. In the above sample text area code meets on
of three conditions
3 cons digits inside of
parenthesis
first group of 3 consecutive
digits in a string of 1-xxx-xxx-xxxx format
first 3 consecutive digits of a
ten consecutive digit string
Now you can still match what you need
in one regex but how you choose to approach it changes. Some regex
authors will use look-around to examine the surrounding text, others
will match the surrounding and use groups to pull out the desired
data. Either way you have to expand your field of vision beyond what
you are trying to match to address the problem.
But even in this simple example,
understand how knowing what the source data look like is still
important. Imagine someone asked you for help and you are trying to
solve this problem without seeing the actual source above. If
someone post (or ask) the question “I need a regex to match a US
area code”
With No Sample: Any helpful regex
author has to guess how your data is formatted. Some may assume it's
a string of “(xxx) yyy-yyyy”, others may assume it's
“xxx-yyy-yyyy”, some may even think it “State: xxx, xxx, xxx”
and all might not it even consider “1-xxx-yyy-yyyy” or
“xxxyyyyyyy”.
With Made up sample: “I have a string
(555) 555-5555 and I need a regex to match a US area code”
Now regex author may assume all your
phone numbers are in that nice format and that there are no
variations. You'll most likely get a regex that match that format
perfectly, but fail the other formats you didn't think were important
enough to mention.
With Real Data: Regex authors don't
have to guess they can look at your data analyze it themselves and
even test edge cases and address issues you may have overlooked.
I hope this simple example shows how
the same problem can have different approaches all because of the
source. Where data in the wild is going to be much more complex than
this which will only magnify the issue.
Now sometimes you can't provide real
data because either its too much or private. With large chunk of
data simply clip a portion containing what you are wanting to match.
With private data making up sample is almost excusable but don't make
up simple ones. Garble the real data but preserve the format, change
names to literary or TV characters, change addresses to the address
of some public office (City hall, post office) or landmark which
this is essentially a made up sample but if you substitute values
your samples won't be as bland and closer to your real data and edge
cases.
Understanding your source data is
crucial to your regex construction whether you are writing it
yourself or asking for help.