Today's regular expression that's due for a tidyup is one that parses default
Apache style log entries, though only matching particular types of request.
Before:
/^([a-z0-9\.\-]+)\s-\s-\s\[[^\]]+\]\s\"GET\s[0-9a-z\.\-_\/]+_
([a-z0-9]+)\.(?:u)?deb\sHTTP\/1\.[0-1]\"\s20(?:0|6)\s([0-9]+)
\s\"[^\"]+\"+\s\"([^\"]+)\"$/
After:
m!^
([a-z\d.-]+) # Remote IP/host
\s-\s-\s # User/group
\[[^]]+\]
\s
"GET \s [\da-z._/-]+ _ ([a-z\d]+) \.u?deb \s HTTP/1\.[01]" # GET request
\s
20[06] # OK, Partial Content
\s
(\d+) # Response size
\s
"[^"]+" # Referrer
\s
"([^"]+)" # User Agent
$
!ix
Bit of a difference, isn't it?
The key thing to note is the use of /x mode and comments. At
very little cost they give great returns on the readability and maintainability
of the expression. We can actually see which bits correspond to what and could
happily change it if we decided to change LogFormat.
Also, we minimize escaping. Escaping characters is a
terrible thing to have to and increases the general level of
noise in a regex. So, first of all, we change the
delimiters. The /.../ are merely traditional. All
good regex implementations let you use whatever you like.
Even some bad ones let you. Second, we remove the backslash
from every character that doesn't need it. Double quotes
don't need them. Characters inside a character class, except
for ] and - don't need them.
What many people don't know is that it's possible to use
], [ and - in a class without
escaping them. The trick is to put them at the back (in the
case of -) or at the front (for all of them). You
can't have a range with no start or no end, thus having
- at the start or end meaning exactly -
makes sense. And an empty character class makes little
sense, so [[] actually matches [, just as
[]] matches ]. Similarly, [^]]
doesn't match ]. To match ] or [,
you can happily use [][] which looks odd, but works
happily.
(?:u)? is a strange construct, which is
equivalent to u?. (?:0|6) similarly maps
to [06].
With those and some other minor tweaks, the regex is
readable, maintainable and reusable. In Perl, one can
happily use the qr// operator to store a regex in a
variable for later use, which means it's perfectly possible
to easily build a library of common regular expressions.
Such as the Regexp-Common
distribution. Isn't that great?