Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

qr{ ^ \QIain Truskett's Regex Blog\E $ }x

Regularly expressing

  • Making regex dynamically

    Many people use their regex just as static input. You write the regex into your code exactly as you intend to use it. Sometimes, it's useful to make your own on the fly.

    Some (most) languages just have regex as strings. Some let you compile them into objects to save on having to recompile them for repeated matching. For example:

    my $date_RE = qr/$days_RE, \s+ $months_RE \s+ \d+(?:th|st|nd|rd)/x;

    This neatly makes a regex that can be passed around like any other variable, and also means that any regex that use this one inside are more legible. Notice the use of $days_RE and $months_RE inside that one. (Note on variable names: I don't normally use Hungarian Notation, but for some reason I tend to label my regex objects with a _RE suffix.)

    That's not really very dynamic though. But where did $days_RE and $months_RE come from?

        my $days_RE = make_re( qw(
            Monday Tuesday Wednesday Thursday Friday Saturday Sunday
        ));
        my $months_RE = make_re( qw(
            January February March April May June July
            August September October November December
        ));
    

    What's make_re? A simple function that makes a regex that matches any of the given elements of the array it is passed.

    sub make_re { qr/(?:@{[ join( '|', @_ ) ]})/ }

    At which point you think "Damn, Perl looks ugly!". Mind you, I'd like to see the code for that in other languages. The most important thing is that it's hidden away in a (badly named) subroutine so the actual code you'd be concentrating on is more legible. And, in turn, regex that use these variables are more legible and involve less repetition. This can only be a good thing.

    make_re doesn't create paricularly efficient regex but, in this case, that doesn't really matter. If you want more efficient ones, Jarkko Hietaniemi wrote Regex::PreSuf: a nifty module that creates optimised regex to do this sort of matching. It would just be overkill for this particular program. If we suddenly get a swag of new months or days where there is more overlap in the leading characters, then it would make sense to switch to using this module.

    For those wondering, this is the program I was thinking of when making those regex. They're reasonably strict regex so that if the page changes, I find out rather than have the program start guessing its way around.

  • Apache log file parsing

    Today's regular expression that's due for a tidyup is one that parses default Apache style log entries, though only matching particular types of request.

    Before:

    /^([a-z0-9\.\-]+)\s-\s-\s\[[^\]]+\]\s\"GET\s[0-9a-z\.\-_\/]+_
    ([a-z0-9]+)\.(?:u)?deb\sHTTP\/1\.[0-1]\"\s20(?:0|6)\s([0-9]+)
    \s\"[^\"]+\"+\s\"([^\"]+)\"$/

    After:

    m!^
        ([a-z\d.-]+)   # Remote IP/host
        \s-\s-\s          # User/group
        \[[^]]+\]
        \s
        "GET \s [\da-z._/-]+ _ ([a-z\d]+) \.u?deb \s HTTP/1\.[01]"   # GET request
        \s
        20[06]  # OK, Partial Content
        \s
        (\d+)   # Response size
        \s
        "[^"]+"  # Referrer
        \s
        "([^"]+)" # User Agent
        $
        !ix
    

    Bit of a difference, isn't it?

    The key thing to note is the use of /x mode and comments. At very little cost they give great returns on the readability and maintainability of the expression. We can actually see which bits correspond to what and could happily change it if we decided to change LogFormat.

    Also, we minimize escaping. Escaping characters is a terrible thing to have to and increases the general level of noise in a regex. So, first of all, we change the delimiters. The /.../ are merely traditional. All good regex implementations let you use whatever you like. Even some bad ones let you. Second, we remove the backslash from every character that doesn't need it. Double quotes don't need them. Characters inside a character class, except for ] and - don't need them.

    What many people don't know is that it's possible to use ], [ and - in a class without escaping them. The trick is to put them at the back (in the case of -) or at the front (for all of them). You can't have a range with no start or no end, thus having - at the start or end meaning exactly - makes sense. And an empty character class makes little sense, so [[] actually matches [, just as []] matches ]. Similarly, [^]] doesn't match ]. To match ] or [, you can happily use [][] which looks odd, but works happily.

    (?:u)? is a strange construct, which is equivalent to u?. (?:0|6) similarly maps to [06].

    With those and some other minor tweaks, the regex is readable, maintainable and reusable. In Perl, one can happily use the qr// operator to store a regex in a variable for later use, which means it's perfectly possible to easily build a library of common regular expressions. Such as the Regexp-Common distribution. Isn't that great?

This Blog

Syndication

Tags

No tags have been created or used yet.