Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

One Regex to rule them all?

 

I don't think so.


Regular expressions have many uses. However finding one regex the is good for every possible scenario for the type of data the regex is matching is daunting.

Recently Justin Rogers used one of my creations ,my datetime regex, as a negative example to try to prove a point. While I used it a couple of times as one of several examples for a different discussion, he chose to focus on this one. In defense of my expression I really can't say I understand the points he's trying to make in regards to the regex or why he’s making the assumptions he's basing them on. He argues because of it size and complexity its usefulness comes in questions. Now you could say the same thing about Microsoft Word. Most people don't use features of Word that exceed what Wordpad offers. MS Word has feature most of it users will never use or even learn about. Does that diminish its usefulness? Unlike Justin I've made no assumptions on how or why someone else will want to use this regex. I have several scenarios where I might have a use for it, and I wrote it to include the criteria I know I want or think I might want later but I wouldn't in my wildest dreams think everyone would want to use it exactly the same way I would. If that happened it would be eerie. How useful something is depends on the situation it will be used in. A hammer is a useful tool if you are driving nails, not so much if you are sawing wood. If you are working with a language that has strong date handing or at least enough to satisfy your needs you may have no need for this regex however if that isn't the case this regex may be just what the doctor ordered or a let a very good jumping off point. Furthermore I didn't write this regex, take over the world, and declare all must use my regex. I wrote it just as much to be a modifiable example of what you could do with a regex and how certain syntax could be applied as I wrote it for actual application. You may learn something just from looking at the construction of this regex even if you never actually use it. Whether or not you use this expression should be based on what your needs are. If it exceeds what you need you may wish to simplify it or choose something else. It was never intended to be everything for everybody. The regex’s range is intentionally broad. You may never need its full scope, in fact you never will use it but that's not the point. It’s there to use if you need to use it and is allow the maximum flexibility. You can restrict its range and loosen it strictness if that’s what you need from it. Y2k should have shown starting out with the minimum then expanding is not always a simple task.


Another point he tried to make was user feedback using this regex. I don't know what this has to do with the price of tea in China. First off who is “the user”? The person using the regex or the person using the application that uses the regex? If he's referring to the former the most I can do as the author of a regex or code snippet or application is to supply documentation of how I intended for it to work. I've provided a description, examples and in several cases links to more information on what’s being checked. Beyond that it's up to the person using it to make use of that information. I can try my best to communicate what the purpose of a regex is but no matter how much information I provide I can’t make anyone look at it or ensure they will interpret it correctly. I've seen several cases of people complaining a regex not working for a case that doesn't match its described use, such as an IP matching regex not matching an email mailbox. Now if the user didn’t read the description, didn’t know what an IP address was or didn’t know what the difference between the two I don’t see how that falls on the author. If you are ignorant of what something does you are better of not using it unless you are willing to learn about it. That doesn't just apply to programming but life in general. I would no more suggest using a regex that you didn't know what it was checking than I would suggest you go down to the nuclear power plant and press all the buttons just to see what happens. You should know what your tools do if you plan to use them. Now if “user” refer to the end user, let me just say that I've never seen any sentient code that knows what the programmer/developer/designer meant for it to do and on its own wrote error messages for every possible bad input. Just as I have never seen a regex testing program explain me why my input didn't match a regex I or someone else wrote. Feedback is solely at the desecration of the creator(s) of the application or function. Whether you use a regex, a function or tea leaves validate the data or match a patterns feedback is a separate issue. Even if you just use the regex to return true or false ,which isn't all you could you it for, there is nothing stopping you from giving a simple error message like “bad date” or a complex message consisting of a three paragraph discussion on the history of calendar reform. Other than testing I have no idea of why you would have a regex sit on an island. If the user is testing this regex and thinks a value should pass and it doesn't they may think your code is broken. That may be true and maybe your code is wrong. But that is something the user will need to determine. If they think you are wrong they need to verify they were right. If someone thinks they are always right well they’re just delusional. If they merely assume they are right and seek or create something that agrees with what think then they will be accepting bad data. I had several people incorrectly tell me what qualified as a leap year and that my regex fail their assumptions. The fact they thought they were right didn't in any way redefine the actual leap years rule. If a user needs something that accurately validates leap years they need to know what that entails.


Finally the assertions that you could write a coded equivalent in fewer characters than the regex uses is presumptuous and relative. Whether or not that was true would depend on at least two factors. The language being used and the skill level of the person using it. Anybody can write inefficient code. Not every can write good tight code. The complexity of the language would also be a factor of how many or how few characters you need. Why he is even counting character is beyond me since he's been arguing for writing more verbose regexes with simpler syntax rather than write shorter expression with advanced syntax. Now we are character counting? You can write code for anything you would write a regex for. Some case code is better sometimes it not but either way the choice of what is best used is that of the person using it.


Sponsor
Published Saturday, August 14, 2004 12:09 AM by mash
Filed under:

Comments

 

mash said:

Just to explain the character counting. In code, counting characters can often be a great way to examine the number of operations that will be required by the code in hand. Things like name lengths can damage this metric, but consistent use of similiar sized names can reduce the damage.

With that in mind, regular expressions don't suffer the same beast. Each separate construct turns into more string parsing operations. Just to note, your datetime expression saturates the underlying string with multiple access after multiple access constantly comparing the same characters over and over again, often times against the same condition. This is what you call backtracking and Darren has done an excellent job of examining the problems with back-tracking.

So yes, my knowledge of expressions extends all the way into the very string processing routines that underly the expression code you write. I know every forward scan, backward scan, and comparison that happens under the hood. I know how many nodes the parse tree ends up being. I know how many functions are called by running your expressions. At the end, that allows me to do things like write large expressions using simpler syntax that beat the pants off of more complex expressions that might be a bit smaller but wind up as more string processing on the other side.

I still stick by the user feedback goal as the largest of goals. If your expression fails you've just done a bunch of work to get 0 information except a boolean yes/no as to whether or not something was matched. You obviously have a bunch of conditions that need to be met, and if you wrote a parser/validator each of those conditions could be processed in turn. At the moment one of them fails you could provide meaningful feedback to the user. What happens if the user types:

07/32/2004

What do you do? You fail, does the user know why? Not really. A functional validator would tell them the day was out of range. What about:

02/29/2005

Again a functional validator would alert them that the 29th day of 2005 doesn't exist and might even explain leap years.

If you are going to take the time to have the user hand input a string, telling them valid/invalid just isn't going to cut it.

To note, I can guarantee in the time you've written your complex expression a relatively novice JScript programmer could have written date parsing logic with more feedback.

Date.parse() does most of the work for you. Extending the functionality to remove invalid dates can be done by checking the parsed year/month/day fields and checking them against a preset list of invalid value.

What about .NET? DateTime.Parse and DateTime.ParseExact.

What about Perl? Well the expression masters themselves seem to opt for functional parsers though they use a number of separate expressions in parsing portions of the string to provide actual user feedback about the failure. HTTP::Date->parse_date
August 14, 2004 3:32 AM
 

mash said:

I do have to point out a most excellent optimization on your part. The assertion you make in the beginning to force a digit is pretty smart. It saves you a lot of processing in the case that the first character isn't a digit at all. Some knowledge of character classes might help here though. Each character class is resolved using a binary search of the characters in the underlying string. This means a bunch of math and several comparisons. For small character classes the amount of work is always maximized, while larger character classes get a bigger performance reward (\\d is conceptually slower than \\w).

How much processing does it save you? Well, by looking at your expression for <month> you can see that something like:

0?[13578]

Actually does quite a bit of work.
Op 1: Match 0
Op 2: Fail or Succeed Match 13578
Op 3: Fail then Backtrack Else Done
Op 4: Match 13578

You can have up to 4 operations there for that single expression because of the backtracking involved. Now, you add that into your alternation groups and you get 12 ops total just checking the first characters for the month. If we assume the first character is not a digit then we can get rid of two of those checks because the second op for matching 1[02] and 11 will never occur (aka [02] won't be attempted nor will the second 1 in 11). That puts you down to 10 ops. Of those ops 2 are character classes. So you already have at least one more character class checking the month than you do in your digit assertion.

Now, running some performance numbers you can find out how badly the (?=\\d) is hurting your perf in the success case (about 3%):

Date Expression Compiled 00:00:03.4048960, True
Date Expression Compiled No Check 00:00:03.2847232, True

And how much it saves you in the failure case (about 94% nearly doubling your throughput):

Date Expression Compiled 00:00:00.5407776, False
Date Expression Compiled No Check 00:00:01.0515120, False

Actually a nice case of where assertions can increase your performance for certain scenarios.
August 14, 2004 5:23 AM
 

mash said:

PS > I missed the 5th wheel or 0?2 which would enforce another 4 operations. Note that finding a way to remove the multiple incantations of 0? would be glorious for performance because it would remove a number of backtracking problems.

Placing least costly expressions like 1[02] and 11 at the beginning of the alternation group will greatly increase your speed as well. Literal checks kill character classes. In fact it may be faster to do 1(?:0|2) than 1[02] (in fact it is by about 18-20%)
August 14, 2004 5:33 AM
 

mash said:

I still see the Feedback issues as an apples and oranges arguement. It you look at most of the existing canned date validating type of function whether built into a langauge or created in code the vast majority of them simple return a true/false result. You don't run across many, if any, that provide the level of feedback you are suggesting. This regex is the rule, not the exception in that regard. Feedback to that degree of why a value failed is a secondary check that most simply don't provide. Can you? Of course but no matter how you validate a true case to what extent the false case is detailed, if at all, an optional second step.

Even if you just look at the .Net methods or fuctions dealing with dates such as isDate() it's not going to tell you why "9/31/2004" isn't valid but the documention does, well at least suggest why. Same is true for System.DateTime.Parse or even the System.DateTime construtor. Pass either invalid date and they throw exceptions basically telling you youre input isn't a date. Not why. The documention provides those details.

And BTW the DateTime.Parse is not historially accurate unlike the regex, which with one noted exception is. And you won't get any feedback even in the documentation telling you why.
August 14, 2004 5:30 PM
 

mash said:

The regex was actaully designed to fail faster not optimized for matching. And actually initial assertions primary purpose is to prevent leading space in with a time only input. A leading space is required for a datetime input, but not allowed for just at time.
August 14, 2004 5:47 PM
 

mash said:

You truly don't understand parser/validator logic. They are two separate pieces of a puzzle. Your expression handles BOTH.

DateTime.Parse and Date.parse are BOTH parsers only, they are not true validators. Just look at the name of the method and they give you a really good CLUE. However, after running a DateTime.Parse, 10 additional lines of code can validate out all historically inaccurate dates. You said yourself, add the parts you need, make it more simple or more complex. Adding 10 lines of VALIDATION code with feedback is my point when working with a PARSER.
August 14, 2004 6:09 PM
 

mash said:

I understand the logic fine and I know exact what my expression does. What you don't understand is the criteria the regex is based on. The parse method was just ONE example of many and it does BTW perform some validation logic. Try parsing 2/29/2004 and 2/29/2005. One will work, the other result in an error. Fine throw out that example and plug in a validating functions or methods like isDate or IsLeap the same inaccuracy is there.
August 14, 2004 11:43 PM
 

mash said:

Last comment here since my recent blog postings are holding their own weight in dissecting these ideas.

DateTime was never meant to handle historic oddities. However, System.Globalization.Calendar and the few pre Whidbey and multiple post Whidbey overrides was meant to handle these things. You can always ask the GregorianCalendar or the JulianCalendar if dates are of note.

I already pointed out that your method is a validator. Hence me saying "Your expression handles BOTH". Anyway, the data flow diagrams explain my point using much more accuracy so I'll retain any further argument in response to comments there.
August 15, 2004 3:06 AM
 

mash said:

Fine then I'll close with these points.
First I never said Datetime was meant to handle historical oddities. I simply pointed out a difference between it and my regex. You were the one who introduced that class into the discussion not I.
Second I'll simply say my regex is different from those calender classes as they are now. As I have no access to Whidbey I can not prove or disprove if that has changed, though I would doubt it since algorithms of that nature typically aren't based on the same criteria I was using.
Finally as I stated before the primary reason for constructing this version of the regex was never application but example and certianly not your assumptions of its application. The level of accuraccy it provides is there simply because I wish it to be. It far exceeds what most application care about or will ever need. However that does not prevent someone now or in the future, for whatever reason they deem necessary, from using it because it is a working example. If you've learned anything even just from the examination of the regex then it has served it's purpose.
August 15, 2004 11:44 AM
Anonymous comments are disabled