Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

ShowUsYour<Regex>

Irregular Expressions Regularly

Is RegExLib full of "it"?

Today I read a comment on Jeffrey Schoolcraft's regex blog from Randal L. Schwartz which I felt that I needed to respond to.  As I started writing the comment I realized that this is probably news that needs to be publicly visible, so I'm posting it to my blog and cross referencing  the original comment.  First, here is Randal's comment:


Yup. I continue to downvote and negative-comment nearly every entry at "regex lib".

Do not validate email addresses with a regex (unless it's the full regex, as you point out).

Do not parse HTML with a regex. HTML is surprisingly complex.

Do not validate a date with a regex. All these regex I see that try to compute the number of days of february based on the year number just have me going "WTF!".

These are NOT regex tasks. These are dedicated tool tasks.

And yet, "regexlib" is full of them. And full of "it", if you know what I mean.


Randal...

I hear your pain.  As the lead developer of RegExLib I also see the problems that you are mentioning and, presently we haven't really provided a good enough toolset for the newbies to really help themselves properly.  Should the newbies be randomly using regex's that they find on the site... dunno?  That's for another argument.

We implemented the rating and comment system in the middle of last year to try and give some indication about the value of individual patterns - so I'm extremely grateful that dilligent members of the community such as yourself are helping out by casting your votes.  We also implement an Rss feed for the comments so that comments such as yours are given public visibility - http://www.regexlib.com/RssComments.aspx

It's a hard battle to win as RegExLib continues to grow and, as of today contains nearly 1000 expressions.  There's good news though.  Over the past couple of months there's been a lot of effort put into helping solve these problems and, to that effect, users of the site will see a vastly improved set of tools to help deal with some of the problems that you've mentioned. 

To give you a quick example, one of the new features will provide users with a shortcut way of finding useful AND ACCURATE expressions by offering a box which says: "Enter N examples of what you want to match and N that you shouldn't and we'll provide you with a list of patterns which that match your requirements".  This will help to remove the hit and miss element of a NOOB scanning through 1000 patterns to find the veritable needle in the haystack.

The tools that allow users to manage their expressions is also getting an improvement so hopefully pattern authors might be more responsive in adjusting their patterns based on feedback received.

I hope that, once you see the new features for yourself you will agree with me that RegExLib is a much more valuable resource than it is today.

Published Friday, April 01, 2005 5:46 PM by digory
Filed under:

Comments

 

digory said:

Maybe if you put something like what I just said in your FAQ, or somewhere else that someone must read before they post.

WE WILL NOT ACCEPT REGEX FOR THE FOLLOWING:

- email validation (see RFC 2822/822 for why this is hard)
- date validation (especially if you don't consider M/D/Y vs D/M/Y)
- HTML parsing of ANY kind

And then just keep updating the list as other "hard" problems.

And actively *delete* anything in these "hard" categories. Don't make me downrate them, just to have some other clueless fool come along and upvote it because it "works" for them.

*That* would greatly increase the quality of your pattern database. Really.

If you give me rights to delete (make it a big button that just makes the regex go away and email the original submitter with the FAQ), I'll be happy to help you police the site.

I'm not a guy who just complains without being willing to help. Here's my offer to help.
April 2, 2005 1:42 AM
 

digory said:

While I agree that regular expression are not appropriate for solving every validation problem I'm not onboard with the don't ever using a regex for this type of problem attitude. I think there are other factors that influence if a regex approach is a good or bad thing. I have written regexes of each type mentioned above and their usefullness or lack of is not always as staightfoward as people seem to want them be. I agree that email validation with a regex is not a good idea in general because it is impossible to write a regex that is 100% compliant with RFC 2822 but if you are only working with a subset of mailboxes it may be a viable option. And as bad email validation with regexes it is a better candidate than phone numbers. As difficult as the regex maybe at least email is an international standard. Phone number formats vary from country to country. Yet phone numbers is probably one of the most common example of uses for regex.
I've written many date regexes. How useful or not they are is something for the people actually using them to decide. However I have personally learned more about regexes in writing them than i could have ever learned from writting "simple" regexes. This includes how certain feature actually work as well as various bugs in an implementation of regular expressions. I've also learned about the good and bad of some non-regex date validators which might be consider better alternatives. Several people have written stuff based off my work, hopefully they've learned something other than how to cut and paste.
But I have to say i disagree with HTML parsing the most. I think this is really a problem specific case and this is true of other type of markup. A lot of common things people want to do with HTML shouldn't be done with a regex, such as document navigation but HTML editors with regex find and replace support can be a godsend.

I think the rating system on the regexlib does fall a little short because there is too little thought put into some of the positive ratings and too little evaluation and feedback put in some of the negative. To me negative rating with no example and/or feedback is pointless.
I do the a FAQ is a good thing but I don't think banning certain problem sets is a good idea. You may learn more from writting "hard" regexes that you almost never use than the simple ones you use daily. Maybe give a strong warning of the common pitfalls of those problem sets and some general alternatives. I often see people suggest a .Net alternative to regex problems but that only helps if you are working with .Net. If you are not then the regex option starts look better again. If you have a code alternative and don't know what language the person is using just offer a psuedo code solution.
April 2, 2005 4:18 AM
 

digory said:

While I agree that regular expression are not appropriate for solving every validation problem I'm not onboard with the don't ever using a regex for this type of problem attitude. I think there are other factors that influence if a regex approach is a good or bad thing. I have written regexes of each type mentioned above and their usefullness or lack of is not always as staightfoward as people seem to want them be. I agree that email validation with a regex is not a good idea in general because it is impossible to write a regex that is 100% compliant with RFC 2822 but if you are only working with a subset of mailboxes it may be a viable option. And as bad email validation with regexes it is a better candidate than phone numbers. As difficult as the regex maybe at least email is an international standard. Phone number formats vary from country to country. Yet phone numbers is probably one of the most common example of uses for regex.
I've written many date regexes. How useful or not they are is something for the people actually using them to decide. However I have personally learned more about regexes in writing them than i could have ever learned from writting "simple" regexes. This includes how certain feature actually work as well as various bugs in an implementation of regular expressions. I've also learned about the good and bad of some non-regex date validators which might be consider better alternatives. Several people have written stuff based off my work, hopefully they've learned something other than how to cut and paste.
But I have to say i disagree with HTML parsing the most. I think this is really a problem specific case and this is true of other type of markup. A lot of common things people want to do with HTML shouldn't be done with a regex, such as document navigation but HTML editors with regex find and replace support can be a godsend.

I think the rating system on the regexlib does fall a little short because there is too little thought put into some of the positive ratings and too little evaluation and feedback put in some of the negative. To me negative rating with no example and/or feedback is pointless.
I do the a FAQ is a good thing but I don't think banning certain problem sets is a good idea. You may learn more from writting "hard" regexes that you almost never use than the simple ones you use daily. Maybe give a strong warning of the common pitfalls of those problem sets and some general alternatives. I often see people suggest a .Net alternative to regex problems but that only helps if you are working with .Net. If you are not then the regex option starts look better again. If you have a code alternative and don't know what language the person is using just offer a psuedo code solution.
April 2, 2005 4:18 AM
 

digory said:

I wrote a big, long comment, but decided it was too long so I just blogged it instead:
http://regexadvice.com/blogs/ssmith/archive/2005/04/01/3258.aspx
April 2, 2005 8:55 AM
 

digory said:

Well, I'm tired of web forms invalidly rejecting my very valid email of "fred&barney@stonehenge.com". There's no f'ing reason to reject that. It's valid RFC822. If you're gonna play on the net, play nicely.

I've made comments in the other blogs in this thread as well.
April 2, 2005 11:19 AM
 

digory said:

Some problems do not require bullet proof solutions. A regex for parsing internet addresses may be a little bit naughty, but it can be implemented quickly and cheaply. I can imagine dozens of uses for such a regex. In many applications only a small range of dates would be accepted. Of course the date regex might not be accurate in a universal sense, but that does not mean it could never have application.

As Michael Ashe mentioned, validating telephone numbers is even more difficult than email because each nation adopts its own "standard". Yet, many business applications would need to parse only numbers included in the North American Numbering Plan. Or in metropolitan Chicago.

Rather than deleting all date/html/email/Phone number regexes, perhaps it would be more useful to document the shortcomings of these patterns and explain why they should only be used where appropriate.

Randal Scwartz is justifiably annoyed that web forms reject his email address, but that is not really a regex problem but a business logic problem, i.e. instead of rejecting the suspect email address the site could send a confirmation request to it. BTW, just how would a person "properly" validate an email address?

Perhaps it would make sense to create a catagory of hard or not universally applicable regexes with specific catagories for email, phone, email and dates. If the heading of each gives a brief explanation of the caveats to their use and links to a page with more detailed explanations, then those who might have use for these patterns will have them. Caveat Emptor.

April 8, 2005 2:51 AM
 

digory said:

I agree with Alan Lindsay, even the existing date regexp have utility. Maybe they just need better organization &/or categorization.
April 15, 2005 7:16 AM
Anonymous comments are disabled