|
|
Regex Musings
-
Several year ago I decided to learn
Regular Expressions. I had been skimming over an article about using
regex for validation on web pages. It was at least a couple of weeks
before I sat down an read the article in full. After I did I wanted
to try to write a regex on my one but couldn't think of anything
original. Eventually I wrote a date regex that validated leap years
as well, something none of the other regex I had seen at the time
did. In the spirit of the article that introduced regular
expressions to me I that this was something that could be used with
Web Page validation. But at the time I wasn't doing a whole lot of
web page development. And what little web development I was doing,
didn't require me to use or validate dates.
I shared my regex on the web. I got a
lot of request to do other date formats. At first I was reluctant to
do that because my original pattern was troublesome to write, mostly
because the text editor I was using didn't help me match parenthesis
and I had a but of types to work out. Plus I always thought it
would be easier to reformat the date into the format the regex was
checking.
Eventually someone else reverse
engineer my regex into another format. Later I wrote patterns for
other formats as I saw uses beyond web page validation. I expanded
the patterns as I learn more about regex. For various reasons I
eventually moved away from working on these type of patterns. I had
done about all I could with them. Other than fixing a bug, generally
caused by a typo, I didn't do much with these patterns.
Recently I got another request for an
alternate format for one of my old patterns. I referred them to my
central depot of date regexes. But something else in the request
reminded me of my original idea of handling other formats. So I
decided to actually do it. So here it is in JavaScript
function isValidDate(ds, format) {
var reDate = /^(?!(?:10([-./])(?:0?[5-9]|1[0-4])\1(?:1582)))(?:(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\2|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\3))(?:(?!0000)\d{4})$|^(?:0?2(\/|-|\.)29\4(?:(?:(?:(?!000[04]|(?:(?:1[^0-6]|[2468][^048]|[3579][^26])00))(?:(?:(?:\d\d)(?:[02468][048]|[13579][26]))))|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\5(?:(?!0000)\d{4}))$/;
switch(format) {
case "dmy":
return reDate.test(ds.toString().replace(/(\d?\d)(\D)(\d?\d)(\D)(\d{4})/, "$3$2$1$4$5"));
case "ymd":
return reDate.test(ds.toString().replace(/(\d{4})(\D)(\d?\d)(\D)(\d?\d)/, "$3$2$5$4$1"));
default:
return reDate.test(ds);
}
}
That's it, The function returns true or false if the string passed is a valid date.
isValidDate("11/1/2011") // true
isValidDate("31/10/2011") // false
isValidDate("31/10/2011","dmy") // true
isValidDate("2031/10/20","ymd") // true
The big regex is the date
validator. I've talked about that enough in the past so I'm not
going to go into detail about how it works just trust me that it
does. Basics are 4-digits years, one or two digit months and days.
You have a choice of three date separators a peiod (.), a hyphen (-)
or a forward slash (/), Which ever one you chose you have to be
consistent. There have been a few modifications from my original date
regex to expand the time span it checks against and exclude missing
day from the switch from Julian to Gregorian in 1592. but most apps
won't push up against the old boundaries.
The default format checked is
dd/mm/yyyy
The function take a optional second
parameter that lets you specify other formats of yyyy-.mm-dd or
dd-mm-yyyy but passing “ymd” or “dmy” respectively.
If the second argument is passed the
given alternate format is converted to the default format for testing
by the use of one of two of simple regex replace operations. Nothing
really fancy The month, day and years are moved around to get to
mm-dd-yyyy format.
For this I just kept it simple and
handle the most common numerical date formats. For general webpage
validation this function is perfectly viable. You could have taken my
date regexes for all the various format and used each on base on the
format. The advantage of just using one date regex is maintenance is
easier. Not that the pattern needs any maintenance but if it did
there is only one complex regex to deal with instead of three. Also
with one regex you can't play around with other checks by using the
match. I did that while playing around with this but not including
that here. And while writing this I thought of a couple other things
I could build off of this.
Anyway there you have it. Three regexes, one scary two simple,
and a few lines of code to validate dates in 3 formats.
|
-
By
far the most common request I see in people wanting regex help is
someone wanting to use a regex to parse HTML. Generally I ignore those
questions. If I do respond, my response is “Don’t use a regex to parse
HTML. Use the HTML DOM” Same is true for XML, with the XML DOM, but doing it for HTML is even worse. Not
that my advice stops people from trying to do it anyway but giving them
proper warning I feel no remorse in the self-inflicted pain they illicit
ignoring my advice.You
may ask why I advise people against using a regex for this task? First
full disclosure, I have, do, and will use regexes to parse HTML. But
this is a case of “Do as I say, not as I do.” Why is it OK for me but
not for you? you may ask. Simple, I am very knowledgable of both
Regex and HTML. If I do embark on such a perilous task, I know exactly
what I’m getting into, what the risk are and what pitfalls to expect.
In most case I am the author of the HTML so I know how the markup will
be constructed. Most people asking for a regex to parse HTML are expert
in neither and have no idea of what to prepare for. Which is reason one
I tell people not to use regex for HTML. It is not as simple as you
want it to be. Most people knowledgeable in regex will tell you to
never use are regex for HTML. I won’t go that far but if someone needs
help to construct such a regex then they shouldn’t be using a regex for
this task. If you can write it yourself, you still shouldn’t do it but
you are the only one who has to deal with the pitfalls so you are the
only one who has to suffer.Now
I’ve seen other article advising not to use regex against HTML, so I
never really considered writing about it. I thought it had been covered
enough. But since people keep asking for these types of pattern I
thought I should address it as well. And also I have insight that most
of those other articles don’t address. The
biggest issue by far is that people asking for these patterns, as well
as most of those who will trying to write patterns for those who ask, is
that they don’t know HTML very well. HTML is one of the languages a
lot of people think they know better than they actually do. And that’s
where the problem start. For the last few year I'm been learning about Web Standards, which in a nutshell includes writing valid HTML and following best practices of web development. One of the things I learn in do this is that most HTML in the real world is poorly written. This compounds the problem of using regex against HTML, since regexes are all about finding patterns. So even if you know the rules of valid HTML you have no guarantee the HTML you are working with follows those guidelines. One
of the caveats of the Posting Guidelines is to show the original text
and not a made up sample. Although you shouldn’t be trying to use a
regex on HTML it points out why this guideline is in place. The
majority of people asking this type of question will either make up the
sample HTML or show a simplified version of the markup highlighting the beginning and the end of what they want to match and writing something
like “blah blah blah” for the content in between. The problem here is a
lack of understanding of how regex works and how HTML can be written.Commonly
the request is to match a certain div or table. The problem here is
they assume that these elements won’t be nested with child elements of
the same type. A table can contain other tables, which very common in
pages that use tables for layout. And just about any page that has div
elements will contain nested divs.Often the request is I want to match a particular div and they have some phony HTML like<div id=”findMe”> blahblah blah </div>And
some kind soul may write them a regex pattern that matches that sample.
The problem is real HTML could break any pattern they write. The
person asking for help incorrectly assumes that a regex will understand
HTML and know the </div> is the closing tag of <div
id=”findMe”>. The person answering may think that all that is need
is after finding the open div just search until the close div occurs.
This shows how made up data and regex construction don’t go together.
The pattern to find a block with no nested elements is simple. The
pattern to find one that have nested element get significantly harder
the deeper the nesting. The “blah blah blah” could easily be more mark
up, include any number of div’s. There could easily be eight
</div> before the one that closes <div id=”findMe”>. If the actual source is<div id=”findMe”> <div>content <div> More content</div> continued </div><div>content</div> <div>content More content</div> </div>Most regexes will choke trying to correctly find findMe’s closing tag, especially if the nesting depth is unknown. While a powerful regex engine may be able to balance to tags that doesn’t mean you are in the clear.Of
course the other assumption that leads to trouble is that the HTML is
either well written or well formed. Most HTML out in the wild is not
well written, in that it wouldn’t pass validation. Run most pages
through the W3C’s validator http://validator.w3.org/
Don’t be shocked if the error count is in triple digits. Unlike a
web browser a regex isn’t a HTML parser so it isn’t going to catch any
HTML errors any of which may cause your pattern to fail. A web browser
is very fault tolerant so it can handle errors and adjust. A regex
won’t.<div id=”findMe”> blahblah blah <div> Oops forgot to close the previous Div </div>So
even if you managed to write pattern that could handle nested tags of
the same type, which itself is tough enough. Now you have account for
unclosed tags as well. Which is one of many very likely errors you’d
need to account for.Even
if the HTML is well written and passes validation doesn’t mean it is
well formed. Unlike XML or XHTML, HTML doesn’t have to be well formed
to be valid. The assumption by many is that every open tags has a
closing match. Sorry that simply isn’t the case. Several tags have
optional closing tags. While the above snippet isn’t valid the below
snippet is<p id=”findMe”> blahblah <p> More stuff</p>Some
may think they can just test for either the first occurrence of
</p> or <p> after <p id=”findMe”> to find the
complete<p id=”findMe”> element but what if the markup is so<p id=”findMe”> <span>blah</span>blah <ul> <li> <li><p> More Stuff</p></ul><p> More stuff</p> or for a nested table<table id=”findMe”> <tbody> <tr><td><td><table> <tr><td><td><tr><td><td></tr> </table> <tr><td><td> <tr><td><td></table>The above samples are representing the source that could exist in an attempt to find the element with the id of “findMe”.Also formatting has to be considered. What is the difference between these two snippets of mark up?<p class=demo id=test> Some Text</p>and<p id=”test” class=”demo”>Some Text</p>As far as a browser is concerned they are the same. A regex would see them as two different strings.Another
common thing I see is when people want use a regex against HTML they
want to try to find an element based on an attribute. Several incorrect
assumptions generally are made- The attribute they are looking for will always be the first
- The order of the attribute are fixed
- That the attributes are always quoted
- All attributes assign a value.
None of the above have to be true for valid HTMLTake the following and assume one wanted to find all the elements with findMe attributes<p class=”findMe”> blahblah <p> More stuff</p><p class=”hello findMe world”> blah <span class=’findMe also’>blah</span><p> More stuff</p><p class=findMe> blahblah </p> <p class=”can you findMe”> More stuff</p> <p> More stuff</p>
Trying to find the find each element with findMe in the class with a regex is way more trouble than it’s worth.
Now
while these example are contrived and you most likely won’t get such a
mix of styles. You could get one that you weren’t expecting. And if
you HTML was created by a team of developer you may get that mix after
all. And remember you can’t even assume the HTML is correctly written.
So the moment you notice you’ve missed a style you didn’t account for
you have to adjust your pattern which could be a huge nightmare.
Another
thing to consider especially when searching by attributes is the a
regex can’t reuse consumed text. Consider the following <div class=”findMe”> <p>Blah</p> <p class=”findMe”> Some text <b class=”findMe”>to find</b></p> <p>Blah</p> </div> <div> Blah Blah Blah</div>
Again
if the task is to find all elements with a find me class. Let’s assume a
regex can match the div element. Once this is done the regex engine
will proceed to find the next match after the div element’s closing
tags. It will not go back inside the div to find the class on the
elements within the div.
Instead
of trying to hit a moving target with a regex, the HTML DOM would parse
all the above samples, even the incorrect one, because that's it job. Of course if there are
errors the document may not be parse exactly as you intended but it will
be correctly parsed. From there it is fairly easy to find the element
you need. Reformatting the markup is very unlikely to break code using the the DOM. For example using the popular JavaScript library JQuery the follow snippet would find all all the elements with a findMe class. var foundThem = $(".findMe"); Or in the versions of JavaScript supporting getElementByClassName
var foundThem = document.getElementsByClassName('findMe'); And
quickly XML which has stricter syntax rules, well formed, all
attributes quoted, may seem more regex friendly but there are other ways
to parse XML. Xpath or XQuery both more maintainable than a regex
would be for this task and easier to use and understand in most cases.
You can use a regex which may work, some of the time, or the DOM, something that should work all the time. In the end the choice is yours.
|
-
Over at the Construction forum there is
a sticky post which list the suggested posting guidelines for asking
question on that forum. First let me say the guidelines were my idea
however I didn't come up with them on my own. I had help. The idea
sprang from the fact I was getting tired of repeatedly having to as
each poster for the same bits of information. So the idea was to put
these standard questions in a central highly visible location
everyone could see. Where new poster could see what would be ask of
them and add that information to their first post. The standard
formula was
P1: Ask vague question
ME: Ask for more info
P1: Provide more info (maybe)
Me: Provide answer (maybe)
This being this be the best case but
often didn't go this smoothly. Variations would be
ask for two piece of addition
information get 1 (or none)
Answer question, Get response of “Oh
I forgot about this” or “but what about this unmentioned case”
Where step two and three could repeat
multiple times.
The point of the posting guidelines was
to eliminated the extra back and for so you get down to
P1: Ask detailed quest
Me: Answer
P1: Thanks (maybe)
Now if I were the grade how successful
the posting guidelines have been I have to give two grades.
Grade 1 (Posters asking questions): D-
Unsatisfactory passing. For the most
part people asking the questions won't follow the posting guidelines.
So in those cases nothing really changes. You still end up going
back and forth until you get the information you need to properly
answer the question. This is a waste of everyone's time, mostly for
the person asking the question. For those asking the question, it
just delays you getting an answer especially if the person responding
is on a different sleep cycle than you. You ask a question, hours
later someone on the other side of the planet starting their day
responds “Need more information”. However you've signed off for
the day. So the next day your respond but now the other person has
signed off for the day (or weekend) So now instead of having an
answer waiting on, you end up wait several more hours or days for an
answer.
In some case getting the asked for
information is like pulling teeth. I've always though it was strange
how some people are so unwilling to help you, help them. Maybe it's
because misery loves company and they want you to struggle with an
answer in the same way they have been struggling with the problem.
Sometime even after referring them to the guidelines people will
respond but still not provide any of the information asked for.
Sometimes they just rephrased their original query but add no new
information or at least nothing satisfying any of the guideline
request.
Some even get hostile about it. One
person stated he couldn't answer any of the guideline questions
because they were new to regex and didn't know anything about it.
Now that wasn't really a valid excuse since the questions don't
require any regex knowledge. The only one that even implies you have
any regex knowledge is “Show what you have tried” which implies
you have attempted to write a pattern. But if you haven't tried
anything, you can reply “I haven't trying anything”
A lot of people with post questions
with the mindset that you will answer while sitting in their lap. One
of the reasons we are asking for this info is because we can not see
what you are doing. We are not physically in the room with you. So
statements like “I have a text file”, or “I'm using a text
editor ...”
aren't helpful. We are not there, we
can't see your file, we can't see your editor, which we may have
never even heard of.
People will often post they are new to
regex. Sometimes I think the feel this excuses them for needing to
follow the guidelines. In fact it's just the opposite. They are the
one who need to be following the guidelines the closest. We've all
been in situations where we didn't understand something so much so we
didn't even know what questions to ask. Or in this case what
information is important and what is too much information. The
guidelines cover the most crucial information needed. The more of
this info provided the easier to answer your question becomes. The
less provided the more difficult to answer your question becomes,
irregardless if you are a regex noob or a vet.
The other thing, that I find annoying
is people asking for help saying it's not important for the people
answering to know what language they are using, or say it doesn't
matter what language platform the use to supply an answer. Why do
you think we would ask for unimportant information? The same is true
for all the guideline questions. Do those asking the questions think
these things were asked just because someone was nosy? Everything
asked is so that you can get the best answer as quickly as possible.
If you don't understand why the question matters, ask! But don't
tell the people trying to help you what parts of the information they
are asking for is important.
I said the the guidelines got two
grades. The second if for people actually answering the questions.
And for them the Guidelines get an A-(exceptional). The guidelines have been
around a few years now even before some the active members providing
the answers joined but in a lot of case I see them referring people
to them with no prompting by myself. Not everyone answering
reference the guidelines. Despite the things I just mentioned it is
possible to provide an answer following none of the guidelines. It
involves guessing and luck. Even if one ask a vague question,
someone might be able to guess what they actually wanted and answer
on the assumption and in some cases will be right. But that's mostly
luck. Luck that they are familiar with the data you are working
with. Just like above “sitting in your lap” people tend to assume
that everyone is as familiar with the data or tools they are using as
they themselves are. If you are lucky if that is true but nothing
guarantees that. Plus a host of other assumptions have to be made
and guessed correctly. Do you really want to rely on dumb luck every
time?
A simple example
Q: I need make sure a sting contains
only letters, what is the regex for that?
A: ^[a-zA-Z]*$
Assumptions made:
-
By letters the question referred
to letters of the English alphabet only
-
Spaces are not allowed
Now that pattern may be exactly what
the person asking wanted. But what if it wasn't. People asking these
questions come from all over the world, English may not be their
first language. The regex may not be intended to used against
English only input.
The more common and actual request I
see is when people say they need there regex to make “Special
Characters”. Now maybe I was sick the day that they announced
exactly which characters this term applies to and why they are
special but when people ask for this there doesn't seem to be a
consensus on which characters this includes. Now I could assume they
meant any non-alphanumeric character, or even just non-alphanumeric
characters on a keyboard. Of course this assumes we have the same
keyboard layout. But often people only want a few of these
characters. Which ones? Well that's part of the problem it varies.
Even if just limited to the keyboard which characters are special or
how many characters they are wanting differs by person.
One of the things that really tell that
those answering question are good with the guidelines is when you see
a person was answering questions or trying to that provided
incomplete information get frustrated by the back and forth of such
questions and start referring newer such post to the posting
guidelines.
As new regex pros joined the board and
start answering questions on a couple of occasions I have asked if
they had any changes, additions or deletions to they thought should
be added the to guidelines. The common answer is that the guidelines
are good as is. The only real addition suggested were adding a
Frequently Ask Question (FAQ) which while a good Idea since some
question get asked repeatedly, does provide useful information for
answering questions. Instead it would be more for preventing
questions from being asked again. Only thing is the questions to
vary slightly, like the afore mentioned “special characters” So
the questions would likely still get asked especially considering the
existing guidelines get ignored.
The only actual change from the first
posting of the guidelines is the additional of a example of a badly
asked question and template of a well asked question.
The actual guidelines can be found here http://regexadvice.com/forums/thread/60465.aspx but a quick summation of the 4 guidelines are
-
What programming
language/Application are you using?
-
What are you trying to do?
-
What have you tried to do?
-
Show the actual text you are
working with.
Now those who say they can't answer any of
these question because they don't know regex, a probably are about to
get fired from their job.
-
What programming language/Application
are you using?- If you are a programmer and you don't know the name
language you are doing your work in you are definitely in the wrong
job. If you are not a programmer or just using an application, how
can you not know the name of the application you are using or
planning to use. How did/will you launch it, or even find it on your
computer? How do you know it will even allow you to use a regex?
-
What are you trying to do? - If you
can't answer this, well... You are not being asked about your process you are being asked for your
goal. If you don't know your own goal why are you even asking others for help?
-
What have you tried to do? - Again
either you tried something or didn't try anything. If you have tried something please show it. If you haven't tried anything just say so.
-
Show the actual text you are working
with – Aside from security concerns which can be addressed, this
basically copy and paste.
Why do we want to know this
information?
-
Regular Expressions are not
uniform in all case. There are differences in syntax, features
offered and even how features work that vary from host environment.
And we aren't talking slight variations. In some cases the syntax
of one regex engine is over 50% different from another. All pattern
are not portable. Knowing the hosting environment makes it possible
to know what is available regex-wise.
-
Knowing the goal make giving an
answer easier, if there is an answer. In some cases the goal may
not be achievable with a regex or at all. Some solution require multiple patterns, some require programming code. Some goals are better
suited for a non-regex solution
-
Samples of what has been tried
provide, insight to host environment, level of regex knowledge or
knowledge of host environment. Code samples would show the language
being used. Some attempts show misunderstanding of regex. Some
attempts show syntax or logic errors in the host language code.
Some people were close to getting the pattern right themselves.
Others have had perfectly fine regex pattern but their host language
code was wrong. So there is nothing with the regex that needs fixing
there code needed to be corrected. Or even the pattern itself is
fine but needs to be adjusted for the host language.
-
This is very important but I have
already covered that in another post.
Generally people incorrectly assume
that if they make their question as minimal as possible it helps to
make it easier to answer. As far as regex goes the opposite is true.
The more details, the easier regex questions are to answer. Minimal
regex questions often lead to more questions. Detailed regex
questions often lead to quick answers.
|
-
A few years ago a wrote about about a
bug in Internet Explorer's Regex engine that affected patterns
with lookaheads. Well the bug came back in the form of a question on
RegexAdvice.com. It too was a password regex, though not as complex
as the previous pattern that introduced me to this bug.
The first pattern had three conditions
that were being tested for with lookaheads.
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$
With the current pattern only one
lookahead was being used.
^(?=.*?\d)[a-z][a-z0-9]{5,7}$
In both patterns the pattern had a min
and max length. In to original attempts of both the length was being
checked after the lookahead(s) test. While this is perfectly fine in
a non VBScript/JScript world, this is were the bug kicks in those
regex engines. Actually it's probably the same regex engine for both
languages, which is probably why it only effects IE it's the only
browser that uses VBScript and Jscript natively. I don't recall in
my original testing for this bug if I tested it server-side given my
previous blog comments, mostly likely I only tested client-side.
However the recent question it was failing server-side so it's not
actually in the browser but more likely the DLL for those languages.
Anyway the previous blog article covered the behavior that was
happening.
Steve Levithan looked much closer at
the problem in general and discussed
it on his blog. He came to the conclusion that the qualifiers
with a minimum boundary of zero, within the lookahead were the
culprit. He provides a couple of simple examples. I think he's
partially right but I don't think the 1+ qualifiers are excluded from
the problem. I think his provided examples were a little too simple
for them to be effected.
OK let look at the regex pattern in the
recent question
^(?=.*?\d)[a-z][a-z0-9]{5,7}$
The requirements were 6 to 8
alphanumeric (English) string that started with an alpha character,
with at least one digit. Let's use “abc123” as the text string.
Now the above pattern was supplied by
one of the resident pros who frequent the message boards. Let's get
pass the fact the pattern itself is correct and satisfies the
requirements. I'm going to rewrite the pattern to
^(?=\D*\d)[a-z][a-z0-9]{5,7}$
Now this is functionally the same, it's
just within the lookahead the pattern is greedy instead of lazy and
it will make the point I'm trying to make easier to see (I hope)
without having to deal with backtracking. Now this pattern suffers
from the bug in VBScript/JScript. Now Steven suggested the one or
more qualifier (+) doesn't suffer from the problem but change the
to a + , which doesn't effect the match because the first test after
the lookahead is for an alpha so the lookahead placed as it is will
always match at least one character with the \D* pattern. So now we
have
^(?=\D+\d)[a-z][a-z0-9]{5,7}$
which is also bitten by the bug.
I first tried the same approach on my
original attempt dealing with the bug, but no luck. I tried using
the + qualifier, still didn't work. Now the person posting the
question stated without the lookahead the remaining pattern worked
fine, excluding the at least one digit test. So I begin testing the
part of patterns such as
^a-z][a-z0-9]{5,7}$
^(?=\D*\d)[a-z]
^(?=\D+\d)[a-z]
are just a few attempts, all work as
expected. It was only when I put the whole pattern back together
that it begin failing. But after some more trial and error I think
I've come across a pattern with the bug. Going back to my original
examination of the problem I discovered that this pattern ^(?=\D*\d)[a-z][a-z0-9]{2,7}$ matches
the test string “abc123” now this doesn't seem to quite fit with
my original assessment of what was happening because in that pattern
after the lookahead test it's just testing for a certain number of
any characters, but it this pattern it's looking for a specific range
of characters but if you up the min boundary on the last qualifier by one to
^(?=\D*\d)[a-z][a-z0-9]{3,7}$ the pattern fails again. Now if you
change the zero to infinity qualifier in the lookahead to one to infinity qualifier in
the modified pattern so you get ^(?=\D+\d)[a-z][a-z0-9]{3,7}$ the pattern
matches again. Bump up the min boundary of this new pattern to
^(?=\D+\d)[a-z][a-z0-9]{4,7}$ Bugaboo! The pattern fails again.
Now I don't have access to the source
code so this is just supposition but here's what I believe is
happening. I think the data examined by the lookahead is being
stored in a stack structure. But just looking at the patterns that
are working and failing it looks like once the lookahead is satisfied
values when a qualifier is encountered in the consuming portion of
the pattern, the lookahead's match is reconsidered with the minimum
boundary of the qualifier in the lookahead popped off the stack of
the lookahead's match. Let's look at the first pattern that worked
^(?=\D*\d)[a-z][a-z0-9]{2,7}$
OK we'll start with after the lookahead
is matched. Now the lookahead is supposed to be non-consuming so the
pointer should still be at the beginning of the string.
^(?=\D*\d) matches “abc1” Now the
rest of the pattern matches normally until we get to the qualifier in
the consuming portion of the pattern we are looking for at least two
alphanumeric characters. At that point the lookahead match is
reconsidered, the lower bound being zero, nothing is remove for the
stack but the current character pointer now points to the character
after the lookahead's match value of “abc1”.The qualified part of
the pattern [a-z0-9]{2,7}$ can be satisfied by “23” so we get a
match
Now if we do the same thing with
^(?=\D*\d)[a-z][a-z0-9]{3,7}$ and apply the same logic the regex
fails because the qualifier in the consuming portion is look for at
least 3 characters and there aren't that many if it tries to satisfy
that part of the pattern after “abc1” in the test string.
Now let's look at
^(?=\D+\d)[a-z][a-z0-9]{3,7}$ with the same logic. The only
difference from the previous pattern is the lower boundary of the
lookahead qualifier. It's now 1. So if we pop 1 character of the
stack of the lookahead's match of “abc1” we get “abc” leaving
us with “123” to be matched by the consuming qualifier, which is
just enough.
Now take ^(?=\D+\d)[a-z][a-z0-9]{4,7}$
apply the same logic, now the pattern fails because the consuming
qualifier is look for as least 4 characters after the stack popping
of lookahead's match.
If you changed the lookahead qualifier
to {2,} the pattern would match. You can continue upping the qualifiers by one in the consuming portion to make the pattern not match then non-consuming part of the pattern to get the match so the behavior seems pretty
consistent with my theory which the consuming qualifier is pointing
to the end of the lookahead match and moving back the number of
character of the lookahead minimum boundary. It also seem to explain
the effect of my original encounter with the bug. It may as well as
why the workaround of testing the length first with another avoids
the bug because that consuming qualifier it that case is usually just
, though + seem to work too which doesn't quite fit but consuming
qualifiers with minimum boundaries of zero or one don't seem to be
effected in any case. In all of the above test cased those values
were below the minimal threshold of every successful test. However
the test cases above has only one lookahead that doesn't backtrack,
who know what influence backtracking, additional lookaheads or how
addition qualifiers in the consuming parts pattern would be
effected. Now while the test values support my theory I can't say for
sure that things are happening exactly the way I've laid out. The
actual mechanics may be different but whatever is happening under the
hood clearly pointers are being corrupted and the regex engine is
loosing it's place
The thing that was so confusing about
this was it only kicks in with a qualifier in the consuming portion
is encountered but if there was something between the lookahead and
the qualified portion it match normally. So this is something it's
really hard to make test cases for because you get these ghost value
popping up latter on in the test than I expected. Not to mention the
pattern itself is correct so even with a tool that will let you step
through the matching you don't get this behavior unless that tool is
using VBScript's regex engine and I haven't seen such a tool for that
engine.
If you are using lookaheads with
JavaScript client-side then you are going to be susceptible to the
bug in IE because it will use the Jscript engine. And while you
should always validate server-side if VBSCript or Jscript is your
server-side language you are still at risk. So a platform like
classic ASP which uses both of those languages by default is at risk
client and server side, but a platform like PHP while it still suffer
the bug client-side for IE, should work correctly server-side which
is using a different regex engine. Same goes for non-web clients
using the JScript/VBScript DLL.
The workaround for the strong password
type of regex is when using lookaheads to include the upper boundary
test(s) before the no upper boundary test then use .* to consume
characters. The bound test should keep the pattern from running
forever. However depending on the complexity of your criteria this
may not always be an option but try it first anyway.
|
-
First off let me say I'm a bit over my
head here. Not regex part but host the language of the regex engine.
Many moons ago I posted a blog article
stating why you could not write a regex that validated an e-mail
address 100%. Well this is still true, however in that
posted I also stated that the pattern was so massive that it wasn't
worth using. This is also still true however I was made aware of a
flavor-specific syntax that reduces the regex from massive to very
large.
This regex is for the PCRE engine.
http://www.myregextester.com/?r=337
Though from what I've read this will work for PHP too. Now I don't know Perl or PHP or what
minimum version of PCRE supports this syntax. That being the case I
also don't how well it performs. I wrote the original version using
the .Net syntax and not only was the regex massive, which is one reason I never posted it but the
performance was terrible. Given that most people want to use this
type regex to validate a data entry field, the pattern was overkill.
In fact I recommend that you don't use this, except to learn from.
The PCRE version may perform better but I don't have the means or
time to test, so use at your own risk. For simple field validation
even this is still overkill. For a large text file performance
may suffer horribly. Most likely you aren't going to want to use
this pattern as it is too large for simple test and performs poorly
for large test.
When I see people asking for Email
regex, I point out that perfect validation is not possible. And when
I see so-call email validating regex that are only about 50
characters long, it makes me chuckle. This pattern is probably to
most compact version of a RFC 2822 address regex you'll find and it
is still huge. Ports to other regex engines not supporting the
recursive syntax will easily be 4x as large as my .Net version was.
The above pattern does the RFC Spec up to
the address-spec, which pretty much what people are thinking about
when they are saying Email address.
It not to hard to take to it up a few
more level in the spec using this syntax
RFC 2822 mailbox :
http://www.myregextester.com/?r=338
but like I said it likely won't perform
well enough to be useful. The two patterns I've linked to I've
wrapped in anchors so they are just matching against the whole
string. Searching for a string within a larger body, without anchors will probably
degrade performance very fast. But if any of you PHP or Perl gurus want to stress test this beast, have fun. Maybe it's not as bad as I think it may be.
|
-
This is a C# 2.0 enhancement of a C# port of YUI Compressor's CSS minification code
I got a little carried away with ideas
for this, they were all regex based which really is what motivated me
to work on it. However after I thought I was done I learned not
everything worked. It did what I wanted it to do but what I wanted
wasn't the correct thing. I really should have just stopped with my
original ideas.
The last idea for my original changes was to take 2 or more
individual subset properties and write them in shorthand notation of
the main property they were a subset of. Well I got that to working.
But upon testing I learned something new about CSS that I didn't
know. Basically that what I was doing could alter the behavior of the
presentation. Which was disappointing because I put a lot of energy
into getting the results I was after.
So it looked as all of that code was
going to go to waste. But there was one scenario that what I was
trying to do was alright. So the code wasn't completely wasted. The
one scenario was if all the subset properties are declared then
combining them is fine. I didn't bother changing the regexes I wrote
for this but I cleaned up some of the code. Though it would have
worked as is some of the things being checked were now unnecessary.
using System;
using System.Collections;
using System.Collections.Generic;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;
namespace CSSMinify
{
class CSSMinify
{
public static Hashtable shortColorNames = new Hashtable();
public static Hashtable shortHexColors = new Hashtable();
public static string Minify(string css)
{
return Minify(css, 0);
}
public static string Minify(string css, int columnWidth)
{
// BSD License http://developer.yahoo.net/yui/license.txt
// New css tests and regexes by Michael Ash
createHashTable();
MatchEvaluator rgbDelegate = new MatchEvaluator(RGBMatchHandler);
MatchEvaluator shortColorNameDelegate = new MatchEvaluator(ShortColorNameMatchHandler);
MatchEvaluator shortColorHexDelegate = new MatchEvaluator(ShortColorHexMatchHandler);
css = RemoveCommentBlocks(css);
css = Regex.Replace(css, @"\s+", " "); //Normalize whitespace
css = Regex.Replace(css, @"\x22\x5C\x22}\x5C\\x22\x22", "___PSEUDOCLASSBMH___"); //hide Box model hack
/* Remove the spaces before the things that should not have spaces before them.
But, be careful not to turn "p :link {...}" into "p:link{...}"
*/
css = Regex.Replace(css, @"(?#no preceding space needed)\s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))", "$1");
css = Regex.Replace(css, @"([!{}:;>+([,])\s+", "$1"); // Remove the spaces after the things that should not have spaces after them.
css = Regex.Replace(css, @"([^;}])}", "$1;}"); // Add the semicolon where it's missing.
css = Regex.Replace(css, @"(\d+)\.0+(p(?:[xct])|(?:[cem])m|%|in|ex)\b", "$1$2"); // Remove .0 from size units x.0em becomes xem
css = Regex.Replace(css, @"([\s:])(0)(px|em|%|in|cm|mm|pc|pt|ex)\b", "$1$2"); // Remove unit from zero
//New test
//Font weights
css = Regex.Replace(css, @"(?<=font-weight:)normal\b", "400");
css = Regex.Replace(css, @"(?<=font-weight:)bold\b", "700");
//Thought this was a good idea but properties of a set not defined get element defaults. This is reseting them. css = ShortHandProperty(css);
css = ShortHandAllProperties(css);
//css = Regex.Replace(css, @":(\s*0){2,4}\s*;", ":0;"); // if all parameters zero just use 1 parameter
// if all 4 parameters the same unit make 1 parameter
css = Regex.Replace(css, @"(?<!background-position\s*):\s*(inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};", ":$1;", RegexOptions.IgnoreCase);
// if has 4 parameters and top unit = bottom unit and right unit = left unit make 2 parameters
css = Regex.Replace(css, @":\s*((inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(inherit|auto|0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;", ":$1;", RegexOptions.IgnoreCase);
// if has 4 parameters and top unit != bottom unit and right unit = left unit make 3 parameters
css = Regex.Replace(css, @":\s*((?:(?:inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(inherit|auto|0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
//// if has 3 parameters and top unit = bottom unit make 2 parameters
//css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, "background-position:0;", "background-position:0 0;");
css = Regex.Replace(css, @"(:|\s)0+\.(\d+)", "$1.$2");
// Outline-styles and Border-sytles parameter reduction
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)(?:\s+\2){1,3};", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
// Outline-color and Border-color parameter reduction
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};", "$1-color:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);", "$1-color:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);", "$1-color:$2;", RegexOptions.IgnoreCase);
// Shorten colors from rgb(51,102,153) to #336699
// This makes it more likely that it'll get further compressed in the next step.
css = Regex.Replace(css, @"rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29", rgbDelegate);
css = Regex.Replace(css, @"(?<![\x22\x27=]\s*)\#(?:([0-9A-F])\1)(?:([0-9A-F])\2)(?:([0-9A-F])\3)", "#$1$2$3", RegexOptions.IgnoreCase);
// Replace hex color code with named value is shorter
css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>f00)\b", "red", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>[0-9a-f]{6})", shortColorNameDelegate, RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(?<=color\s*:\s*)\b(Black|Fuchsia|LightSlateGr[ae]y|Magenta|White|Yellow)\b", shortColorHexDelegate, RegexOptions.IgnoreCase);
// Remove empty rules.
css = Regex.Replace(css, @"[^}]+{;}", "");
//Remove semicolon of last property
css = Regex.Replace(css, ";(})", "$1");
if (columnWidth > 0)
{
css = BreakLines(css, columnWidth);
}
return css;
}
private static string RemoveCommentBlocks(string input)
{
int startIndex = 0;
int endIndex = 0;
bool iemac = false;
startIndex = input.IndexOf(@"/*", startIndex);
while (startIndex >= 0)
{
endIndex = input.IndexOf(@"*/", startIndex + 2);
if (endIndex >= startIndex + 2)
{
if (input[endIndex - 1] == '\\')
{
startIndex = endIndex + 2;
iemac = true;
}
else if (iemac)
{
startIndex = endIndex + 2;
iemac = false;
}
else
{
input = input.Remove(startIndex, endIndex + 2 - startIndex);
}
}
startIndex = input.IndexOf(@"/*", startIndex);
}
return input;
}
private static String RGBMatchHandler(Match m)
{
int val = 0;
StringBuilder hexcolor = new StringBuilder("#");
for (int index = 1; index <= 3; index += 1)
{
val = Int32.Parse(m.Groups[index].Value);
hexcolor.Append(val.ToString("x2"));
}
return hexcolor.ToString();
}
private static string BreakLines(string css, int columnWidth)
{
int i = 0;
int start = 0;
StringBuilder sb = new StringBuilder(css);
while (i < sb.Length)
{
char c = sb[i++];
if (c == '}' && i - start > columnWidth)
{
sb.Insert(i, '\n');
start = i;
}
}
return sb.ToString();
}
private static string ReplaceNonEmpty(string inputText, string replacementText)
{
if (replacementText.Trim() != string.Empty)
{
inputText = string.Format(" {0}", replacementText);
}
return inputText;
}
private static string ShortColorNameMatchHandler(Match m)
{
// This function replace hex color values named colors if the name is shorter than the hex code
string returnValue = m.Value;
if (shortColorNames.ContainsKey(m.Groups["hex"].Value))
{
returnValue = shortColorNames[m.Groups["hex"].Value].ToString();
}
return returnValue;
}
private static string ShortColorHexMatchHandler(Match m)
{
//This function replaces named values with there shorter hex equivalent
return shortHexColors[m.Value.ToString().ToLower()].ToString();
}
private static void createHashTable()
{
//Color names shorter than hex notation. Except for red.
shortColorNames.Add("F0FFFF".ToLower(), "Azure".ToLower());
shortColorNames.Add("F5F5DC".ToLower(), "Beige".ToLower());
shortColorNames.Add("FFE4C4".ToLower(), "Bisque".ToLower());
shortColorNames.Add("A52A2A".ToLower(), "Brown".ToLower());
shortColorNames.Add("FF7F50".ToLower(), "Coral".ToLower());
shortColorNames.Add("FFD700".ToLower(), "Gold".ToLower());
shortColorNames.Add("808080".ToLower(), "Grey".ToLower());
shortColorNames.Add("008000".ToLower(), "Green".ToLower());
shortColorNames.Add("4B0082".ToLower(), "Indigo".ToLower());
shortColorNames.Add("FFFFF0".ToLower(), "Ivory".ToLower());
shortColorNames.Add("F0E68C".ToLower(), "Khaki".ToLower());
shortColorNames.Add("FAF0E6".ToLower(), "Linen".ToLower());
shortColorNames.Add("800000".ToLower(), "Maroon".ToLower());
shortColorNames.Add("000080".ToLower(), "Navy".ToLower());
shortColorNames.Add("808000".ToLower(), "Olive".ToLower());
shortColorNames.Add("FFA500".ToLower(), "Orange".ToLower());
shortColorNames.Add("DA70D6".ToLower(), "Orchid".ToLower());
shortColorNames.Add("CD853F".ToLower(), "Peru".ToLower());
shortColorNames.Add("FFC0CB".ToLower(), "Pink".ToLower());
shortColorNames.Add("DDA0DD".ToLower(), "Plum".ToLower());
shortColorNames.Add("800080".ToLower(), "Purple".ToLower());
shortColorNames.Add("FA8072".ToLower(), "Salmon".ToLower());
shortColorNames.Add("A0522D".ToLower(), "Sienna".ToLower());
shortColorNames.Add("C0C0C0".ToLower(), "Silver".ToLower());
shortColorNames.Add("FFFAFA".ToLower(), "Snow".ToLower());
shortColorNames.Add("D2B48C".ToLower(), "Tan".ToLower());
shortColorNames.Add("008080".ToLower(), "Teal".ToLower());
shortColorNames.Add("FF6347".ToLower(), "Tomato".ToLower());
shortColorNames.Add("EE82EE".ToLower(), "Violet".ToLower());
shortColorNames.Add("F5DEB3".ToLower(), "Wheat".ToLower());
// Hex notation shorter than named value
shortHexColors.Add("black", "#000");
shortHexColors.Add("fuchsia", "#f0f");
shortHexColors.Add("lightSlategray", "#789");
shortHexColors.Add("lightSlategrey", "#789");
shortHexColors.Add("magenta", "#f0f");
shortHexColors.Add("white", "#fff");
shortHexColors.Add("yellow", "#ff0");
}
private static string ShortHandAllProperties(string css)
{
/*
* This function searchs for properties specifying all the individual properties of a property type
* and reduces it to a single property use shorthand notation
*/
Regex reCSSBlock = new Regex("{[^{}]*}");
Regex reTRBL1 = new Regex(@"(?<fullProperty>(?:(?<property>padding)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
Regex reTRBL2 = new Regex(@"(?<fullProperty>(?:(?<property>margin)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
Regex reTRBL3 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:color)))\s*:\s*(?<unit>[#\w.]+);?", RegexOptions.IgnoreCase);
Regex reTRBL4 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:style)))\s*:\s*(?<unit>none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset);?", RegexOptions.IgnoreCase);
Regex reTRBL5 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:width)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
Regex reListStyle = new Regex(@'list-style-(?<style>type|image|position)\s*:\s*(?<unit>[^};]+);?', RegexOptions.IgnoreCase);
Regex reFont = new Regex(@"font-(?:(?:(?<fontProperty>family\b)\s*:\s*(?<fontPropertyValue>(?:\b[a-zA-Z]+(-[a-zA-Z]+)?\b|\x22[^\x22]+\x22)(?:\s*,\s*(?:\b[a-zA-Z]+(-[a-zA-Z]+)?\b|\x22[^\x22]+\x22))*)\b)|
(?:(?<fontProperty>style\b)\s*:\s*(?<fontPropertyValue>normal|italic|oblique|inherit))|
(?:(?<fontProperty>variant\b)\s*:\s*(?<fontPropertyValue>normal|small-caps|inherit))|
(?:(?<fontProperty>weight\b)\s*:\s*(?<fontPropertyValue>normal|bold|(?:bold|light)er|[1-9]00|inherit))|
(?:(?<fontProperty>size\b)\s*:\s*(?<fontPropertyValue>(?:(?:xx?-)?(?:small|large))|medium|(?:\d*\.?\d+(?:%|(p(?:[xct])|(?:[cem])m|in|ex))\b)|inherit|\b0\b)))\s*;?", (RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace));
Regex reBackGround = new Regex(@"background-(?:
(?:(?<property>color)\s*:\s*(?<unit>transparent|inherit|(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)))|
(?:(?<property>image)\s*:\s*(?<unit>none|inherit|(?:url\s*\([^()]+\))))|
(?:(?<property>repeat)\s*:\s*(?<unit>no-repeat|inherit|repeat(?:-[xy])))|
(?:(?<property>attachment)\s*:\s*(?<unit>scroll|inherit|fixed))|
(?:(?<property>position)\s*:\s*(?<unit>((?<horizontal>left | center | right|(?:0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+(?<vertical>top | center | bottom |(?:0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))))|
((?<vertical>top | center | bottom )\s+(?<horizontal>left | center | right ))|
((?<horizontal>left | center | right )|(?<vertical>top | center | bottom ))))
);?", (RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture));
MatchCollection mcBlocks = reCSSBlock.Matches(css);
foreach (Match mBlock in mcBlocks)
{
string strBlock = mBlock.Value;
HasAllPositions(reTRBL1, ref strBlock);
HasAllPositions(reTRBL2, ref strBlock);
HasAllPositions(reTRBL3, ref strBlock);
HasAllPositions(reTRBL4, ref strBlock);
HasAllPositions(reTRBL5, ref strBlock);
HasAllListStyle(reListStyle, ref strBlock);
HasAllFontProperties(reFont, ref strBlock);
HasAllBackGroundProperties(reBackGround, ref strBlock);
css = css.Replace(mBlock.Value, strBlock);
}
return css;
}
private static void HasAllBackGroundProperties(Regex re, ref string CSSText)
{
{
MatchCollection mcProperySet = re.Matches(CSSText);
int z = 5;
if (mcProperySet.Count == z)
{
int y = 0;
for (int x = 0; x < z; x = x + 1)
{
switch (mcProperySet[x].Groups["property"].Value)
{
case "color":
y = y + 1;
break;
case "image":
y = y + 2;
break;
case "repeat":
y = y + 4;
break;
case "attachment":
y = y + 8;
break;
case "position":
y = y + 16;
break;
}
}
if (y == 31)
{
CSSText = ShortHandBackGroundReplaceV2(mcProperySet, re, CSSText);
}
}
}
}
private static void HasAllFontProperties(Regex re, ref string CSSText)
{
{
MatchCollection mcProperySet = re.Matches(CSSText);
int z = 5;
if (mcProperySet.Count == z)
{
int y = 0;
for (int x = 0; x < z; x = x + 1)
{
switch (mcProperySet[x].Groups["fontProperty"].Value)
{
case "style":
y = y + 1;
break;
case "variant":
y = y + 2;
break;
case "weight":
y = y + 4;
break;
case "size":
y = y + 8;
break;
case "family":
y = y + 16;
break;
}
}
if (y == 31)
{
CSSText = ShortHandFontReplaceV2(mcProperySet, re, CSSText);
}
}
}
}
private static void HasAllListStyle(Regex re, ref string CSSText)
{
{
int z = 3;
MatchCollection mcProperySet = re.Matches(CSSText);
if (mcProperySet.Count == z)
{
int y = 0;
for (int x = 0; x < z; x = x + 1)
{
switch (mcProperySet[x].Groups["style"].Value)
{
case "type":
y = y + 1;
break;
case "image":
y = y + 2;
break;
case "position":
y = y + 4;
break;
}
}
if (y == 7)
{
CSSText = ShortHandListReplaceV2(mcProperySet, re, CSSText);
}
}
}
}
private static void HasAllPositions(Regex re, ref string CSSText)
{
{
MatchCollection mcProperySet = re.Matches(CSSText);
if (mcProperySet.Count == 4)
{
int y = 0;
for (int x = 0; x < 4; x = x + 1)
{
switch (mcProperySet[x].Groups["position"].Value)
{
case "top":
y = y + 1;
break;
case "right":
y = y + 2;
break;
case "bottom":
y = y + 4;
break;
case "left":
y = y + 8;
break;
}
}
if (y == 15)
{
CSSText = ShortHandReplaceV2(mcProperySet, re, CSSText);
}
}
}
}
private static string ShortHandFontReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
{
/*
* This Function replaces the individual font properties with a single entry
* */
string strFamily, strStyle, strVariant, strWeight, strSize;
Regex reLineHeight = new Regex(@"line-height\s*:\s*((?:\d*\.?\d+(?:%|(p(?:[xct])|(?:[cem])m|in|ex)\b)?)|normal|inherit);?", RegexOptions.IgnoreCase);
strFamily = string.Empty;
strStyle = string.Empty;
strVariant = string.Empty;
strWeight = string.Empty;
strSize = string.Empty;
string strStyle_Variant_Weight = string.Empty;
foreach (Match mProperty in mcProperySet)
{
switch (mProperty.Groups[""].Value)
{
case "family":
strFamily = string.Format(" {0}", mProperty.Groups["fontPropertyValue"].Value);
break;
case "size":
if (reLineHeight.IsMatch(InputText))
{
Match m = reLineHeight.Match(InputText);
if (m.Groups[1].Value != "normal")
{
strSize = String.Format("/{0}", m.Groups[1].Value);
}
InputText = reLineHeight.Replace(InputText, string.Empty);
}
strSize = string.Format(" {0}{1}", mProperty.Groups["fontPropertyValue"].Value, strSize);
if (strSize == "medium")
{
strSize = string.Empty;
}
break;
case "style":
case "variant":
case "weight":
if (mProperty.Groups["fontPropertyValue"].Value != "normal")
{
strStyle_Variant_Weight += string.Format(" {0}", mProperty.Groups["fontPropertyValue"].Value);
} break;
}
}
string strShortcut;
string strProperties = string.Format("{0}{1}{2};", strStyle_Variant_Weight, strVariant, strWeight, strSize, strFamily);
strShortcut = string.Format("font:{0}", strProperties.Trim());
string strNewBlock = re.Replace(InputText, "");
strNewBlock = strNewBlock.Insert(1, strShortcut);
return strNewBlock;
}
private static string ShortHandBackGroundReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
{
/*
* This Function replaces the individual background properties with a single entry
* */
string strColor, strImage, strRepeat, strAttachment, strPosition;
strColor = string.Empty;
strImage = string.Empty;
strRepeat = string.Empty;
strAttachment = string.Empty;
strPosition = string.Empty;
foreach (Match mProperty in mcProperySet)
{
switch (mProperty.Groups["property"].Value)
{
case "color":
if (mProperty.Groups["unit"].Value != "transparent")
{
strColor = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
case "image":
if (mProperty.Groups["unit"].Value != "none")
{
strImage = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
case "repeat":
if (mProperty.Groups["unit"].Value != "repeat")
{
strRepeat = string.Format(" {0}", mProperty.Groups["unit"].Value);
} break;
case "attachment":
if (mProperty.Groups["unit"].Value != "scroll")
{
strAttachment = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
case "position":
if (mProperty.Groups["unit"].Value != "0% 0%")
{
strPosition = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
}
}
string strShortcut;
string strProperties = string.Format("{0}{1}{2}{3}{4};", strColor, strImage, strRepeat, strAttachment, strPosition);
strShortcut = string.Format("background:{0}", strProperties.Trim());
string strNewBlock = re.Replace(InputText, "");
strNewBlock = strNewBlock.Insert(1, strShortcut);
return strNewBlock;
}
private static string ShortHandReplaceV2(MatchCollection mcProperySet, Regex reTRBL1, string InputText)
{
// Replace method for regexes used in ShortHand property method for properties with top, right, bottom and left sub properties.
string strTop, strRight, strBottom, strLeft;
strTop = string.Empty;
strRight = string.Empty;
strBottom = string.Empty;
strLeft = string.Empty;
string strProperty;
strProperty = string.Format("{0}{1}", mcProperySet[0].Groups["property"].Value, mcProperySet[0].Groups["property2"].Value);
foreach (Match mProperty in mcProperySet)
{
switch (mProperty.Groups["position"].Value)
{
case "top":
strTop = mProperty.Groups["unit"].Value;
break;
case "right":
strRight = mProperty.Groups["unit"].Value;
break;
case "bottom":
strBottom = mProperty.Groups["unit"].Value;
break;
case "left":
strLeft = mProperty.Groups["unit"].Value;
break;
}
}
string strShortcut = string.Format("{0}:{1} {2} {3} {4};", strProperty, strTop, strRight, strBottom, strLeft);
string strNewBlock = reTRBL1.Replace(InputText, "");
strNewBlock = strNewBlock.Insert(1, strShortcut);
return strNewBlock;
}
private static string ShortHandListReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
{
/*
* This Function replaces the individual list properties with a single entry
* */
string strType, strPosition, strImage;
strType = string.Empty;
strPosition = string.Empty;
strImage = string.Empty;
foreach (Match mProperty in mcProperySet)
{
switch (mProperty.Groups["style"].Value)
{
case "type":
if (mProperty.Groups["unit"].Value != "disc")
{
strType = mProperty.Groups["unit"].Value;
}
break;
case "position":
if (mProperty.Groups["unit"].Value != "outside")
{
strPosition = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
case "style":
if (mProperty.Groups["unit"].Value != "none")
{
strImage = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
}
}
string strShortcut = string.Format("list-style:{0}{1}{2};", strType, strPosition, strImage);
string strNewBlock = re.Replace(InputText, "");
strNewBlock = strNewBlock.Insert(1, strShortcut);
return strNewBlock;
}
}
}
|
-
OK, there regexes were discussed in the previous post this is mostly just their application.
This is a C# 2.0 enhancement of a C# port of YUI Compressor's CSS minification code
Since I was doing this is C# I took full advantage of it's regex engine, namely using lookbehinds and delegates for some replaces.
Almost all the regexes after the "New Test" comment are the new or modified regexes from the ported version. There is also one new and two modified expressions before that comment. One of those modification is just a change in writing style, the other modifications are replacing some code but (hopefully) not functionality with a regex replace. The new regex replacements of course are the new compression enhancements.
There are also a couple of new regexes not mentioned in the previous post that match and replace some of the color values with an equivalent but a more concisely written value. The replace the color "red" is a straight replace but the other colors require some code evaluation and are using delegates.
I've done some very limited testing but as I mentioned in the previous post most of the CSS I've written doesn't have some of the new things I was searching for. I could add them for a test (which I did) but that won't catch any problems they my cause to the actual CSS application since I wasn't really using the test values. So the source code is now available for beta testing. Test early and often before committing to use it. I'm willing to fix any minor bugs for things I may have overlook but if a particular replace is problematic it's easy enough to comment out the offender and use the rest.
And as was mentioned in the comments of the previous post any generated content that looks like CSS may get stepped on so be aware of that.
And also that all licenses for previous versions still apply. UPDATE 2008-04-27
After a little more testing I discovered one of the replaces I was doing can alter how the CSS is processed. So I have just crossed out the functions and function call I've come up with a safer, though less likely to occur replacement.
using System;
using System.Collections;
using System.Collections.Generic;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;
namespace CSSMinify
{
class CSSMinify
{
public static Hashtable shortColorNames = new Hashtable();
public static Hashtable shortHexColors = new Hashtable();
public static string Minify(string css)
{
return Minify(css, 0);
}
public static string Minify(string css, int columnWidth)
{
// BSD License http://developer.yahoo.net/yui/license.txt
// New css tests and regexes by Michael Ash
createHashTable();
MatchEvaluator rgbDelegate = new MatchEvaluator(RGBMatchHandler);
MatchEvaluator shortColorNameDelegate = new MatchEvaluator(ShortColorNameMatchHandler); MatchEvaluator shortColorHexDelegate = new MatchEvaluator(ShortColorHexMatchHandler);
css = RemoveCommentBlocks(css);
css = Regex.Replace(css, @"\s+", " "); //Normalize whitespace
css = Regex.Replace(css, @"\x22\x5C\x22}\x5C\x22\x22", "___PSEUDOCLASSBMH___"); //hide Box model hack
/* Remove the spaces before the things that should not have spaces before them.
But, be careful not to turn "p :link {...}" into "p:link{...}"
*/
css = Regex.Replace(css, @"(?#no preceding space needed)\s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))", "$1");
css = Regex.Replace(css, @"([!{}:;>+([,])\s+", "$1"); // Remove the spaces after the things that should not have spaces after them.
css = Regex.Replace(css, @"([^;}])}", "$1;}"); // Add the semicolon where it's missing.
css = Regex.Replace(css, @"(\d+)\.0+(p(?:[xct])|(?:[cem])m|%|in|ex)\b", "$1$2"); // Remove .0 from size units x.0em becomes xem
css = Regex.Replace(css, @"([\s:])(0)(px|em|%|in|cm|mm|pc|pt|ex)\b", "$1$2"); // Remove unit from zero
//New test
css = ShortHandProperty(css);
//css = Regex.Replace(css, @":(\s*0){2,4}\s*;", ":0;"); // if all parameters zero just use 1 parameter
// if all 4 parameters the same unit make 1 parameter
css = Regex.Replace(css, @":\s*(0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};", ":$1;", RegexOptions.IgnoreCase);
// if has 4 parameters and top unit = bottom unit and right unit = left unit make 2 parameters
css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;", ":$1;", RegexOptions.IgnoreCase);
// if has 4 parameters and top unit != bottom unit and right unit = left unit make 3 parameters
css = Regex.Replace(css, @":\s*((?:(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
//// if has 3 parameters and top unit = bottom unit make 2 parameters
//css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])| (?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
css = Regex.Replace(css,"background-position:0;", "background-position:0 0;");
css = Regex.Replace(css,@"(:|\s)0+\.(\d+)", "$1.$2");
// Outline-styles and Border-sytles parameter reduction
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)(?:\s+\2){1,3};", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
// Outline-color and Border-color parameter reduction
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};", "$1-color:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);", "$1-color:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);", "$1-color:$2;", RegexOptions.IgnoreCase);
// Shorten colors from rgb(51,102,153) to #336699
// This makes it more likely that it'll get further compressed in the next step.
css = Regex.Replace(css,@"rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29", rgbDelegate);
css = Regex.Replace(css, @"(?<![\x22\x27=]\s*)\#(?:([0-9A-F])\1)(?:([0-9A-F])\2)(?:([0-9A-F])\3)", "#$1$2$3", RegexOptions.IgnoreCase);
// Replace hex color code with named value is shorter
css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>f00)\b", "red",RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>[0-9a-f]{6})", shortColorNameDelegate, RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(?<=color\s*:\s*)\b(Black|Fuchsia|LightSlateGr[ae]y|Magenta|White|Yellow)\b", shortColorHexDelegate,RegexOptions.IgnoreCase);
// Remove empty rules.
css = Regex.Replace(css,@"[^}]+{;}", "");
//Remove semicolon of last property
css = Regex.Replace(css, ";(})", "$1");
if (columnWidth > 0)
{
css = BreakLines(css, columnWidth);
}
return css;
}
private static string RemoveCommentBlocks(string input)
{
int startIndex = 0;
int endIndex = 0;
bool iemac = false;
startIndex = input.IndexOf(@"/*", startIndex);
while (startIndex >= 0)
{
endIndex = input.IndexOf(@"*/", startIndex + 2);
if (endIndex >= startIndex + 2)
{
if (input[endIndex - 1] == '\\')
{
startIndex = endIndex + 2;
iemac = true;
}
else if (iemac)
{
startIndex = endIndex + 2;
iemac = false;
}
else
{
input = input.Remove(startIndex, endIndex + 2 - startIndex);
}
}
startIndex = input.IndexOf(@"/*", startIndex);
}
return input;
}
private static String RGBMatchHandler(Match m)
{
int val = 0;
StringBuilder hexcolor = new StringBuilder("#");
for(int index=1; index <= 3; index += 1)
{
val = Int32.Parse(m.Groups[index].Value);
hexcolor.Append(val.ToString("x2"));
}
return hexcolor.ToString();
}
private static string BreakLines(string css, int columnWidth)
{
int i = 0;
int start = 0;
StringBuilder sb = new StringBuilder(css);
while (i < sb.Length)
{
char c = sb[i++];
if (c == '}' && i - start > columnWidth)
{
sb.Insert(i, '\n');
start = i;
}
}
return sb.ToString();
}
private static string ShortHandProperty(string css)
{
/*
* This function searchs for properties specifying at least 2 of the top, right, bottom or left box model
* positions and reduces it to a single property use shorthand notation
*/
Regex reCSSBlock = new Regex("{[^{}]*}");
Regex reTRBL1 = new Regex(@"(?<fullProperty>(?:(?<property>padding)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
Regex reTRBL2 = new Regex(@"(?<fullProperty>(?:(?<property>margin)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
Regex reTRBL3 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:color)))\s*:\s*(?<unit>[#\w.]+);?", RegexOptions.IgnoreCase);
Regex reTRBL4 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:style)))\s*:\s*(?<unit>none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset);?", RegexOptions.IgnoreCase);
Regex reTRBL5 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:width)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
MatchCollection mcBlocks = reCSSBlock.Matches(css);
foreach (Match mBlock in mcBlocks)
{
string strBlock= mBlock.Value;
MatchCollection mcProperySet = reTRBL1.Matches(strBlock);
if (mcProperySet.Count > 1)
{
strBlock = ShortHandReplace(mcProperySet, reTRBL1, strBlock);
}
mcProperySet = reTRBL2.Matches(strBlock);
if (mcProperySet.Count > 1)
{
strBlock = ShortHandReplace(mcProperySet, reTRBL2, strBlock);
}
mcProperySet = reTRBL3.Matches(strBlock);
if (mcProperySet.Count > 1)
{
strBlock = ShortHandReplace(mcProperySet, reTRBL3, strBlock);
}
mcProperySet = reTRBL4.Matches(strBlock);
if (mcProperySet.Count > 1)
{
strBlock = ShortHandReplace(mcProperySet, reTRBL4, strBlock);
}
mcProperySet = reTRBL5.Matches(strBlock);
if (mcProperySet.Count > 1)
{
strBlock = ShortHandReplace(mcProperySet, reTRBL5, strBlock);
}
css = css.Replace(mBlock.Value, strBlock);
}
return css;
}
private static string ShortHandReplace(MatchCollection mcProperySet, Regex reTRBL1, string InputText)
{
// Replace method for regexes used in ShortHand property method.
string strTop, strRight, strBottom, strLeft;
strTop = string.Empty;
strRight = string.Empty;
strBottom = string.Empty;
strLeft = string.Empty;
string strProperty;
string strDefaultValue;
strProperty = string.Format("{0}{1}", mcProperySet[0].Groups["property"].Value, mcProperySet[0].Groups["property2"].Value);
switch (strProperty){
case "border-color":
strDefaultValue = "inherit";
break;
case "border-style":
strDefaultValue = "none";
break;
default:
strDefaultValue = "0";
break;
}
foreach (Match mProperty in mcProperySet)
{
if (mProperty.Groups["position"].Value == "top")
{
if (strTop == string.Empty)
{
strTop = mProperty.Groups["unit"].Value;
}
else
{
break;
}
}
if (mProperty.Groups["position"].Value == "right")
{
if (strRight == string.Empty)
{
strRight = mProperty.Groups["unit"].Value;
}
else
{
break;
}
}
if (mProperty.Groups["position"].Value == "bottom")
{
if (strBottom == string.Empty)
{
strBottom = mProperty.Groups["unit"].Value;
}
else
{
break;
}
}
if (mProperty.Groups["position"].Value == "left")
{
if (strLeft == string.Empty)
{
strLeft = mProperty.Groups["unit"].Value;
}
else
{
break;
}
}
}
if (strTop == string.Empty)
{
strTop = strDefaultValue;
}
if (strRight == string.Empty)
{
strRight = strDefaultValue;
}
if (strBottom == string.Empty)
{
strBottom = strDefaultValue;
}
if (strLeft == string.Empty)
{
strLeft = strDefaultValue;
}
string strShortcut = string.Format("{0}:{1} {2} {3} {4};", strProperty, strTop, strRight, strBottom, strLeft);
string strNewBlock = reTRBL1.Replace(InputText, "");
strNewBlock = strNewBlock.Insert(1, strShortcut);
return strNewBlock;
}
private static string ShortColorNameMatchHandler(Match m)
{
// This function replace hex color values named colors if the name is shorter than the hex code
string returnValue = m.Value;
if (shortColorNames.ContainsKey(m.Groups["hex"].Value))
{
returnValue = shortColorNames[m.Groups["hex"].Value].ToString();
}
return returnValue;
}
private static string ShortColorHexMatchHandler(Match m)
{
return shortHexColors[m.Value.ToString().ToLower()].ToString();
}
private static void createHashTable()
{
//Color names shorter than hex notation. Except for red.
shortColorNames.Add("F0FFFF".ToLower(), "Azure".ToLower());
shortColorNames.Add("F5F5DC".ToLower(), "Beige".ToLower());
shortColorNames.Add("FFE4C4".ToLower(), "Bisque".ToLower());
shortColorNames.Add("A52A2A".ToLower(), "Brown".ToLower());
shortColorNames.Add("FF7F50".ToLower(), "Coral".ToLower());
shortColorNames.Add("FFD700".ToLower(), "Gold".ToLower());
shortColorNames.Add("808080".ToLower(), "Grey".ToLower());
shortColorNames.Add("008000".ToLower(), "Green".ToLower());
shortColorNames.Add("4B0082".ToLower(), "Indigo".ToLower());
shortColorNames.Add("FFFFF0".ToLower(), "Ivory".ToLower());
shortColorNames.Add("F0E68C".ToLower(), "Khaki".ToLower());
shortColorNames.Add("FAF0E6".ToLower(), "Linen".ToLower());
shortColorNames.Add("800000".ToLower(), "Maroon".ToLower());
shortColorNames.Add("000080".ToLower(), "Navy".ToLower());
shortColorNames.Add("808000".ToLower(), "Olive".ToLower());
shortColorNames.Add("FFA500".ToLower(), "Orange".ToLower());
shortColorNames.Add("DA70D6".ToLower(), "Orchid".ToLower());
shortColorNames.Add("CD853F".ToLower(), "Peru".ToLower());
shortColorNames.Add("FFC0CB".ToLower(), "Pink".ToLower());
shortColorNames.Add("DDA0DD".ToLower(), "Plum".ToLower());
shortColorNames.Add("800080".ToLower(), "Purple".ToLower());
shortColorNames.Add("FA8072".ToLower(), "Salmon".ToLower());
shortColorNames.Add("A0522D".ToLower(), "Sienna".ToLower());
shortColorNames.Add("C0C0C0".ToLower(), "Silver".ToLower());
shortColorNames.Add("FFFAFA".ToLower(), "Snow".ToLower());
shortColorNames.Add("D2B48C".ToLower(), "Tan".ToLower());
shortColorNames.Add("008080".ToLower(), "Teal".ToLower());
shortColorNames.Add("FF6347".ToLower(), "Tomato".ToLower());
shortColorNames.Add("EE82EE".ToLower(), "Violet".ToLower());
shortColorNames.Add("F5DEB3".ToLower(), "Wheat".ToLower());
// Hex notation shorter than named value
shortHexColors.Add("black", "#000");
shortHexColors.Add("fuchsia", "#f0f");
shortHexColors.Add("lightSlategray", "#789");
shortHexColors.Add("lightSlategrey", "#789");
shortHexColors.Add("magenta", "#f0f");
shortHexColors.Add("white", "#fff");
shortHexColors.Add("yellow", "#ff0");
}
}
}
|
-
NOTE: All the regex referenced on this page written by me are using IgnoreCase = true
I was looking at the regexes used in
the YUI Compressor to minify CSS and came up with a couple of more
that I think could help the process. The code and port I was looking
at was already trimming unneeded zeros used for the top, right,
bottom, left values with a simple string replace. But there were
three separate replaces being done. It was pretty simple to come up
with a regex to handle all the cases
(Pseudo code)
string.Replace(":0 0 0 0;","0;") string.Replace(":0 0 0;","0;") string.Replace(":0 0;","0;")
becomes
Regex.Replace(input,":(\s*0)(\s+0){0,3}\s*;",":0;")
Pretty simple but of course I thought
why stop there so I came up with a regex for all numbers
:\s*(0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};
The replacement string is simply ":$1;"
Once this was done next on the list was
to handle cases when all the numbers are not the all the same. For those who don't know CSS this part
of the syntax basically says of the 4 possible values
1) if only one value is specified the
other 3 are implied to the same value (X = X X X X) 2) if two values are specified the 1st
implies the third the second implies the 4th (X Y = X Y X Y) 3) if three values are specified the
second implies the 4th (X Y Z = X Y Z Y) Of course minifying you want to use the
shorter syntax so the following regexes convert the longer to the
shorter. The replacement string for all is the
same as above "$1;"
# 4 parameters to 2 x y x y to x y
:\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;
# 4 to 3 (x y z y to x y z) or 3 to 2
(x y x to x y)
:\s*((?:(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2; Though the make look unwieldy the
longer ones just repeat one of the sub-pattern. The only real
difference is what is and isn't captured.
Along those same lines I came up with
similar patterns for border-style, outline-style, border-color and
outline-color
border-style/outline-style
The replacement string for all is
“$1-style:$2;”
(outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset
)(?:\s+\2){1,3};
(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset
)\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset
))(?:\s+\3)(?:\s+\4);
(outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset
)\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset
)\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset
))(?:\s+\3);
border-color/outline-color
The replacement string for all is
“$1-color:$2;”
(outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};
(outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);
(outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);
I also came up with a couple of more
regexes to replace some code, but as couple of these use look-behinds
they are not as portable.
These work in .Net which supports
variable length lookbehinds
This pattern
\s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))
Is
used to find character preceeded by an unneed whitespace taking care
not to find colon's that are used as psuedo-selectors or
psuedo-classes. Replacement string would be $1
This
pattern
(?<![\x22\x27=]\s*)\#([0-9A-F])\1([0-9A-F])\2([0-9A-F])\3
is
used for reducing hexadecimal values from AABBCC to ABC, the
replacement string would be $1$2$3.
The patten I created was to find RGB values of th format rgb(x, y,
z) where x, y and z are the integers in the range of 0-255.
rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29
The
pattern I used was just stricter than the existing one and put each
value as a group so I could just work with them with further
processing. The match itself is passed to a matchevaluator to do the
decimal to hex conversion.
I
did a little alpha testing but most of the CSS I've written doesn't
have the stuff I'm testing for. So beta testing is definitely in
order. Also I don't really know CSS hacks so I don't know if any of
this will have adverse effect on hacks. Other than the
top/left/bottom/right parameters I mostly trying to duplicate the
existing effects. I'm going to pass this info on the YUI Compressor
maintainer to see if they can make use of any of this.
|
-
The square brackets character class is
one of the more misunderstood of the basic regex features. This
feature is supported in virtually all regex implementations. In fact
off the top of my head I don't know an implementation that doesn't
support it. Maybe it's not well documented in most tutorials or
maybe the samples are not clear enough or maybe users are just
skimming over the details for this one but I see this feature misused
quite often.
This type of character class simply
matches one and only one of the characters between the square
brackets. That's pretty much it. Now there are a few caveats to
what can go between the brackets but the output is the same. One, and
only, character will be returned if there is a match.
Now a common error rookie regex users
tend make by trying to write patterns inside the character class and
make it match a particular sequence of characters. Trying to either
have it match a particular fixed string of characters or another
more generic regex pattern inside the character class. One of the
obvious tells of this approach is when the same character appears in
the character class multiple times, something like [RADAR] or
[HELLO]. Listen up newbies, you can't write any sort of patterns
within a character class so don't waste your time trying. If you have
a fixed string like “ABC” don't try using a character class [ABC]
to match that string as a a whole. As I stated in the previous
paragraph this features is for matching a character (singular) from a
group of characters, not an ordering of characters (plural) so what
you don't get a match of “ABC” what you get is 3 matches since
each character in the string is one of the 3 characters. There is
nothing you can do inside of the brackets to make it return “ABC”
as single match. You can't group these characters together since you
can't write any patterns within the character class. All the
characters in the class are tested against just one position in your
input. Now you can add a qualifier, like + or * after the character
class to make “ABC” one match but remember what the character
class matches, one character. The qualifier allows the the character
class to be applied multiple times, the whole class. None of the
possible choices are eliminated in following application of the
class. Every application of [ABC] test for 1of 3 possible characters,
[ABC]+ test for 1of 3 possible one or more times. So while it will
match “ABC” it doesn't only match “ABC” which brings me to my
second topic, Order.
Along with the incorrect belief that
you can include a pattern within a character class is that the
characters within the class has to be in the same order as the
characters being tested against. This isn't true. Everything I said
about the regex pattern [ABC] is true for patterns [CBA] or [CAB] or
[BCA]. They all match exactly the same thing. With a few exceptions,
couple of that have to do with ranges one which I've discussed
before, the order of the characters doesn't matter. If you are using a range the value on the
left of the hyphen has to be a lower ASCII value (or code point for
Unicode) than the value on the right. Of the previous equivalent
character class patterns the only way to write an equivalent range is
[A-C]. [C-A] won't work and in some implementations will throw a
compile error.
Now with these first couple of errors
it maybe because new users think that alternation and character
classes are interchangeable. They are not. Some tutorials may
present that as the case especially if the examples for each are too
simple. Any character class pattern can be written using
alternation, however the reverse isn't always true. Alternation can
include single characters or literal or general regex patterns. The
character class pattern [ABC] can be written using alternation A|B|C.
The alternation pattern of single characters C|D|E can be written
[CDE] or[DCE] but the alternation pattern of multiple character
patterns (ABC)|(DEF)|(GHI) can not be written in an equivalent
pattern using just a character class. Though you could write a
pattern that would include matches of the same 3 values you couldn't
limit it to matching just those 3 values. Even negated character
class can be written using alternation but that would generally lead
to enormous patterns making it impractical to write pattern in that
fashion. Those trying to write patterns inside of a character class
needed to focus their attention to alternation and/or one or more
grouping constructs.
The final error I see a lot is one that
gets regex rookies and vets alike, regex metacharacters. With rookie
this mostly fall within the trying to use a pattern within the
character class by trying to use the qualifiers. Another is using
the dot character which in this situation matches only itself.
Within a character class in most implementations the only
metacharacter that still retains it's special regex meaning is \ ,
which means it can be used as it normally is. The shorthand character
class notations will work the same, except for \b which alters it's
meaning, which make since if you think about it. You can't have any
patterns within a character class and \b in it's use outside of a
character class doesn't match a character and would only be used as
part of a pattern.
However one metacharacter ^ alters it's
behavior but only if it is the first character inside the brackets.
This is the other position dependent case I was referring to earlier.
All the other regular metacharacters that have a special meaning
outside a character class lose there special meaning with one and are
interpreted as literals, so the is no need to escape these characters
though doing so isn't wrong just unnecessary. I talked about the
hyphen, which gains a feature, few years ago.
Finally one last thing I see rookie
regex users doing isn't really an error but just make it look like
they aren't sure of when the use a character class and that's when
they have a positive character class with only one character in it
like [k]. There is no reason to do this. Just type the character. If
you were negating a single character then that makes sense but
otherwise it's just extra typing.
The character class is great when you
are trying to match (or negate) a single character at one position in
your input and you have lots of possible characters to choose from,
but anything else is outside of its domain.
|
-
One thing I've noticed among rookie
regex users is that they focus way too much on what they are trying
to match and not on where they are trying to match it from. There is
a tenancy to drastically under estimate the importance of their
source data, which they are trying to apply their regex against.
I see this all the time on forums where
posters asking for help almost never post any sample data, and most
of the few that do post a sample they made up on the spot. Over on
they RegexAdvice construction forum we've added guidelines for
posting a question. The whole point of the guidelines are to
expedite getting questions answered. One of the things we ask is
that posters from a sample of the actual data they are trying to
match. We explicitly ask they don't make up a sample. Unfortunately
most people don't follow this request, along with others, in the
guidelines, which generally result in those questions taking longer
to get answered. On this particular point not supplying any sample is
worst offense because anyone trying to help you can't see what you
are working with and can only give an untested guess solution It
like calling a mechanic on the phone and saying “My car isn't
running right” leaving your number and waiting for him to call back
to tell you how to fix it. He can call back and tell you to change
your oil, but without more detail he can't know that will fix your
problem. However making up a sample doesn't help much either and is
almost just as bad. I know what people think they are being helpful
and not boring people with the bloody details by providing a simple
example to work with but generally it's not helpful since it doesn't
represent their real problem. In regex construction details matter.
Often the made up sample is too simple. So what they end of getting
is a solution that works great for their sample but doesn't handle
their real problem well, if at all. So now you are back where you
started. This is like calling your mechanic on the phone and saying
“My car making a funny noise.” While it's less vague than “isn't
running right” it's still vague. You may know exactly what kind of
noise, and how loud it is and where it coming from but that sentence
doesn't communicate any of that. Made up samples are about the same
when asking for regex help.
The majority of regular expression
problems are content sensitive. Meaning the approach of how to match
the target string may depend on where it is the the source string.
Now sometime the target string is unique from the surrounding text
that the context seems irrelevant. Say you are trying to find a
string of digits somewhere in your source data. And there just
happens to be only one occurrence of digits are in your source. Well
it that case a generic regex that find digits will be good enough.
Source : “abc345def” or “a125bcd”
Regex: \d+
Now I said the content seems irrelevant
in this case but it is as relevant as ever. The context is digits
only occur once in the string.
But relevance becomes clearer if you
have a source file where there are two or more occurrences of text
that mimic your target data but only one is actually the data you
want. For example let say you want to find area codes of a phone
number. Well if you agree that an area code is a 3 digit number the
regex \d{3} might be all need. It depends. On what you ask? Come
on you know, say it with me....”Your Source data!”
If your source data is like the case I
mentioned first, where the only digits in the files are area codes or
if there are other digit in the file that are not area codes those
other numbers are less than three consecutive digits then the simple
regex should work
Sample
1) AL 205, 256, 334
2) AK 907
3) AR 479
4) etc.
But if your source is a bunch phone
numbers
1) Mom (555) 123-4567
2) Sis 1-555-234-4567
3) Friend 5553456789
Well now you have a problem. While you
general task is still the same, you have to approach it differently
because of the source. You can't simply look for 3 consecutive
digits since that would return 3 matches for each line. But if we
know how the data is structured we can focus on how the data we want
sticks out from the rest. In the above sample text area code meets on
of three conditions
3 cons digits inside of
parenthesis
first group of 3 consecutive
digits in a string of 1-xxx-xxx-xxxx format
first 3 consecutive digits of a
ten consecutive digit string
Now you can still match what you need
in one regex but how you choose to approach it changes. Some regex
authors will use look-around to examine the surrounding text, others
will match the surrounding and use groups to pull out the desired
data. Either way you have to expand your field of vision beyond what
you are trying to match to address the problem.
But even in this simple example,
understand how knowing what the source data look like is still
important. Imagine someone asked you for help and you are trying to
solve this problem without seeing the actual source above. If
someone post (or ask) the question “I need a regex to match a US
area code”
With No Sample: Any helpful regex
author has to guess how your data is formatted. Some may assume it's
a string of “(xxx) yyy-yyyy”, others may assume it's
“xxx-yyy-yyyy”, some may even think it “State: xxx, xxx, xxx”
and all might not it even consider “1-xxx-yyy-yyyy” or
“xxxyyyyyyy”.
With Made up sample: “I have a string
(555) 555-5555 and I need a regex to match a US area code”
Now regex author may assume all your
phone numbers are in that nice format and that there are no
variations. You'll most likely get a regex that match that format
perfectly, but fail the other formats you didn't think were important
enough to mention.
With Real Data: Regex authors don't
have to guess they can look at your data analyze it themselves and
even test edge cases and address issues you may have overlooked.
I hope this simple example shows how
the same problem can have different approaches all because of the
source. Where data in the wild is going to be much more complex than
this which will only magnify the issue.
Now sometimes you can't provide real
data because either its too much or private. With large chunk of
data simply clip a portion containing what you are wanting to match.
With private data making up sample is almost excusable but don't make
up simple ones. Garble the real data but preserve the format, change
names to literary or TV characters, change addresses to the address
of some public office (City hall, post office) or landmark which
this is essentially a made up sample but if you substitute values
your samples won't be as bland and closer to your real data and edge
cases.
Understanding your source data is
crucial to your regex construction whether you are writing it
yourself or asking for help.
|
-
Who should be using regular
expressions?
This has been on my mind for a while,
there are some people who shouldn't be using regular expressions. People
who don't know what regular expressions are for.[:^)] Now you can use
regexes in a variety of ways, and you can debate either side on
whether certain applications are good uses or bad uses, but that's
not what I'm talking about. I'm not even talking about people who
don't know how to write regexes, well or at all. I'm talking about
people trying to use regular expressions but have no idea what
regular expressions do. And I don't mean that you can't decipher a
regex pattern by looking at it. I mean you don't understand the
general concept that regular expressions match
patterns in data.
I've seen several post over on the
regexadvice.com forums in the past few months which go something like
“I have this problem someone told me I should use a regular
expression” but unless that same someone bother to explain what a
regular expressions does or why it is suited for your task, don't be
in such a hurry to plug in a regex.
I'm not saying you need to master
regex before using them just get your head around the basics
(matching). At the very least you should associate regex to wildcard
searches, just more powerful. If you are at least that far, proceed
with trying to implement your regex. But if you are thinking
something completely different slow down there. Unless you have a
basic understanding of what regular expressions do, even simple tasks
become way more difficult than they should be. Even if you get
someone to help you write a regex and still don't know what it's
doing you are likely to continue to incorrectly try to use them.
A common result from this lack
understanding I've been seeing is trying to perform a task solely via
writing a regex pattern, that regexes themselves have no capacity to
perform. Not that a regex couldn't be a part of the solution but in
the end it's not even what will do what the person wants done,
usually it a coding task, but the regex can at least do some of the
prep work of the task. Even if some implementations of the regex
engine allow code to be called, it is still the code doing the heavy
lifting. The regex is only finding the data. One could save
themselves hours of wasted time if the just understand the basics of
what they are getting themselves into.
Another thing I've seen, related to the
not having a basic understanding, is a task that is not only
perfectly suited for using a regex but a very common application of
a regex task and the person asking the question asking “Is this
some that a regex can do?” Of the issues I've mention this one is
the most and the least understandable. The most because regular
expressions are not simple to everyone, to some it comes easy to
others it never comes completely. And some of the documentation isn't
the greatest, though there is plenty of good documentation out there these days. So I can understand why some can't get their head
about exactly how to write a regex. But it the least understandable
when they are asking how to write a regex so common, and usually
simple, that dozens of tutorials and/or articles on regexes use the
very regex they are asking for as an example.
I think the main problems of people
who fall into this category is
Lack of research. There are more
than enough tutorials out on the web and more than a few books that
have simple samples to give you an idea of what a regex does. I find
it very hard to believe when someone says they couldn't find a regex
for basic (US) zip codes. Did you even look? Or are you only using a
regex for this task because someone told you that you should and ran out to find someone to write it for you? or are you just cheating on your homework?
Misunderstanding what regular
expressions understand. There seems to be more than a few people
who think a regex understands the context of the data it matches.
That it not only know how to match the data but it know what it
means, either in context to their application or to the world at
large. Those are the people that think a regex for matching zip codes
know what zip codes are used for and where they would be used.
Sorry but that's not the case. Regexes understand nothing of what
your data means to you. As far as the regex engine is concerned
it's just a string of characters. It's up to the regex author know
the context of the data they want to match and to shape the regex
accordingly to return only the relevant data. This may make it seem
like the regex understands the data it's matching but that not what
happening. What happen is the person who wrote the regex understood
the data and the problem set so well they were able to construct a
regex the only match the relevant values.
That regex is a full blown
programming language. It's not. I've seen questions about wanting
a regex to compare numbers, tell time or do some other function
completely outside their realm but something most programming
languages have a feature to deal with or let you write code that
can. I've never seen any regex documentation promoting such
features so I can't imagine why someone thinks a regex can perform
these task. Other than my previous point where they saw the results
of a well written regex and speculated on what and how much work the
regex did. Like I mentioned some implementation allow you to perform
function calls but that is more an add-on of the programming
environment you are using than a generic regex feature. Realize
that regular expressions are one of many features of your
programming language, not the other way around.
If you've read this far and you didn't
know what regular expressions did or didn't do before hopefully by
now you have some idea. And if you already knew what they did just
stop to consider the next time you advise someone to use a regex, you
make sure you get across the high level point that you are “matching
something (a character pattern) in a string” before you get into
the more complex aspects of what a regex can do.
|
-
Hello boys and girls. Wow it's been a
while since I've done this. I want to touch on a very useful but
often overlooked feature of regex, grouping. While I haven't been
blogging I have been active on a message board here or there. A
question I see quite often is “I want to find a match in a string
but I don't want part of the match” or “I need the value of this
portion of the string” Now often I see solutions to these type of
questions that involve look-arounds. Any why they certainly work
they aren't the only way to achieve the desired results. New regex
users seem to believe they can only access the full match. Most
regex engine support groups, where in a match you can access a
certain portion of the full match. Groups are identified by
parenthesis. Every pair of plain parenthesis is a group. For
example the regex pattern
/(Hello)\x20(world)/
There are two groups in the regex. The
regex itself matches the string “Hello world”, group 1 contains
the string “Hello”, group 2 contains the string “world”.
Note neither group contains the space between the two words but it is
part of the full match. Most implementations of regular expressions
allow you a ways to access these groups. They are contained in a
collection inside the match object. Now you'll need to consult your
regex documentation to know the exact layout but in most of these
implementation there is a zero based collection where Item 0 is the
full match and Item 1..n is whichever group element your regex
contains, if any.
Now if you notice I said every pair of
plain parenthesis is a
group. The reason I stress plain is because there are other group
constructs, the aforementioned look-arounds being some. Now support
for the other constructs vary in implementations so again consult
your regex documentation to see which ones you have. The other
grouping constructs consist of a open parenthesis immediately
followed by a question mark, which then is followed by the characters
that define that particular grouping construct. Again consult your
documentation to see which characters define what. I'm not going to
go into all of them here but they basically fall into two categories.
Capturing and non-Capturing. The plain parenthesis I've mentioned
above are a capturing group. However capturing requires extra
resources in some case you need the extra speed, but you still to
group a certain part of the pattern together for either necessity or
readability or both. This is where you'll what to using a
non-capturing group (?:pattern), a open parenthesis followed
immediately by a question mark followed immediately by a colon. The
difference here is that the data matched in the group is not add to
the collection of submatch in the Match object.
Taking our
previous example and making the first group non-capturing
/(?:Hello)\x20(world)/
Where before we
had two groups here we only have one. Group 1 contains the
string”world”. Now this example is not a very practical use of a
non-capturing group. Typically you'd use them in more complex
regexes that have a grouping but you really don't care about the
sub-matches.
One more quick
thing about capturing groups basically each left parenthesis is the
index of the sub-match in the groups collection. So if you have
nested parenthesis count every (plain) left parenthesis to know which
index to use to reference it. Some of the advance grouping constructs
and regex options can affect the ordering but if you are using them
hopefully you've read their effects so I won't go over that here.
/((Hello)|(Goodbye
Cruel))\x20(world)/
The above regex
has 4 capturing groups (not counting group 0). Can you find them?
Now it should match either the string “Hello world” or “Goodbye
Cruel world” Now I want to point out that not all the groups will
participate in the match, but the are still part of the Groups
collection. There will always be 4 groups, just one will always be
empty. Which one depends on which string was matched.
If “Hello
world” was matched the groups are
-
Hello
-
Hello
-
(empty)
-
world
If “Goodbye
Cruel world” was matched the groups are
-
Goodbye
Cruel
-
(empty)
-
Goodbye
Cruel
-
world
in both case
group 0 would be the full match
If you note in
both case two groups contain the same value. Even if you need to know
whether “Hello” or “Goodbye Cruel” was match, you certainly
don't need to know it twice. Plus the inner parenthesis have
different index you'd have to check if you want to use those. This
is where you'd use the non-capturing group to simplify your groups
collection.
/((?:Hello)|(?:Goodbye
Cruel))\x20(world)/
Now we are back
down to two groups. Group 1 contains either “Hello” or “Goodbye
Cruel” depending on which string was matched. Group 2 always
contains “world”
However keep in
mind in some cases you'll want to use the inner index do determine
which group was matched. So using non-capturing groups isn't
necessarily a better thing it just depends on if you need to access
those groups or not. But if you are not doing anything with them
don't capture them.
These are just
two of the basic grouping constructs and they are general supported
across implementations of regex, but not always. But if they are you
can use the to easily dissect larger matches.
|
-
I was asked to modify some text that had been built
incorrectly. Basically insert some text at a certain point. First I use a regex
to find the text, then insert the new value within that match. Now since the inserted value goes inside the
matched text I simply wanted use backreferences and the replace method. Simple right?
Well not so much.
Now the text is in a field of various rows of a database table and the
text to be inserted comes from another of the fields in the same row and is an
alphanumeric value. So the inserted text
value is dynamic, so I can’t simply hard code the replacement text. So the
replacement text is built dynamically for each row. The text to be modified is a certain
attribute somewhere in the text. For this
example lets say it’s “id=xyz”, which is constant for all records. Now the new text will be inserted right after
the equals sign.
So for
source =“ {some stuff}
id=xyz {more stuff}”
newText = “ab1”
you get
“{some stuff} id=ab1xyz {more stuff}”
Simple enough. You
use this regex \bid=xyz\b to match the
text. Then split it in to groups so you can use backreferences in the
replace. So your final regex looks like
this:
\b(bid=)(xyz)\b
Now group 1 contain the text up to your insertion point
(id=)
And group2 contains the text after your insertion point (xyz)
So your replacement string for your regex is “$1(new data
goes here)$2)” , where (new data goes here) = some alphanumeric value pulled
from a second field in a row.
Doing this is in .Net my code looked something like this pseudo-code
Regex regexFind = new Regex(“\b(bid=)(xyz)\b”);
Get Records
For each row
fieldA
= rowFieldA (source text)
fieldB
= rowFieldB (insert value)
fieldA =
regexFind.Replace(FieldA,String.Format(“$1{0}$2”,fieldB))
next
The Format method of the string create a replacement string
for each row.
Look good? Works find
for our example but there is a problem.
For our example value of “ab1” the string format produces “$1ab1$2” which is
exactly what we want, but as this field is alphanumeric so it could begin with a
number which causes a problem. Say
for the next record the value of the text to be inserted is “12a” the format
method produces a replacement string of “$112a$2”, which is not good. Syntactically it’s fine but it’s not what we
want, because instead of trying to inserts some text between group 1 and 2,
which is what we want to do, it is trying to insert text between group 112 and
group 2. As there is no group 112 it
assumes $112 is literal text so your final result is “id=$112axyz”
Ok this is where named group become handy (necessary?). If you used name groups in your regex and
replacement string you can avoid this problem
Change the regex to \b(?<att>bid=)(?<val>xyz)\b
And your replacement string to “${att}(new data goes here)${val})”
Now if you are using the string format method there is one
more hoop you have to jump through because the
regex engine and format method both use the curly braces there is a
conflict and the format method will complain so you have to write it like this
String.Format("${0}att{1}{2}${0}val{1}","{","}",newValue)
To get the desired replacement string.
When I started writing this I thought this was the only way
to get this to work which means you could only solve this with a regex engine, like .Net,
that supported named groups, but I’ve thought of a second way.
But whenever a group in your replacement string
can be followed by a digit you may want to consider using named groups
to avoid unexpected surprizes
|
-
There are times when regular expression you’ve written or
someone written for you needs a little tweaking before you add it to your code
and the tweaking is required because the syntax of the language conflicts with
your regex. For example when part of
your regex pattern contains a double quote and the language you are using uses
double quotes as string delimiters. If
you just cut and paste the pattern in your code the pattern’s quotation will
terminate your string prematurely. Now
the code way to fix it is to escape the quotation in pattern. This solution
requires altering the regex and how the character is escapes depends on the
language being used. The regex itself
allows you to escape character with the \ character. The language being used may or may not
recognize that as escape character for its syntax. And it may be confusing later when you look
at the regex and can’t remember why you escaped a character that the pattern
itself doesn’t need it, But there is another way. Hex values
Most regex implementations support a hex syntax \x##, where # is a hex digit.
So if you use \x22 instead of double quote and \x27 for
single quotes the regexes become more cookie cutter ready.
Another useful hex value is \x20 which is a space. This is especially useful in .Net where
there is an option on a regex to ignorewhitespace in the pattern. Turning this option on allows end of line comments
in the regex but with the exception of inside a character class, ignores typed
in spaces within the patterns, which would be problematic if a space was part
of the pattern to match. So you could
break a working regex if you later decide to add this option. This happened on
the Regexlib when the option was first turned on. A lot of patterns that were written before
the switch was flipped suddenly stopped working.
Speaking of .Net when it comes to name groups you can’t use
the hex notation to define the group name using the single quote syntax . However you can avoid any issue with single
quotes by using the alternate syntax.
|
-
The list-detail design is very commonly used with web pages, where you have a list of links that lead to more detailed information of each entry. Sometimes the text of the list is simply a snippet of a much longer string of text in the detail. A common way to handle this is to use a string function to return the first n characters of the string and display that in the list. The problem with this is that tends to make the break right in the middle of a word. Which isn’t major problem but can be aesthetically displeasing or may accidentally form another word you didn’t mean to put on your site.
When facing this issue I came up with a simple regex to allow me to break on whole words.
^(?:[ -~]{n,m}(?:$|(?:[\w!?.])\s))
Where n = the minimum number of characters to match
And m = the maximum number of character to allow in the match.
Now in instance I’m considering a word to be one or more ACSII non white-space characters. The way it works is after matching n ASCII characters it tries to match either the end of the string or a letter or sentence ending punctuation followed by a white space. So it will accept as many characters, including white spaces as it can up to m and still satisfy the rest of the match. Otherwise it backtracks until the regex is satisfied. So if you wanted a minimum of 2 characters and a maximum of 75 the regex would be
^(?:[ -~]{2,75}(?:$|(?:[\w!?.])\s))
and if you applied it the Gettysburg Address
“Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal.
…” (only the 1st paragraph shown for the example but you could apply the full text)
Taking the first match you get
“Four score and seven years ago, our fathers brought forth upon this ”
There are a few problems with the regex that can be improved. First off it only accepts basic ASCII displayable characters, decimal 32 to 126 with mean the text must be in that range. I did it this way because it give you the US alphabet, digits and commonly used symbols and punctuation which was all I needed at the time. Other characters would need to be added. Also if the first word character count exceeds your maximum length no match will be found
You can make this regex a little dynamic by putting inside a function that takes the your string, the max and min values as input.
|
|
|
|