Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

  • Looking again at the Lookahead bug

    A few years ago a wrote about about a bug in Internet Explorer's Regex engine that affected patterns with lookaheads. Well the bug came back in the form of a question on RegexAdvice.com. It too was a password regex, though not as complex as the previous pattern that introduced me to this bug.

    The first pattern had three conditions that were being tested for with lookaheads.

    ^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$

    With the current pattern only one lookahead was being used.

    ^(?=.*?\d)[a-z][a-z0-9]{5,7}$

    In both patterns the pattern had a min and max length. In to original attempts of both the length was being checked after the lookahead(s) test. While this is perfectly fine in a non VBScript/JScript world, this is were the bug kicks in those regex engines. Actually it's probably the same regex engine for both languages, which is probably why it only effects IE it's the only browser that uses VBScript and Jscript natively. I don't recall in my original testing for this bug if I tested it server-side given my previous blog comments, mostly likely I only tested client-side. However the recent question it was failing server-side so it's not actually in the browser but more likely the DLL for those languages. Anyway the previous blog article covered the behavior that was happening.

    Steve Levithan looked much closer at the problem in general and discussed it on his blog. He came to the conclusion that the qualifiers with a minimum boundary of zero, within the lookahead were the culprit. He provides a couple of simple examples. I think he's partially right but I don't think the 1+ qualifiers are excluded from the problem. I think his provided examples were a little too simple for them to be effected.

    OK let look at the regex pattern in the recent question

    ^(?=.*?\d)[a-z][a-z0-9]{5,7}$

    The requirements were 6 to 8 alphanumeric (English) string that started with an alpha character, with at least one digit. Let's use “abc123” as the text string.

    Now the above pattern was supplied by one of the resident pros who frequent the message boards. Let's get pass the fact the pattern itself is correct and satisfies the requirements. I'm going to rewrite the pattern to

    ^(?=\D*\d)[a-z][a-z0-9]{5,7}$

    Now this is functionally the same, it's just within the lookahead the pattern is greedy instead of lazy and it will make the point I'm trying to make easier to see (I hope) without having to deal with backtracking. Now this pattern suffers from the bug in VBScript/JScript. Now Steven suggested the one or more qualifier (+) doesn't suffer from the problem but change the to a + , which doesn't effect the match because the first test after the lookahead is for an alpha so the lookahead placed as it is will always match at least one character with the \D* pattern. So now we have

    ^(?=\D+\d)[a-z][a-z0-9]{5,7}$

    which is also bitten by the bug.

    I first tried the same approach on my original attempt dealing with the bug, but no luck. I tried using the + qualifier, still didn't work. Now the person posting the question stated without the lookahead the remaining pattern worked fine, excluding the at least one digit test. So I begin testing the part of patterns such as

    ^a-z][a-z0-9]{5,7}$

    ^(?=\D*\d)[a-z]

    ^(?=\D+\d)[a-z]

    are just a few attempts, all work as expected. It was only when I put the whole pattern back together that it begin failing. But after some more trial and error I think I've come across a pattern with the bug. Going back to my original examination of the problem I discovered that this pattern ^(?=\D*\d)[a-z][a-z0-9]{2,7}$ matches the test string “abc123” now this doesn't seem to quite fit with my original assessment of what was happening because in that pattern after the lookahead test it's just testing for a certain number of any characters, but it this pattern it's looking for a specific range of characters but if you up the min boundary on the last qualifier by one to ^(?=\D*\d)[a-z][a-z0-9]{3,7}$ the pattern fails again. Now if you change the zero to infinity qualifier in the lookahead to one to infinity qualifier in the modified pattern so you get ^(?=\D+\d)[a-z][a-z0-9]{3,7}$ the pattern matches again. Bump up the min boundary of this new pattern to ^(?=\D+\d)[a-z][a-z0-9]{4,7}$ Bugaboo! The pattern fails again.

    Now I don't have access to the source code so this is just supposition but here's what I believe is happening. I think the data examined by the lookahead is being stored in a stack structure. But just looking at the patterns that are working and failing it looks like once the lookahead is satisfied values when a qualifier is encountered in the consuming portion of the pattern, the lookahead's match is reconsidered with the minimum boundary of the qualifier in the lookahead popped off the stack of the lookahead's match. Let's look at the first pattern that worked

    ^(?=\D*\d)[a-z][a-z0-9]{2,7}$

    OK we'll start with after the lookahead is matched. Now the lookahead is supposed to be non-consuming so the pointer should still be at the beginning of the string.

    ^(?=\D*\d) matches “abc1” Now the rest of the pattern matches normally until we get to the qualifier in the consuming portion of the pattern we are looking for at least two alphanumeric characters. At that point the lookahead match is reconsidered, the lower bound being zero, nothing is remove for the stack but the current character pointer now points to the character after the lookahead's match value of “abc1”.The qualified part of the pattern [a-z0-9]{2,7}$ can be satisfied by “23” so we get a match

    Now if we do the same thing with ^(?=\D*\d)[a-z][a-z0-9]{3,7}$ and apply the same logic the regex fails because the qualifier in the consuming portion is look for at least 3 characters and there aren't that many if it tries to satisfy that part of the pattern after “abc1” in the test string.

    Now let's look at ^(?=\D+\d)[a-z][a-z0-9]{3,7}$ with the same logic. The only difference from the previous pattern is the lower boundary of the lookahead qualifier. It's now 1. So if we pop 1 character of the stack of the lookahead's match of “abc1” we get “abc” leaving us with “123” to be matched by the consuming qualifier, which is just enough.

    Now take ^(?=\D+\d)[a-z][a-z0-9]{4,7}$ apply the same logic, now the pattern fails because the consuming qualifier is look for as least 4 characters after the stack popping of lookahead's match.

    If you changed the lookahead qualifier to {2,} the pattern would match. You can continue upping the qualifiers by one in the consuming portion to make the pattern not match then non-consuming part of the pattern to get the match so the behavior seems pretty consistent with my theory which the consuming qualifier is pointing to the end of the lookahead match and moving back the number of character of the lookahead minimum boundary. It also seem to explain the effect of my original encounter with the bug. It may as well as why the workaround of testing the length first with another avoids the bug because that consuming qualifier it that case is usually just , though + seem to work too which doesn't quite fit but consuming qualifiers with minimum boundaries of zero or one don't seem to be effected in any case. In all of the above test cased those values were below the minimal threshold of every successful test. However the test cases above has only one lookahead that doesn't backtrack, who know what influence backtracking, additional lookaheads or how addition qualifiers in the consuming parts pattern would be effected. Now while the test values support my theory I can't say for sure that things are happening exactly the way I've laid out. The actual mechanics may be different but whatever is happening under the hood clearly pointers are being corrupted and the regex engine is loosing it's place

    The thing that was so confusing about this was it only kicks in with a qualifier in the consuming portion is encountered but if there was something between the lookahead and the qualified portion it match normally. So this is something it's really hard to make test cases for because you get these ghost value popping up latter on in the test than I expected. Not to mention the pattern itself is correct so even with a tool that will let you step through the matching you don't get this behavior unless that tool is using VBScript's regex engine and I haven't seen such a tool for that engine.

    If you are using lookaheads with JavaScript client-side then you are going to be susceptible to the bug in IE because it will use the Jscript engine. And while you should always validate server-side if VBSCript or Jscript is your server-side language you are still at risk. So a platform like classic ASP which uses both of those languages by default is at risk client and server side, but a platform like PHP while it still suffer the bug client-side for IE, should work correctly server-side which is using a different regex engine. Same goes for non-web clients using the JScript/VBScript DLL.

    The workaround for the strong password type of regex is when using lookaheads to include the upper boundary test(s) before the no upper boundary test then use .* to consume characters. The bound test should keep the pattern from running forever. However depending on the complexity of your criteria this may not always be an option but try it first anyway.

    Sponsor
  • Validating Email Revisited

    First off let me say I'm a bit over my head here. Not regex part but host the language of the regex engine.

    Many moons ago I posted a blog article stating why you could not write a regex that validated an e-mail address 100%. Well this is still true, however in that posted I also stated that the pattern was so massive that it wasn't worth using. This is also still true however I was made aware of a flavor-specific syntax that reduces the regex from massive to very large.

    This regex is for the PCRE engine. http://www.myregextester.com/?r=337

    Though from what I've read this will work for PHP too.  Now I don't know Perl or PHP or what minimum version of PCRE supports this syntax. That being the case I also don't how well it performs. I wrote the original version using the .Net syntax and not only was the regexPublish massive, which is one reason I never posted it but the performance was terrible. Given that most people want to use this type regex to validate a data entry field, the pattern was overkill. In fact I recommend that you don't use this, except to learn from. The PCRE version may perform better but I don't have the means or time to test, so use at your own risk. For simple field validation even this is still overkill. For a large text file performance may suffer horribly. Most likely you aren't going to want to use this pattern as it is too large for simple test and performs poorly for large test.

    When I see people asking for Email regex, I point out that perfect validation is not possible. And when I see so-call email validating regex that are only about 50 characters long, it makes me chuckle. This pattern is probably to most compact version of a RFC 2822 address regex you'll find and it is still huge. Ports to other regex engines not supporting the recursive syntax will easily be 4x as large as my .Net version was.

    The above pattern does the RFC Spec up to the address-spec, which pretty much what people are thinking about when they are saying Email address.

    It not to hard to take to it up a few more level in the spec using this syntax

    RFC 2822 mailbox : http://www.myregextester.com/?r=338

    but like I said it likely won't perform well enough to be useful. The two patterns I've linked to I've wrapped in anchors so they are just matching against the whole string. Searching  for a string within a larger body, without anchors will probably degrade performance very fast.  But if any of you PHP or Perl gurus want to stress test this beast, have fun. Maybe it's not as bad as I think it may be.


    Save and Continue Writing



    Sponsor
  • Update to CSS Minification

    This is a C# 2.0 enhancement of a C# port of YUI Compressor's  CSS minification code

    I got a little carried away with ideas for this, they were all regex based which really is what motivated me to work on it. However after I thought I was done I learned not everything worked. It did what I wanted it to do but what I wanted wasn't the correct thing. I really should have just stopped with my original ideas.

    The last idea for my original changes was to take 2 or more individual subset properties and write them in shorthand notation of the main property they were a subset of. Well I got that to working. But upon testing I learned something new about CSS that I didn't know. Basically that what I was doing could alter the behavior of the presentation. Which was disappointing because I put a lot of energy into getting the results I was after.

    So it looked as all of that code was going to go to waste. But there was one scenario that what I was trying to do was alright. So the code wasn't completely wasted. The one scenario was if all the subset properties are declared then combining them is fine. I didn't bother changing the regexes I wrote for this but I cleaned up some of the code. Though it would have worked as is some of the things being checked were now unnecessary.

     

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.Globalization;
    using System.Text;
    using System.Text.RegularExpressions;

    namespace CSSMinify
    {
        class CSSMinify
        {
            public static Hashtable shortColorNames = new Hashtable();
            public static Hashtable shortHexColors = new Hashtable();
            public static string Minify(string css)
            {
                return Minify(css, 0);
            }
            public static string Minify(string css, int columnWidth)
            {
                // BSD License http://developer.yahoo.net/yui/license.txt
                // New css tests and regexes by Michael Ash

                createHashTable();
                MatchEvaluator rgbDelegate = new MatchEvaluator(RGBMatchHandler);
                MatchEvaluator shortColorNameDelegate = new MatchEvaluator(ShortColorNameMatchHandler);
                MatchEvaluator shortColorHexDelegate = new MatchEvaluator(ShortColorHexMatchHandler);
                css = RemoveCommentBlocks(css);
                css = Regex.Replace(css, @"\s+", " "); //Normalize whitespace
                css = Regex.Replace(css, @"\x22\x5C\x22}\x5C\\x22\x22", "___PSEUDOCLASSBMH___"); //hide Box model hack
                /* Remove the spaces before the things that should not have spaces before them.
                   But, be careful not to turn "p :link {...}" into "p:link{...}"
                */
                css = Regex.Replace(css, @"(?#no preceding space needed)\s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))", "$1");
                css = Regex.Replace(css, @"([!{}:;>+([,])\s+", "$1");  // Remove the spaces after the things that should not have spaces after them.
                css = Regex.Replace(css, @"([^;}])}", "$1;}");    // Add the semicolon where it's missing.
                css = Regex.Replace(css, @"(\d+)\.0+(p(?:[xct])|(?:[cem])m|%|in|ex)\b", "$1$2"); // Remove .0 from size units x.0em becomes xem
                css = Regex.Replace(css, @"([\s:])(0)(px|em|%|in|cm|mm|pc|pt|ex)\b", "$1$2"); // Remove unit from zero
                //New test
                //Font weights
                css = Regex.Replace(css, @"(?<=font-weight:)normal\b", "400");
                css = Regex.Replace(css, @"(?<=font-weight:)bold\b", "700");
                //Thought this was a good idea but properties of a set not defined get element defaults. This is reseting them. css = ShortHandProperty(css);
                css = ShortHandAllProperties(css);
                //css = Regex.Replace(css, @":(\s*0){2,4}\s*;", ":0;"); // if all parameters zero just use 1 parameter
                // if all 4 parameters the same unit make 1 parameter
                css = Regex.Replace(css, @"(?<!background-position\s*):\s*(inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};", ":$1;", RegexOptions.IgnoreCase);
                // if has 4 parameters and top unit = bottom unit and right unit = left unit make 2 parameters
                css = Regex.Replace(css, @":\s*((inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(inherit|auto|0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;", ":$1;", RegexOptions.IgnoreCase);
                // if has 4 parameters and top unit != bottom unit and right unit = left unit make 3 parameters
                css = Regex.Replace(css, @":\s*((?:(?:inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(inherit|auto|0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
                //// if has 3 parameters and top unit = bottom unit make 2 parameters
                //css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
                css = Regex.Replace(css, "background-position:0;", "background-position:0 0;");
                css = Regex.Replace(css, @"(:|\s)0+\.(\d+)", "$1.$2");
                //  Outline-styles and Border-sytles parameter reduction
                css = Regex.Replace(css, @"(outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)(?:\s+\2){1,3};", "$1-style:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);", "$1-style:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);

                //  Outline-color and Border-color parameter reduction
                css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};", "$1-color:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);", "$1-color:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);", "$1-color:$2;", RegexOptions.IgnoreCase);

                // Shorten colors from rgb(51,102,153) to #336699
                // This makes it more likely that it'll get further compressed in the next step.
                css = Regex.Replace(css, @"rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29", rgbDelegate);
                css = Regex.Replace(css, @"(?<![\x22\x27=]\s*)\#(?:([0-9A-F])\1)(?:([0-9A-F])\2)(?:([0-9A-F])\3)", "#$1$2$3", RegexOptions.IgnoreCase);
                // Replace hex color code with named value is shorter
                css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>f00)\b", "red", RegexOptions.IgnoreCase);
                css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>[0-9a-f]{6})", shortColorNameDelegate, RegexOptions.IgnoreCase);
                css = Regex.Replace(css, @"(?<=color\s*:\s*)\b(Black|Fuchsia|LightSlateGr[ae]y|Magenta|White|Yellow)\b", shortColorHexDelegate, RegexOptions.IgnoreCase);

                // Remove empty rules.
                css = Regex.Replace(css, @"[^}]+{;}", "");
                //Remove semicolon of last property
                css = Regex.Replace(css, ";(})", "$1");
                if (columnWidth > 0)
                {
                    css = BreakLines(css, columnWidth);
                }
                return css;
            }
            private static string RemoveCommentBlocks(string input)
            {
                int startIndex = 0;
                int endIndex = 0;
                bool iemac = false;
                startIndex = input.IndexOf(@"/*", startIndex);
                while (startIndex >= 0)
                {
                    endIndex = input.IndexOf(@"*/", startIndex + 2);
                    if (endIndex >= startIndex + 2)
                    {
                        if (input[endIndex - 1] == '\\')
                        {
                            startIndex = endIndex + 2;
                            iemac = true;
                        }
                        else if (iemac)
                        {
                            startIndex = endIndex + 2;
                            iemac = false;
                        }
                        else
                        {
                            input = input.Remove(startIndex, endIndex + 2 - startIndex);
                        }
                    }
                    startIndex = input.IndexOf(@"/*", startIndex);
                }
                return input;
            }
            private static String RGBMatchHandler(Match m)
            {
                int val = 0;
                StringBuilder hexcolor = new StringBuilder("#");
                for (int index = 1; index <= 3; index += 1)
                {
                    val = Int32.Parse(m.Groups[index].Value);
                    hexcolor.Append(val.ToString("x2"));
                }
                return hexcolor.ToString();
            }
            private static string BreakLines(string css, int columnWidth)
            {
                int i = 0;
                int start = 0;
                StringBuilder sb = new StringBuilder(css);
                while (i < sb.Length)
                {
                    char c = sb[i++];
                    if (c == '}' && i - start > columnWidth)
                    {
                        sb.Insert(i, '\n');
                        start = i;
                    }
                }
                return sb.ToString();

            }
            private static string ReplaceNonEmpty(string inputText, string replacementText)
            {
                if (replacementText.Trim() != string.Empty)
                {
                    inputText = string.Format(" {0}", replacementText);
                }
                return inputText;
            }
            private static string ShortColorNameMatchHandler(Match m)
            {
                // This function replace hex color values named colors if the name is shorter than the hex code
                string returnValue = m.Value;
                if (shortColorNames.ContainsKey(m.Groups["hex"].Value))
                {
                    returnValue = shortColorNames[m.Groups["hex"].Value].ToString();
                }
                return returnValue;
            }
            private static string ShortColorHexMatchHandler(Match m)
            {
                //This function replaces named values with there shorter hex equivalent
                return shortHexColors[m.Value.ToString().ToLower()].ToString();
            }
            private static void createHashTable()
            {
                //Color names shorter than hex notation. Except for red.
                shortColorNames.Add("F0FFFF".ToLower(), "Azure".ToLower());
                shortColorNames.Add("F5F5DC".ToLower(), "Beige".ToLower());
                shortColorNames.Add("FFE4C4".ToLower(), "Bisque".ToLower());
                shortColorNames.Add("A52A2A".ToLower(), "Brown".ToLower());
                shortColorNames.Add("FF7F50".ToLower(), "Coral".ToLower());
                shortColorNames.Add("FFD700".ToLower(), "Gold".ToLower());
                shortColorNames.Add("808080".ToLower(), "Grey".ToLower());
                shortColorNames.Add("008000".ToLower(), "Green".ToLower());
                shortColorNames.Add("4B0082".ToLower(), "Indigo".ToLower());
                shortColorNames.Add("FFFFF0".ToLower(), "Ivory".ToLower());
                shortColorNames.Add("F0E68C".ToLower(), "Khaki".ToLower());
                shortColorNames.Add("FAF0E6".ToLower(), "Linen".ToLower());
                shortColorNames.Add("800000".ToLower(), "Maroon".ToLower());
                shortColorNames.Add("000080".ToLower(), "Navy".ToLower());
                shortColorNames.Add("808000".ToLower(), "Olive".ToLower());
                shortColorNames.Add("FFA500".ToLower(), "Orange".ToLower());
                shortColorNames.Add("DA70D6".ToLower(), "Orchid".ToLower());
                shortColorNames.Add("CD853F".ToLower(), "Peru".ToLower());
                shortColorNames.Add("FFC0CB".ToLower(), "Pink".ToLower());
                shortColorNames.Add("DDA0DD".ToLower(), "Plum".ToLower());
                shortColorNames.Add("800080".ToLower(), "Purple".ToLower());
                shortColorNames.Add("FA8072".ToLower(), "Salmon".ToLower());
                shortColorNames.Add("A0522D".ToLower(), "Sienna".ToLower());
                shortColorNames.Add("C0C0C0".ToLower(), "Silver".ToLower());
                shortColorNames.Add("FFFAFA".ToLower(), "Snow".ToLower());
                shortColorNames.Add("D2B48C".ToLower(), "Tan".ToLower());
                shortColorNames.Add("008080".ToLower(), "Teal".ToLower());
                shortColorNames.Add("FF6347".ToLower(), "Tomato".ToLower());
                shortColorNames.Add("EE82EE".ToLower(), "Violet".ToLower());
                shortColorNames.Add("F5DEB3".ToLower(), "Wheat".ToLower());

                // Hex notation shorter than named value
                shortHexColors.Add("black", "#000");
                shortHexColors.Add("fuchsia", "#f0f");
                shortHexColors.Add("lightSlategray", "#789");
                shortHexColors.Add("lightSlategrey", "#789");
                shortHexColors.Add("magenta", "#f0f");
                shortHexColors.Add("white", "#fff");
                shortHexColors.Add("yellow", "#ff0");
            }
            private static string ShortHandAllProperties(string css)
            {
                /*
                 * This function searchs for properties specifying all the individual properties of a property type
                 * and reduces it to a single property use shorthand notation
                 */
                Regex reCSSBlock = new Regex("{[^{}]*}");
                Regex reTRBL1 = new Regex(@"(?<fullProperty>(?:(?<property>padding)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
                Regex reTRBL2 = new Regex(@"(?<fullProperty>(?:(?<property>margin)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
                Regex reTRBL3 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:color)))\s*:\s*(?<unit>[#\w.]+);?", RegexOptions.IgnoreCase);
                Regex reTRBL4 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:style)))\s*:\s*(?<unit>none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset);?", RegexOptions.IgnoreCase);
                Regex reTRBL5 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:width)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
                Regex reListStyle = new Regex(@'list-style-(?<style>type|image|position)\s*:\s*(?<unit>[^};]+);?', RegexOptions.IgnoreCase);
                Regex reFont = new Regex(@"font-(?:(?:(?<fontProperty>family\b)\s*:\s*(?<fontPropertyValue>(?:\b[a-zA-Z]+(-[a-zA-Z]+)?\b|\x22[^\x22]+\x22)(?:\s*,\s*(?:\b[a-zA-Z]+(-[a-zA-Z]+)?\b|\x22[^\x22]+\x22))*)\b)|
    (?:(?<fontProperty>style\b)\s*:\s*(?<fontPropertyValue>normal|italic|oblique|inherit))|
    (?:(?<fontProperty>variant\b)\s*:\s*(?<fontPropertyValue>normal|small-caps|inherit))|
    (?:(?<fontProperty>weight\b)\s*:\s*(?<fontPropertyValue>normal|bold|(?:bold|light)er|[1-9]00|inherit))|
    (?:(?<fontProperty>size\b)\s*:\s*(?<fontPropertyValue>(?:(?:xx?-)?(?:small|large))|medium|(?:\d*\.?\d+(?:%|(p(?:[xct])|(?:[cem])m|in|ex))\b)|inherit|\b0\b)))\s*;?", (RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace));
                Regex reBackGround = new Regex(@"background-(?:
    (?:(?<property>color)\s*:\s*(?<unit>transparent|inherit|(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)))|
    (?:(?<property>image)\s*:\s*(?<unit>none|inherit|(?:url\s*\([^()]+\))))|
    (?:(?<property>repeat)\s*:\s*(?<unit>no-repeat|inherit|repeat(?:-[xy])))|
    (?:(?<property>attachment)\s*:\s*(?<unit>scroll|inherit|fixed))|
    (?:(?<property>position)\s*:\s*(?<unit>((?<horizontal>left | center | right|(?:0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+(?<vertical>top | center | bottom |(?:0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))))|
        ((?<vertical>top | center | bottom )\s+(?<horizontal>left | center | right ))|
        ((?<horizontal>left | center | right )|(?<vertical>top | center | bottom ))))
    );?", (RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture));
                MatchCollection mcBlocks = reCSSBlock.Matches(css);
                foreach (Match mBlock in mcBlocks)
                {
                    string strBlock = mBlock.Value;
                    HasAllPositions(reTRBL1, ref strBlock);
                    HasAllPositions(reTRBL2, ref strBlock);
                    HasAllPositions(reTRBL3, ref strBlock);
                    HasAllPositions(reTRBL4, ref strBlock);
                    HasAllPositions(reTRBL5, ref strBlock);
                    HasAllListStyle(reListStyle, ref strBlock);
                    HasAllFontProperties(reFont, ref strBlock);
                    HasAllBackGroundProperties(reBackGround, ref strBlock);
                    css = css.Replace(mBlock.Value, strBlock);
                }
                return css;
            }
            private static void HasAllBackGroundProperties(Regex re, ref string CSSText)
            {
                {
                    MatchCollection mcProperySet = re.Matches(CSSText);
                    int z = 5;
                    if (mcProperySet.Count == z)
                    {

                        int y = 0;
                        for (int x = 0; x < z; x = x + 1)
                        {
                            switch (mcProperySet[x].Groups["property"].Value)
                            {
                                case "color":
                                    y = y + 1;
                                    break;
                                case "image":
                                    y = y + 2;
                                    break;
                                case "repeat":
                                    y = y + 4;
                                    break;
                                case "attachment":
                                    y = y + 8;
                                    break;
                                case "position":
                                    y = y + 16;
                                    break;
                            }
                        }
                        if (y == 31)
                        {
                            CSSText = ShortHandBackGroundReplaceV2(mcProperySet, re, CSSText);
                        }
                    }
                }
            }
            private static void HasAllFontProperties(Regex re, ref string CSSText)
            {
                {
                    MatchCollection mcProperySet = re.Matches(CSSText);
                    int z = 5;
                    if (mcProperySet.Count == z)
                    {

                        int y = 0;
                        for (int x = 0; x < z; x = x + 1)
                        {
                            switch (mcProperySet[x].Groups["fontProperty"].Value)
                            {
                                case "style":
                                    y = y + 1;
                                    break;
                                case "variant":
                                    y = y + 2;
                                    break;
                                case "weight":
                                    y = y + 4;
                                    break;
                                case "size":
                                    y = y + 8;
                                    break;
                                case "family":
                                    y = y + 16;
                                    break;
                            }
                        }
                        if (y == 31)
                        {
                            CSSText = ShortHandFontReplaceV2(mcProperySet, re, CSSText);
                        }
                    }
                }
            }
            private static void HasAllListStyle(Regex re, ref string CSSText)
            {
                {
                    int z = 3;
                    MatchCollection mcProperySet = re.Matches(CSSText);
                    if (mcProperySet.Count == z)
                    {

                        int y = 0;
                        for (int x = 0; x < z; x = x + 1)
                        {
                            switch (mcProperySet[x].Groups["style"].Value)
                            {
                                case "type":
                                    y = y + 1;
                                    break;
                                case "image":
                                    y = y + 2;
                                    break;
                                case "position":
                                    y = y + 4;
                                    break;

                            }
                        }
                        if (y == 7)
                        {
                            CSSText = ShortHandListReplaceV2(mcProperySet, re, CSSText);
                        }
                    }
                }
            }
            private static void HasAllPositions(Regex re, ref string CSSText)
            {
                {
                    MatchCollection mcProperySet = re.Matches(CSSText);
                    if (mcProperySet.Count == 4)
                    {

                        int y = 0;
                        for (int x = 0; x < 4; x = x + 1)
                        {
                            switch (mcProperySet[x].Groups["position"].Value)
                            {
                                case "top":
                                    y = y + 1;
                                    break;
                                case "right":
                                    y = y + 2;
                                    break;
                                case "bottom":
                                    y = y + 4;
                                    break;
                                case "left":
                                    y = y + 8;
                                    break;
                            }
                        }
                        if (y == 15)
                        {
                            CSSText = ShortHandReplaceV2(mcProperySet, re, CSSText);
                        }
                    }
                }
            }
            private static string ShortHandFontReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
            {
                /*
                 * This Function replaces the individual font properties with a single entry
                 * */
                string strFamily, strStyle, strVariant, strWeight, strSize;
                Regex reLineHeight = new Regex(@"line-height\s*:\s*((?:\d*\.?\d+(?:%|(p(?:[xct])|(?:[cem])m|in|ex)\b)?)|normal|inherit);?", RegexOptions.IgnoreCase);
                strFamily = string.Empty;
                strStyle = string.Empty;
                strVariant = string.Empty;
                strWeight = string.Empty;
                strSize = string.Empty;
                string strStyle_Variant_Weight = string.Empty;
                foreach (Match mProperty in mcProperySet)
                {
                    switch (mProperty.Groups[""].Value)
                    {
                        case "family":
                            strFamily = string.Format(" {0}", mProperty.Groups["fontPropertyValue"].Value);
                            break;
                        case "size":
                            if (reLineHeight.IsMatch(InputText))
                            {
                                Match m = reLineHeight.Match(InputText);
                                if (m.Groups[1].Value != "normal")
                                {
                                    strSize = String.Format("/{0}", m.Groups[1].Value);
                                }
                                InputText = reLineHeight.Replace(InputText, string.Empty);
                            }
                            strSize = string.Format(" {0}{1}", mProperty.Groups["fontPropertyValue"].Value, strSize);
                            if (strSize == "medium")
                            {
                                strSize = string.Empty;
                            }
                            break;
                        case "style":
                        case "variant":
                        case "weight":
                            if (mProperty.Groups["fontPropertyValue"].Value != "normal")
                            {
                                strStyle_Variant_Weight += string.Format(" {0}", mProperty.Groups["fontPropertyValue"].Value);
                            } break;

                    }
                }

                string strShortcut;
                string strProperties = string.Format("{0}{1}{2};", strStyle_Variant_Weight, strVariant, strWeight, strSize, strFamily);
                strShortcut = string.Format("font:{0}", strProperties.Trim());
                string strNewBlock = re.Replace(InputText, "");
                strNewBlock = strNewBlock.Insert(1, strShortcut);
                return strNewBlock;
            }
            private static string ShortHandBackGroundReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
            {
                /*
                 * This Function replaces the individual background properties with a single entry
                 * */
                string strColor, strImage, strRepeat, strAttachment, strPosition;
                strColor = string.Empty;
                strImage = string.Empty;
                strRepeat = string.Empty;
                strAttachment = string.Empty;
                strPosition = string.Empty;
                foreach (Match mProperty in mcProperySet)
                {
                    switch (mProperty.Groups["property"].Value)
                    {
                        case "color":
                            if (mProperty.Groups["unit"].Value != "transparent")
                            {
                                strColor = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                        case "image":
                            if (mProperty.Groups["unit"].Value != "none")
                            {
                                strImage = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                        case "repeat":
                            if (mProperty.Groups["unit"].Value != "repeat")
                            {
                                strRepeat = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            } break;
                        case "attachment":
                            if (mProperty.Groups["unit"].Value != "scroll")
                            {
                                strAttachment = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                        case "position":
                            if (mProperty.Groups["unit"].Value != "0% 0%")
                            {
                                strPosition = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                    }
                }

                string strShortcut;
                string strProperties = string.Format("{0}{1}{2}{3}{4};", strColor, strImage, strRepeat, strAttachment, strPosition);
                strShortcut = string.Format("background:{0}", strProperties.Trim());
                string strNewBlock = re.Replace(InputText, "");
                strNewBlock = strNewBlock.Insert(1, strShortcut);
                return strNewBlock;
            }
            private static string ShortHandReplaceV2(MatchCollection mcProperySet, Regex reTRBL1, string InputText)
            {
                // Replace method for regexes used in ShortHand property method for properties with top, right, bottom and left sub properties.
                string strTop, strRight, strBottom, strLeft;
                strTop = string.Empty;
                strRight = string.Empty;
                strBottom = string.Empty;
                strLeft = string.Empty;
                string strProperty;
                strProperty = string.Format("{0}{1}", mcProperySet[0].Groups["property"].Value, mcProperySet[0].Groups["property2"].Value);
                foreach (Match mProperty in mcProperySet)
                {
                    switch (mProperty.Groups["position"].Value)
                    {
                        case "top":
                            strTop = mProperty.Groups["unit"].Value;
                            break;
                        case "right":
                            strRight = mProperty.Groups["unit"].Value;
                            break;
                        case "bottom":
                            strBottom = mProperty.Groups["unit"].Value;
                            break;
                        case "left":
                            strLeft = mProperty.Groups["unit"].Value;
                            break;
                    }

                }

                string strShortcut = string.Format("{0}:{1} {2} {3} {4};", strProperty, strTop, strRight, strBottom, strLeft);
                string strNewBlock = reTRBL1.Replace(InputText, "");
                strNewBlock = strNewBlock.Insert(1, strShortcut);
                return strNewBlock;
            }
            private static string ShortHandListReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
            {
                /*
                 * This Function replaces the individual list properties with a single entry
                 * */
                string strType, strPosition, strImage;
                strType = string.Empty;
                strPosition = string.Empty;
                strImage = string.Empty;
                foreach (Match mProperty in mcProperySet)
                {
                    switch (mProperty.Groups["style"].Value)
                    {
                        case "type":
                            if (mProperty.Groups["unit"].Value != "disc")
                            {
                                strType = mProperty.Groups["unit"].Value;
                            }
                            break;
                        case "position":
                            if (mProperty.Groups["unit"].Value != "outside")
                            {
                                strPosition = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                        case "style":
                            if (mProperty.Groups["unit"].Value != "none")
                            {
                                strImage = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                    }

                }

                string strShortcut = string.Format("list-style:{0}{1}{2};", strType, strPosition, strImage);
                string strNewBlock = re.Replace(InputText, "");
                strNewBlock = strNewBlock.Insert(1, strShortcut);
                return strNewBlock;
            }
        }
    }
     

    Sponsor
  • Follow up to Additional CSS minifying regex patterns

    OK, there regexes were discussed in the previous post this is mostly just their application.

    This is a C# 2.0 enhancement of a C# port of YUI Compressor's  CSS minification code

     Since I was doing this is C# I took full advantage of it's regex engine, namely using lookbehinds and delegates for some replaces.

    Almost all the regexes after the "New Test" comment are the new or modified regexes from the ported version. There is also one new and two modified expressions before that comment. One of those modification is just a change in writing style, the other modifications are replacing some code but (hopefully) not functionality with a regex replace. The new regex replacements of course are the new compression enhancements.

    There are also a couple of new regexes not mentioned in the previous post that match and replace some of the color values with an equivalent but a more concisely written value. The replace the color "red" is a straight replace but the other colors require some code evaluation and are using delegates.

    I've done some very limited testing but as I mentioned in the previous post most of the CSS I've written doesn't have some of the new things I was searching for. I could add them for a test (which I did) but that won't catch any problems they my cause to the actual CSS application since I wasn't really using the test values. So the source code is now available for beta testing.  Test early and often before committing to use it.  I'm willing to fix any minor bugs for things I may have overlook but if a particular replace is problematic it's easy enough to comment out the offender and use the rest.

    And as was mentioned in the comments of the previous post any generated content that looks like CSS may get stepped on so be aware of that.

    And also that all licenses for previous versions still apply.

    UPDATE 2008-04-27

    After a little more testing I discovered one of the replaces I was doing can alter how the CSS is processed.  So I have just crossed out the  functions and function call
    I've come up with a safer, though less likely to occur replacement.

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.Globalization;
    using System.Text;
    using System.Text.RegularExpressions;
     namespace CSSMinify
    {
     class CSSMinify
     {
       public static Hashtable shortColorNames = new Hashtable();
       public static Hashtable shortHexColors = new Hashtable();
       public static string Minify(string css)
       {
         return Minify(css, 0);
       }
       public static string Minify(string css, int columnWidth)
       {
       // BSD License http://developer.yahoo.net/yui/license.txt
       // New css tests and regexes by Michael Ash
         createHashTable();
         MatchEvaluator rgbDelegate = new MatchEvaluator(RGBMatchHandler);
         MatchEvaluator shortColorNameDelegate = new     MatchEvaluator(ShortColorNameMatchHandler);
         MatchEvaluator shortColorHexDelegate = new MatchEvaluator(ShortColorHexMatchHandler);
         css = RemoveCommentBlocks(css);
         css = Regex.Replace(css, @"\s+", " "); //Normalize whitespace
         css = Regex.Replace(css, @"\x22\x5C\x22}\x5C\x22\x22", "___PSEUDOCLASSBMH___"); //hide Box model hack
         /* Remove the spaces before the things that should not have spaces before           them.
             But, be careful not to turn "p :link {...}" into "p:link{...}"
         */
         css = Regex.Replace(css, @"(?#no preceding space needed)\s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))", "$1");
         css = Regex.Replace(css, @"([!{}:;>+([,])\s+", "$1"); // Remove the spaces after the things that should not have spaces after them.
         css = Regex.Replace(css, @"([^;}])}", "$1;}"); // Add the semicolon where it's missing.
         css = Regex.Replace(css, @"(\d+)\.0+(p(?:[xct])|(?:[cem])m|%|in|ex)\b", "$1$2"); // Remove .0 from size units x.0em becomes xem
         css = Regex.Replace(css, @"([\s:])(0)(px|em|%|in|cm|mm|pc|pt|ex)\b", "$1$2"); // Remove unit from zero
         //New test
         css = ShortHandProperty(css);
         //css = Regex.Replace(css, @":(\s*0){2,4}\s*;", ":0;"); // if all parameters zero just use 1 parameter
         // if all 4 parameters the same unit make 1 parameter
         css = Regex.Replace(css, @":\s*(0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};", ":$1;", RegexOptions.IgnoreCase);
         // if has 4 parameters and top unit = bottom unit and right unit = left unit make 2 parameters
         css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;", ":$1;", RegexOptions.IgnoreCase);
         // if has 4 parameters and top unit != bottom unit and right unit = left unit make 3 parameters
         css = Regex.Replace(css, @":\s*((?:(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
         //// if has 3 parameters and top unit = bottom unit make 2 parameters
         //css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|    (?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css,"background-position:0;", "background-position:0 0;");
         css = Regex.Replace(css,@"(:|\s)0+\.(\d+)", "$1.$2");
       // Outline-styles and Border-sytles parameter reduction
         css = Regex.Replace(css, @"(outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)(?:\s+\2){1,3};", "$1-style:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);", "$1-style:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
         // Outline-color and Border-color parameter reduction
         css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};", "$1-color:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);", "$1-color:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);", "$1-color:$2;", RegexOptions.IgnoreCase);
     // Shorten colors from rgb(51,102,153) to #336699
     // This makes it more likely that it'll get further compressed in the next step.
         css = Regex.Replace(css,@"rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29", rgbDelegate);
         css = Regex.Replace(css, @"(?<![\x22\x27=]\s*)\#(?:([0-9A-F])\1)(?:([0-9A-F])\2)(?:([0-9A-F])\3)", "#$1$2$3", RegexOptions.IgnoreCase);
     // Replace hex color code with named value is shorter
         css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>f00)\b", "red",RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>[0-9a-f]{6})", shortColorNameDelegate, RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(?<=color\s*:\s*)\b(Black|Fuchsia|LightSlateGr[ae]y|Magenta|White|Yellow)\b", shortColorHexDelegate,RegexOptions.IgnoreCase);
         // Remove empty rules.
         css = Regex.Replace(css,@"[^}]+{;}", "");
         //Remove semicolon of last property
         css = Regex.Replace(css, ";(})", "$1");
         if (columnWidth > 0)
         {
           css = BreakLines(css, columnWidth);
         }
         return css;
     }
     private static string RemoveCommentBlocks(string input)
     {
       int startIndex = 0;
       int endIndex = 0;
       bool iemac = false;
       startIndex = input.IndexOf(@"/*", startIndex);
       while (startIndex >= 0)
       {
         endIndex = input.IndexOf(@"*/", startIndex + 2);
         if (endIndex >= startIndex + 2)
         {
           if (input[endIndex - 1] == '\\')
           {
             startIndex = endIndex + 2;
             iemac = true;
           }
           else if (iemac)
           {
             startIndex = endIndex + 2;
             iemac = false;
            }
           else
           {
            input = input.Remove(startIndex, endIndex + 2 - startIndex);
            }
         }
         startIndex = input.IndexOf(@"/*", startIndex);
       }
     return input;
    }
     private static String RGBMatchHandler(Match m)
     {
       int val = 0;
       StringBuilder hexcolor = new StringBuilder("#");
       for(int index=1; index <= 3; index += 1)
       {
         val = Int32.Parse(m.Groups[index].Value);
         hexcolor.Append(val.ToString("x2"));
       }
       return hexcolor.ToString();
     }
     private static string BreakLines(string css, int columnWidth)
    {
       int i = 0;
       int start = 0;
       StringBuilder sb = new StringBuilder(css);
       while (i < sb.Length)
       {
         char c = sb[i++];
         if (c == '}' && i - start > columnWidth)
         {
           sb.Insert(i, '\n');
           start = i;
         }
       }
     return sb.ToString();
     }
     private static string ShortHandProperty(string css)
     {
     /*
      * This function searchs for properties specifying at least 2 of the top, right, bottom or left box model
      * positions and reduces it to a single property use shorthand notation
      */
       Regex reCSSBlock = new Regex("{[^{}]*}");
       Regex reTRBL1 = new Regex(@"(?<fullProperty>(?:(?<property>padding)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
       Regex reTRBL2 = new Regex(@"(?<fullProperty>(?:(?<property>margin)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
       Regex reTRBL3 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:color)))\s*:\s*(?<unit>[#\w.]+);?", RegexOptions.IgnoreCase);
       Regex reTRBL4 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:style)))\s*:\s*(?<unit>none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset);?",  RegexOptions.IgnoreCase);
       Regex reTRBL5 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:width)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
       MatchCollection mcBlocks = reCSSBlock.Matches(css);
       foreach (Match mBlock in mcBlocks)
       {
         string strBlock= mBlock.Value;
         MatchCollection mcProperySet = reTRBL1.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL1, strBlock);
     
         }
         mcProperySet = reTRBL2.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL2, strBlock);
         }
         mcProperySet = reTRBL3.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL3, strBlock);
         }
         mcProperySet = reTRBL4.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL4, strBlock);
         }
         mcProperySet = reTRBL5.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL5, strBlock);
         }
         css = css.Replace(mBlock.Value, strBlock);
       }
       return css;
     }
     private static string ShortHandReplace(MatchCollection mcProperySet, Regex reTRBL1, string InputText)
     {
     // Replace method for regexes used in ShortHand property method.
       string strTop, strRight, strBottom, strLeft;
       strTop = string.Empty;
       strRight = string.Empty;
       strBottom = string.Empty;
       strLeft = string.Empty;
       string strProperty;
       string strDefaultValue;
     
       strProperty = string.Format("{0}{1}", mcProperySet[0].Groups["property"].Value, mcProperySet[0].Groups["property2"].Value);
       switch (strProperty){
         case "border-color":
           strDefaultValue = "inherit";
           break;
         case "border-style":
           strDefaultValue = "none";
           break;
         default:
           strDefaultValue = "0";
           break;
       }
       foreach (Match mProperty in mcProperySet)
       {
         if (mProperty.Groups["position"].Value == "top")
         {
           if (strTop == string.Empty)
           {
             strTop = mProperty.Groups["unit"].Value;
           }
           else
           {
             break;
           }
         }
         if (mProperty.Groups["position"].Value == "right")
         {
           if (strRight == string.Empty)
           {
             strRight = mProperty.Groups["unit"].Value;
           }
           else
           {
             break;
           }
          }
         if (mProperty.Groups["position"].Value == "bottom")
         {
           if (strBottom == string.Empty)
           {
             strBottom = mProperty.Groups["unit"].Value;
           }
           else
           {
             break;
           }
         }
         if (mProperty.Groups["position"].Value == "left")
         {
           if (strLeft == string.Empty)
           {
             strLeft = mProperty.Groups["unit"].Value;
           }
           else
           {
             break;
           }
         }
       }
       if (strTop == string.Empty)
       {
         strTop = strDefaultValue;
       }
       if (strRight == string.Empty)
       {
         strRight = strDefaultValue;
       }
       if (strBottom == string.Empty)
       {
         strBottom = strDefaultValue;
       }
       if (strLeft == string.Empty)
       {
         strLeft = strDefaultValue;
       }
       string strShortcut = string.Format("{0}:{1} {2} {3} {4};", strProperty, strTop, strRight, strBottom, strLeft);
       string strNewBlock = reTRBL1.Replace(InputText, "");
       strNewBlock = strNewBlock.Insert(1, strShortcut);
       return strNewBlock;
     }

     private static string ShortColorNameMatchHandler(Match m)
     {
     // This function replace hex color values named colors if the name is shorter than the hex code
       string returnValue = m.Value;
       if (shortColorNames.ContainsKey(m.Groups["hex"].Value))
       {
         returnValue = shortColorNames[m.Groups["hex"].Value].ToString();
       }
       return returnValue;
     }
     private static string ShortColorHexMatchHandler(Match m)
     {
       return shortHexColors[m.Value.ToString().ToLower()].ToString();
     }
     private static void createHashTable()
     {
     //Color names shorter than hex notation. Except for red.
       shortColorNames.Add("F0FFFF".ToLower(), "Azure".ToLower());
       shortColorNames.Add("F5F5DC".ToLower(), "Beige".ToLower());
       shortColorNames.Add("FFE4C4".ToLower(), "Bisque".ToLower());
       shortColorNames.Add("A52A2A".ToLower(), "Brown".ToLower());
       shortColorNames.Add("FF7F50".ToLower(), "Coral".ToLower());
       shortColorNames.Add("FFD700".ToLower(), "Gold".ToLower());
       shortColorNames.Add("808080".ToLower(), "Grey".ToLower());
       shortColorNames.Add("008000".ToLower(), "Green".ToLower());
       shortColorNames.Add("4B0082".ToLower(), "Indigo".ToLower());
       shortColorNames.Add("FFFFF0".ToLower(), "Ivory".ToLower());
       shortColorNames.Add("F0E68C".ToLower(), "Khaki".ToLower());
       shortColorNames.Add("FAF0E6".ToLower(), "Linen".ToLower());
       shortColorNames.Add("800000".ToLower(), "Maroon".ToLower());
       shortColorNames.Add("000080".ToLower(), "Navy".ToLower());
       shortColorNames.Add("808000".ToLower(), "Olive".ToLower());
       shortColorNames.Add("FFA500".ToLower(), "Orange".ToLower());
       shortColorNames.Add("DA70D6".ToLower(), "Orchid".ToLower());
       shortColorNames.Add("CD853F".ToLower(), "Peru".ToLower());
       shortColorNames.Add("FFC0CB".ToLower(), "Pink".ToLower());
       shortColorNames.Add("DDA0DD".ToLower(), "Plum".ToLower());
       shortColorNames.Add("800080".ToLower(), "Purple".ToLower());
       shortColorNames.Add("FA8072".ToLower(), "Salmon".ToLower());
       shortColorNames.Add("A0522D".ToLower(), "Sienna".ToLower());
       shortColorNames.Add("C0C0C0".ToLower(), "Silver".ToLower());
       shortColorNames.Add("FFFAFA".ToLower(), "Snow".ToLower());
       shortColorNames.Add("D2B48C".ToLower(), "Tan".ToLower());
       shortColorNames.Add("008080".ToLower(), "Teal".ToLower());
       shortColorNames.Add("FF6347".ToLower(), "Tomato".ToLower());
       shortColorNames.Add("EE82EE".ToLower(), "Violet".ToLower());
       shortColorNames.Add("F5DEB3".ToLower(), "Wheat".ToLower());
       // Hex notation shorter than named value
       shortHexColors.Add("black", "#000");
       shortHexColors.Add("fuchsia", "#f0f");
       shortHexColors.Add("lightSlategray", "#789");
       shortHexColors.Add("lightSlategrey", "#789");
       shortHexColors.Add("magenta", "#f0f");
       shortHexColors.Add("white", "#fff");
       shortHexColors.Add("yellow", "#ff0");
       }
     }
    }
     

    Sponsor
  • Additional CSS minifying regex patterns

    NOTE: All the regex referenced on this page written by me are using IgnoreCase = true

    I was looking at the regexes used in the YUI Compressor to minify CSS and came up with a couple of more that I think could help the process. The code and port I was looking at was already trimming unneeded zeros used for the top, right, bottom, left values with a simple string replace. But there were three separate replaces being done. It was pretty simple to come up with a regex to handle all the cases

    (Pseudo code)


    string.Replace(":0 0 0 0;","0;")
    string.Replace(":0 0 0;","0;")
    string.Replace(":0 0;","0;")

    becomes

    Regex.Replace(input,":(\s*0)(\s+0){0,3}\s*;",":0;")

    Pretty simple but of course I thought why stop there so I came up with a regex for all numbers

    :\s*(0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};

    The replacement string is simply ":$1;"

    Once this was done next on the list was to handle cases when all the numbers are not the all the same.


    For those who don't know CSS this part of the syntax basically says of the 4 possible values

    1) if only one value is specified the other 3 are implied to the same value (X = X X X X)
    2) if two values are specified the 1st implies the third the second implies the 4th (X Y = X Y X Y)
    3) if three values are specified the second implies the 4th (X Y Z = X Y Z Y)
     
    Of course minifying you want to use the shorter syntax so the following regexes convert the longer to the shorter.
    The replacement string for all is the same as above "$1;"


    # 4 parameters to 2 x y x y to x y

    :\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;


    # 4 to 3 (x y z y to x y z) or 3 to 2 (x y x to x y)

    :\s*((?:(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;

    Though the make look unwieldy the longer ones just repeat one of the sub-pattern. The only real difference is what is and isn't captured.

    Along those same lines I came up with similar patterns for border-style, outline-style, border-color and outline-color

    border-style/outline-style

    The replacement string for all is “$1-style:$2;”


    (outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )(?:\s+\2){1,3};


    (outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);


    (outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);


    border-color/outline-color

    The replacement string for all is “$1-color:$2;”


    (outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};


    (outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);


    (outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);


    I also came up with a couple of more regexes to replace some code, but as couple of these use look-behinds they are not as portable.

    These work in .Net which supports variable length lookbehinds

    This pattern

    \s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))

    Is used to find character preceeded by an unneed whitespace taking care not to find colon's that are used as psuedo-selectors or psuedo-classes. Replacement string would be $1

    This pattern

    (?<![\x22\x27=]\s*)\#([0-9A-F])\1([0-9A-F])\2([0-9A-F])\3

    is used for reducing hexadecimal values from AABBCC to ABC, the replacement string would be $1$2$3.

    The patten I created was to find RGB values of th format rgb(x, y, z) where x, y and z are the integers in the range of 0-255.

    rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29

    The pattern I used was just stricter than the existing one and put each value as a group so I could just work with them with further processing. The match itself is passed to a matchevaluator to do the decimal to hex conversion.

    I did a little alpha testing but most of the CSS I've written doesn't have the stuff I'm testing for. So beta testing is definitely in order. Also I don't really know CSS hacks so I don't know if any of this will have adverse effect on hacks. Other than the top/left/bottom/right parameters I mostly trying to duplicate the existing effects. I'm going to pass this info on the YUI Compressor maintainer to see if they can make use of any of this.

    Sponsor
  • A touch of Character Class

    The square brackets character class is one of the more misunderstood of the basic regex features. This feature is supported in virtually all regex implementations. In fact off the top of my head I don't know an implementation that doesn't support it. Maybe it's not well documented in most tutorials or maybe the samples are not clear enough or maybe users are just skimming over the details for this one but I see this feature misused quite often.

    This type of character class simply matches one and only one of the characters between the square brackets. That's pretty much it. Now there are a few caveats to what can go between the brackets but the output is the same. One, and only, character will be returned if there is a match.

    Now a common error rookie regex users tend make by trying to write patterns inside the character class and make it match a particular sequence of characters. Trying to either have it match a particular fixed string of characters or another more generic regex pattern inside the character class. One of the obvious tells of this approach is when the same character appears in the character class multiple times, something like [RADAR] or [HELLO]. Listen up newbies, you can't write any sort of patterns within a character class so don't waste your time trying. If you have a fixed string like “ABC” don't try using a character class [ABC] to match that string as a a whole. As I stated in the previous paragraph this features is for matching a character (singular) from a group of characters, not an ordering of characters (plural) so what you don't get a match of “ABC” what you get is 3 matches since each character in the string is one of the 3 characters. There is nothing you can do inside of the brackets to make it return “ABC” as single match. You can't group these characters together since you can't write any patterns within the character class. All the characters in the class are tested against just one position in your input. Now you can add a qualifier, like + or * after the character class to make “ABC” one match but remember what the character class matches, one character. The qualifier allows the the character class to be applied multiple times, the whole class. None of the possible choices are eliminated in following application of the class. Every application of [ABC] test for 1of 3 possible characters, [ABC]+ test for 1of 3 possible one or more times. So while it will match “ABC” it doesn't only match “ABC” which brings me to my second topic, Order.

    Along with the incorrect belief that you can include a pattern within a character class is that the characters within the class has to be in the same order as the characters being tested against. This isn't true. Everything I said about the regex pattern [ABC] is true for patterns [CBA] or [CAB] or [BCA]. They all match exactly the same thing. With a few exceptions, couple of that have to do with ranges one which I've discussed before, the order of the characters doesn't matter.  If you are using a range the value on the left of the hyphen has to be a lower ASCII value (or code point for Unicode) than the value on the right. Of the previous equivalent character class patterns the only way to write an equivalent range is [A-C]. [C-A] won't work and in some implementations will throw a compile error.

    Now with these first couple of errors it maybe because new users think that alternation and character classes are interchangeable. They are not. Some tutorials may present that as the case especially if the examples for each are too simple. Any character class pattern can be written using alternation, however the reverse isn't always true. Alternation can include single characters or literal or general regex patterns. The character class pattern [ABC] can be written using alternation A|B|C. The alternation pattern of single characters C|D|E can be written [CDE] or[DCE] but the alternation pattern of multiple character patterns (ABC)|(DEF)|(GHI) can not be written in an equivalent pattern using just a character class. Though you could write a pattern that would include matches of the same 3 values you couldn't limit it to matching just those 3 values. Even negated character class can be written using alternation but that would generally lead to enormous patterns making it impractical to write pattern in that fashion. Those trying to write patterns inside of a character class needed to focus their attention to alternation and/or one or more grouping constructs.

    The final error I see a lot is one that gets regex rookies and vets alike, regex metacharacters. With rookie this mostly fall within the trying to use a pattern within the character class by trying to use the qualifiers. Another is using the dot character which in this situation matches only itself. Within a character class in most implementations the only metacharacter that still retains it's special regex meaning is \ , which means it can be used as it normally is. The shorthand character class notations will work the same, except for \b which alters it's meaning, which make since if you think about it. You can't have any patterns within a character class and \b in it's use outside of a character class doesn't match a character and would only be used as part of a pattern.

    However one metacharacter ^ alters it's behavior but only if it is the first character inside the brackets. This is the other position dependent case I was referring to earlier. All the other regular metacharacters that have a special meaning outside a character class lose there special meaning with one and are interpreted as literals, so the is no need to escape these characters though doing so isn't wrong just unnecessary. I talked about the hyphen, which gains a feature, few years ago.

    Finally one last thing I see rookie regex users doing isn't really an error but just make it look like they aren't sure of when the use a character class and that's when they have a positive character class with only one character in it like [k]. There is no reason to do this. Just type the character. If you were negating a single character then that makes sense but otherwise it's just extra typing.


    The character class is great when you are trying to match (or negate) a single character at one position in your input and you have lots of possible characters to choose from, but anything else is outside of its domain.






    Sponsor
  • Remember where you come from

    One thing I've noticed among rookie regex users is that they focus way too much on what they are trying to match and not on where they are trying to match it from. There is a tenancy to drastically under estimate the importance of their source data, which they are trying to apply their regex against.

    I see this all the time on forums where posters asking for help almost never post any sample data, and most of the few that do post a sample they made up on the spot. Over on they RegexAdvice construction forum we've added guidelines for posting a question. The whole point of the guidelines are to expedite getting questions answered. One of the things we ask is that posters from a sample of the actual data they are trying to match. We explicitly ask they don't make up a sample. Unfortunately most people don't follow this request, along with others, in the guidelines, which generally result in those questions taking longer to get answered. On this particular point not supplying any sample is worst offense because anyone trying to help you can't see what you are working with and can only give an untested guess solution It like calling a mechanic on the phone and saying “My car isn't running right” leaving your number and waiting for him to call back to tell you how to fix it. He can call back and tell you to change your oil, but without more detail he can't know that will fix your problem. However making up a sample doesn't help much either and is almost just as bad. I know what people think they are being helpful and not boring people with the bloody details by providing a simple example to work with but generally it's not helpful since it doesn't represent their real problem. In regex construction details matter. Often the made up sample is too simple. So what they end of getting is a solution that works great for their sample but doesn't handle their real problem well, if at all. So now you are back where you started. This is like calling your mechanic on the phone and saying “My car making a funny noise.” While it's less vague than “isn't running right” it's still vague. You may know exactly what kind of noise, and how loud it is and where it coming from but that sentence doesn't communicate any of that. Made up samples are about the same when asking for regex help.

    The majority of regular expression problems are content sensitive. Meaning the approach of how to match the target string may depend on where it is the the source string. Now sometime the target string is unique from the surrounding text that the context seems irrelevant. Say you are trying to find a string of digits somewhere in your source data. And there just happens to be only one occurrence of digits are in your source. Well it that case a generic regex that find digits will be good enough.


    Source : “abc345def” or “a125bcd”


    Regex: \d+


    Now I said the content seems irrelevant in this case but it is as relevant as ever. The context is digits only occur once in the string.


    But relevance becomes clearer if you have a source file where there are two or more occurrences of text that mimic your target data but only one is actually the data you want. For example let say you want to find area codes of a phone number. Well if you agree that an area code is a 3 digit number the regex \d{3} might be all need. It depends. On what you ask? Come on you know, say it with me....”Your Source data!”


    If your source data is like the case I mentioned first, where the only digits in the files are area codes or if there are other digit in the file that are not area codes those other numbers are less than three consecutive digits then the simple regex should work


    Sample


    1) AL 205, 256, 334

    2) AK 907

    3) AR 479

    4) etc.


    But if your source is a bunch phone numbers


    1) Mom (555) 123-4567

    2) Sis 1-555-234-4567

    3) Friend 5553456789


    Well now you have a problem. While you general task is still the same, you have to approach it differently because of the source. You can't simply look for 3 consecutive digits since that would return 3 matches for each line. But if we know how the data is structured we can focus on how the data we want sticks out from the rest. In the above sample text area code meets on of three conditions


    1. 3 cons digits inside of parenthesis

    2. first group of 3 consecutive digits in a string of 1-xxx-xxx-xxxx format

    3. first 3 consecutive digits of a ten consecutive digit string


    Now you can still match what you need in one regex but how you choose to approach it changes. Some regex authors will use look-around to examine the surrounding text, others will match the surrounding and use groups to pull out the desired data. Either way you have to expand your field of vision beyond what you are trying to match to address the problem.

    But even in this simple example, understand how knowing what the source data look like is still important. Imagine someone asked you for help and you are trying to solve this problem without seeing the actual source above. If someone post (or ask) the question “I need a regex to match a US area code”

    With No Sample: Any helpful regex author has to guess how your data is formatted. Some may assume it's a string of “(xxx) yyy-yyyy”, others may assume it's “xxx-yyy-yyyy”, some may even think it “State: xxx, xxx, xxx” and all might not it even consider “1-xxx-yyy-yyyy” or “xxxyyyyyyy”.

     

    With Made up sample: “I have a string (555) 555-5555 and I need a regex to match a US area code”

    Now regex author may assume all your phone numbers are in that nice format and that there are no variations. You'll most likely get a regex that match that format perfectly, but fail the other formats you didn't think were important enough to mention.


    With Real Data: Regex authors don't have to guess they can look at your data analyze it themselves and even test edge cases and address issues you may have overlooked.


    I hope this simple example shows how the same problem can have different approaches all because of the source. Where data in the wild is going to be much more complex than this which will only magnify the issue.


    Now sometimes you can't provide real data because either its too much or private. With large chunk of data simply clip a portion containing what you are wanting to match. With private data making up sample is almost excusable but don't make up simple ones. Garble the real data but preserve the format, change names to literary or TV characters, change addresses to the address of some public office (City hall, post office) or landmark which this is essentially a made up sample but if you substitute values your samples won't be as bland and closer to your real data and edge cases.


    Understanding your source data is crucial to your regex construction whether you are writing it yourself or asking for help.

    Sponsor
  • Are you ready for regex?

    Who should be using regular expressions?

    This has been on my mind for a while, there are some people shouldn't be using regular expressions. People who don't know what regular expressions are for.Huh? Now you can use regexes in a variety of ways, and you can debate either side on whether certain applications are good uses or bad uses, but that's not what I'm talking about. I'm not even talking about people who don't know how to write regexes, well or at all. I'm talking about people trying to use regular expressions but have know idea what regular expressions do. And I don't mean that you can't decipher a regex pattern by looking at it. I mean you don't understand the general concept that regular expressions match patterns in data.

    I've seen several post over on the regexadvice.com forums in the past few months which go something like “I have this problem someone told me I should use a regular expression” but unless that same someone bother to explain what a regular expressions does or why it is suited for your task, don't be in such a hurry to plug in a regex.

    I'm not saying you need to master regex before using them just get your head around the basics (matching). At the very least you should associate regex to wildcard searches, just more powerful. If you are at least that far, proceed with trying to implement your regex. But if you are thinking something completely different slow down there. Unless you have a basic understanding of what regular expressions do, even simple tasks become way more difficult than they should be. Even if you get someone to help you write a regex and still don't know what it's doing you are likely to continue to incorrectly try to use them.


    A common result from this lack understanding I've been seeing is trying to perform a task solely via writing a regex pattern, that regexes themselves have no capacity to perform. Not that a regex couldn't be a part of the solution but in the end it's not even what will do what the person wants done, usually it a coding task, but the regex can at least do some of the prep work of the task. Even if some implementations of the regex engine allow code to be called, it is still the code doing the heavy lifting. The regex is only finding the data. One could save themselves hours of wasted time if the just understand the basics of what they are getting themselves into.


    Another thing I've seen, related to the not having a basic understanding, is a task that is not only perfectly suited for using a regex but a very common application of a regex task and the person asking the question asking “Is this some that a regex can do?” Of the issues I've mention this one is the most and the least understandable. The most because regular expressions are not simple to everyone, to some it comes easy to others it never comes completely. And some of the documentation isn't the greatest, though there is plenty of good documentation out there these days. So I can understand why some can't get their head about exactly how to write a regex. But it the least understandable when they are asking how to write a regex so common, and usually simple, that dozens of tutorials and/or articles on regexes use the very regex they are asking for as an example.


    I think the main problems of people who fall into category is


    1. Lack of research. There are more than enough tutorials out on the web and more than a few books that have simple samples to give you an idea of what a regex does. I find it very hard to believe when someone says they couldn't find a regex for basic (US) zip codes. Did you even look? Or are you only using a regex for this task because someone told you that you should and ran out to find someone to write it for you? or are you just cheating on your homework?

    2. Misunderstanding what regular expressions understand. There seems to be more than a few people who think a regex understands the context of the data it matches. That it not only know how to match the data but it know what it means, either in context to their application or to the world at large. Those are the people that think a regex for matching zip codes know what zip codes are used for and where they would be used. Sorry but that's not the case. Regexes understand nothing of what your data means to you. As far as the regex engine is concerned it's just a string of characters. It's up to the regex author know the context of the data they want to match and to shape the regex accordingly to return only the relevant data. This may make it seem like the regex understands the data it's matching but that not what happening. What happen is the person who wrote the regex understood the data and the problem set so well they were able to construct a regex the only match the relevant values.

    3. That regex is a full blown programming language. It's not. I've seen questions about wanting a regex to compare numbers, tell time or do some other function completely outside their realm but something most programming languages have a feature to deal with or let you write code that can. I've never seen any regex documentation promoting such features so I can't imagine why someone thinks a regex can perform these task. Other than my previous point where they saw the results of a well written regex and speculated on what and how much work the regex did. Like I mentioned some implementation allow you to perform function calls but that is more an add-on of the programming environment you are using than a generic regex feature. Realize that regular expressions are one of many features of your programming language, not the other way around.


    If you've read this far and you didn't know what regular expressions did or didn't do before hopefully by now you have some idea. And if you already knew what they did just stop to consider the next time you advise someone to use a regex, you make sure you get across the high level point that you are “matching something (a character pattern) in a string” before you get into the more complex aspects of what a regex can do.

    Sponsor
  • You've got your sub-matches in my matches

    Hello boys and girls. Wow it's been a while since I've done this. I want to touch on a very useful but often overlooked feature of regex, grouping. While I haven't been blogging I have been active on a message board here or there. A question I see quite often is “I want to find a match in a string but I don't want part of the match” or “I need the value of this portion of the string” Now often I see solutions to these type of questions that involve look-arounds. Any why they certainly work they aren't the only way to achieve the desired results. New regex users seem to believe they can only access the full match. Most regex engine support groups, where in a match you can access a certain portion of the full match. Groups are identified by parenthesis. Every pair of plain parenthesis is a group. For example the regex pattern


    /(Hello)\x20(world)/


    There are two groups in the regex. The regex itself matches the string “Hello world”, group 1 contains the string “Hello”, group 2 contains the string “world”. Note neither group contains the space between the two words but it is part of the full match. Most implementations of regular expressions allow you a ways to access these groups. They are contained in a collection inside the match object. Now you'll need to consult your regex documentation to know the exact layout but in most of these implementation there is a zero based collection where Item 0 is the full match and Item 1..n is whichever group element your regex contains, if any.


    Now if you notice I said every pair of plain parenthesis is a group. The reason I stress plain is because there are other group constructs, the aforementioned look-arounds being some. Now support for the other constructs vary in implementations so again consult your regex documentation to see which ones you have. The other grouping constructs consist of a open parenthesis immediately followed by a question mark, which then is followed by the characters that define that particular grouping construct. Again consult your documentation to see which characters define what. I'm not going to go into all of them here but they basically fall into two categories. Capturing and non-Capturing. The plain parenthesis I've mentioned above are a capturing group. However capturing requires extra resources in some case you need the extra speed, but you still to group a certain part of the pattern together for either necessity or readability or both. This is where you'll what to using a non-capturing group (?:pattern), a open parenthesis followed immediately by a question mark followed immediately by a colon. The difference here is that the data matched in the group is not add to the collection of submatch in the Match object.

    Taking our previous example and making the first group non-capturing

    /(?:Hello)\x20(world)/


    Where before we had two groups here we only have one. Group 1 contains the string”world”. Now this example is not a very practical use of a non-capturing group. Typically you'd use them in more complex regexes that have a grouping but you really don't care about the sub-matches.


    One more quick thing about capturing groups basically each left parenthesis is the index of the sub-match in the groups collection. So if you have nested parenthesis count every (plain) left parenthesis to know which index to use to reference it. Some of the advance grouping constructs and regex options can affect the ordering but if you are using them hopefully you've read their effects so I won't go over that here.


    /((Hello)|(Goodbye Cruel))\x20(world)/


    The above regex has 4 capturing groups (not counting group 0). Can you find them? Now it should match either the string “Hello world” or “Goodbye Cruel world” Now I want to point out that not all the groups will participate in the match, but the are still part of the Groups collection. There will always be 4 groups, just one will always be empty. Which one depends on which string was matched.

    If “Hello world” was matched the groups are

    1. Hello

    2. Hello

    3. (empty)

    4. world


    If “Goodbye Cruel world” was matched the groups are

    1. Goodbye Cruel

    2. (empty)

    3. Goodbye Cruel

    4. world


    in both case group 0 would be the full match


    If you note in both case two groups contain the same value. Even if you need to know whether “Hello” or “Goodbye Cruel” was match, you certainly don't need to know it twice. Plus the inner parenthesis have different index you'd have to check if you want to use those. This is where you'd use the non-capturing group to simplify your groups collection.


    /((?:Hello)|(?:Goodbye Cruel))\x20(world)/


    Now we are back down to two groups. Group 1 contains either “Hello” or “Goodbye Cruel” depending on which string was matched. Group 2 always contains “world”


    However keep in mind in some cases you'll want to use the inner index do determine which group was matched. So using non-capturing groups isn't necessarily a better thing it just depends on if you need to access those groups or not. But if you are not doing anything with them don't capture them.



    These are just two of the basic grouping constructs and they are general supported across implementations of regex, but not always. But if they are you can use the to easily dissect larger matches.



    Sponsor
  • Named Groups to the Rescue

    I was asked to modify some text that had been built incorrectly. Basically insert some text at a certain point. First I use a regex to find the text, then insert the new value within that match.  Now since the inserted value goes inside the matched text I simply wanted use backreferences and the replace method.   Simple right?  Well not so much.

    Now the text is in  a field of various rows of a database table and the text to be inserted comes from another of the fields in the same row and is an alphanumeric value.  So the inserted text value is dynamic, so I can’t simply hard code the replacement text. So the replacement text is built dynamically for each row.  The text to be modified is a certain attribute somewhere in the text.  For this example lets say it’s “id=xyz”, which is constant for all records.  Now the new text will be inserted right after the equals sign.

    So for              

    source  =“ {some stuff} id=xyz {more stuff}”
    newText = “ab1”

     
    you get

     
    “{some stuff} id=ab1xyz {more stuff}”

     

    Simple enough.  You use this regex  \bid=xyz\b to match the text. Then split it in to groups so you can use backreferences in the replace.  So your final regex looks like this:

     \b(bid=)(xyz)\b

     
    Now group 1 contain the text up to your insertion point (id=)

    And group2 contains the text after your insertion point (xyz)

    So your replacement string for your regex is “$1(new data goes here)$2)” , where (new data goes here) = some alphanumeric value pulled from a second field in a row.

     
    Doing this is in .Net my code looked something like this pseudo-code

    Regex regexFind = new Regex(“\b(bid=)(xyz)\b”);

    Get Records

    For each row

                 fieldA = rowFieldA  (source text)

                 fieldB = rowFieldB (insert value)

                fieldA = regexFind.Replace(FieldA,String.Format(“$1{0}$2”,fieldB))

    next

    The Format method of the string create a replacement string for each row.

    Look good?  Works find for our example but there is a problem.

    For our example value of “ab1”  the string format produces “$1ab1$2” which is exactly what we want, but as this field is alphanumeric so it could begin with a number which causes a problem.  Say for the next record the value of the text to be inserted is “12a” the format method produces a replacement string of “$112a$2”, which is not good.  Syntactically it’s fine but it’s not what we want, because instead of trying to inserts some text between group 1 and 2, which is what we want to do, it is trying to insert text between group 112 and group 2.  As there is no group 112 it assumes $112 is literal text so your final result is “id=$112axyz”

    Ok this is where named group become handy (necessary?).  If you used name groups in your regex and replacement string you can avoid this problem

    Change the regex to \b(?<att>bid=)(?<val>xyz)\b

    And your replacement string to “${att}(new data goes here)${val})”

    Now if you are using the string format method there is one more hoop you have to jump through because the  regex engine and format method both use the curly braces there is a conflict and the format method will complain so you have to write it like this

    String.Format("${0}att{1}{2}${0}val{1}","{","}",newValue)

    To get the desired replacement string.
     

    When I started writing this I thought this was the only way to get this to work which means you could only solve this with a regex engine, like .Net, that supported named groups, but I’ve thought of a second way.

    But whenever a group in your replacement string can be followed by a digit you may want to consider using named groups to avoid unexpected surprizes

     


    Sponsor
  • Making your regex code ready.

    There are times when regular expression you’ve written or someone written for you needs a little tweaking before you add it to your code and the tweaking is required because the syntax of the language conflicts with your regex.   For example when part of your regex pattern contains a double quote and the language you are using uses double quotes as string delimiters.  If you just cut and paste the pattern in your code the pattern’s quotation will terminate your string prematurely.  Now the code way to fix it is to escape the quotation in pattern. This solution requires altering the regex and how the character is escapes depends on the language being used.   The regex itself allows you to escape character with the \ character.  The language being used may or may not recognize that as escape character for its syntax.   And it may be confusing later when you look at the regex and can’t remember why you escaped a character that the pattern itself doesn’t need it, But there is another way. Hex values

                                                                                                                                 

    Most regex implementations support a hex syntax \x##,  where # is a hex digit.

    So if you use \x22 instead of double quote and \x27 for single quotes the regexes become more cookie cutter ready.

     

    Another useful hex value is \x20 which is a space.   This is especially useful in .Net where there is an option on a regex to ignorewhitespace in the pattern.  Turning this option on allows end of line comments in the regex but with the exception of inside a character class, ignores typed in spaces within the patterns, which would be problematic if a space was part of the pattern to match.  So you could break a working regex if you later decide to add this option. This happened on the Regexlib when the option was first turned on.  A lot of patterns that were written before the switch was flipped suddenly stopped working.

     

    Speaking of .Net when it comes to name groups you can’t use the hex notation to define the group name using the single quote syntax .  However you can avoid any issue with single quotes by using the alternate syntax.

    Sponsor
  • Word Break.

    The list-detail design is very commonly used with web pages, where you have a list of links that lead to more detailed information of each entry. Sometimes the text of the list is simply a snippet of a much longer string of text in the detail. A common way to handle this is to use a string function to return the first n characters of the string and display that in the list. The problem with this is that tends to make the break right in the middle of a word. Which isn’t major problem but can be aesthetically displeasing or may accidentally form another word you didn’t mean to put on your site. When facing this issue I came up with a simple regex to allow me to break on whole words. ^(?:[ -~]{n,m}(?:$|(?:[\w!?.])\s)) Where n = the minimum number of characters to match And m = the maximum number of character to allow in the match. Now in instance I’m considering a word to be one or more ACSII non white-space characters. The way it works is after matching n ASCII characters it tries to match either the end of the string or a letter or sentence ending punctuation followed by a white space. So it will accept as many characters, including white spaces as it can up to m and still satisfy the rest of the match. Otherwise it backtracks until the regex is satisfied. So if you wanted a minimum of 2 characters and a maximum of 75 the regex would be ^(?:[ -~]{2,75}(?:$|(?:[\w!?.])\s)) and if you applied it the Gettysburg Address “Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal. …” (only the 1st paragraph shown for the example but you could apply the full text) Taking the first match you get “Four score and seven years ago, our fathers brought forth upon this ” There are a few problems with the regex that can be improved. First off it only accepts basic ASCII displayable characters, decimal 32 to 126 with mean the text must be in that range. I did it this way because it give you the US alphabet, digits and commonly used symbols and punctuation which was all I needed at the time. Other characters would need to be added. Also if the first word character count exceeds your maximum length no match will be found You can make this regex a little dynamic by putting inside a function that takes the your string, the max and min values as input.
    Sponsor
  • Making Dynamic XHTML pages valid with a Regex

    I have a website that I use primarily to expand and sharpen my web development skills.  The latest effort in that regard had been writing valid markup.  Although there is some usable function to this site I use mainly for practice.  Consequently I don’t work on it full time.  Instead work on it in spurts, with different editors, apply new scripts, new ideas and such at different times so the output is far from consistent.  So I went to a lot of effort  to first make my HTML valid, the later convert that HTML to XHTML and make that valid.  However being that the content of most of the pages is dynamically generated, when validating the XHTML it didn’t occur to me right away that something in the content could make my page not validate in certain situations.  Case in point ampersands. According to the W3C (http://www.w3.org/TR/xhtml1/#C_12) you should not have solitary ampersand characters in your markup.  Ampersands declare the beginning of a entity references.  If you wanted to display the ampersand character you should use the & entity reference instead of (&).  Keep in mind most browsers let you get away with incorrect usage so I didn’t have do any to get my page to render but see as the whole point of writing valid mark up is not to depend on the browser fixing my sloppy code I didn’t want to just let this go.  The first time I encountered this problem on a page the fix was pretty simple.  I was working with ASP so I simply wrapped ASP’s Server.HTMLEncode method around the variables containing the database derived content.  Simple.  Problem solved.  Ummm not quite.  The problem occurred on another page inside a dropdown list.  My first thought to use the HTMLEncode method again was a horrible failure. Unlike my previous fix where I was pulling one row’s column, let’s say First Name, out of a database and sticking in a single variable for latter use, all the markup for the dropdown was being created in a separate function using ADO’s Getstring method. So the variable containing the dropdown contained the full XHTML markup (the Select and option tags).  HTMLEncode turns the less than and greater than signs of the tags into entities which in turn basic displays your intended markup source code on the page.

     

    Again the intent in to turn the solitary ampersands (&) into the &amp; entity. So I decide to use a regex to fix this problem. I modified this regex http://www.regexlib.com/REDetails.aspx?regexp_id=626 which matches entities to this

                    

     &(?!(?i:\#((x([\dA-F]){1,5})|(104857[0-5]|10485[0-6]\d|1048[0-4]\d\d|104[0-7]\d{3}|10[0-3]\d{4}|0?\d{1,6}))|([A-Za-z\d.]{2,31}));)

     

    Which matches ampersands which are not part of an entity.  So using the replace pattern of “&amp;”  I can replace just the solitary ampersand.

     

    Now some of you might think this is overkill since I could use VBScript’s Replace function to get the result I want.  And in this very specific case that is true.  There were only a handful of ampersands to replace.  However this issue came up other places where a simple replace would not be easily implemented.  The truth of the matter I didn’t come up with this regex for this problem.  This was originally designed for dealing with XML files that contained solitary ampersands AND entity references.   The VB or VBScript Replace won’t work it that case.  For example that the following line

     

    <STATEMENT>(1 &lt; 2) & (1 &lt; 4) are both true </STATEMENT>

     

     

    using HTMLEncode you get

     &lt;STATEMENT&gt; (1 &amp;lt; 2) &amp; (1 &amp;lt; 4) are both true &lt;/STATEMENT&gt;

     

    Using the VB replace function  you get

    <STATEMENT>(1 &amp;lt; 2) &amp; (1 &amp;lt; 4) are both true </STATEMENT>

     

    Neither being what you want.

     

    So the regex replace allows your content to contain entities and won’t mess them up and you get the desired output

     

    <STATEMENT>(1 &lt; 2) &amp; (1 &lt; 4) are both true </STATEMENT>

    Sponsor
  • Higher plane solution

    I’ve come up with a workaround to Unicode plane bug I talk about earlier

     

    Originally I was trying to identify Unicode values outside of the ASCII range, code point higher than 127, for a replace.  However the problem was characters in any other plane besides plane zero returned two matches per character.  Meaning I could get two incorrect replacements for those characters.  Turns out the replacement would just not taken place which was still a problem.

     

    The following two regexes allowed me to get around this behavior.

                Plane 0 excluding standard ASCII pattern  (?![\uD800-\uDBFF])(?![\uDC00-\uDFFF])[\u0080-\uFFFF]

                Non-Plane 0 pattern [\uD800-\uDBFF][\uDC00-\uDFFF]

     

    The ranges D800-DBFF and DC00-DFFF are the High and Low surrogate values respectfully. Surrogate code points are the values in UTF-16 encoding of the two 16-bit code units that make up a Supplementary Character

     

    You may notice in the plane 0 pattern the surrogate pairs are in separate lookaheads.  You might think that they could be combined into one look ahead, but this won’t work.  I’m not exactly sure why this is, but doing so will get you a match on the low surrogate of a non-plane 0 character.

     

    I've posted these pattern to the regexlib.  Unfortunately most of the characters used in the examples can't be displayed on the site. The sample characters I used are

    Plane 0 : ☻

    Plane 1:  𐌈, 𐍈 and 𐐏

    Sponsor
  • Flavor of the month

    It seem most regex engines based their implementation on what had been done in the PERL version, with some implementing a little more, others a little less.  This is what generally keeps regexes from being portable across different implementations.   This often leads to confusion and syntax errors when you take a canned regex that was designed or tested in a different environment than it will be used.   I’ve talked before about knowing the syntax

    of the implementation the regex will be used to save yourself the trouble of trying to fix regexes that aren’t incorrect but just written to be supported by a different engine.

     

     

    There is a new testing tool on the Regexlib.  It’s pretty neat and a great improvement over the previous version.  However it pretty much targeted for Microsoft regex engines.  Realizing not everyone is working in that environment and more importantly allowing myself (and others) a chance to play around with regular expressions using syntax not currently supported correctly or at all by those engines,  I’ve found several online regex testers for a variety of languages and platforms.

     

    Environment/Language  Site

     

    Please refer to each site’s documentation (if there is any) on what features and advanced syntax is supported. But here is a quick highlight of something one site supports that the other may or may not.

     

    .Net = Named Groups and Look-Behinds

    ORO = POSIX syntax

    java.utitl.regex = Character Class Subtraction, Possessive quantifiers

     

    Sponsor
More Posts Next page »