Got more questions? Find advice on: ASP | SQL | XML | Windows
Welcome to RegexAdvice Sign in | Join | Help

Re: Expression for SQL parsing

  •  04-21-2007, 1:44 PM

    Re: Expression for SQL parsing

    I'd rather have a regex to split SQL statements with, but this generates problems with not splitting quoted strings as well. You would need a variable length lookbehind assertion. I'll try to match non whitespace characters instead.

    If you also need to have operators like "<=" and ">=" and numbers in any format (not in your example) as one token, you can use

    (?:'(?:[^']|'')*'|-?\d+(?:\.\d+)?(?:[eE]-?\d+)?|\w+|[<>=]{2}|\S) 

    This regex will recognize SQL escaped quotes (two single quotes) within a single quoted string and will create one token. As I don't know the SQL spec by heart some multichar non alphabetic operators with characters other than "<", ">" and "=" might be a problem. I can't think of any off the top of my head right now though. The minus character as sign for negative numbers might be a problem if you want to use the minus character as an operator. In this case I would advise separating minus sign and number with whitespace.

    A little explanation. There are five alternatives in this (non capturing) regex:

    • match a quoted string
    • match a number with optional decimal fraction and optional scientific notation
    • match any word with word characters (you might want to use a character class like [a-zA-Z0-9_] instead, the regex does not validate SQL, only tokenize it)
    • match any two character combination of  <,> and = (<< and >> are bit operations, == is not valid SQL, again: no validation)
    • match any single non space character not matched until now (these will be non alphabetic operators)

    Edit:

    Had a little bit of fun looking at the SQL 92 specs. This pattern:

    (?:(['"])(?:\\\\|\\\1|(?!\1).|\1\1)*\1|(?:(?<!\d)-)?\d+(?:\.\d+)?(?:[eE]-?\d+)?|\w+|[<>=|.]{2}|\S) 

    should also work for:

    • quoted identifiers ("P.NAME" in double quotes will not be split!),
    • || and .. operators and
    • backslash escaped quotes (those are not part of the spec but some databases allow them).

    And finally this pattern: 

    (?:(['"])(?:\\\\|\\\1|(?!\1).|\1\1)*\1|(?:(?<!\d)-)?\d+(?:\.\d+)?(?:[eE]-?\d+)?|\.\.|(?:\w+\.)*\w+|[<>=|]{2}|\S)
    

    should keep identifiers as one token regardless whether they are quoted or not (P.NAME and "P.NAME" will not be split).

View Complete Thread