I'd rather have a regex to split SQL statements with, but this generates problems with not splitting quoted strings as well. You would need a variable length lookbehind assertion. I'll try to match non whitespace characters instead.
If you also need to have operators like "<=" and ">=" and numbers in any format (not in your example) as one token, you can use
(?:'(?:[^']|'')*'|-?\d+(?:\.\d+)?(?:[eE]-?\d+)?|\w+|[<>=]{2}|\S)
This regex will recognize SQL escaped quotes (two single quotes) within a single quoted string and will create one token. As I don't know the SQL spec by heart some multichar non alphabetic operators with characters other than "<", ">" and "=" might be a problem. I can't think of any off the top of my head right now though. The minus character as sign for negative numbers might be a problem if you want to use the minus character as an operator. In this case I would advise separating minus sign and number with whitespace.
A little explanation. There are five alternatives in this (non capturing) regex:
- match a quoted string
- match a number with optional decimal fraction and optional scientific notation
- match any word with word characters (you might want to use a character class like [a-zA-Z0-9_] instead, the regex does not validate SQL, only tokenize it)
- match any two character combination of <,> and = (<< and >> are bit operations, == is not valid SQL, again: no validation)
- match any single non space character not matched until now (these will be non alphabetic operators)
Edit:
Had a little bit of fun looking at the SQL 92 specs. This pattern:
(?:(['"])(?:\\\\|\\\1|(?!\1).|\1\1)*\1|(?:(?<!\d)-)?\d+(?:\.\d+)?(?:[eE]-?\d+)?|\w+|[<>=|.]{2}|\S)
should also work for:
- quoted identifiers ("P.NAME" in double quotes will not be split!),
- || and .. operators and
- backslash escaped quotes (those are not part of the spec but some databases allow them).
And finally this pattern:
(?:(['"])(?:\\\\|\\\1|(?!\1).|\1\1)*\1|(?:(?<!\d)-)?\d+(?:\.\d+)?(?:[eE]-?\d+)?|\.\.|(?:\w+\.)*\w+|[<>=|]{2}|\S)
should keep identifiers as one token regardless whether they are quoted or not (P.NAME and "P.NAME" will not be split).