Got more questions? Find advice on: ASP | SQL | XML | Windows
Welcome to RegexAdvice Sign in | Join | Help

Meta-regex for variable-length matching regexes (XSLT)

  •  10-02-2007, 9:37 AM

    Meta-regex for variable-length matching regexes (XSLT)

    In the context of the regex dialect of XSLT 2.0, I am looking for two
    "meta-regexes" that analyse other regexes. Each is discussed in a part
    of its own.

    (Part 2 posted as: Meta-regex to find fixed length of regex match (XSLT))

    Part 1

    Aim: A regex that would identify all those regexes that might yield
    matches of varying length, as opposed to regexes whose matches will
    always be of fixed length. (Considering only the case that the regex
    matches at all.)

    The idea, of course, is to look for metacharacters in constructs that
    allow repetition, alternation or other length variations, while
    keeping out uses of metacharacters required for escaping that do not
    change the length of the matches.

    The syntax of XSLT 2.0 regexes is described here:
    http://www.w3.org/TR/xpath-functions/#regex-syntax , which is based on
    the XML Schema Datatypes Spec at
    http://www.w3.org/TR/xmlschema-2/#regexs .

    XSLT 2.0 permits these metacharacters in regexes:

    . \ ? * + { } [ ] ( ) | ^ $

    (And "-" within Character Ranges and with Character Class
    Subtractions.)

    Among those, these metacharacters never influence the length of the
    match (parentheses are only used for grouping in XSLT; pattern anchors
    or character ranges by themselves cannot lead to variable lengths of
    the matches):

    . ( ) [ ] ^ $

    The following characters are used to denote quantifiers (and
    alternation) and thus most easily lead to matches of varying length:

    ? * + { } |

    They make up these quantifiers (first greedy ones, then reluctant ones):

    ?
    *
    +
    {n}
    {n,}
    {n,m}

    ??
    *?
    +?
    {n}?
    {n,}?
    {n,m}?

    However, non-escaped curly brackets may also be used in a Category
    Escape like "\p{L}" which is used to designate a set of characters
    that match a one-character string.

    The backslash requires special thought. Although in most uses as
    escaping metacharacter, it is just a notational device for
    single-character matches, XSLT 2.0 allows back-references like "\1"
    that can definitely influence the matched string. So if followed by at
    least one digit, the backslash indicates potential variable-length
    matches, too.

    Conclusion: An XSLT 2.0 regex is guaranteed to have only matches of
    fixed length iff it does not contain:

    1) a non-escaped occurrence of a character from this set of characters:

    ? * + |

    2) curly brackets that a) are neither escaped nor b) do belong to a
       Category Escape

    3) an occurrence of a backslash followed by a digit

    The rationale is that only the non-escaped presence of one of those
    constructs enables matches with varying length in the XSLT 2.0 regex
    dialect.

    So this regex should find all the regexes possibly matching strings of
    variable length (reluctant quantifiers are implicitly considered as
    the greedy ones are prefixes to them):

    [^\\][?*+{}|]|\\\d

    Is my reasoning correct? In how far does the regex fulfill the
    requirements stated?

    Unfortunately, condition 2 b is still unsatisified. This is because I
    have difficulty specifying for the opening curly bracket that it
    should not match when it is preceded by the string "\p" or "\P" - is
    there any way to enforce occurrence of both characters when testing,
    with neither lookbehind assertions nor conditional subpatterns being
    available in XSLT?

    Requiring a preceding character other than a backslash should be OK,
    because all real quantifiers must have at least some character in
    front of them. (As to "|", it does not, but it would make no sense to
    start a regexp with it, and so even if it is allowed in XSLT 2.0, I
    won't care about that.)

    BTW, if someone should conceive the complementary "meta-regex" that is
    identifying only those regexes with fixed-length matches, I could do
    equally well with that one, by simply inverting the logic of my
    control flow.

      Regards,

       Yves
View Complete Thread