Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

invalid characters in xml tags

Last post 02-24-2008, 8:53 PM by ddrudik. 1 replies.
Sort Posts: Previous Next
  •  02-24-2008, 6:51 PM 39835

    invalid characters in xml tags

    I'm looking for a way to strip out invalid characters in xml tags. Ideally, they shouldn't be put in there in the first place, however the program that outputs the XML is dumb and it basically just dumps text inbetween <> characters.. So.. for example I need to change

    <elem/en/t>Elements / are / good</elem/en/t>

    into

    <element>Elements / are / good</element>

     in other words, stripping out / 's except for the end tag (and the end tag doesn't necessarily have to be on the same line)

    I'm reading in the file per line in perl, so I need to assume the tag won't start with a / otherwise there wouldn't be a way to tell a closing tag from an invalid one.

    I've been playing around with things like

     s#\(<[^/]*)/([^/]*)>#$1$2#g;

    Filed under:
  •  02-24-2008, 8:53 PM 39836 in reply to 39835

    Re: invalid characters in xml tags

    I can't speak to Perl syntax, but PHP allows for a function callback such as:

    <?php
    $string="<elem/en/t>Elements / are / good</elem/en/t>";
    function doreplace($matches) {
      return '<'.$matches[1].preg_replace('/[^A-Za-z]/','',$matches[2]).'>';
    }
    $string=preg_replace_callback('~<(/?)([^>]*)>~','doreplace',$string);
    echo htmlentities($string);
    ?>

    I assume the "invalid" characters would be anything other than [A-Za-z], it's just / characters that need to be removed:

    <?php
    $string="<elem/en/t>Elements / are / good</elem/en/t>";
    function doreplace($matches) {
      return '<'.$matches[1].preg_replace('~/~','',$matches[2]).'>';
    }
    $string=preg_replace_callback('~<(/?)([^>]*)>~','doreplace',$string);
    echo htmlentities($string);
    ?>

    A method not using additional capture groups:

    <?php
    $string="<elem/en/t>Elements / are / good</elem/en/t>";
    function doreplace($matches) {
      return preg_replace('~(?<!<)/~','',$matches[0]);
    }
    $string=preg_replace_callback('~<[^>]*>~','doreplace',$string);
    echo htmlentities($string);
    ?>


View as RSS news feed in XML