Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Replacing blog post urls

Last post 08-19-2008, 7:43 PM by Aussie Susan. 3 replies.
Sort Posts: Previous Next
  •  08-18-2008, 4:27 PM 45416

    Replacing blog post urls

    I migrated my old blog to wordpress, unfortunately all the old posts have urls of the form:-

    http://geoffjones.com/2006/05/ladbrokes-or-sadblokes.html/

    whereas on my new blog they are in the form :

    http://geoffjones.com/2006/05/ladbrokes-or-sadblokes

    So I need to strip all endings from the urls  .htmletcetc , from all urls that begin with  http://geoffjones.com/2....

     

    Filed under:
  •  08-18-2008, 7:43 PM 45418 in reply to 45416

    Re: Replacing blog post urls

    In the 'Construction Advice' companion forum to this, there is normally a sticky note with the 'Posting Guidelines'; however it seems to have disappears for the moment. In there, we normally ask for the regex and programming language (if appropriate) because different version have different capabilities - sadly not all regex engines work in the same way and understand the same syntax.

    One of the other things we normally as for is a reasonable sample of the real text you are dealing with. In this case, what is important is what immediately follows the URL so that we can reliably find its ending. In this case, the ".html/' could be straight forward (but what about the URL "http://geoffJones/2006/05/freds.html/realfile.png/" - that's perfectly legal) but when we have to handle ANY file type then the problem is harder. Therefore I've assumed that:

    1) you are using a regex that allows for lookaheads
    2) the end of all URLs can be found by looking for a '/' that is NOT followed by an alphanumeric character

    Therefore, try:

    \b(http://geoffjones.com/2.*?)\.\w+/(?!\w)

    as the pattern and

    $1

    (or whatever the equivalent is in your language) as the replacement string. I've used the 'ignore case' option as well.

    Hope this helps.

    Susan

    PS: if lookaheads are not allowed, you could try a pattern of "\b(http://geoffjones.com/2.*?)\.\w+/(\W)" and a replacement string of "$1$2" - will have problems if the URL ends at the end of the text and/or the end of a line, depending on the line/file terminators.

     

  •  08-19-2008, 3:40 AM 45420 in reply to 45418

    Re: Replacing blog post urls

    Susan

    Many thanks for your interest and expression. I am using this http://urbangiraffe.com/plugins/redirection/  which says "Supports both WordPress-based and Apache-based redirections"

    I tried your first expression but it didn't seem to work, Am I intrepreting it correctly in that the replacement pattern is 'just' $1 ?

    regards

    Geoff

  •  08-19-2008, 7:43 PM 45454 in reply to 45420

    Re: Replacing blog post urls

    Geoff,

    Last question first: yes. The pattern is designed to capture all of the text up to but not including the file type into match group #1. The pattern itself will also match the file type. When a replacement is made, ALL of the matched text is removed and whatever is contstructed from the replacement text is put in its place. By making the replacement text just whatever is captured in match group #1 what you are effectively doing  is to delete the entire URL and replace it with the entire URL minus the file type.

    As for the first part, I've not got any real experience with URL replacement software and the syntax that is used. I've seen others replying in this and the companion fora who seem to know all about this web stuff (.htaccess and all that) and they may be able to help you in this instance. Also, without downloading the software an installing it myself, I couldn't quickly see any mention of the underlying regex engine that is being used by this software on the page that you referenced. However, looking at the examples on that page, provided you are applying all of the other syntax that may be needed, it should be close. One thing I can think of is that the underlying regex engine does not support lookaheads and may be quietly failing. Have you tried the second alternative I suggested? Also, you could try taking the '?' off the '.*?' part of the pattern in case the regex engine doesn't like non-greedy quantifiers - in this situation (where you are only examining a single URL at a time as opposed to a large block of text with YRL's embedded in it) the pattern should still work with the greedy action of the remaining '.*' subpattern.

     Finally, it looks as though there is a support mechanism for the plugin. I can tell you that the pattern works in the 'Expresso' regex tester (which is .NET based) and so what you want to do is certainly possible. Perhaps you may need to use the plugin support mechanism to find out a bit more about the regex syntax and capabilities and then either come back here or use one of the regex testers on the web that are appropriate for the variety of regex in question.

    Susan

View as RSS news feed in XML