Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Multiple Line Search Criteria for splitting large files

Last post 02-19-2010, 5:22 PM by KathleenA. 6 replies.
Sort Posts: Previous Next
  •  02-17-2010, 5:52 PM 59790

    Multiple Line Search Criteria for splitting large files

    I am using the AutoSplit™ / AutoSplit Pro™ by EverMap to split PDF documents, i.e., a large report into individual reports.

    It uses Regex, or regular expressions for more advanced splits.

    The program DOESN'T MATTER, it's the Regular Expression line that I need.

    I need to split a large report by VERSION ID (report version) and CASE ID (the client). The Version ID is on one line, left justified (time stamp, unneeded on right), and the Case ID is on the second line, left justified.

    VERSION ID: ASB                                                TIME : 06:33:38
    CASE ID     : C12345              TITLE

    A version can have many cases, and / or a case can have several versions. I need a Regex Expression to split the reports out and so far what I have attempted has not worked.

    By themselves, VERSION ID: \w{3} and

    CASE ID :\s+\w{1}\d{5} works,

    but I need to define something that uses both criteria.   It has to include multiple lines, find the IDs at the left and ignore the time stamp and titles at the right.

    Thank you, Kathleen

  •  02-17-2010, 7:27 PM 59791 in reply to 59790

    Re: Multiple Line Search Criteria for splitting large files

    Kathleen,

    Actually is DOES matter about the program because there are many different regex variants around, each with differing capabilities and (in some cases) syntax. I've managed to find the following on the 'net: http://www.evermap.com/UserGuides/AutoSplit.htm#SplitByContent but it is not really clear on some items. In particular it talks about the '^' and '$' anchors matching the beginning and ending (respectively) of a string but doesn't really say what a "string" is; I'm assuming that it may well be a complete page of text as that seems to be one of the main units for splitting the PDF file.

    Following on from this I can't see if the usual characters for the line terminators ('\r' and more commonly '\n' and sometimes both) are available.

    Therefore the this may or may not work:

    VERSION ID: \w{3}.*?\r?\nCASE ID :\s+\w\d{5}

    By the way, I've taken out the '{1}' quantifier after the '\w{1}' part as this is the default quantifier value - specifying it (in my opinion) just takes away from the overall readability of the pattern.

    Susan

  •  02-18-2010, 2:58 PM 59846 in reply to 59791

    Re: Multiple Line Search Criteria for splitting large files

     

    Susan,

    Ah, that's good to know.  When I presented the issue previously, some folks got all fixated on the Adobe plug-in and disregarded the Regex part.

    It works!!!  Thankyouthankyouthankyou!   How can I say thanks?  chocolates, gift card,  I was really sweating over this!!  I really appreciate your research here.  I hacked out part of this, but regex is SO specific you really have to have a good familiarity with it to get all the little bits just so.

    Since we're on a roll here, you'll see from the user guide that you can then add in a second field, NAME SUFFIX [Add text from search].   Ideally we will end up with the CASE ID (underscore) VERSION ID.  Or, since this seems to be order-sensitive, second-best choice would be VERSION ID (underscore) CASE ID in the file name output.

    As in above, separately,  (?<=CASE ID :)\s+\w{1}\d{5}     works, and   (?<=VERSION ID:) \w{3}    works.

     Thank you again so much for your help.  I was getting a crash course in regex to fill a business need here !  =o)  Kathleen

  •  02-18-2010, 3:38 PM 59853 in reply to 59846

    Re: Multiple Line Search Criteria for splitting large files

    P.S.

    I pasted your search code into the Name Suffix field just to see the result.  It comes out like this:

    VERSION ID_ ASB TIME _ 06_33_42 _CASE ID _ C12345.pdf

    Thought that would be useful ...  I'd like the name to be C12345_ASB.pdf or ASB_C12345.pdf.

  •  02-18-2010, 5:39 PM 59861 in reply to 59853

    Re: Multiple Line Search Criteria for splitting large files

    P.S.S.

    I am wondering if there is a way to exclude the junk pages that are sometimes between the reports.  Most likely that will not be a function of the regex, I'm thinking.  When the indicator is simple, taken from a fixed field, they are left out, because the field is blank on the junk separator pages.  Still, I thought I'd put it out there ...
  •  02-18-2010, 5:44 PM 59862 in reply to 59853

    Re: Multiple Line Search Criteria for splitting large files

    I'm now back in an area where I have no real understanding of the product you are using but within a "normal' regex pattern you can add in capture groups that will allow you access to various parts of the string that is matched by the pattern. Therefpore something like:

    VERSION ID: (\w{3}).*?\r?\nCASE ID :\s+(\w\d{5})

    (Note the parentheses around the '\w{3}' and '\w\d{5}' parts) will return the characters of the version ID and the case id respectively.

    Now, how you can access the captured text and use it to build up the string you are after is not something I can tell at the moment. In "regex speak" we use terms such as "match group number" or "back reference" to refer to this captured text, with such symbols as "\3" or "$3" to refer to the text captured by match group 3. Yu may find such terminology within the application documentation.

    As a  related aside, regex functions typically return either the text of the complete match or a data structure that indicates where the captured text of the various match groups starts and end (or how long it is). When regexs are buried within layers of software, it is really up to the creators of those layers as to what (and how) any additional information is made available. When we number the match groups, we start at 1 and work our way up the number sequence. However there is also a "match group 0" that represents the matched text from the entire pattern - this looks like the text that you are seeing from the "Name Suffix" field.

    Susan

    PS - I make my contributions to this forum freely and I certainly do not want or expect any reward - but thank you for the offer.

     

  •  02-19-2010, 5:22 PM 59888 in reply to 59862

    Re: Multiple Line Search Criteria for splitting large files

    Many thanks, Susan

    Your input was very valuable and got me around this obstacle I was facing.   Very much appreciated.  We are now outputting reports digitally via secure email from our system.

    I hadn't dealt with Regex before.  Very interesting logic, and a good skill to have in one's toolbox. 

    Along with your regex, I was able to resolve my file naming issue using the Adobe Batch Processing feature and processing (splitting) the files step by step, from high level to detail level.  Another solution we still may end up using for some files, is outputting the entire capture name and putting it through a BAT DOS rename process, which would trim out the unneeded elements.  From there, the automated email process picks up the client attachments and sends it on its way.  It WORKS.  Yea!!!!

    Take care,

    Kathleen

View as RSS news feed in XML