Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Parse a text file in VB.NET 2005

Last post 01-19-2007, 12:08 PM by OmegaMan. 12 replies.
Sort Posts: Previous Next
  •  01-09-2007, 2:56 PM 26073

    Parse a text file in VB.NET 2005

    I am working with large text files and standard parsing won't work due to the layout.  A sample of one line of the data is:

    |645-50-9246AKA:PEDRO, RAFAEL GARCIA|,|19671129|,000000000,,|M|,|H|,|SAN LUIS DELA PAZ, GUANAJUATO, MEXICO|,|THF14|,|THEFT BY POSSESSION|,|19901025|,|19901025|,|PORT ISABEL POLICE DEPARTMENT|,|19910123|,|91CR00000097|,|D|,|5G|,|DISMISSAL - OTHER DISMISSALS|,|19910318|

    As you can see, it is comma delimited with pipes wrapped around text.  I could handle it if pipes wrapped every field but they don't.  I am using Visual Studio 2005, VB.NET.  I don't know much at all about Regular Expressions.  I've been looking for ways to parse this data in code and all signs lead to RegEx.  Any help would be appreciated.

  •  01-09-2007, 3:54 PM 26076 in reply to 26073

    Re: Parse a text file in VB.NET 2005

    You can't split because the fields aren't simply separated; in some cases they're delimited. This:

    \|[^|]*\||[^,]*(?:,)

    ...will work somewhat, but you'll have to strip off the '|' at the beginning and end if there is one. Using non-capturing matches to save you this work is a little tricky, and I don't have time to work it out.  


    "Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
  •  01-09-2007, 6:22 PM 26085 in reply to 26076

    Re: Parse a text file in VB.NET 2005

    Thanks for your help.  You actually inspired me to learn more about RegEx and this works:

    ,(?=[A-Z]|[0-9]|[\|]|[,])

    I still have to strip the pipes from the results, but I can live with that.

    ,(?=[A-Z]|[0-9]|[\|]|[,])

    I still have to strip the pipes from the results, but I can live with that.

    ,(?=[A-Z]|[0-9]|[\|]|[,])

    I still have to strip the pipes from the results, but I can live with that.

    UPDATE.  I have made 1 more modification.  Here is the final result that works.

    ,(?=[A-Z0-9]|[\|]|[,]|[\040]{2})

  •  01-10-2007, 1:44 PM 26119 in reply to 26076

    Re: Parse a text file in VB.NET 2005

    Well, I thought I had it working but I was wrong.  I have found 1 exeption.  If the last field is blank it breaks.  Below is the code that works for all but that.  Any ideas would be greatly appreciated.  Thanks.

    Dim re As New System.Text.RegularExpressions.Regex(",(?=[0-9]|[\|]|[,]|[\040]{2}|\n)")

    Dim Records() As String

    Dim re As New System.Text.RegularExpressions.Regex(",(?=[0-9]|[\|]|[,]|[\040]{2}|\n)")

    Dim Records() As String

    Dim Records() As String

    Dim i, j As Integer

    Dim i, j As Integer

    sTxt = sr.ReadLine

    While Not sTxt Is Nothing

    While Not sTxt Is Nothing

    Records = re.Split(sTxt)

  •  01-12-2007, 1:11 PM 26197 in reply to 26119

    Re: Parse a text file in VB.NET 2005

    I don't think you should use Split(), precisely because you cannot depend on a pipe character being on both sides of a comma.

    "Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
  •  01-13-2007, 3:20 PM 26222 in reply to 26073

    Re: Parse a text file in VB.NET 2005

    Two things: One, I had to tweak this from what I currently use, data in quotes instead of pipes. Also, you didn't just post someones arrest record up? If the data is not clean, I would ask the moderator to clean it up or Pedro might be mad. <g>

     

    '  Imports System.Text.RegularExpressions

    '  Regular expression built for Visual Basic on: Sat, Jan 13, 2007, 01:15:45 PM
    '  Using Expresso Version: 3.0.2559, http://www.ultrapico.com

    '  Will put columns into named capture groups. Data found in pipes, but pipes not included in capture.

    '  A description of the regular expression:

    '  Match a prefix but exclude it from the capture. [,|^]
    '      Select from 2 alternatives
    '          ,
    '          Beginning of line or string
    '  Match expression but don't capture it. [\|], zero or one repetitions
    '      Literal |
    '  [Column]: A named capture group. [[^\|]*]
    '      Any character that is NOT in this class: [\|], any number of repetitions
    '  Match expression but don't capture it. [\|], zero or one repetitions
    '      Literal |
    '  Match a suffix but exclude it from the capture. [,|$]
    '      Select from 2 alternatives
    '          ,
    '          End of line or string

    Public Dim CSV as Regex = New Regex( _
        "(?<=,|^)(?:\|)?(?<Column>[^\|]*)(?:\|)?(?=,|$)", _
        RegexOptions.Singleline _
        Or RegexOptions.ExplicitCapture _
        Or RegexOptions.CultureInvariant _
        Or RegexOptions.Compiled _
        )


  •  01-13-2007, 3:23 PM 26223 in reply to 26222

    Re: Parse a text file in VB.NET 2005

    Oh I forgot, the regex places the matches into the named group Column for each item.
  •  01-13-2007, 4:47 PM 26225 in reply to 26223

    Re: Parse a text file in VB.NET 2005

    That won't work either. If a line of data does not begin with a pipe, this regex will skip the first column!

    I worked on this for 30 minutes; it's not that easy. Several of my attempts did not work correctly on literal edge cases, i.e., first and last columns. Here's what I've got now. Note that isn't perfect; those annoying Windows newlines (\r\n) get matched by [^|]*, for example. If you read in individual lines in a way that strips off new newlines, whether DOS or Unix, this won't be an issue.

    # Disallow 0 characters at the end of a line.
    # This is necessary because any other column can have 0 width!
    (?!$)
    # If you match a pipe, capture it.
    (?<pipe>\|)?
    # Match lazily, and capture as "column"...
    (?<column>
      (?(pipe) # If there was a pipe...
        [^|]*      # ...keep going until just before the next one
        |           # or
        [^,]*      # continue until just before a comma.
      )
    )
    # If there was a pipe to start with, then there must be another to finish. 
    (?<endPipe>(?(pipe)\|))  # Match the next comma, or EOL.
    (?<cleanUp>,|\s*$)             

     


    "Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
  •  01-13-2007, 5:31 PM 26231 in reply to 26225

    Re: Parse a text file in VB.NET 2005

    >That won't work either. If a line of data does not begin with a pipe, this regex will skip the first column.

    I disagree and here is why. The pipes purpose is to allow data to contains commas. If the pipes are gone then it is an empty field designation as per the user's example above where for example there were empty columns such as "|abc|,,|def|" that did occur. My regex handles that situation.

    If field one were to be an empty field, let us examine it the original as posted looked like this:

    |645-50-9246AKA:PEDRO, RAFAEL GARCIA|,|19671129|,000000000,,|M|,| ...

    to an empty field for column1 should look like this:

    ,|19671129|,000000000,,|M|,|H

     my regex faithfully reports a blank column if that occurs.

    If the pipes were to go missing as suggested in your post, instead of having one record there are two now....and it would look like this:

    645-50-9246AKA:PEDRO
    RAFAEL GARCIA

    That makes no sense in breaking up that record.

    My regex as posted is currently in use in a financial setting on a daily automated basis. If the user is getting data back without pipes in non null columns, then the fault is on the data provider and they need to fix it!  Data in CSV files has to have symmetry in data placement. If that symmetry/contract is broken...so is the data.

  •  01-13-2007, 11:43 PM 26233 in reply to 26073

    Re: Parse a text file in VB.NET 2005

    Sometimes simplicity is best.

     my @output = split /\|/, $test;

    my $counter = 1;
    foreach my $field (@output)
    {
       print "$counter. $field\n";
       $counter++;

    }


    This work fine, is fast and will allow you to process each field according to your needs. Regular Expressions should be complex as needed and it's better to stay  as simple as possible to make code maintenance straightforward.

     


    ---
    Gil Vidals
    Truepath - your Christian Web Host
    http://www.truepath.com
  •  01-14-2007, 1:28 PM 26242 in reply to 26233

    Re: Parse a text file in VB.NET 2005

    The original statement of the problem was, "I could handle it if pipes wrapped every field but they don't."
    "Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
  •  01-15-2007, 8:51 AM 26271 in reply to 26231

    Re: Parse a text file in VB.NET 2005

    OmegaMan, I've tried your regex against a line of data beginning with 1 empty column (indicated only by the comma separator), and it works. However, if the next column is also empty, it won't work. In fact, if I cut the modified sample data from your posting and paste it into Regulator, it fails, precisely because two consecutive columns are blank. I'm not trying to be a jerk; I just wanted to make sure that I wasn't just blathering before I posted. But:

    |645-50-9246AKA:PEDRO, RAFAEL GARCIA|,|19671129|,000000000,,|M|,|

    ...gives me 000000000, as a column!

    I'm not talking about leaving off pipe characters around column values that include comma characters; I'm talking about data that I understood to be correctly formatted, based on the initial posting. I agree with your point that if the data is bad, the onus is on the provider; moreover, we're not tasked with validating the data anyway. Believe me, it drives me insane in the membrane to have data suppliers hand me poorly-conceived or even invalid XML schemas and expect me to write XSLT around it. I don't consider that my job. On the other hand, sometimes the people paying me do. Anyway, it's not up to us to evaluate this tradeoff.

    The proposal to use a bit of Perl (or Python, or whatever) to solve this problem is not a bad one, esp. if you're plowing through large amounts of data. A performance comparison would be in order.
     


    "Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
  •  01-19-2007, 12:08 PM 26482 in reply to 26271

    Re: Parse a text file in VB.NET 2005

    No offense taken slide! These forums are here to help and after reading your post I now understand the problem better. I respect that fact that you took timeout to reply and to evaluate what was said.

     With that said, I concur with the reasoning that my regex will fail. There are actually two problems with the CSV file

    1) Null values are not delimited by | such as ||,
    2) Actual values are not delimited by | such as 1234,|abc,def|.

     The advice I would give for the user, besides the real advice of throw it back at the data provider and cry foul, is to create a program to normalize the data to expected norms. At that point it falls outside of Regex.

    Sorry for the late reply...work/life got in my way.

    Again thanks for looking into it!

View as RSS news feed in XML