|
|
Parse a text file in VB.NET 2005
Last post 01-19-2007, 12:08 PM by OmegaMan. 12 replies.
-
01-09-2007, 2:56 PM |
-
dbinthecountry
-
-
-
Joined on 01-09-2007
-
-
Posts 3
-
-
|
Parse a text file in VB.NET 2005
I am working with large text files and standard parsing won't work due to the layout. A sample of one line of the data is:
|645-50-9246AKA:PEDRO, RAFAEL GARCIA|,|19671129|,000000000,,|M|,|H|,|SAN LUIS DELA PAZ, GUANAJUATO, MEXICO|,|THF14|,|THEFT BY POSSESSION|,|19901025|,|19901025|,|PORT ISABEL POLICE DEPARTMENT|,|19910123|,|91CR00000097|,|D|,|5G|,|DISMISSAL - OTHER DISMISSALS|,|19910318| As you can see, it is comma delimited with pipes wrapped around text. I could handle it if pipes wrapped every field but they don't. I am using Visual Studio 2005, VB.NET. I don't know much at all about Regular Expressions. I've been looking for ways to parse this data in code and all signs lead to RegEx. Any help would be appreciated.
|
|
-
01-09-2007, 3:54 PM |
-
SlideGuitar
-
-
-
Joined on 09-02-2005
-
Reston, VA
-
Posts 528
-
-
|
Re: Parse a text file in VB.NET 2005
You can't split because the fields aren't simply separated; in some cases they're delimited. This: \|[^|]*\||[^,]*(?:,) ...will work somewhat, but you'll have to strip off the '|' at the beginning and end if there is one. Using non-capturing matches to save you this work is a little tricky, and I don't have time to work it out.
"Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
|
|
-
01-09-2007, 6:22 PM |
-
dbinthecountry
-
-
-
Joined on 01-09-2007
-
-
Posts 3
-
-
|
Re: Parse a text file in VB.NET 2005
Thanks for your help. You actually inspired me to learn more about RegEx and this works: ,(?=[A-Z]|[0-9]|[\|]|[,]) I still have to strip the pipes from the results, but I can live with that. ,(?=[A-Z]|[0-9]|[\|]|[,]) I still have to strip the pipes from the results, but I can live with that. ,(?=[A-Z]|[0-9]|[\|]|[,]) I still have to strip the pipes from the results, but I can live with that. UPDATE. I have made 1 more modification. Here is the final result that works. ,(?=[A-Z0-9]|[\|]|[,]|[\040]{2})
|
|
-
01-10-2007, 1:44 PM |
-
dbinthecountry
-
-
-
Joined on 01-09-2007
-
-
Posts 3
-
-
|
Re: Parse a text file in VB.NET 2005
Well, I thought I had it working but I was wrong. I have found 1 exeption. If the last field is blank it breaks. Below is the code that works for all but that. Any ideas would be greatly appreciated. Thanks. Dim re As New System.Text.RegularExpressions.Regex(",(?=[0-9]|[\|]|[,]|[\040]{2}|\n)")Dim Records() As String Dim re As New System.Text.RegularExpressions.Regex(",(?=[0-9]|[\|]|[,]|[\040]{2}|\n)")Dim Records() As String Dim Records() As StringDim i, j As Integer Dim i, j As IntegersTxt = sr.ReadLine While Not sTxt Is Nothing While Not sTxt Is NothingRecords = re.Split(sTxt)
|
|
-
01-12-2007, 1:11 PM |
-
SlideGuitar
-
-
-
Joined on 09-02-2005
-
Reston, VA
-
Posts 528
-
-
|
Re: Parse a text file in VB.NET 2005
I don't think you should use Split(), precisely because you cannot depend on a pipe character being on both sides of a comma.
"Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
|
|
-
01-13-2007, 3:20 PM |
-
OmegaMan
-
-
-
Joined on 09-10-2006
-
-
Posts 18
-
-
|
Re: Parse a text file in VB.NET 2005
Two things: One, I had to tweak this from what I currently use, data in quotes instead of pipes. Also, you didn't just post someones arrest record up? If the data is not clean, I would ask the moderator to clean it up or Pedro might be mad. <g> ' Imports System.Text.RegularExpressions ' Regular expression built for Visual Basic on: Sat, Jan 13, 2007, 01:15:45 PM ' Using Expresso Version: 3.0.2559, http://www.ultrapico.com ' ' Will put columns into named capture groups. Data found in pipes, but pipes not included in capture. ' ' A description of the regular expression: ' ' Match a prefix but exclude it from the capture. [,|^] ' Select from 2 alternatives ' , ' Beginning of line or string ' Match expression but don't capture it. [\|], zero or one repetitions ' Literal | ' [Column]: A named capture group. [[^\|]*] ' Any character that is NOT in this class: [\|], any number of repetitions ' Match expression but don't capture it. [\|], zero or one repetitions ' Literal | ' Match a suffix but exclude it from the capture. [,|$] ' Select from 2 alternatives ' , ' End of line or string ' ' Public Dim CSV as Regex = New Regex( _ "(?<=,|^)(?:\|)?(?<Column>[^\|]*)(?:\|)?(?=,|$)", _ RegexOptions.Singleline _ Or RegexOptions.ExplicitCapture _ Or RegexOptions.CultureInvariant _ Or RegexOptions.Compiled _ )
|
|
-
01-13-2007, 3:23 PM |
-
OmegaMan
-
-
-
Joined on 09-10-2006
-
-
Posts 18
-
-
|
Re: Parse a text file in VB.NET 2005
Oh I forgot, the regex places the matches into the named group Column for each item.
|
|
-
01-13-2007, 4:47 PM |
-
SlideGuitar
-
-
-
Joined on 09-02-2005
-
Reston, VA
-
Posts 528
-
-
|
Re: Parse a text file in VB.NET 2005
That won't work either. If a line of data does not begin with a pipe, this regex will skip the first column! I worked on this for 30 minutes; it's not that easy. Several of my attempts did not work correctly on literal edge cases, i.e., first and last columns. Here's what I've got now. Note that isn't perfect; those annoying Windows newlines (\r\n) get matched by [^|]*, for example. If you read in individual lines in a way that strips off new newlines, whether DOS or Unix, this won't be an issue. # Disallow 0 characters at the end of a line. # This is necessary because any other column can have 0 width! (?!$) # If you match a pipe, capture it. (?<pipe>\|)? # Match lazily, and capture as "column"... (?<column> (?(pipe) # If there was a pipe... [^|]* # ...keep going until just before the next one | # or [^,]* # continue until just before a comma. ) ) # If there was a pipe to start with, then there must be another to finish. (?<endPipe>(?(pipe)\|)) # Match the next comma, or EOL. (?<cleanUp>,|\s*$)
"Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
|
|
-
01-13-2007, 5:31 PM |
-
OmegaMan
-
-
-
Joined on 09-10-2006
-
-
Posts 18
-
-
|
Re: Parse a text file in VB.NET 2005
>That won't work either. If a line of data does not begin with a pipe, this regex will skip the first column. I disagree and here is why. The pipes purpose is to allow data to contains commas. If the pipes are gone then it is an empty field designation as per the user's example above where for example there were empty columns such as "|abc|,,|def|" that did occur. My regex handles that situation. If field one were to be an empty field, let us examine it the original as posted looked like this: |645-50-9246AKA:PEDRO, RAFAEL GARCIA|,|19671129|,000000000,,|M|,| ... to an empty field for column1 should look like this: ,|19671129|,000000000,,|M|,|H my regex faithfully reports a blank column if that occurs. If the pipes were to go missing as suggested in your post, instead of having one record there are two now....and it would look like this: 645-50-9246AKA:PEDRO RAFAEL GARCIA That makes no sense in breaking up that record. My regex as posted is currently in use in a financial setting on a daily automated basis. If the user is getting data back without pipes in non null columns, then the fault is on the data provider and they need to fix it! Data in CSV files has to have symmetry in data placement. If that symmetry/contract is broken...so is the data.
|
|
-
01-13-2007, 11:43 PM |
-
vidals
-
-
-
Joined on 01-13-2007
-
Escondido
-
Posts 2
-
-
|
Re: Parse a text file in VB.NET 2005
Sometimes simplicity is best. my @output = split /\|/, $test;
my $counter = 1; foreach my $field (@output) { print "$counter. $field\n"; $counter++;
}
This work fine, is fast and will allow you to process each field according to your needs. Regular Expressions should be complex as needed and it's better to stay as simple as possible to make code maintenance straightforward.
--- Gil Vidals Truepath - your Christian Web Host http://www.truepath.com
|
|
-
01-14-2007, 1:28 PM |
-
SlideGuitar
-
-
-
Joined on 09-02-2005
-
Reston, VA
-
Posts 528
-
-
|
Re: Parse a text file in VB.NET 2005
The original statement of the problem was, "I could handle it if pipes wrapped every field but they don't."
"Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
|
|
-
01-15-2007, 8:51 AM |
-
SlideGuitar
-
-
-
Joined on 09-02-2005
-
Reston, VA
-
Posts 528
-
-
|
Re: Parse a text file in VB.NET 2005
OmegaMan, I've tried your regex against a line of data beginning with 1 empty column (indicated only by the comma separator), and it works. However, if the next column is also empty, it won't work. In fact, if I cut the modified sample data from your posting and paste it into Regulator, it fails, precisely because two consecutive columns are blank. I'm not trying to be a jerk; I just wanted to make sure that I wasn't just blathering before I posted. But: |645-50-9246AKA:PEDRO, RAFAEL GARCIA|,|19671129|,000000000,,|M|,| ...gives me 000000000, as a column!
I'm not talking about leaving off pipe characters around column values that include comma characters; I'm talking about data that I understood to be correctly formatted, based on the initial posting. I agree with your point that if the data is bad, the onus is on the provider; moreover, we're not tasked with validating the data anyway. Believe me, it drives me insane in the membrane to have data suppliers hand me poorly-conceived or even invalid XML schemas and expect me to write XSLT around it. I don't consider that my job. On the other hand, sometimes the people paying me do. Anyway, it's not up to us to evaluate this tradeoff. The proposal to use a bit of Perl (or Python, or whatever) to solve this problem is not a bad one, esp. if you're plowing through large amounts of data. A performance comparison would be in order.
"Some day, and that day may never come, I will call upon you to do a service for me." — Don Vito Corleone
|
|
-
01-19-2007, 12:08 PM |
-
OmegaMan
-
-
-
Joined on 09-10-2006
-
-
Posts 18
-
-
|
Re: Parse a text file in VB.NET 2005
No offense taken slide! These forums are here to help and after reading your post I now understand the problem better. I respect that fact that you took timeout to reply and to evaluate what was said. With that said, I concur with the reasoning that my regex will fail. There are actually two problems with the CSV file 1) Null values are not delimited by | such as ||, 2) Actual values are not delimited by | such as 1234,|abc,def|. The advice I would give for the user, besides the real advice of throw it back at the data provider and cry foul, is to create a program to normalize the data to expected norms. At that point it falls outside of Regex. Sorry for the late reply...work/life got in my way. Again thanks for looking into it!
|
|
|
|
|