How does the regex think?
If you want to write good, efficient regular expressions, than you would like to know, and understand how the regex object is thinking. What happens when you execute a regex? I'll try to put some light on that using conclusions I have made after working with regex for almost 4 years.
Disclaimer: I didn't see, nor designed a regex object, all my experience is based on working with regex, trying to optimize them, and concluding how come one regex run better than the other.
So, we all know the DOS days, dir command, and *.txt. that is a very simple regex. we accept all that follows by a dot and a t and an x and a t. but can we understand what the computer was doing?
Let's say I give you the task of finding all the students which name starts with an C within a list of students. Try to think what you'll be doing. you will look on each line, and acknowledge it as a students name entry, than you will count in the head every time you'll find an entry that starts with the letter C. Now lets say that we want the number of all students that the first name start with C and the last name start with A. Here you will again look on each line, and acknowledge it as a students name entry, than you will count in the head every time you'll find an entry that starts with the letter C but now there is a new criteria, the second name must start with A, you will acknowledge the second name by the first space after the first name. Well you are human, you are clever, but you are slow, and you get tired.
A program cannot acknowledge a new line for you, it cannot see where a first name ends and the second begins. That is why we have so many syntax in regex, because we need to specify mile stones in the text (the begining and the end of the line in our case), and because we need to define the pattern that define a first name and a last name. in this case a new line is the beginning of a student name entry, and an end line is the end of a student name entry. the space is what separates the first name from the second name. a name of a student can only be from a to z or from A to Z, in regex a-Z, and a name can minimum have 2 chars, so we can say the reg is ^[a-Z]{2,}\s[a-Z]{2,}$ simple, isn't it?
Now that we build this regex, let's say we are the regex object, and let's see step by step what the regex is doing.
The first thing the reg will do is to compile the regex, after that it will take the first char in the source, and ask if it's a ^, if it is it will put it in temp container, and will it will look on the next element in the expression, that is a-Z at least 2 chars. it will ask if the char after the ^ in the source is between a-Z, and if it is it will put in in the temp container, if not it will clear the temp container, and will again ask if the char is a ^. lets say that the second char is between a-Z, than the regex object will look at the 3rd char, and if it's between a-Z it will add it to the temp container, if not it will empty it, and check if it's like ^, now that we have 2 a-Z chars, we can have 2 ways to continue matching, either er find a space, og more a-Z chars. at any point of failure to match this 2, the regex will empty the temp container, and wil try to begin again from one char after the point where it started the match that failed (this is very important to understand in order to debug slow regex). At a point it will come to a space, than it will again look on a a-Z char atleast 2 times, and when it will find 2 a-Z chars, it will look on either more a-Z or an end of line $, when it will find the end of line it will add the value from the temp container to the matches array. and it will start again looking for a match from one char after the match (that is why slow regex are usually regex that do not find a match).
I guess you are just as confused, if not more after reading this part, but no fear, here is a task. take a piece of paper, and write 3 names in it. 1 name on each line, and one space between the first and the last name. now take another piece of paper, and draw a table, with a container column, and a caret column, now scan the list with names using the regex, and the steps I wrote, update the table with a new row for each caret movement.
That is it, it's a simple process, the complication lies in the many different definition we can give the regex to look after. but don't let that scare you! this complication are a tool for you, there are there to help you, to help us. When you want to write a regex, ask your self what do I need to match, a tlf number? That is a number with let's say 10 digits, no other chars are allowed, but all tlf numbers start with tlf: so we can make sure we dont save a social security number as a phone number. So just like you in your head will look for 10 digits pattern that comes after the milestone tlf:, write a regex that is looking for a milestone tlf:\s and that is followed by 10 digits pattern \d{10} tlf:\s\d{10}
That's all for today,
Be good