Protecting your regex from future changes in the source code
Hi guys,
Long time since I've written here, well let's just say it's been busy at work and home front!
Today I had to fix a web crawler due a regex that failed to track a tax amount. The regex was not written by me, but it reminded me what I have learned long time ago: "Remember to protect your regex from future changes in the source code!!!". It gave me the idea to share with you which steps I take in order to fulfill this rule.
When we use an HTML tag in the regex, we will usually write <td>name or <td align="left" >name but who knows which attributes the website we are extracting data from will use, and we know that those attributes can be changed from time to time. So we will maybe try to solve it by writing <td.*?>name but this is dangerous, because this regex is bad performance, and includes the risk of allocating more than you really looking for.
The best way to solve this issue is by using a negative definition. an HTML tag may include basically any char in the world excepts >, so by using this regex <table[^>]*> we can be sure we will never allocate more than a single table element. notice that we do not need to use ?* as this is already taken care of by using the [^>] definition.
On top of that we do accept any form for text within the tag. attributes, javascript calls styling you name it, we support it!
Now what about white spaces? can we know for sure how many spaces\tabs\newlines will be created by the server side code? No we can not! That is why it's always a good idea to use \s* ALL THE TIME!!! Use it like there is no tomorrow! (Actually use it because you want the regex to work tomorrow as well
)
Lets say the code we need to allocate is the following:
<tabel>
<tr>
<td class="bla" >
name</td>
</tr></table>
Due to the way browsers reads HTML code, the white spaces are removed, so web masters tends to be careless with them, especially when the code is generated automatically by a server side code. a smart regex that is taking this fact is consideration is this one: <table[^>]*>\s*<tr[^>]*>\s*<td[^>]*>\s*([a-zA-Z]+)\s*</td[^>]*>\s*</tr[^>]*>\s*</table[^>]*> Notice the usage of the \s* between any element which may be separated by an unknown number of white spaces! Another example is javascript code. In javascript code spaces are not a factor. there is no difference between x=0 and x = 0 and x = 0 That is why we need to remember that when we write regex that are extracting data from javascript code. the name space javascript it self can be used both with and without a space (e.g. BLOCKED SCRIPTmyfunc(); BLOCKED SCRIPT myFunc(); ) On the other hand, you cannot break the commands or variable name with a space in javascript. So you need to know a bit about the syntax of the code you are extracting, before you can be sure where you may and may not meet white spaces.
That is all for now.
I hope you will find this information useful.
Be good