Tech blogg

Regular Expressions (RegEx), cheat sheets and examples

Are you like me and tend to forget the regex-expressions between each project?

I just had to revisit the usage of regex, and came accross these cheat sheets on the web. They came in pretty handy for me, so I thought I should post them here for future reference.

I have also posted a few examples that might come in handy for you. I even posted a snippet that will find any HTML tag and its closing tag - a pretty cool regex-expression!

I have added the following to this text:

Cheat-sheets - documents containing an overview of regex quantifiers
Examples - how to find HTML tags with explanations
A note on greedy quantifiers (+, *)

Cheat-sheets
The place to look if you want additional information about regex is RegExLib.com. Click on the link below to go to the cheat-sheet they have posted:
http://regexlib.com/CheatSheet.aspx

Here are some code examples that might come in pretty handy:
http://www.dijksterhuis.org/regular-expressions-csharp-practical-use/#more-808

Examples

Remove HTML tags like :
objHTMLPattern = new Regex(@"(<\/?span[^>]*>)", RegexOptions.IgnoreCase);
cleanedFileContent = objHTMLPattern.Replace(cleanedFileContent, String.Empty);

This regex searches for either a or . The / has been escaped by a backslash forming what looks like a capital "V".
The or tag can be followed by 0 or more characters that is not a >, and then a closing >. Ths closing > indicates the end of the tag. The whole or tag is then removed (replaced by a String.Empty). The idea here is also to remove tags like .

Match any HTML tag and its closing tag:
Here is an expression that will find an HTML tag and the matching end tag:

Regex objHTMLPattern = new Regex(@"<(.+?)>(.+)<\/\1>", RegexOptions.IgnoreCase);

Anything enclosed in parenthesis will be stored in a variable and can be referenced later. The first parenthesis will represent variable 1, then 2, and so on. We can reference these variables by using a $1, $2, ..., like we do at the bottom of this snippet. In the first line of this snippet, we reference the first variable in the regex-expression itself by using a backslash in front of the variable number - \1
< Look for the starting bracket for the opening tag.
.+ A dot represent any character, and a pluss means one or more matches of the previous (i.e., any character). The .+? means match any character one or more times, but as few as possible (the ? indicates as few as possible). Remember that + is a greedy quantifier, and to get the correct result we need to use the +? in this case. More on greedy quatifiers later.
> Look for a closing bracket for the opening tag.
.+ The opening bracket is followed by any character one or more times.
<\/\1> The closing bracket starts with a <, followed by a / (this has to be escaped by a backward slash giving the sign that looks like a capital V - \/ ). We then reference the first variable by \1 - i.e., the variable number with a backslash in front.
> The closing tag ends with a closing bracket.

The string we use in this example is listed below. The boolean variable will in this case return a true value, and the output is stored back in the string "cleanedFileContent" which will now contain the value Underlined text. Underlined and bold text.

Regex objHTMLPattern = new Regex(@"<(.+?)>(.+)<\/\1>", RegexOptions.IgnoreCase);
string cleanedFileContent = "Underlined text. Underlined and bold text.";
bool match = objHTMLPattern.Match(cleanedFileContent).Success;
if (match)
{
cleanedFileContent = objHTMLPattern.Replace(cleanedFileContent, "$2");
}

Disallow hyphen, slash and dot (only at end of string):
Here is another example from something I was working on the other day. This example will not allow the characters "-" (hyphen) and "/" (slash) in the text, and will not allow hyphen, slash or dot at the end of the text.

The ^ in the square bracket means negation, and the | means or. The first square bracket then reads "not hyphen or slash".
The * means repeat the previous 0 or more times, i.e., repeating any character "not hyphen or slash".
The last square bracket reads "not hyphen or slash or dot (must be escaped by a backslash). Not the $ at the end of this square bracket. The dollar sign means that the previous should be at the end of the string. So the last square bracket then will not allow hyphen, slash or dot at the end of the string.

[^-|/]*[^-|/|\.]$

Greedy quantifiers
Watch out for greedy quantifiers! It is important that you understand how the Regex engine works, and especially how greedy quantifiers work in order to avoid mistakes.
In the example above, I used the expression <.+?>. If I omit the ? from this expression, I will get <.+>. You would expect this to match the first <, then any number of characters and a minimum of one character, and then a closing >. Consider the example below:
Underlined text.

If you apply an expression like:

Regex objHTMLPattern = new Regex("<.+>");

you would expect the result to be . However, since the + is a greedy quantifier, the regex will match the whole string (i.e., Underlined text.). This is obviously not what we wanted! You might have expected the regex to match and after that . The greedy + causes the regex engine to repeat the preceding token as often as possible. Only if that coused the entire regex to fail, will the regex engine backtrack. The regex engine will thus find the first <, then match as many characters as possible (remember that the < and > are considered ordinary characters) before matching the closing >. The regex engine then does match everything from the staring < to the very last closing >. Not what we wanted, but engine still did exactly what it was told to do.

If we, however, add the ? to make the * a lazy quantifier as follows:

Regex objHTMLPattern = new Regex("<.+?>");

we would tell the regex engine to match any character, but as few as possible. This would then yield a result of , which is what we wanted.

Happy regex'ing :)

André Vold

30.01.2010

Kommentar

Skriv kommentar

André Vold

Her er en rask liten BLOGGER BIO for å fortelle litt om min bakgrunn - og hvorfor jeg valgte programmering som fagfelt.

Jeg vokste opp i Norge, men gikk på universitet i USA. Jeg er sivilingeninør med grad innen "Computer Engineering" fra Arizona State University. Etter utdannelsen flyttet jeg tilbake til Norge, begynte å arbeide på Norsk Data og senere IBM, og tok videreutdannelse på BI innen Master of Management. Etter flere år som divisjonsdirektør i IBM Norge, valgte jeg å starte e-læringsfirmaet Apropos Internett AS . Senere startet jeg også ViroSafe Norge AS som importerer og distribuerer anti-malware og anti-virus software og hardware.

Det har alltid vært essensielt for meg å holde meg oppdatert innen teknologiske endringer og trender - ikke bare for å kunne holde programmeringskunnskaper på topp, men også for min suksess som gründer.

Lumino blogger omfatter en rekke temaer som har vært essensiell for min virksomhet, og som derfor kan være relevant for din.