I have added the following to this text: Cheat-sheets - documents containing an overview of regex quantifiers Examples - how to find HTML tags with explanations A note on greedy quantifiers (+, *) Cheat-sheets The place to look if you want additional information about regex is RegExLib.com. Click on the link below to go to the cheat-sheet they have posted: http://regexlib.com/CheatSheet.aspx Here are some code examples that might come in pretty handy: http://www.dijksterhuis.org/regular-expressions-csharp-practical-use/#more-808 Examples Remove HTML tags like <span style=...>: objHTMLPattern = new Regex(@"(<\/?span[^>]*>)", RegexOptions.IgnoreCase); cleanedFileContent = objHTMLPattern.Replace(cleanedFileContent, String.Empty); This regex searches for either a <span> or </span>. The / has been escaped by a backslash forming what looks like a capital "V". The <span> or </span> tag can be followed by 0 or more characters that is not a >, and then a closing >. Ths closing > indicates the end of the <span> tag. The whole <span> or </span> tag is then removed (replaced by a String.Empty). The idea here is also to remove tags like <span style=...>. Match any HTML tag and its closing tag: Here is an expression that will find an HTML tag and the matching end tag: Regex objHTMLPattern = new Regex(@"<(.+?)>(.+)<\/\1>", RegexOptions.IgnoreCase);
The string we use in this example is listed below. The boolean variable will in this case return a true value, and the output is stored back in the string "cleanedFileContent" which will now contain the value Underlined text. <b>Underlined and bold text.</b> Regex objHTMLPattern = new Regex(@"<(.+?)>(.+)<\/\1>", RegexOptions.IgnoreCase); string cleanedFileContent = "<u>Underlined text. <b>Underlined and bold text.</b></u>"; bool match = objHTMLPattern.Match(cleanedFileContent).Success; if (match) { cleanedFileContent = objHTMLPattern.Replace(cleanedFileContent, "$2"); } Disallow hyphen, slash and dot (only at end of string): Here is another example from something I was working on the other day. This example will not allow the characters "-" (hyphen) and "/" (slash) in the text, and will not allow hyphen, slash or dot at the end of the text. The ^ in the square bracket means negation, and the | means or. The first square bracket then reads "not hyphen or slash". The * means repeat the previous 0 or more times, i.e., repeating any character "not hyphen or slash". The last square bracket reads "not hyphen or slash or dot (must be escaped by a backslash). Not the $ at the end of this square bracket. The dollar sign means that the previous should be at the end of the string. So the last square bracket then will not allow hyphen, slash or dot at the end of the string. [^-|/]*[^-|/|\.]$ Greedy quantifiers Watch out for greedy quantifiers! It is important that you understand how the Regex engine works, and especially how greedy quantifiers work in order to avoid mistakes. In the example above, I used the expression <.+?>. If I omit the ? from this expression, I will get <.+>. You would expect this to match the first <, then any number of characters and a minimum of one character, and then a closing >. Consider the example below: <u>Underlined text.</u> If you apply an expression like: Regex objHTMLPattern = new Regex("<.+>"); you would expect the result to be <u>. However, since the + is a greedy quantifier, the regex will match the whole string (i.e., <u>Underlined text.</u>). This is obviously not what we wanted! You might have expected the regex to match <u> and after that </u>. The greedy + causes the regex engine to repeat the preceding token as often as possible. Only if that coused the entire regex to fail, will the regex engine backtrack. The regex engine will thus find the first <, then match as many characters as possible (remember that the < and > are considered ordinary characters) before matching the closing >. The regex engine then does match everything from the staring < to the very last closing >. Not what we wanted, but engine still did exactly what it was told to do. If we, however, add the ? to make the * a lazy quantifier as follows: Regex objHTMLPattern = new Regex("<.+?>"); we would tell the regex engine to match any character, but as few as possible. This would then yield a result of <u>, which is what we wanted. Happy regex'ing :) André Vold
Her er en rask liten BLOGGER BIO for å fortelle litt om min bakgrunn - og hvorfor jeg valgte programmering som fagfelt.