Regex with IgnoreCase option may match more characters than expected
This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
- String comparisons are harder than it seems
- How to correctly count the number of characters of a string
- Correctly converting a character to lower/upper case
- How not to read a string from an UTF-8 stream
- Regex with IgnoreCase option may match more characters than expected (this post)
- How to remove diacritics from a string in .NET
In a previous post, I explained why \d
is different from [0-9]
. In this post, I'll explain why the regex [a-zA-Z]
is different from the regex [a-z]
with the IgnoreCase
option.
var regex1 = new Regex("^[a-zA-Z]+$");
var regex2 = new Regex("^[a-z]+$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
Console.WriteLine(regex1.IsMatch("Test")); // true
Console.WriteLine(regex2.IsMatch("Test")); // true
However, you get different results if you use the Kelvin sign:
// Kelvin Sign (U+212A)
Console.WriteLine(regex1.IsMatch("K")); // false
Console.WriteLine(regex2.IsMatch("K")); // true
When a regular expression specifies the option RegexOptions.IgnoreCase
then comparisons between the input and the pattern will be case-insensitive. To support this, Regex needs to define which case mappings shall be used for the comparisons. A case mapping exists whenever you have two characters 'A' and 'B', where either 'A' is the ToLower()
representation of 'B' or both 'A' and 'B' are lowercase to the same character.
In this case, char.ToLowerInvariant('K')
(Kelvin Sign) is 'k'
(Latin Small Letter K). So, when using IgnoreCase
regex option, [a-z]
matches the Kelvin sign. However, the Kelvin Sign is not part of the [a-zA-Z]
set. That's why the regex [a-zA-Z]
does not match the Kelvin Sign.
To conclude, the following regular expressions are equivalent:
new Regex("^[a-z]+$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
new Regex("^[A-Za-z\u212A]+$");
If you are curious about how .NET knows which case mappings to use, you can read the code of the GenerateRegexCasingTable tool on GitHub.
Do you have a question or a suggestion about this post? Contact me!