String comparisons are harder than it seems
This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
- String comparisons are harder than it seems (this post)
- How to correctly count the number of characters of a string
- Correctly converting a character to lower/upper case
- How not to read a string from an UTF-8 stream
- Regex with IgnoreCase option may match more characters than expected
- How to remove diacritics from a string in .NET
Comparing strings is different from comparing numbers. 2 numbers are equal if their values are identical. For instance, 1 is equal to 1, and 1 is not equal to 2. That's trivial. When it comes to strings, things are different. For instance, do you want a case-sensitive comparison? What about the different ways to write the same letter? For instance, the letter ß
is common in German, but it's also possible to write it ss
as it is easier on lots of keyboards.
#Comparison modes
In .NET there are 6 ways to compare strings:
Ordinal:
Performs a simple byte comparison that is independent of language. This is most appropriate when comparing strings that are generated programmatically or when comparing case-sensitive resources such as passwords.
OrdinalIgnoreCase:
Treats the characters in the strings to compare as if they were converted to uppercase using the conventions of the invariant culture, and then performs a simple byte comparison that is independent of language. This is most appropriate when comparing strings that are generated programmatically or when comparing case-insensitive resources such as paths and filenames.
InvariantCulture:
Compares strings in a linguistically relevant manner, but it is not suitable for display in any particular culture. Its major application is to order strings in a way that will be identical across cultures.
InvariantCultureIgnoreCase:
Compares strings in a linguistically relevant manner that ignores case, but it is not suitable for display in any particular culture. Its major application is to order strings in a way that will be identical across cultures.
CurrentCulture:
Can be used when strings are linguistically relevant. For example, if strings are displayed to the user, or if strings are the result of user interaction, culture-sensitive string comparison should be used to order the string data.
CurrentCultureIgnoreCase:
Can be used when strings are linguistically relevant but their case is not. For example, if strings are displayed to the user but case is unimportant, culture-sensitive, case-insensitive string comparison should be used to order the string data.
It's important to explicitly specify the comparison mode to avoid unexpected behaviors, or even worse, security issues. For instance, if you compare 2 passwords using the current culture or the invariant culture, the passwords may be equal whereas they may be different!
#Examples
It's easy to understand the difference between Ordinal
and OrdinalIgnoreCase
. However, it may be harder to understand what a culture-specific comparison means. So, here are some examples that show some differences depending on the comparison used.
Let's start with some basic comparisons using Ordinal
and OrdinalIgnoreCase
:
// Basic comparisons
string.Equals("a", "a", StringComparison.Ordinal); // true
string.Equals("a", "A", StringComparison.Ordinal); // false
string.Equals("a", "a", StringComparison.OrdinalIgnoreCase); // true
string.Equals("a", "A", StringComparison.OrdinalIgnoreCase); // true
Now we can try with a non-obvious character. The German character ß
(Eszett) can also be written ss
, so they can be considered as equivalent as you can see in the following examples:
string.Equals("ss", "ß", StringComparison.OrdinalIgnoreCase); // false
string.Equals("ss", "ß", StringComparison.InvariantCulture); // true on Windows / false on Linux (WSL)
string.Equals("ss", "ß", StringComparison.InvariantCultureIgnoreCase); // true on Windows / false on Linux (WSL)
Similar differences may apply with ligature:
string.Contains("encyclopædia", "ae", StringComparison.Ordinal); // false
string.Contains("encyclopædia", "ae", StringComparison.InvariantCulture); // true
Another tricky character is the character i
in the Turkish language. In Latin languages, there is only one i
. In Turkish, there are 2: the dotless ı
and the dotted i
. There are also 2 different uppercase characters: I
and İ
. You'll find more details in the 2 following posts:
- Internationalization for Turkish: Dotted and Dotless Letter "I"
- What's Wrong With Turkey?
- Dotted and dotless I
Console.WriteLine(string.Equals("ı", "I", StringComparison.OrdinalIgnoreCase)); // false
Console.WriteLine(string.Equals("ı", "İ", StringComparison.OrdinalIgnoreCase)); // false
Console.WriteLine(string.Equals("i", "I", StringComparison.OrdinalIgnoreCase)); // true
Console.WriteLine(string.Equals("i", "İ", StringComparison.OrdinalIgnoreCase)); // false
CultureInfo.CurrentCulture = new CultureInfo("en-US");
string.Equals("i", "I", StringComparison.CurrentCultureIgnoreCase); // true
string.Equals("i", "İ", StringComparison.CurrentCultureIgnoreCase); // false
CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
string.Equals("i", "I", StringComparison.CurrentCultureIgnoreCase); // false
string.Equals("i", "İ", StringComparison.CurrentCultureIgnoreCase); // true
To my mind, the strangest culture is Thai culture. It doesn't contain the symbol dot (.
), so comparisons are very funky!
// The thai culture doesn't contain '.', so comparisons can be a little bit strange (tested on Ubuntu 18.04):
CultureInfo.CurrentCulture = new CultureInfo("th_TH.UTF8");
"Test".StartsWith(".", StringComparison.CurrentCulture); // true!!!
"12.4".IndexOf(".", StringComparison.CurrentCulture); // 0
Ok, the dot was nice, but you can continue with numbers. When comparing numbers, Thai digits are converted to Arabic digits (0 to 9 in Thai: ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙) as explained in the Unicode documentation
CultureInfo.CurrentCulture = new CultureInfo("th_TH.UTF8");
string.Equals("1", "๑", StringComparison.CurrentCulture); // true
string.Equals("1", "๑", StringComparison.InvariantCultureIgnoreCase); // false
\0
and a few characters are ignored in linguistic comparisons:
string.Equals("a\0b", "ab", StringComparison.Ordinal); // false
string.Equals("a\0b", "ab", StringComparison.InvariantCulture); // true
string.Equals("a\0b", "ab", StringComparison.CurrentCulture); // true
string.Equals("A꙰B", "AB", StringComparison.Ordinal); // false
string.Equals("A꙰B", "AB", StringComparison.InvariantCulture); // true
The normalization of the strings can also change the results between the ordinal and non-ordinal comparison modes. In the following example, the first string contains only 1 character: é
. The second one contains 2 characters: e
and \u0301
(Combining Acute Accent). The second string is the normalization in form D of the first string.
"é".Normalize(System.Text.NormalizationForm.FormD); // e\u0301
string.Equals("é", "e\u0301", StringComparison.Ordinal); // false
string.Equals("é", "e\u0301", StringComparison.InvariantCulture); // true
string.Equals("é", "e\u0301", StringComparison.CurrentCulture); // true
The culture-sensitive comparison compares grapheme clusters. For instance, the string A\r\nB
is split into A
, \r\n
, and B
. This means that \n
is not part of this string. But, \r
or \r\n
are part of the string.
"A\r\nB".Contains("\n", StringComparison.Ordinal); // True
"A\r\nB".Contains("\n", StringComparison.InvariantCulture); // False
"A\r\nB".Contains("\r\n", StringComparison.InvariantCulture); // True
"A\r\nB".Contains("\r", StringComparison.InvariantCulture); // True
// Similar issue
CultureInfo.CurrentCulture = CultureInfo.GetCultureInfo("en-US");
"\n\r\nTest".IndexOf("\nTest", StringComparison.CurrentCulture); // -1
"\n\r\nTest".Contains("\nTest", StringComparison.CurrentCulture); // False
The same may apply to other cultures. For instance, in Hungarian dz
is considered a single letter:
CultureInfo.CurrentCulture = CultureInfo.GetCultureInfo("hu-HU");
"endz".Contains("z", StringComparison.CurrentCulture); // False
"endz".Contains("d", StringComparison.CurrentCulture); // False
"endz".Contains("dz", StringComparison.CurrentCulture); // True
"d".Contains("d", StringComparison.CurrentCulture); // True
"z".Contains("z", StringComparison.CurrentCulture); // True
I hope you now understand how important it is to specify the way you want to compare strings.
##string.Equals with CurrentCulture vs ToUpper/ToLower and ==
Comparing strings using ToUpper
or ToLower
and ==
is not the same as using StringComparison.CurrentCultureIgnoreCase
. The ToUpper
and ToLower
methods use the current culture to change the case, then ==
performs an ordinal comparison. Using StringComparison.CurrentCultureIgnoreCase
applies linguistic-comparison rules. Here's an example that shows the difference:
var a = "a\0b";
var b = "ab";
Console.WriteLine(a.ToUpper() == b.ToUpper()); // False
Console.WriteLine(string.Equals(a, b, StringComparison.CurrentCultureIgnoreCase)); // True
#Methods have different default comparison modes
You should always specify explicitly the comparer as the default values are not consistent. For instance, string.IndexOf
uses the current culture whereas string.Equals
use Ordinal. So, the basic rule is to always use an overload of a method that has a StringComparison
, IEqualityComparer<string>
, or IComparer<string>
when possible.
// Inconsistent result
"encyclopædia".Contains("ae"); // False because it uses Ordinal by default
"encyclopædie".IndexOf("ae"); // '8' because it uses the Current culture
// To get a consistent result, you must set the StringComparison argument
var stringComparison = StringComparison.Ordinal;
"encyclopædia".Contains("ae", stringComparison); // False
"encyclopædie".IndexOf("ae", stringComparison); // -1
This also applies to Equals
and CompareTo
(and all the operators ==
, <
, >
, etc.):
"encyclopædia".Equals("encyclopaedia"); // False
"encyclopædia".CompareTo("encyclopaedia"); // 0 (Equals)
This previous change also implies that Comparer<string>.Default
is culture-sensitive, whereas the default EqualityComparer<string>.Default
uses the ordinal comparison. You can see the difference when you use Dictionary<string, T>
and SortedList<string, T>
// The dictionary contains 2 items
var dict = new Dictionary<string, int>()
{
{ "encyclopædia", 0 },
{ "encyclopaedia", 0 },
};
// System.ArgumentException: An entry with the same key already exists.
var list = new SortedList<string, int>()
{
{ "encyclopædia", 0 },
{ "encyclopaedia", 0 }, // The key is duplicated
};
#Implication of the Globalization Invariant mode in comparisons
The globalization invariant mode enables you to remove application dependencies on globalization data and globalization behavior. This mode is an opt-in feature that provides more flexibility if you care more about reducing dependencies and the size of distribution than globalization functionality or globalization-correctness.
Invariant globalization modifies the way OrdinalIgnoreCase
works by limiting the string casing operation to the ASCII range only.
// "𐓸" U+104F8, "𐓐" U+104D0 (both are non-ascii character)
// InvariantGlobalization = false
string.Equals("𐓸", "𐓐", StringComparison.OrdinalIgnoreCase); // true
string.Equals("𐓸", "𐓐", StringComparison.InvariantCultureIgnoreCase); // true
// InvariantGlobalization = true
string.Equals("𐓸", "𐓐", StringComparison.OrdinalIgnoreCase); // false
string.Equals("𐓸", "𐓐", StringComparison.InvariantCultureIgnoreCase); // false
You can detect if the application run in invariant mode by using the AppContext.TryGetSwitch("System.Globalization.Invariant", out bool isEnabled)
. This is a good approximation, but if you want to be more precise, you can use the code from this post: How to detect Globalization-Invariant mode in .NET?.
#NLS vs ICU
In the past, the .NET globalization APIs used different underlying libraries on different platforms. On Unix, the APIs used International Components for Unicode (ICU), and on Windows, they used National Language Support (NLS). This resulted in some behavioral differences in a handful of globalization APIs when running applications on different platforms. .NET 5.0 introduces a runtime behavioral change where globalization APIs now use ICU by default across all supported platforms. This enables applications to avoid differences across platforms.
Note that ICU is promoted by Windows too. ICU dlls are shipped with Windows since Windows 10 version 1703. Many applications already started using ICU on Windows. So the migration to use ICU is happening everywhere and not just in .NET.
Windows | Linux | |
---|---|---|
.NET 1.0 - 4.8 | NLS | - |
NET Core 1.0 - 3.1 | NLS | ICU |
.NET 5.0 | ICU1 | ICU |
1 It's possible to force usage of NLS using a flag. If the version of Windows doesn't have ICU available, it automatically fallbacks to NLS.
NLS and ICU have different behavior in some edge cases. Also, a new version of the library may introduce behavior changes. Starting with .NET 5.0, you can fix the version of ICU used by your application to avoid undesired changes.
"a\r\nb".Contains("\n", StringComparison.InvariantCulture); // NLS: True, ICU (.NET 5): False, ICU (.NET 6): True
string.Equals("AE", "ae", StringComparison.InvariantCultureIgnoreCase); // NLS: True, ICU: False
CultureInfo.CurrentCulture = CultureInfo.GetCultureInfo("mi-NZ");
"/*".StartsWith("--", StringComparison.CurrentCulture); // NLS: True, ICU: False
Note that changing from NLS to ICU is not a breaking change. The globalization data changes happen even with NLS and these data should not be considered stable. Also, Windows is converging to CLDR which means there may be changes in the future in NLS.
Starting with .NET 5.0, you can ensure consistency across all deployments by using an app-local ICU. In this case, you can pin the ICU version used by your application instead of using the version provided by .NET.
#Where to specify the comparison mode
Almost all methods that work with strings have a parameter to specify the comparison mode. There are 2 ways, using StringComparison
or an IEqualityComparer<string>
.
// String methods
"" == ""; // Should be string.Equals("", "", StringComparison.Ordinal);
string.Equals("", "", StringComparison.Ordinal);
string.Compare("", "", StringComparison.Ordinal);
"".Contains("", StringComparison.CurrentCulture);
"".EndsWith("", StringComparison.CurrentCulture);
"".Equals("", StringComparison.Ordinal);
"".IndexOf("", StringComparison.CurrentCulture);
"".Replace("", "", StringComparison.CurrentCulture);
"".StartsWith("", StringComparison.CurrentCulture);
"".ToLower(CultureInfo.CurrentCulture);
"".ToUpper(CultureInfo.CurrentCulture);
// Enumerable extensions
new [] { "" }.Contains("", StringComparer.Ordinal);
new [] { "" }.Distinct(StringComparer.Ordinal);
new [] { "" }.GroupBy(x => x, StringComparer.Ordinal);
new [] { "" }.OrderBy(x => x, StringComparer.Ordinal);
// ...
// HashSet and Dictionary constructors
new HashSet<string>(StringComparer.Ordinal);
new Dictionary<string, object>(StringComparer.Ordinal);
new ConcurrentDictionary<string, object>(StringComparer.Ordinal);
new SortedList<string, object>(StringComparer.CurrentCulture);
// GetHashCode
"".GetHashCode(StringComparison.Ordinal); // .NET Core 2.0+
StringComparer.Ordinal.GetHashCode("");
#Getting warnings in the IDE using a Roslyn Analyzer
You can check the usage of these methods in your applications using a Roslyn analyzer. The good news is the free analyzer I've made already contains rules for that: https://github.com/meziantou/Meziantou.Analyzer. It was the first rule I've created because it's so easy to forget them, especially for junior developers.
- MA0001 - Use StringComparison parameter when possible
- MA0002 - Use an overload that has a
IEqualityComparer<string>
parameter - MA0006 - Use string.Equals instead of
==
or!=
- MA0020 - Use
StringComparer.GetHashCode
instead ofstring.GetHashCode
- MA0024 - Do not use
EqualityComparer<string>.Default
You can install the Visual Studio extension or the NuGet package to analyze your code:
#Additional resources
- How to compare strings in C#
- Best practices for comparing strings in .NET
- .NET globalization and ICU
- Behavior changes when comparing strings on .NET 5+
- .NET Core Globalization Invariant Mode
- Breaking change with string.IndexOf(string) from .NET Core 3.0 → .NET 5.0
List of issues that report problems with comparisons:
- C# “anyString”.Contains('\0', StringComparison.InvariantCulture) returns true in .NET5 but false in older versions
- Why indexof search text not correct?
- .net 5.0 seems to be wrong in detecting zero-terminzted string
- .NET 5.0 - Non-empty string is equal to empty string
- Strange Contains and IndexOf handling of "\0" in .NET 5.0
- The result of full-width string comparison with InvariantCultureIgnoreCase has changed in .NET 5
- StringComparison.InvariantCulture behavior change between .NET Core 3.1 and .NET 5
- CompareInfo.LastIndexOf does not handle zero-weight sequences correctly on ICU
- SortedDictionary Behaves Differently Across .NET Core Platforms
- Breaking change with string.IndexOf(string) from .NET Core 3.0 → .NET 5.0
- \u007F is ignored for non-Ordinal comparisons in StartsWith, EndsWith and Contains on Linux and Mac but not Windows
- String.CompareTo Behaviour different between WIN10 & Linux
- InvariantCulture string comparison is inconsistent between Windows and macOS
- CompareInfo.IndexOf() returns different values on Linux and Windows
- Same culture behaves inconsistently across Windows and Linux
- String.IndexOf and String.LastIndexOf do not behave consistently across OS platforms
- string.StartsWith("--") returns true for "/*" in case of mi and mi-NZ culture
- .NET 5 RC1 breaks string CompareTo,
- IndexOf result
- String.IndexOf("\0") returns always 0 in .NET 5.0
- NET5.0 vs NETCore3.1 - Strings comparison between space and non-breaking-space characters
- String.IndexOf("\0") Zero
- Fix the string search behavior when using ICU
- string.IndexOf() incorrect return value for full stop for Thai culture
- String.IndexOf("'") returns incorrect value in Thai culture
- String.EndsWith("\0") behavior difference between .NET7 and .NET Framework
- Unexpected result for string.IndexOf((char)0, StringComparison.InvariantCultureIgnoreCase)
- string.IndexOf() on Thai Culture , Windows11, 64 bits, .net5,6,7,8, return incorrect index
- String.IndexOf has incorrectly behavier in Thai(Thaiand) region
Do you have a question or a suggestion about this post? Contact me!