How to remove diacritics from a string in .NET
This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
- String comparisons are harder than it seems
- How to correctly count the number of characters of a string
- Correctly converting a character to lower/upper case
- How not to read a string from an UTF-8 stream
- Regex with IgnoreCase option may match more characters than expected
- How to remove diacritics from a string in .NET (this post)
Diacritics are a way to add additional information to a character. They are used in many languages, such as French, Spanish, and German. A common usage of diacritics is to add an accent to a letter (e.g. é
). In this post, I describe how to remove diacritics from a string in .NET.
In the post How to correctly count the number of characters of a string, I already wrote about diacritics. But let give a quick reminder. A diacritic is a glyph added to a letter. For instance, the letter é
is composed of the letter e
and the acute accent ´
. In Unicode, the diacritic is a separate character from the base character. This means that the letter é
is composed of 2 characters: e
(U+0065
Latin Small Letter E) and ´
(U+0301
Combining Acute Accent). Note that the letter é
can also represented by the single character é
(U+00E9
Latin Small Letter E with Acute).
In .NET, you can convert the string representation from the canonical form to a decomposed form using the Normalize
method. The canonical form is the form where the diacritics are combined with the base character. The decomposed form is the form where the diacritics are separated from the base character. For instance, the canonical form of the character é
is U+00E9
, and the decomposed form is U+0065
and U+0301
. You can quickly see the difference by using the following code:
EnumerateRune("é");
// é (00E9 LowercaseLetter)
EnumerateRune("é".Normalize(NormalizationForm.FormD));
// e (0065 LowercaseLetter)
// ' (0301 NonSpacingMark)
void EnumerateRune(string str)
{
foreach (var rune in str.EnumerateRunes())
{
Console.WriteLine($"{rune} ({rune.Value:X4} {Rune.GetUnicodeCategory(rune)})");
}
}
Now that you know how to convert a string to a decomposed form, you can remove the diacritics. The common algorithm to do it is the following:
- Normalize the string to Unicode Normalization Form D (NFD).
- Iterate over each character keep only the non-spacing mark characters (the base characters)
- Concatenate the characters to get the final string
public static string RemoveDiacritics(string input)
{
string normalized = input.Normalize(NormalizationForm.FormD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}
While this is a common question on internet, the main use-case is comparing strings accent-insensitive. In this case, you can use the CompareOptions.IgnoreNonSpace
option of the String.Compare
method instead of using the previous method. This will be faster and avoid errors due to the complexity of the Unicode standard.
public static bool AreEqualIgnoringAccents(string s1, string s2)
{
return string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace) == 0;
}
Do you have a question or a suggestion about this post? Contact me!