Correctly converting a character to lower/upper case
This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
- String comparisons are harder than it seems
- How to correctly count the number of characters of a string
- Correctly converting a character to lower/upper case (this post)
- How not to read a string from an UTF-8 stream
- Regex with IgnoreCase option may match more characters than expected
- How to remove diacritics from a string in .NET
Strings are complicated! One thing I often see is people using char.IsUpper
or char.ToUpper
wrongly. For instance, they want to convert the first character of a string to the upper-case for display. The easy way to do it, which is the wrong way, is the following:
static string FirstCharacterToUpperCaseBad(string str)
{
if(string.IsNullOrEmpty(str) || char.IsUpper(str[0]))
return str;
return char.ToUpperInvariant(str[0]) + str[1..];
}
This method works great for many strings. For instance, "abc"
will correctly be changed to "Abc"
. However, Latin is not the only alphabet that exists. What if people want to use the Osage's alphabet? The character 𐓸
should become 𐓐
when converted to uppercase. However, FirstCharacterToUpperCaseBad("𐓸")
returns the same string.
In .NET a string is a sequential read-only collection of char
. A char
represents a UTF-16 code unit. UTF-16 is one way to encode a Unicode character to bytes. The UTF-16 encoding is variable-length, as code points are encoded with one or two 16-bit code units.
The string "𐓸"
is composed of 2 char
s as it needs 2 UTF-16 code units to represent the character. This means that "𐓸".Length
returns 2. You can see in the following screenshot how the characters a
and 𐓸
are encoded in UTF-16:
source: https://tools.meziantou.net/string-info
When you use "𐓸"[0]
to get the first character, you only get the first UTF-16 code unit of "𐓸". This means you only have half of the character… The machine cannot guess the other half of the character, so you cannot know if this is an uppercase character, nor how to change its casing. This is why char.ToUpperInvariant("𐓸"[0])
returns the same character.
The correct way to handle this case is by checking if the first character is a surrogate-pair (composed of 2 char
s) and using these 2 char
s to convert to uppercase. Instead of doing the hard-work manually using char.IsSurrogate
, you can rely on the type Rune
to handle the complexity for you:
static string FirstCharacterToUpperCase(string str)
{
if(string.IsNullOrEmpty(str))
return str;
// Get the first Rune of the string
var result = Rune.DecodeFromUtf16(str, out var rune, out var charsConsumed);
// Check if the rune is uppercase
if (result != OperationStatus.Done || Rune.IsUpper(rune))
return str;
// Convert the first rune to uppercase and concatenate it to the rest of the string
return Rune.ToUpperInvariant(rune) + str[charsConsumed..];
}
You can now test this method with many different strings:
FirstCharacterToUpperCase("abc def"); // Abd def (Latin)
FirstCharacterToUpperCase("𐓷𐓘𐓻𐓘𐓻𐓟 𐒻𐓟"); // 𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟 (Osage)
FirstCharacterToUpperCase("𐐿𐐱𐐻"); // 𐐗𐐱𐐻 (Deseret)
// etc. (U+10C80, U+118A0, U+16E40)
More generally, when working with unknown characters, you should consider using Rune
instead of using char
s.
Do you have a question or a suggestion about this post? Contact me!