Correctly converting a character to lower/upper case

02/08/2021

Gérald Barré

.NET

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

String comparisons are harder than it seems
How to correctly count the number of characters of a string
Correctly converting a character to lower/upper case (this post)
How not to read a string from an UTF-8 stream
Regex with IgnoreCase option may match more characters than expected
How to remove diacritics from a string in .NET

Strings are complicated! One thing I often see is people using char.IsUpper or char.ToUpper wrongly. For instance, they want to convert the first character of a string to the upper-case for display. The easy way to do it, which is the wrong way, is the following:

static string FirstCharacterToUpperCaseBad(string str)
{
    if(string.IsNullOrEmpty(str) || char.IsUpper(str[0]))
        return str;
    return char.ToUpperInvariant(str[0]) + str[1..];
}

This method works great for many strings. For instance, "abc" will correctly be changed to "Abc". However, Latin is not the only alphabet that exists. What if people want to use the Osage's alphabet? The character 𐓸 should become 𐓐 when converted to uppercase. However, FirstCharacterToUpperCaseBad("𐓸") returns the same string.

In .NET a string is a sequential read-only collection of char. A char represents a UTF-16 code unit. UTF-16 is one way to encode a Unicode character to bytes. The UTF-16 encoding is variable-length, as code points are encoded with one or two 16-bit code units.

The string "𐓸" is composed of 2 chars as it needs 2 UTF-16 code units to represent the character. This means that "𐓸".Length returns 2. You can see in the following screenshot how the characters a and 𐓸 are encoded in UTF-16:

source: https://tools.meziantou.net/string-info

When you use "𐓸"[0] to get the first character, you only get the first UTF-16 code unit of "𐓸". This means you only have half of the character… The machine cannot guess the other half of the character, so you cannot know if this is an uppercase character, nor how to change its casing. This is why char.ToUpperInvariant("𐓸"[0]) returns the same character.

The correct way to handle this case is by checking if the first character is a surrogate-pair (composed of 2 chars) and using these 2 chars to convert to uppercase. Instead of doing the hard-work manually using char.IsSurrogate, you can rely on the type Rune to handle the complexity for you:

static string FirstCharacterToUpperCase(string str)
{
    if(string.IsNullOrEmpty(str))
        return str;

    // Get the first Rune of the string
    var result = Rune.DecodeFromUtf16(str, out var rune, out var charsConsumed);

    // Check if the rune is uppercase
    if (result != OperationStatus.Done || Rune.IsUpper(rune))
        return str;

    // Convert the first rune to uppercase and concatenate it to the rest of the string
    return Rune.ToUpperInvariant(rune) + str[charsConsumed..];
}

You can now test this method with many different strings:

FirstCharacterToUpperCase("abc def");   // Abd def   (Latin)
FirstCharacterToUpperCase("𐓷𐓘𐓻𐓘𐓻𐓟 𐒻𐓟"); // 𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟 (Osage)
FirstCharacterToUpperCase("𐐿𐐱𐐻");       // 𐐗𐐱𐐻       (Deseret)
// etc. (U+10C80, U+118A0, U+16E40)

More generally, when working with unknown characters, you should consider using Rune instead of using chars.

Do you have a question or a suggestion about this post? Contact me!

Follow me:

Enjoy this blog?

💖 Sponsor on GitHub