How not to read a string from an UTF-8 stream

09/06/2021

Gérald Barré

.NET

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

String comparisons are harder than it seems
How to correctly count the number of characters of a string
Correctly converting a character to lower/upper case
How not to read a string from an UTF-8 stream (this post)
Regex with IgnoreCase option may match more characters than expected
How to remove diacritics from a string in .NET

This post is the result of a code review. The code I'll show is a simplified version of the original code, so the bug is easier to spot (or not).

The goal of the code is to read a UTF-8 encoded string from a stream. In the actual context, the stream is a named pipe and there are a few more things to do with the stream.

string ReadString(Stream stream)
{
    var sb = new StringBuilder();
    var buffer = new byte[4096];
    int readCount;
    while ((readCount = stream.Read(buffer)) > 0)
    {
        var s = Encoding.UTF8.GetString(buffer, 0, readCount);
        sb.Append(s);
    }

    return sb.ToString();
}

The problem is that in some cases the returned string is different from the encoded string. For instance, a smiley is sometimes decoded as 4 unknown characters:

Encoded string: 😊
Decoded string: ????

UTF-8 can use from 1 to 4 bytes to represent a Unicode character (more info about string encoding), but the Stream.Read method can return from 1 to messageBuffer.Length bytes. This means that the buffer may contain an imcomplete UTF-8 character. This means that Encoding.UTF8.GetString may have an invalid UTF-8 string to convert as the last character in the buffer may be incomplete. In this case, the method returns an invalid string as it cannot guess the missing bytes. Let's demo this behavior using the following code:

var bytes = Encoding.UTF8.GetBytes("😊");
// bytes = new byte[4] { 240, 159, 152, 138 }

var sb = new StringBuilder();
// Simulate reading the stream byte by byte
for (var i = 0; i < bytes.Length; i++)
{
    sb.Append(Encoding.UTF8.GetString(bytes, i, 1));
}

Console.WriteLine(sb.ToString());
// "????" instead of "😊"

Encoding.UTF8.GetBytes(sb.ToString());
// new byte[12] { 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191, 189 }

#How to fix the code?

There are multiple ways to fix the code. One way is to convert the byte array to a string only when you have all the data:

string ReadString(Stream stream)
{
    using var ms = new MemoryStream();
    var buffer = new byte[4096];
    int readCount;
    while ((readCount = stream.Read(buffer)) > 0)
    {
        ms.Write(buffer, 0, readCount);
    }

    return Encoding.UTF8.GetString(ms.ToArray());
}

You could also wrap the stream into a StreamReader with the right encoding:

string ReadString(Stream stream)
{
    using var sr = new StreamReader(stream, Encoding.UTF8);
    return sr.ReadToEnd();
}

You can also use the System.Text.Decoder class to correctly decode the character from the buffer. In the case where performance is needed, you may prefer using the PipeReader, Rune classes to read the data in a memory-optimized way.

#Additional resources

Do you have a question or a suggestion about this post? Contact me!

Follow me:

Enjoy this blog?

💖 Sponsor on GitHub