How not to read a string from a UTF-8 stream

 
 
  • Gérald Barré

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

This post is the result of a code review. The code below is a simplified version of the original, making the bug easier to spot.

The goal is to read a UTF-8 encoded string from a stream. In the actual context, the stream is a named pipe, and there are additional operations performed on the stream.

C#
string ReadString(Stream stream)
{
    var sb = new StringBuilder();
    var buffer = new byte[4096];
    int readCount;
    while ((readCount = stream.Read(buffer)) > 0)
    {
        var s = Encoding.UTF8.GetString(buffer, 0, readCount);
        sb.Append(s);
    }

    return sb.ToString();
}

The problem is that the returned string may differ from the encoded string. For instance, a smiley is sometimes decoded as 4 unknown characters:

Encoded string: 😊
Decoded string: ????

UTF-8 can use from 1 to 4 bytes to represent a Unicode character (more info about string encoding), but the Stream.Read method can return from 1 to messageBuffer.Length bytes. Consequently, the buffer may contain an incomplete UTF-8 character. This means that Encoding.UTF8.GetString may attempt to convert an invalid UTF-8 sequence if the last character in the buffer is incomplete. In this case, the method returns a string with replacement characters () because it cannot determine the missing bytes. Let's demonstrate this behavior using the following code:

C#
var bytes = Encoding.UTF8.GetBytes("😊");
// bytes = new byte[4] { 240, 159, 152, 138 }

var sb = new StringBuilder();
// Simulate reading the stream byte by byte
for (var i = 0; i < bytes.Length; i++)
{
    sb.Append(Encoding.UTF8.GetString(bytes, i, 1));
}

Console.WriteLine(sb.ToString());
// "????" instead of "😊"

Encoding.UTF8.GetBytes(sb.ToString());
// new byte[12] { 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191, 189 }

#How to fix the code?

There are multiple ways to fix this. One approach is to convert the byte array to a string only after receiving all the data:

C#
string ReadString(Stream stream)
{
    using var ms = new MemoryStream();
    var buffer = new byte[4096];
    int readCount;
    while ((readCount = stream.Read(buffer)) > 0)
    {
        ms.Write(buffer, 0, readCount);
    }

    return Encoding.UTF8.GetString(ms.ToArray());
}

Alternatively, you can wrap the stream in a StreamReader with the correct encoding:

C#
string ReadString(Stream stream)
{
    using var sr = new StreamReader(stream, Encoding.UTF8);
    return sr.ReadToEnd();
}

You can also use the System.Text.Decoder class to correctly decode characters from the buffer. If performance is a concern, consider using PipeReader or Rune to read the data in a memory-optimized way.

#Additional resources

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?Buy Me A Coffee💖 Sponsor on GitHub