This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
This post is the result of a code review. The code below is a simplified version of the original, making the bug easier to spot.
The goal is to read a UTF-8 encoded string from a stream. In the actual context, the stream is a named pipe, and there are additional operations performed on the stream.
C#
string ReadString(Stream stream)
{
var sb = new StringBuilder();
var buffer = new byte[4096];
int readCount;
while ((readCount = stream.Read(buffer)) > 0)
{
var s = Encoding.UTF8.GetString(buffer, 0, readCount);
sb.Append(s);
}
return sb.ToString();
}
The problem is that the returned string may differ from the encoded string. For instance, a smiley is sometimes decoded as 4 unknown characters:
Encoded string: 😊
Decoded string: ????
UTF-8 can use from 1 to 4 bytes to represent a Unicode character (more info about string encoding), but the Stream.Read method can return from 1 to messageBuffer.Length bytes. Consequently, the buffer may contain an incomplete UTF-8 character. This means that Encoding.UTF8.GetString may attempt to convert an invalid UTF-8 sequence if the last character in the buffer is incomplete. In this case, the method returns a string with replacement characters () because it cannot determine the missing bytes. Let's demonstrate this behavior using the following code:
C#
var bytes = Encoding.UTF8.GetBytes("😊");
// bytes = new byte[4] { 240, 159, 152, 138 }
var sb = new StringBuilder();
// Simulate reading the stream byte by byte
for (var i = 0; i < bytes.Length; i++)
{
sb.Append(Encoding.UTF8.GetString(bytes, i, 1));
}
Console.WriteLine(sb.ToString());
// "????" instead of "😊"
Encoding.UTF8.GetBytes(sb.ToString());
// new byte[12] { 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191, 189 }
#How to fix the code?
There are multiple ways to fix this. One approach is to convert the byte array to a string only after receiving all the data:
C#
string ReadString(Stream stream)
{
using var ms = new MemoryStream();
var buffer = new byte[4096];
int readCount;
while ((readCount = stream.Read(buffer)) > 0)
{
ms.Write(buffer, 0, readCount);
}
return Encoding.UTF8.GetString(ms.ToArray());
}
Alternatively, you can wrap the stream in a StreamReader with the correct encoding:
C#
string ReadString(Stream stream)
{
using var sr = new StreamReader(stream, Encoding.UTF8);
return sr.ReadToEnd();
}
You can also use the System.Text.Decoder class to correctly decode characters from the buffer. If performance is a concern, consider using PipeReader or Rune to read the data in a memory-optimized way.
#Additional resources
Do you have a question or a suggestion about this post? Contact me!