How not to read a string from an UTF-8 stream
This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
- String comparisons are harder than it seems
- How to correctly count the number of characters of a string
- Correctly converting a character to lower/upper case
- How not to read a string from an UTF-8 stream (this post)
- Regex with IgnoreCase option may match more characters than expected
- How to remove diacritics from a string in .NET
This post is the result of a code review. The code I'll show is a simplified version of the original code, so the bug is easier to spot (or not).
The goal of the code is to read a UTF-8 encoded string from a stream. In the actual context, the stream is a named pipe and there are a few more things to do with the stream.
string ReadString(Stream stream)
{
var sb = new StringBuilder();
var buffer = new byte[4096];
int readCount;
while ((readCount = stream.Read(buffer)) > 0)
{
var s = Encoding.UTF8.GetString(buffer, 0, readCount);
sb.Append(s);
}
return sb.ToString();
}
The problem is that in some cases the returned string is different from the encoded string. For instance, a smiley is sometimes decoded as 4 unknown characters:
Encoded string: 😊
Decoded string: ????
UTF-8 can use from 1 to 4 bytes to represent a Unicode character (more info about string encoding), but the Stream.Read
method can return from 1 to messageBuffer.Length
bytes. This means that the buffer may contain an imcomplete UTF-8 character. This means that Encoding.UTF8.GetString
may have an invalid UTF-8 string to convert as the last character in the buffer may be incomplete. In this case, the method returns an invalid string as it cannot guess the missing bytes. Let's demo this behavior using the following code:
var bytes = Encoding.UTF8.GetBytes("😊");
// bytes = new byte[4] { 240, 159, 152, 138 }
var sb = new StringBuilder();
// Simulate reading the stream byte by byte
for (var i = 0; i < bytes.Length; i++)
{
sb.Append(Encoding.UTF8.GetString(bytes, i, 1));
}
Console.WriteLine(sb.ToString());
// "????" instead of "😊"
Encoding.UTF8.GetBytes(sb.ToString());
// new byte[12] { 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191, 189 }
#How to fix the code?
There are multiple ways to fix the code. One way is to convert the byte array to a string only when you have all the data:
string ReadString(Stream stream)
{
using var ms = new MemoryStream();
var buffer = new byte[4096];
int readCount;
while ((readCount = stream.Read(buffer)) > 0)
{
ms.Write(buffer, 0, readCount);
}
return Encoding.UTF8.GetString(ms.ToArray());
}
You could also wrap the stream into a StreamReader
with the right encoding:
string ReadString(Stream stream)
{
using var sr = new StreamReader(stream, Encoding.UTF8);
return sr.ReadToEnd();
}
You can also use the System.Text.Decoder
class to correctly decode the character from the buffer. In the case where performance is needed, you may prefer using the PipeReader
, Rune
classes to read the data in a memory-optimized way.
#Additional resources
- Stream.Read documentation
- How to correctly count the number of characters of a string
- Example of bad code - Use ReadAsync in NamedPipeServerStream
- Example of bad code - Use async/await to read from the named pipe
Do you have a question or a suggestion about this post? Contact me!