Split a string into lines without any allocation
It's very common to split a string into lines. You can write something like that:
var str = "Nickname: meziantou\r\nName: Gérald Barré";
var separators = new [] { '\r', '\n' };
var lines = str.Split(separators, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
// - Allocate 1 string per line + 1 array
// - The Split method may allocate
}
This code create one string per line and allocate one array. So, there are lots of allocations. Also, it splits the whole string even if you only need a few lines. You can also use a StringReader
:
var reader = new StringReader(str);
string line;
while ((line = reader.ReadLine()) != null)
{
// - Allocate 1 string per line
// - The ReadLine method may allocate
}
But this still allocates one StringReader
and one string per line.
In the recent .NET release, you can take advantage of ReadOnlySpan<char>
to avoid some allocations when working with strings. My goal is to create a method SplitLines
that doesn't allocate and can be used this way:
foreach (ReadOnlySpan<char> line in str.SplitLines())
{
// whatever, but without any allocation 😊
}
Before showing the code, here're some important C# / .NET notions to understand.
##Making code compatible with the foreach operator
The foreach
statement is not limited to IEnumerable<T>
and can be applied to an instance of any type that satisfies the following conditions:
- has the public parameterless
GetEnumerator
method whose return type is either class, struct, or interface type, - the return type of the
GetEnumerator
method has the publicCurrent
property and the public parameterlessMoveNext
method whose return type isBoolean
.
var enumerable = new List<int> { 1, 2, 3 };
foreach(var item in enumerable)
{
Console.WriteLine(item);
}
// The foreach operator is rewritten by the compiler as the following code
var enumerator = enumerable.GetEnumerator();
while (enumerator.MoveNext())
{
var item = enumerator.Current;
Console.WriteLine(item);
}
That means that the GetEnumerator
method can return a struct that has a bool MoveNext()
method and a Current
property. This way you can avoid one allocation. Indeed, the result doesn't need to be casted to IEnumerator<T>
which would allocate.
##Span<T> and ReadOnlySpan<T>
These types provide a type-safe and memory-safe representation of a contiguous region of arbitrary memory. This type can represent any array-like (array, string, ArraySegment<T>) type and has the same optimizations as an array, so it is fast. Spans are very useful to create a slice of an array-like object. In the case of a string this is similar to Substring
but without any allocation as Span<T>
is a value type (struct).
ReadOnlySpan<int> array = new int[]{1,2,3,4,5,6};
ReadOnlySpan<int> slice = array[1..^1]; // [2,3,4,5]
ReadOnlySpan<char> str = "meziantou";
ReadOnlySpan<char> substring = str[0..3]; // mez
In our case, ReadOnlySpan<char>
will allow us to substring the string line by line by using the Slice method. Unless Substring
, it won't create a new string on the heap, so this will prevent an allocation.
##ref struct
Span<T>
and ReadOnlySpan<T>
are ref struct
. Instances of a ref struct type are allocated on the stack and can't escape to the managed heap. The compiler ensures that by reporting an error when you try to do an operation that may allocate the struct on the heap. For instance:
A ref struct can't be a declared type of a field of a class or a non-ref struct
👉🏼 This means the enumerator that we'll implement needs to be a ref struct to contains the
ReadOnlySpan<char>
A ref struct can't implement interfaces
👉🏼 Hopefully, you don't need to implement an interface to be compatible with the foreach operator
A ref struct can't be boxed to
System.ValueType
orSystem.Object
👉🏼 The foreach operator doesn't cast the value of the iterator, so this is ok
Note that there are other limitations explained in the documentation but they don't apply to the code of this post.
#SplitLines implementation
Now that you understand the foreach operator and ref structs, let's see the code:
using System;
public static class StringExtensions
{
public static LineSplitEnumerator SplitLines(this string str)
{
// LineSplitEnumerator is a struct so there is no allocation here
return new LineSplitEnumerator(str.AsSpan());
}
// Must be a ref struct as it contains a ReadOnlySpan<char>
public ref struct LineSplitEnumerator
{
private ReadOnlySpan<char> _str;
public LineSplitEnumerator(ReadOnlySpan<char> str)
{
_str = str;
Current = default;
}
// Needed to be compatible with the foreach operator
public LineSplitEnumerator GetEnumerator() => this;
public bool MoveNext()
{
var span = _str;
if (span.Length == 0) // Reach the end of the string
return false;
var index = span.IndexOfAny('\r', '\n');
if (index == -1) // The string is composed of only one line
{
_str = ReadOnlySpan<char>.Empty; // The remaining string is an empty string
Current = new LineSplitEntry(span, ReadOnlySpan<char>.Empty);
return true;
}
if (index < span.Length - 1 && span[index] == '\r')
{
// Try to consume the '\n' associated to the '\r'
var next = span[index + 1];
if (next == '\n')
{
Current = new LineSplitEntry(span.Slice(0, index), span.Slice(index, 2));
_str = span.Slice(index + 2);
return true;
}
}
Current = new LineSplitEntry(span.Slice(0, index), span.Slice(index, 1));
_str = span.Slice(index + 1);
return true;
}
public LineSplitEntry Current { get; private set; }
}
public readonly ref struct LineSplitEntry
{
public LineSplitEntry(ReadOnlySpan<char> line, ReadOnlySpan<char> separator)
{
Line = line;
Separator = separator;
}
public ReadOnlySpan<char> Line { get; }
public ReadOnlySpan<char> Separator { get; }
// This method allow to deconstruct the type, so you can write any of the following code
// foreach (var entry in str.SplitLines()) { _ = entry.Line; }
// foreach (var (line, endOfLine) in str.SplitLines()) { _ = line; }
// https://learn.microsoft.com/en-us/dotnet/csharp/fundamentals/functional/deconstruct?WT.mc_id=DT-MVP-5003978#deconstructing-user-defined-types
public void Deconstruct(out ReadOnlySpan<char> line, out ReadOnlySpan<char> separator)
{
line = Line;
separator = Separator;
}
// This method allow to implicitly cast the type into a ReadOnlySpan<char>, so you can write the following code
// foreach (ReadOnlySpan<char> entry in str.SplitLines())
public static implicit operator ReadOnlySpan<char>(LineSplitEntry entry) => entry.Line;
}
}
Here're some usage examples of the SplitLines()
method:
var str = "Nickname: meziantou\r\nName: Gérald Barré";
foreach (ReadOnlySpan<char> line in str.SplitLines())
{
Console.WriteLine(line.ToString());
}
foreach (var (line, endOfLine) in str.SplitLines())
{
Console.Write(line.ToString());
Console.Write(endOfLine.ToString());
}
foreach (var lineEntry in str.SplitLines())
{
Console.Write(lineEntry.Line.ToString());
Console.Write(lineEntry.Sparator.ToString());
}
#Benchmark
Let's create a quick benchmark with BenchmarkDotNet:
[MemoryDiagnoser]
public class SplitLinesBenchmark
{
private const string Data = "Nickname: meziantou\r\nFirstName: Gérald\nLastName: Barré";
private readonly char[] Separators = new[] { '\r', '\n' };
[Benchmark]
public void StringReader()
{
var reader = new StringReader(Data);
string line;
while ((line = reader.ReadLine()) != null)
{
}
}
[Benchmark]
public void Split()
{
foreach (var line in Data.Split(new char[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries))
{
}
}
[Benchmark]
public void Split_Preallocated()
{
foreach (var line in Data.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
{
}
}
[Benchmark]
public void Span()
{
foreach (ReadOnlySpan<char> item in Data.SplitLines())
{
}
}
}
public class Program
{
public static void Main(string[] args) => BenchmarkRunner.Run<SplitLinesBenchmark>();
}
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.5.22307.18
[Host] : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
Method | Mean | Error | StdDev | Gen 0 | Allocated |
---|---|---|---|---|---|
StringReader | 37.51 ns | 0.639 ns | 0.598 ns | 0.0124 | 208 B |
Split | 73.75 ns | 0.854 ns | 0.799 ns | 0.0186 | 312 B |
Split_Preallocated | 69.16 ns | 1.231 ns | 1.152 ns | 0.0167 | 280 B |
Span | 35.87 ns | 0.734 ns | 0.785 ns | - | - |
The code is faster and doesn't allocate at all 😃
#Additional references
- foreach, in (C# reference)
- Roslyn - LocalRewriter_ForEachStatement
- ReadOnlySpan<T> Struct
- Structure types (C# reference)
- Ref struct (byref-like type) and ByReference (byref-like instance field)
Do you have a question or a suggestion about this post? Contact me!