This post is part of the series 'SIMD'. Be sure to check out the rest of the blog posts of the series!
Comparing GUIDs in .NET is a very common operation. While it is simple, there is room to optimize it using SIMD instructions. I discovered this technique while reading this PR and used it as an introduction to SIMD and Vector in .NET. This post explains how the optimization works.
#How Guid.Equals() works
A Guid is a 128-bit value. If you look at the fields, you can see a mix of int, short, and bytes (source):
Guid.cs (C#)
public readonly partial struct Guid
{
private readonly int _a;
private readonly short _b;
private readonly short _c;
private readonly byte _d;
private readonly byte _e;
private readonly byte _f;
private readonly byte _g;
private readonly byte _h;
private readonly byte _i;
private readonly byte _j;
private readonly byte _k;
}
Here's the current Guid.Equals implementation from GitHub (comments are mine):
Guid.cs (C#)
public readonly partial struct Guid
{
public bool Equals(Guid g) => EqualsCore(this, g);
private static bool EqualsCore(in Guid left, in Guid right)
{
// Unsafe.AsRef returns a int pointer to the first field of the Guid structure
ref int rA = ref Unsafe.AsRef(in left._a);
ref int rB = ref Unsafe.AsRef(in right._a);
// Compare each element
// Unsafe.Add() returns a int pointer to the field at the specified offset
return rA == rB // Compare [0:31] bits
&& Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1) // Compare [32:63] bits
&& Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2) // Compare [64:95] bits
&& Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3); // Compare [96:127] int
}
}
Comparing each field is slow. So, the current implementation treats the Guid as 4 ints (4*32bits = 128bits) and compares them one by one. This approach uses unsafe memory operations, but since the structure layout is known and guaranteed not to change, it is logically safe. You can confirm that at runtime the fields in the struct are sequential with no padding, so the comparison is equivalent to comparing two binary blobs.
Guid Layout
Type layout for 'Guid'
Size: 16 bytes. Paddings: 0 bytes (%0 of empty space)
|===========================|
| 0-3: Int32 _a (4 bytes) |
|---------------------------|
| 4-5: Int16 _b (2 bytes) |
|---------------------------|
| 6-7: Int16 _c (2 bytes) |
|---------------------------|
| 8: Byte _d (1 byte) |
|---------------------------|
| 9: Byte _e (1 byte) |
|---------------------------|
| 10: Byte _f (1 byte) |
|---------------------------|
| 11: Byte _g (1 byte) |
|---------------------------|
| 12: Byte _h (1 byte) |
|---------------------------|
| 13: Byte _i (1 byte) |
|---------------------------|
| 14: Byte _j (1 byte) |
|---------------------------|
| 15: Byte _k (1 byte) |
|===========================|
You can see the performance difference between comparing each field one by one and comparing the Guid as 4 ints:
| Method | Mean | Error | StdDev | Code Size |
|---|
| Equals | 1.279 ns | 0.0446 ns | 0.0417 ns | 49 B |
| Equals_field_by_field | 3.497 ns | 0.0394 ns | 0.0349 ns | 119 B |
#Optimization using Vector128
Single Instruction, Multiple Data (SIMD) is a class of CPU instructions that enables data parallelism on a single core. In our case, we can apply a single "equal" instruction across all 16 bytes of both GUIDs simultaneously. SIMD is widely used in the .NET runtime for performance. For example, it accelerates string.IndexOf, Base64.EncodeToUtf8, HexConvert.EncodeToUtf16, query string manipulation, the BitArray class, and a sorting algorithm in the GC. You can also find SIMD usage in libraries such as ImageSharp.
In .NET, SIMD instructions are accessible through the Vector<T>, Vector64<T>, Vector128<T>, and Vector256<T> classes. These classes require hardware support for performance gains. You can check whether your hardware supports SIMD by calling IsHardwareAccelerated. When hardware acceleration is unavailable, the Vector<T> classes fall back to software implementations, so you can use them safely. However, checking hardware support lets you choose the most efficient implementation for your platform. Note that only a few primitive types (such as byte, short, int, and long) are supported by Vector<T>. Starting with .NET 7, you can verify whether a type is supported by calling Vector<T>.IsSupported.
You can also use SIMD instructions through raw methods. These methods are exposed in the System.Runtime.Intrinsics namespace. For instance, you can use System.Runtime.Intrinsics.X86.Sse2.CompareEqual. Before calling these methods, you must verify that the CPU supports them using IsSupported, because there is no software fallback. If the hardware does not support them, the call will throw. For instance, check System.Runtime.Intrinsics.X86.Sse2.IsSupported before using SSE2 instructions.
When the JIT detects these methods, it replaces the calls with equivalent assembly instructions. If you look at the C# code for methods in System.Runtime.Intrinsics, you won't see any useful logic. The magic is in the JIT, which knows these methods and handles them specially!
C#
// The JIT replaces the condition with a constant based on the current CPU capabilities.
// So, it keeps only the supported branch when generating the assembly code.
if (Vector256.IsHardwareAccelerated)
{
// todo implementation using Vector256<T>
}
else if (Vector128.IsHardwareAccelerated)
{
// todo implementation using Vector128<T>
}
else if (System.Runtime.Intrinsics.X86.Sse2.IsSupported)
{
// todo implementation using System.Runtime.Intrinsics.X86.Sse2
}
else
{
// todo fallback implementation
}
In our case, we want to compare 128 bits of data. So, we use Vector128<T> to load the 128 bits of both Guids and compare them. The code is more complex as you have to convert the Guid to a byte*. This is possible using Unsafe.AsRef and Unsafe.As.
C#
static class GuidExtensions
{
public static bool OptimizedGuidEquals(in Guid left, in Guid right)
{
if (Vector128.IsHardwareAccelerated)
{
// - Vector.LoadUnsafe loads 128 bits of data from the pointer
// - Unsafe.As casts the Guid pointer to byte pointer
// - Unsafe.AsRef(in left) returns a pointer to the value
// => This looks complicated and maybe slow, but
// the JIT converts this line to a single asm instruction
Vector128<byte> leftVector =
Vector128.LoadUnsafe(
ref Unsafe.As<Guid, byte>(
ref Unsafe.AsRef(in left)));
Vector128<byte> rightVector =
Vector128.LoadUnsafe(
ref Unsafe.As<Guid, byte>(
ref Unsafe.AsRef(in right)));
// Compare both vectors
return leftVector == rightVector;
}
else
{
// Fallback for non-hardware accelerated platforms
return left == right;
}
}
}
The JIT converts the previous code to the following:
Assembly
; GuidEqualsBenchmark.Equals_Opt(System.Guid, System.Guid)
; Clear the xmm registers
vzeroupper
; Vector128<byte> leftVector = Vector128.LoadUnsafe(ref Unsafe.As<Guid, byte>(ref Unsafe.AsRef(in left)));
vmovdqu xmm0,xmmword ptr [rdx] ; Move Unaligned Packed Integer Values to xmm0
; Vector128<byte> rightVector = Vector128.LoadUnsafe(ref Unsafe.As<Guid, byte>(ref Unsafe.AsRef(in right)));
vmovdqu xmm1,xmmword ptr [r8] ; Move Unaligned Packed Integer Values to xmm1
; Compare xmm0 and xmm1 and store the result in xmm0.
; Compare 128 bits (16 bytes) byte by byte:
; For each byte, if xmm0[i] == xmm1[i] then xmm0[i]=0xFF (8 bits to "1"), else xmm0[i]=0x00 (8 bits to "0")
vpcmpeqb xmm0,xmm0,xmm1
; Move the result of the comparison to eax.
; xmm0 is 128 bits, eax is 32 bits. eax is filled using the following algorithm:
; eax[0] <- xmm0[7];
; eax[1] <- xmm0[15];
; (* Repeat operation for bytes 2 through 14 *)
; eax[15] <- xmm0[127];
; eax[31:16] <- ZERO_FILL;
vpmovmskb eax,xmm0
; Check if the result is all 1s
; 0FFFF is 16 bits set to 1
cmp eax,0FFFF
; Set al to 0 or 1 according to the result of the preceeding cmp instruction
sete al
; Return the result
movzx eax,al
ret
; Total bytes of code 32
As you can see, the JIT removes the if/else block and only preserves the branch where Vector128.IsHardwareAccelerated is true, since my system supports it. It also replaces three method calls with a single assembly instruction (vmovdqu). The shorter assembly output increases the chance of the method being inlined, which slightly improves performance. Let's compare the original and optimized methods:
Benchmark.cs (C#)
BenchmarkRunner.Run<GuidEqualsBenchmark>();
[DisassemblyDiagnoser(printSource: true, exportDiff: true)]
public class GuidEqualsBenchmark
{
[Benchmark]
[ArgumentsSource(nameof(GetGuids))]
public bool Equals(Guid a, Guid b) => a == b;
[Benchmark]
[ArgumentsSource(nameof(GetGuids))]
public bool Equals_Opt(Guid a, Guid b) => GuidExtensions.OptimizedGuidEquals(in a, in b);
public IEnumerable<object[]> GetGuids()
{
yield return new object[] { Guid.Empty, Guid.Empty };
yield return new object[] { Guid.Empty, Guid.NewGuid() };
yield return new object[] { Guid.NewGuid(), Guid.NewGuid() };
var guid = Guid.NewGuid();
yield return new object[] { guid, guid };
}
}
Machine configuration
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22153.17
[Host] : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
| Method | Mean | Error | StdDev | Code Size |
|---|
| Equals | 2.2564 ns | 0.0326 ns | 0.0305 ns | 87 B |
| Equals_Opt | 0.3842 ns | 0.0186 ns | 0.0174 ns | 32 B |
| Equals | 1.0837 ns | 0.0287 ns | 0.0269 ns | 87 B |
| Equals_Opt | 0.2317 ns | 0.0172 ns | 0.0161 ns | 32 B |
| Equals | 1.0880 ns | 0.0234 ns | 0.0207 ns | 87 B |
| Equals_Opt | 0.2787 ns | 0.0224 ns | 0.0209 ns | 32 B |
| Equals | 1.1011 ns | 0.0251 ns | 0.0222 ns | 87 B |
| Equals_Opt | 0.2724 ns | 0.0238 ns | 0.0211 ns | 32 B |
#Support .NET 6
The previous code targets .NET 7, which was in preview at the time of writing. .NET 7 adds more methods on Vector to support generic vector operations. To target .NET 6 instead, you can use direct intrinsic methods. Note that the following code is less optimal than the generic implementation, as it only provides optimizations for the SSE2 instruction set. A more complete implementation would check additional instruction sets and provide optimized paths for each.
C#
// Works with .NET 6
// For .NET 7, use the previous implementation
static class GuidExtensions
{
public static bool OptimizedGuidEquals(in Guid left, in Guid right)
{
if (Sse2.IsSupported)
{
Vector128<byte> leftVector = Unsafe.ReadUnaligned<Vector128<byte>>(
ref Unsafe.As<Guid, byte>(
ref Unsafe.AsRef(in left)));
Vector128<byte> rightVector = Unsafe.ReadUnaligned<Vector128<byte>>(
ref Unsafe.As<Guid, byte>(
ref Unsafe.AsRef(in right)));
var equals = Sse2.CompareEqual(leftVector, rightVector);
var result = Sse2.MoveMask(equals);
return (result & 0xFFFF) == 0xFFFF;
}
return left == right;
}
}
#Additional resources
Do you have a question or a suggestion about this post? Contact me!