Faster Guid comparisons using Vectors (SIMD) in .NET

 
 
  • Gérald Barré

This post is part of the series 'SIMD'. Be sure to check out the rest of the blog posts of the series!

Comparing GUIDs in .NET is a very common operation. While it is simple, there is room to optimize it using SIMD instructions. I discovered this technique while reading this PR and used it as an introduction to SIMD and Vector in .NET. This post explains how the optimization works.

#How Guid.Equals() works

A Guid is a 128-bit value. If you look at the fields, you can see a mix of int, short, and bytes (source):

Guid.cs (C#)
public readonly partial struct Guid
{
    private readonly int _a;
    private readonly short _b;
    private readonly short _c;
    private readonly byte _d;
    private readonly byte _e;
    private readonly byte _f;
    private readonly byte _g;
    private readonly byte _h;
    private readonly byte _i;
    private readonly byte _j;
    private readonly byte _k;
}

Here's the current Guid.Equals implementation from GitHub (comments are mine):

Guid.cs (C#)
public readonly partial struct Guid
{
    public bool Equals(Guid g) => EqualsCore(this, g);

    private static bool EqualsCore(in Guid left, in Guid right)
    {
        // Unsafe.AsRef returns a int pointer to the first field of the Guid structure
        ref int rA = ref Unsafe.AsRef(in left._a);
        ref int rB = ref Unsafe.AsRef(in right._a);

        // Compare each element
        // Unsafe.Add() returns a int pointer to the field at the specified offset
        return rA == rB                                        // Compare [0:31] bits
            && Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)  // Compare [32:63] bits
            && Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)  // Compare [64:95] bits
            && Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3); // Compare [96:127] int
    }
}

Comparing each field is slow. So, the current implementation treats the Guid as 4 ints (4*32bits = 128bits) and compares them one by one. This approach uses unsafe memory operations, but since the structure layout is known and guaranteed not to change, it is logically safe. You can confirm that at runtime the fields in the struct are sequential with no padding, so the comparison is equivalent to comparing two binary blobs.

Guid Layout
Type layout for 'Guid'
Size: 16 bytes. Paddings: 0 bytes (%0 of empty space)
|===========================|
|   0-3: Int32 _a (4 bytes) |
|---------------------------|
|   4-5: Int16 _b (2 bytes) |
|---------------------------|
|   6-7: Int16 _c (2 bytes) |
|---------------------------|
|     8: Byte _d (1 byte)   |
|---------------------------|
|     9: Byte _e (1 byte)   |
|---------------------------|
|    10: Byte _f (1 byte)   |
|---------------------------|
|    11: Byte _g (1 byte)   |
|---------------------------|
|    12: Byte _h (1 byte)   |
|---------------------------|
|    13: Byte _i (1 byte)   |
|---------------------------|
|    14: Byte _j (1 byte)   |
|---------------------------|
|    15: Byte _k (1 byte)   |
|===========================|

You can see the performance difference between comparing each field one by one and comparing the Guid as 4 ints:

MethodMeanErrorStdDevCode Size
Equals1.279 ns0.0446 ns0.0417 ns49 B
Equals_field_by_field3.497 ns0.0394 ns0.0349 ns119 B

#Optimization using Vector128

Single Instruction, Multiple Data (SIMD) is a class of CPU instructions that enables data parallelism on a single core. In our case, we can apply a single "equal" instruction across all 16 bytes of both GUIDs simultaneously. SIMD is widely used in the .NET runtime for performance. For example, it accelerates string.IndexOf, Base64.EncodeToUtf8, HexConvert.EncodeToUtf16, query string manipulation, the BitArray class, and a sorting algorithm in the GC. You can also find SIMD usage in libraries such as ImageSharp.

In .NET, SIMD instructions are accessible through the Vector<T>, Vector64<T>, Vector128<T>, and Vector256<T> classes. These classes require hardware support for performance gains. You can check whether your hardware supports SIMD by calling IsHardwareAccelerated. When hardware acceleration is unavailable, the Vector<T> classes fall back to software implementations, so you can use them safely. However, checking hardware support lets you choose the most efficient implementation for your platform. Note that only a few primitive types (such as byte, short, int, and long) are supported by Vector<T>. Starting with .NET 7, you can verify whether a type is supported by calling Vector<T>.IsSupported.

You can also use SIMD instructions through raw methods. These methods are exposed in the System.Runtime.Intrinsics namespace. For instance, you can use System.Runtime.Intrinsics.X86.Sse2.CompareEqual. Before calling these methods, you must verify that the CPU supports them using IsSupported, because there is no software fallback. If the hardware does not support them, the call will throw. For instance, check System.Runtime.Intrinsics.X86.Sse2.IsSupported before using SSE2 instructions.

When the JIT detects these methods, it replaces the calls with equivalent assembly instructions. If you look at the C# code for methods in System.Runtime.Intrinsics, you won't see any useful logic. The magic is in the JIT, which knows these methods and handles them specially!

C#
// The JIT replaces the condition with a constant based on the current CPU capabilities.
// So, it keeps only the supported branch when generating the assembly code.
if (Vector256.IsHardwareAccelerated)
{
    // todo implementation using Vector256<T>
}
else if (Vector128.IsHardwareAccelerated)
{
    // todo implementation using Vector128<T>
}
else if (System.Runtime.Intrinsics.X86.Sse2.IsSupported)
{
    // todo implementation using System.Runtime.Intrinsics.X86.Sse2
}
else
{
    // todo fallback implementation
}

In our case, we want to compare 128 bits of data. So, we use Vector128<T> to load the 128 bits of both Guids and compare them. The code is more complex as you have to convert the Guid to a byte*. This is possible using Unsafe.AsRef and Unsafe.As.

C#
static class GuidExtensions
{
    public static bool OptimizedGuidEquals(in Guid left, in Guid right)
    {
        if (Vector128.IsHardwareAccelerated)
        {
            // - Vector.LoadUnsafe loads 128 bits of data from the pointer
            // - Unsafe.As casts the Guid pointer to byte pointer
            // - Unsafe.AsRef(in left) returns a pointer to the value
            // => This looks complicated and maybe slow, but
            //    the JIT converts this line to a single asm instruction
            Vector128<byte> leftVector =
                Vector128.LoadUnsafe(
                    ref Unsafe.As<Guid, byte>(
                        ref Unsafe.AsRef(in left)));

            Vector128<byte> rightVector =
                Vector128.LoadUnsafe(
                    ref Unsafe.As<Guid, byte>(
                        ref Unsafe.AsRef(in right)));

            // Compare both vectors
            return leftVector == rightVector;
        }
        else
        {
            // Fallback for non-hardware accelerated platforms
            return left == right;
        }
    }
}

The JIT converts the previous code to the following:

Assembly
; GuidEqualsBenchmark.Equals_Opt(System.Guid, System.Guid)
       ; Clear the xmm registers
       vzeroupper

       ; Vector128<byte> leftVector = Vector128.LoadUnsafe(ref Unsafe.As<Guid, byte>(ref Unsafe.AsRef(in left)));
       vmovdqu   xmm0,xmmword ptr [rdx] ; Move Unaligned Packed Integer Values to xmm0

       ; Vector128<byte> rightVector = Vector128.LoadUnsafe(ref Unsafe.As<Guid, byte>(ref Unsafe.AsRef(in right)));
       vmovdqu   xmm1,xmmword ptr [r8]  ; Move Unaligned Packed Integer Values to xmm1

       ; Compare xmm0 and xmm1 and store the result in xmm0.
       ; Compare 128 bits (16 bytes) byte by byte:
       ; For each byte, if xmm0[i] == xmm1[i] then xmm0[i]=0xFF (8 bits to "1"), else xmm0[i]=0x00 (8 bits to "0")
       vpcmpeqb  xmm0,xmm0,xmm1

       ; Move the result of the comparison to eax.
       ; xmm0 is 128 bits, eax is 32 bits. eax is filled using the following algorithm:
       ; eax[0] <- xmm0[7];
       ; eax[1] <- xmm0[15];
       ; (* Repeat operation for bytes 2 through 14 *)
       ; eax[15] <- xmm0[127];
       ; eax[31:16] <- ZERO_FILL;
       vpmovmskb eax,xmm0

       ; Check if the result is all 1s
       ; 0FFFF is 16 bits set to 1
       cmp       eax,0FFFF

       ; Set al to 0 or 1 according to the result of the preceeding cmp instruction
       sete      al

       ; Return the result
       movzx     eax,al
       ret
; Total bytes of code 32

As you can see, the JIT removes the if/else block and only preserves the branch where Vector128.IsHardwareAccelerated is true, since my system supports it. It also replaces three method calls with a single assembly instruction (vmovdqu). The shorter assembly output increases the chance of the method being inlined, which slightly improves performance. Let's compare the original and optimized methods:

Benchmark.cs (C#)
BenchmarkRunner.Run<GuidEqualsBenchmark>();

[DisassemblyDiagnoser(printSource: true, exportDiff: true)]
public class GuidEqualsBenchmark
{
    [Benchmark]
    [ArgumentsSource(nameof(GetGuids))]
    public bool Equals(Guid a, Guid b) => a == b;

    [Benchmark]
    [ArgumentsSource(nameof(GetGuids))]
    public bool Equals_Opt(Guid a, Guid b) => GuidExtensions.OptimizedGuidEquals(in a, in b);

    public IEnumerable<object[]> GetGuids()
    {
        yield return new object[] { Guid.Empty, Guid.Empty };
        yield return new object[] { Guid.Empty, Guid.NewGuid() };
        yield return new object[] { Guid.NewGuid(), Guid.NewGuid() };

        var guid = Guid.NewGuid();
        yield return new object[] { guid, guid };
    }
}
Machine configuration
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22153.17
  [Host]     : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
MethodMeanErrorStdDevCode Size
Equals2.2564 ns0.0326 ns0.0305 ns87 B
Equals_Opt0.3842 ns0.0186 ns0.0174 ns32 B
Equals1.0837 ns0.0287 ns0.0269 ns87 B
Equals_Opt0.2317 ns0.0172 ns0.0161 ns32 B
Equals1.0880 ns0.0234 ns0.0207 ns87 B
Equals_Opt0.2787 ns0.0224 ns0.0209 ns32 B
Equals1.1011 ns0.0251 ns0.0222 ns87 B
Equals_Opt0.2724 ns0.0238 ns0.0211 ns32 B

#Support .NET 6

The previous code targets .NET 7, which was in preview at the time of writing. .NET 7 adds more methods on Vector to support generic vector operations. To target .NET 6 instead, you can use direct intrinsic methods. Note that the following code is less optimal than the generic implementation, as it only provides optimizations for the SSE2 instruction set. A more complete implementation would check additional instruction sets and provide optimized paths for each.

C#
// Works with .NET 6
// For .NET 7, use the previous implementation
static class GuidExtensions
{
    public static bool OptimizedGuidEquals(in Guid left, in Guid right)
    {
        if (Sse2.IsSupported)
        {
            Vector128<byte> leftVector = Unsafe.ReadUnaligned<Vector128<byte>>(
                ref Unsafe.As<Guid, byte>(
                    ref Unsafe.AsRef(in left)));

            Vector128<byte> rightVector = Unsafe.ReadUnaligned<Vector128<byte>>(
                ref Unsafe.As<Guid, byte>(
                    ref Unsafe.AsRef(in right)));

            var equals = Sse2.CompareEqual(leftVector, rightVector);
            var result = Sse2.MoveMask(equals);
            return (result & 0xFFFF) == 0xFFFF;
        }

        return left == right;
    }
}

#Additional resources

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?