Faster Guid comparisons using Vectors (SIMD) in .NET

06/27/2022

Gérald Barré

.NET

This post is part of the series 'SIMD'. Be sure to check out the rest of the blog posts of the series!

Faster Guid comparisons using Vectors (SIMD) in .NET (this post)
Finding the maximum value in an array using vectorization
Replace characters in a string using Vectorization

Comparing Guids in .NET is a very common operation. The operation is simple, but there is a way to optimize it 😃 I've discovered it while reading the changes in this PR. The idea is to use SIMD instructions to compare Guids. I've wanted to explore SIMD and Vector for quite a long time. So, this was a good introduction. In this post, I'll explain how the optimization works.

#How Guid.Equals() works

A Guid is a 128-bit value. If you look at the fields, you can see a mix of int, short, and bytes (source):

Guid.cs (C#)

public readonly partial struct Guid
{
    private readonly int _a;
    private readonly short _b;
    private readonly short _c;
    private readonly byte _d;
    private readonly byte _e;
    private readonly byte _f;
    private readonly byte _g;
    private readonly byte _h;
    private readonly byte _i;
    private readonly byte _j;
    private readonly byte _k;
}

Here's the current Guid.Equals implementation from GitHub (comments are mine):

Guid.cs (C#)

public readonly partial struct Guid
{
    public bool Equals(Guid g) => EqualsCore(this, g);

    private static bool EqualsCore(in Guid left, in Guid right)
    {
        // Unsafe.AsRef returns a int pointer to the first field of the Guid structure
        ref int rA = ref Unsafe.AsRef(in left._a);
        ref int rB = ref Unsafe.AsRef(in right._a);

        // Compare each element
        // Unsafe.Add() returns a int pointer to the field at the specified offset
        return rA == rB                                        // Compare [0:31] bits
            && Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)  // Compare [32:63] bits
            && Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)  // Compare [64:95] bits
            && Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3); // Compare [96:127] int
    }
}

Comparing each field is slow. So, the current implementation considers the Guid as 4 ints (4*32bits = 128bits) and compares them one by one. This optimization is unsafe, but as the structure is known and won't change, it's safe. Indeed, you can see that at runtime the fields in the struct are sequential and there is no padding. So, the comparison is the same as comparing 2 binary blobs.

Guid Layout

Type layout for 'Guid'
Size: 16 bytes. Paddings: 0 bytes (%0 of empty space)
|===========================|
|   0-3: Int32 _a (4 bytes) |
|---------------------------|
|   4-5: Int16 _b (2 bytes) |
|---------------------------|
|   6-7: Int16 _c (2 bytes) |
|---------------------------|
|     8: Byte _d (1 byte)   |
|---------------------------|
|     9: Byte _e (1 byte)   |
|---------------------------|
|    10: Byte _f (1 byte)   |
|---------------------------|
|    11: Byte _g (1 byte)   |
|---------------------------|
|    12: Byte _h (1 byte)   |
|---------------------------|
|    13: Byte _i (1 byte)   |
|---------------------------|
|    14: Byte _j (1 byte)   |
|---------------------------|
|    15: Byte _k (1 byte)   |
|===========================|

You can see the performance difference between comparing each field one by one and comparing the Guid as 4 ints:

Method	Mean	Error	StdDev	Code Size
Equals	1.279 ns	0.0446 ns	0.0417 ns	49 B
Equals_field_by_field	3.497 ns	0.0394 ns	0.0349 ns	119 B

#Optimization using Vector128

Single Instruction, Multiple Data (SIMD) is a set of instructions that allows parallelizing code on a single core. In our case, we can use the "equal" instruction (single instruction) on 16 bytes (multiple data) simultaneously. SIMD is commonly used in the .NET code source for performance reasons. For instance, it helps in string.IndexOf, Base64.EncodeToUtf8, HexConvert.EncodeToUtf16, some query string manipulations, in the BitArray class, or for a sorting algorithm in the GC, etc. You can also find usages of SIMD in libraries such as ImageSharp.

In .NET, SIMD instructions are accessible through the Vector<T>, Vector64<T>, Vector128<T>, and Vector256<T> classes. These classes require hardware support to get performance improvements. You can check if your hardware supports SIMD by calling IsHardwareAccelerated. When the hardware is not supported, the Vector<T> classes fall back to software implementations. So, you can use them without any worries. However, it could be useful to check the support to use the best implementation for your hardware. Note that only a few native types, such as byte, short, int or long, are supported by Vector<T>. Starting with .NET 7, you can check if a type is supported by Vector<T> by calling Vector<T>.IsSupported.

You also use SIMD instructions by using raw methods. These methods are exposed in the System.Runtime.Intrinsics namespace. For instance, you can use System.Runtime.Intrinsics.X86.Sse2.CompareEqual. Before using these methods, you must check that the CPU supports them by using IsSupported because there is no software implementation fallback. This means that the method will throw if the hardware doesn't support them. For instance, you can use System.Runtime.Intrinsics.X86.Sse2.IsSupported before using SSE2 instructions.

When the JIT detect these methods, it replaces the method call by the equivalent assembly instructions. If you look at the C# code for methods in System.Runtime.Intrinsics, you won't see any useful code. The magic is in the JIT which knows these methods and handles them specifically!

// The JIT replaces the condition with a constant based on the current CPU capabilities.
// So, it keeps only the supported branch when generating the assembly code.
if (Vector256.IsHardwareAccelerated)
{
    // todo implementation using Vector256<T>
}
else if (Vector128.IsHardwareAccelerated)
{
    // todo implementation using Vector128<T>
}
else if (System.Runtime.Intrinsics.X86.Sse2.IsSupported)
{
    // todo implementation using System.Runtime.Intrinsics.X86.Sse2
}
else
{
    // todo fallback implementation
}

In our case, we want to compare 128 bits of data. So, we use Vector128<T> to load the 128 bits of both Guids and compare them. The code is more complex as you have to convert the Guid to a byte*. This is possible using Unsafe.AsRef and Unsafe.As.

static class GuidExtensions
{
    public static bool OptimizedGuidEquals(in Guid left, in Guid right)
    {
        if (Vector128.IsHardwareAccelerated)
        {
            // - Vector.LoadUnsafe loads 128 bits of data from the pointer
            // - Unsafe.As casts the Guid pointer to byte pointer
            // - Unsafe.AsRef(in left) returns a pointer to the value
            // => This looks complicated and maybe slow, but
            //    the JIT converts this line to a single asm instruction
            Vector128<byte> leftVector =
                Vector128.LoadUnsafe(
                    ref Unsafe.As<Guid, byte>(
                        ref Unsafe.AsRef(in left)));

            Vector128<byte> rightVector =
                Vector128.LoadUnsafe(
                    ref Unsafe.As<Guid, byte>(
                        ref Unsafe.AsRef(in right)));

            // Compare both vectors
            return leftVector == rightVector;
        }
        else
        {
            // Fallback for non-hardware accelerated platforms
            return left == right;
        }
    }
}

The JIT converts the previous code to the following:

Assembly

; GuidEqualsBenchmark.Equals_Opt(System.Guid, System.Guid)
       ; Clear the xmm registers
       vzeroupper

       ; Vector128<byte> leftVector = Vector128.LoadUnsafe(ref Unsafe.As<Guid, byte>(ref Unsafe.AsRef(in left)));
       vmovdqu   xmm0,xmmword ptr [rdx] ; Move Unaligned Packed Integer Values to xmm0

       ; Vector128<byte> rightVector = Vector128.LoadUnsafe(ref Unsafe.As<Guid, byte>(ref Unsafe.AsRef(in right)));
       vmovdqu   xmm1,xmmword ptr [r8]  ; Move Unaligned Packed Integer Values to xmm1

       ; Compare xmm0 and xmm1 and store the result in xmm0.
       ; Compare 128 bits (16 bytes) byte by byte:
       ; For each byte, if xmm0[i] == xmm1[i] then xmm0[i]=0xFF (8 bits to "1"), else xmm0[i]=0x00 (8 bits to "0")
       vpcmpeqb  xmm0,xmm0,xmm1

       ; Move the result of the comparison to eax.
       ; xmm0 is 128 bits, eax is 32 bits. eax is filled using the following algorithm:
       ; eax[0] <- xmm0[7];
       ; eax[1] <- xmm0[15];
       ; (* Repeat operation for bytes 2 through 14 *)
       ; eax[15] <- xmm0[127];
       ; eax[31:16] <- ZERO_FILL;
       vpmovmskb eax,xmm0

       ; Check if the result is all 1s
       ; 0FFFF is 16 bits set to 1
       cmp       eax,0FFFF

       ; Set al to 0 or 1 according to the result of the preceeding cmp instruction
       sete      al

       ; Return the result
       movzx     eax,al
       ret
; Total bytes of code 32

As you can see, the JIT removes the if/else block and only preserve the branch where Vector128.IsHardwareAccelerated as my system support it. It replaces three method calls with a single assembly instruction (vmovdqu). And the method is shorter in asm, so it increases the chance for the method to be inlined which slightly improves performance. Let's compare the performance of the original and the optimized methods:

Benchmark.cs (C#)

BenchmarkRunner.Run<GuidEqualsBenchmark>();

[DisassemblyDiagnoser(printSource: true, exportDiff: true)]
public class GuidEqualsBenchmark
{
    [Benchmark]
    [ArgumentsSource(nameof(GetGuids))]
    public bool Equals(Guid a, Guid b) => a == b;

    [Benchmark]
    [ArgumentsSource(nameof(GetGuids))]
    public bool Equals_Opt(Guid a, Guid b) => GuidExtensions.OptimizedGuidEquals(in a, in b);

    public IEnumerable<object[]> GetGuids()
    {
        yield return new object[] { Guid.Empty, Guid.Empty };
        yield return new object[] { Guid.Empty, Guid.NewGuid() };
        yield return new object[] { Guid.NewGuid(), Guid.NewGuid() };

        var guid = Guid.NewGuid();
        yield return new object[] { guid, guid };
    }
}

Machine configuration

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22153.17
  [Host]     : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT

Method	Mean	Error	StdDev	Code Size
Equals	2.2564 ns	0.0326 ns	0.0305 ns	87 B
Equals_Opt	0.3842 ns	0.0186 ns	0.0174 ns	32 B
Equals	1.0837 ns	0.0287 ns	0.0269 ns	87 B
Equals_Opt	0.2317 ns	0.0172 ns	0.0161 ns	32 B
Equals	1.0880 ns	0.0234 ns	0.0207 ns	87 B
Equals_Opt	0.2787 ns	0.0224 ns	0.0209 ns	32 B
Equals	1.1011 ns	0.0251 ns	0.0222 ns	87 B
Equals_Opt	0.2724 ns	0.0238 ns	0.0211 ns	32 B

#Support .NET 6

The previous code works with .NET 7 which is currently in preview. .NET 7 provides more methods on Vector to support generic vector operations. If you want to support .NET 6, you can use direct intrinsic methods. Note that the following code is not as good as the generic implementation as it only provides optimizations using the SSE2 instruction set. A better implementation would check other instruction sets and provide optimized code for them.

// Works with .NET 6
// For .NET 7, use the previous implementation
static class GuidExtensions
{
    public static bool OptimizedGuidEquals(in Guid left, in Guid right)
    {
        if (Sse2.IsSupported)
        {
            Vector128<byte> leftVector = Unsafe.ReadUnaligned<Vector128<byte>>(
                ref Unsafe.As<Guid, byte>(
                    ref Unsafe.AsRef(in left)));

            Vector128<byte> rightVector = Unsafe.ReadUnaligned<Vector128<byte>>(
                ref Unsafe.As<Guid, byte>(
                    ref Unsafe.AsRef(in right)));

            var equals = Sse2.CompareEqual(leftVector, rightVector);
            var result = Sse2.MoveMask(equals);
            return (result & 0xFFFF) == 0xFFFF;
        }

        return left == right;
    }
}

#Additional resources

Do you have a question or a suggestion about this post? Contact me!

Follow me:

Enjoy this blog?

💖 Sponsor on GitHub