Post

Skipping hardware instruction dependent tests in C#

When writing tests for a project that uses hardware intrinsics, we face the problem of ensuring that we only run tests that the host hardware supports. I recently ran into this issue for a personal project in C#, and wanted to write a post about it. There is nothing hugely complicated here, but it does allow us to explore hardware intrinsics, standard library API design, and practical testing advice in a single post.

If you are already familiar with hardware intrinsics in C# or only care about how to implement the test skipping functionality, it may be worth skipping over the preamble and directly to the “Skipping instruction dependent tests” section.

Hardware Intrinsics in C#

What are they?

My introduction to hardware intrinsics in C# came all the way back in 2019 through a Microsoft Dev Blogs post ‘Hardware Intrinsics in .NET Core’ that a former colleague shared with me. However, the use of hardware intrinsics in the .NET runtime actually goes all the way back over a decade to 2014 as described in the, rather humourously titled, post ‘The JIT finally proposed. JIT and SIMD are getting married’.

The term ‘intrinsic’ is used in compilers to refer to a small piece of reusable logic that is built into the compiler and exposed to programmers. This could be something like direct access to a hardware feature or a special version of a fundamental algorithm that is really well understood by the compiler, so it is much easier to optimize later. The MSVC documentation contains a good section describing compiler intrinsics in more detail, if you are interested in reading more.

The term ‘hardware intrinsic’ tends to refer to intrinsics that exist for the purpose of using hardware features in a higher level language than machine assembly. Often, they are used to dispatch a single instruction without leaving the semantics of the source language i.e, inline assembly. One ubiquitous example of this is the immintrin.h header, that contains definitions for x86 hardware intrinsic functions for C/C++.

One of the most common use cases for hardware intrinsics on x86 are SIMD (Single Instruction Multiple Data) instructions. As the name implies, these instructions use wider registers to allow multiple values to be processed in a single instruction, sometimes called “instruction level parallelism”. This is useful for increasing throughput - in theory up to a factor of the number of values we can process per instruction. This post is not aiming to be a comprehensive guide of the considerations when using SIMD (which can get extremely complicated), so I’d recommend doing some further reading if you’re interested in the fundamentals and idioms.

Instead, let’s dive straight into the deep end completely unprepared and look at an example of adding four double values with a single SIMD instruction in C#. Whilst I haven’t explained any of this, I hope that the code sample demonstrates that the requirements to use these low-level features is more-or-less ‘normal’ C# - nothing super scary here. The key take away is that those Avx method calls are generating only a single CPU instruction each!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

// All families of hardware intrinsics allow us to check if they are supported
// on the current machine.
if (!Avx.IsSupported)
{
  throw new NotSupportedException("Need AVX instructions for this example!");
}

// sizeof(double) is 64, so we can fit 4 values in a 256-bit vector.
// We will add `arrayOne` and `arrayTwo` and store the result in `result`.
var arrayOne = new[] { 1.0, 2.0, 3.0, 4.0 };
var arrayTwo = new[] { 5.0, 6.0, 7.0, 8.0 };
var result = new[] { 0.0, 0.0, 0.0, 0.0 };

// The SIMD APIs require pointers for loading and storing data but as C#
// is a garbage collected language, so we need to tell the GC to leave
// these arrays alone whilst we get pointers to them safely.
fixed (double* a1 = arrayOne, a2 = arrayTwo, r = result)
{
  var v1 = Avx.LoadVector256(a1);
  var v2 = Avx.LoadVector256(a2);
  Avx.Store(r, Avx.Add(v1, v2));
}

// Now the GC resumes normal control of our array variables.

Console.WriteLine(result); // [ 6.0, 8.0, 10.0, 12.0 ]

If you are interested in the generated assembly, the C# API documentation for Avx.Add directly references the corresponding C intrinsic and assembly. This reference is provided for every hardware intrinsic supported by the standard library, making it easy to cross-reference the more friendly-named C# APIs with low level reference material (such as the Intel Intrinsics Guide).

If you didn’t want to click through those links, here is the table of instruction/function names for adding a 256-bit vector of doubles:

LanguageName
Assemblyvaddpd
C_mm256_add_pd
C#Avx.Add<double>

To clarify some of this nomenclature: v means ‘vector’, _mm256 is a 256-bit vector, and pd is shorthand for ‘packed double’ (meaning the vector is full of 64-bit double values).

Now that we have hastily immersed ourselves in a basic code example, let’s look at some of the design features of the standard library intrinsics namespaces - at least enough to implement our test skipping attribute.

API Design

The .NET intrinsics APIs can be found in the System.Runtime.Intrinsics namespace. This contains generic and type-erased fixed-width vector types, respectively:

  • Vector64<T>, Vector128<T>, Vector256<T>, Vector512<T>
  • Vector64, Vector128, Vector256, Vector512

It also contains namespaces for architecture-specific intrinsics. At the time of writing, only x86 and Arm architectures are supported.

Note that not all of these intrinsics classes are for SIMD and so some don’t make use of vector types e.g., Lzcnt (leading zero count).

Instruction Set Organisation

The intrinsics namespaces contain abstract classes that represent the instruction set extensions. These contain mostly static methods and properties that allow us to call the hardware intrinsic function or query information about an instruction set.

We have already seen an example of an instruction set architecture (ISA) extension - Avx - in the code sample above. AVX (Advanced Vector Extensions) was released in the early 2010s and provides 256-bit wide registers and instructions. These days it features in most consumer hardware, so it’s a great place to get started.

As we can see from the previous code sample, the Avx class encapsulates AVX instructions as static methods for easy consumption. These methods are also propagated through a type hierarchy of instruction sets (more on that later) which means that we get common, developer-friendly names (Xor, CompareEqual, etc.) for operations regardless of the instructions we use to compute them.

This clearly categorised grouping of instructions is a huge readability win for the C# standard library - especially seeing as there are tens of thousands of X86 intrinsics! Comparing the nomenclatures in the table of AVX add APIs, the C# naming is a night and day improvement over the lower-level languages. Of course this comes at the cost of having to learn the assembly mapping if you’re interested in the individual instructions, but for general readability purposes it is much better in my opinion.

Dependent Instruction Sets

You might have noticed that I have skipped over 128-bit operations so far. This is because it will allow us to highlight one of my favourite design features of the intrinsics namespaces - the modelling of dependent ISAs through inheritance. To explain this properly, we are going to have to briefly look at x86 register naming conventions…

Registers in x86 can have multiple names depending on the width of the register you are addressing. For example, ecx and rcx are actually the same register but addressed as 32-bits and 64-bits. Here are the different names and sizes for that register, with each ASCII block representing a byte.

rcx: ████████ (64-bits)
ecx: ████     (lower 32-bits)
cx:  ██       (lower 16-bits)
cl:  █        (lower 8-bits)

Sidenote: I chose rcx deliberately as is canonically used for the address of this in instance methods. Also, I’m not entirely sure what the naming convention of these registers is. I think that e means “extended”, but I’m not entirely sure about r - perhaps “really extended”?

Anyway, similar addressing conventions apply to our x86 vector registers that we can use for SIMD instructions, so let’s introduce those registers by name: xmm (128-bit), ymm (256-bit), and zmm (512-bit). Much like general-purpose registers can be addressed in smaller chunks, we can see that we can actually build wider vector registers by combining smaller ones:

████████████████ ████████████████ ████████████████ ████████████████ (512-bits)
|--------------| |--------------| |--------------| |--------------|
     xmm0              xmm1             xmm2             xmm3
|-------------------------------| |-------------------------------|
               ymm0                              ymm1
|-----------------------------------------------------------------|
                                zmm0

So these widths of vector registers sets actually depend on each other as you unlock wider vector intrinsics. If you support the 256-bit ymm registers, then you necessarily support 128-bit xmm registers. Where do we see this kind of pattern emerge in C# code? Inheritance - where we take functionality from a base type and extend it with our own logic. For instance if we support Avx.Add on a 256-bit vector, then we must necessarily support adding 128-bit vectors - even though that will actually generate SSE instructions instead of AVX ones. In reality, we would be calling a static method on the parent ISA class through the Avx class here.

The key takeaway is that instruction sets that are dependent in this way modelled as base types of an instruction set. For instance, the inheritance hierarchy for Avx is Object -> X86Base -> Sse -> Sse2 -> Sse3 -> Ssse3 -> Sse41 -> Sse42 -> Avx. We can very clearly see the history of SIMD ISAs that predate Avx in the inheritance hierarchy.

There are a few handy real world benefits of this:

  • Checking the IsSupported property will implicitly check all dependent ISAs.
  • Generics over instruction sets can expressively constrain type parameters against the minimum instruction set that they require.
  • Most static analysers will warn us about calling parent static methods through derived types, so we can usually find the most portable ISA for the job with tooling.

I think this is a great design feature, because it models the hardware dependence neatly with standard C# semantics.

Skipping instruction dependent tests

Now we know a little bit about hardware intrinsics in C#, we can clearly see why we will need to skip tests on platforms that don’t have hardware support for the intrinsics contained within those test cases. Ideally, our mechanism for this will leverage the strongly-typed ISAs we are given by the System.Runtime.Intrinsics namespace. This will enhance the readability of our tests by making it obvious exactly which ISA we need to have available to execute the test.

I am using Xunit in this personal project, which will dictate most of the design decisions. The core logic is really straightforward though, so it should be easy enough to port to your testing framework of choice.

The code samples in this section require the following using statements, which are omitted for brevity throughout.

1
2
3
4
using System.Reflection;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.Intrinsics.X86;
using Xunit;

Desired Interface

Xunit uses a Skip property on test attributes to determine whether or not to skip a test. The docs state:

Gets the skip reason for the test. When null is returned, the test is not skipped.

For example:

1
2
3
4
5
6
7
8
9
[Fact]
public void Test()
{
}

[Fact(Skip = "Don't run this")]
public void SkippedTest()
{
}

Xunit allows us to derive our own version of these attributes to control skipping behaviour. Seeing as we have a strongly typed model of ISAs provided by the standard library, it would be great to have an attribute that was generic in the type of the ISA we must support in order to be able to run the test. Here is an outline of that idea at the point of consumption:

1
2
3
4
[RequiresIsaFact<Avx>]
public void AvxInstructionDependentTest()
{
}

Let’s start from this usage, and work out how to implement the logic of the attribute.

Checking whether an ISA is supported

We’ve already seen how every ISA has an IsSupported property, so our check is as simple as that. One small inconvenience of C# is going to get in our way here though - we cannot directly call a static method on a generic type parameter. There are some good reasons for this which are explained over a three part Eric Lippert blog post. If we had language support for this, our implementation would be as simple as:

1
2
3
4
if (T.IsSupported)
{
  Skip = $"{typeof(T).Name} is not supported on this platform.";
}

Instead, we can use reflection to get the name of the static property and invoke it on a null instance. This causes some extra pain because even though we know that the IsSupported property will always be there for a valid ISA, we could technically fail when we attempt to get the property instance, extract the value, or cast it to bool. Practically this should never cause us any issues in real world usage but I’ve added defensive checks for these cases anyway because we have no type constraints on T, so it is possible to mess this up. We’ll address this later but for now, the code we have so far is:

1
2
3
4
5
6
7
8
9
10
11
var isSupported = typeof(T)
  .GetProperty(nameof(X86Base.IsSupported))?
  .GetValue(null);

if (isSupported is null)
{
    throw new InvalidOperationException(
      $"The type {typeof(T).Name} does not have a property named IsSupported.");
}

(bool)isSupported; // the result

Implementing the attributes

To allow us to consume this logic into our tests, let’s create the Xunit attributes we described earlier. We’ll shift the logic outlined above into the constructor attribute and initialize the Skip property accordingly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
abstract class RequiresIsaFactAttribute<T> : FactAttribute
{
    protected RequiresIsaFactAttribute()
    {
        var isSupported = typeof(T)
            .GetProperty(nameof(X86Base.IsSupported))?
            .GetValue(null);

        if (isSupported is null)
        {
            throw new InvalidOperationException(
                $"The type {typeof(T).Name} does not have a property named IsSupported.");
        }

        if (!(bool)isSupported)
        {
            Skip = $"{typeof(T).Name} is not supported on this platform.";
        }
    }
}

The reason for making this attribute abstract is to extract this logic for checking IsSupported to a common base that can be supported for attributes for each architecture. As there is no common base type between ISAs on different architectures, we must employ multiple generic type constraints for each architecture we wish to support in our tests by using separate derived types. By having multiple derived types for this (and making the base RequiresIsaFactAttribute impossible to instantiate), we can ensure that these attributes will be used in a correct manner.

We can constrain T to only X86 ISAs by requiring it to be derived from X86Base, the type from which all X86 ISAs are derived:

1
2
3
4
[AttributeUsage(AttributeTargets.Method)]
sealed class RequiresX86IsaFactAttribute<T>
    : RequiresIsaFactAttribute<T>
    where T : X86Base;

No prizes for guessing the counterpart for ARM:

1
2
3
4
[AttributeUsage(AttributeTargets.Method)]
sealed class RequiresArmIsaFactAttribute<T>
    : RequiresIsaFactAttribute<T>
    where T : ArmBase;

The Xunit attribute aficionados amongst you would have noticed that the AttributeUsage is technically inherited all the way through from FactAttribute, but I have added it to each of the attributes the appear in my test project for readability.

Usage in tests

Let’s have a look at the attribute in action. This very simple test checks for equality when loading two 128-bit vectors through the same address with the standard Sse2.LoadVector128 method and an extension method for the library-under-test’s vocabulary type for a native pointer.

1
2
3
4
5
6
7
8
9
10
11
12
[RequiresX86IsaFact<Sse2>]
public unsafe void LoadVector128LoadsExpectedValues()
{
    var values = stackalloc int[] { 1, 2, 3, 4 };
    var vector = Sse2.LoadVector128(values);
    var pointer = new Ptr<int>(values);

    var result = pointer.LoadVector128();

    // Assert.Equals does not support Vector128 (yet?)
    Assert.True(vector.Equals(result));
}

I think that placing the platform name alongside the ISA type parameter gives us a slight readability buff. This explicit link between architecture and ISA could improve test navigation for new contributors who are perhaps not so familiar with the many ISAs that exist on every platform and the instructions contained within each. My experience with SIMD programming has only ever been on relatively modern x64 CPUs, so if I were to look at a more Arm-oriented codebase for example, I would certainly be grateful for any information I could get from the tests!

Conclusion

This small tangent from my personal project was certainly an enjoyable one. At the time these hardware intrinsics were released to .NET Core 3.1, I was only three months into my first graduate role. Luckily for me, that team was aggressive with framework updates and I got the opportunity to investigate these APIs at work which was a lot of fun. In fact, looking back at some of these resources whilst writing this post was verging on nostalgic!

More relevantly, I think the .NET team did a really great job designing the APIs for SIMD intrinsics. The API surface is huge - depending on which method you use, SIMD instructions make up about ~70-~85% of the x86 instruction set. Strongly typed ISAs with common properties across platforms combined with the use of inheritance for dependent ISAs is a really slick design in my opinion. In general, I’m a big fan of using type systems for enforcing correctness and these intrinsics types really lend themselves to this, even if this unit testing example is only a small demonstration of that power.

This post is licensed under CC BY 4.0 by the author.