Prototype `TYP_HALF` changes #15

anthonycanino · 2025-10-21T12:57:54Z

I've hacked a bit of changes together to see if I am working on the ABI in the right places. So far, the following code will

[MethodImpl(MethodImplOptions.NoInlining)]
public static Half ProduceHalf(float val, float val2)
{
	Half h1 = (Half)val;
	Half h2 = (Half)val2;
	return ConsumeHalf(h1, h2);
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static Half ConsumeHalf(Half h1, Half h2)
{
	Half h3 = h1 + h2;
	Half h4 = h1 + h3;
	return h4;
}

which will get lowered into the following asm

; Assembly listing for method Program:ProduceHalf(float,float):System.Half (FullOpts)
; Emitting BLENDED_CODE for generic X64 + VEX + EVEX on Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   float  ->  mm0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )   float  ->  mm1         single-def
;* V02 loc0         [V02    ] (  0,  0   )    half  ->  zero-ref    <System.Half>
;  V03 OutArgs      [V03    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <UNNAMED>
;* V04 tmp1         [V04    ] (  0,  0   )  ushort  ->  zero-ref    "field V02._value (fldOffset=0x0)" P-INDEP
;  V05 tmp2         [V05    ] (  2,  4   )    half  ->  [rsp+0x30]  "by-value struct argument" <System.Half>
;  V06 tmp3         [V06    ] (  2,  4   )    half  ->  [rsp+0x28]  "by-value struct argument" <System.Half>
;
; Lcl frame size = 56

G_M43922_IG01:  ;; offset=0x0000
       4883EC38             sub      rsp, 56
						;; size=4 bbWeight=1 PerfScore 0.25
G_M43922_IG02:  ;; offset=0x0004
       62F57C081DC0         vcvtss2sh xmm0, xmm0
       62F57E0811442418     vmovsh   word  ptr [rsp+0x30], xmm0
       62F574081DC1         vcvtss2sh xmm0, xmm1
       62F57E0811442414     vmovsh   word  ptr [rsp+0x28], xmm0
       0FB74C2430           movzx    rcx, word  ptr [rsp+0x30]
       0FB7542428           movzx    rdx, word  ptr [rsp+0x28]
       FF15E0C08700         call     [Program:ConsumeHalf(System.Half,System.Half):System.Half]
       90                   nop      
						;; size=45 bbWeight=1 PerfScore 9.25
G_M43922_IG03:  ;; offset=0x0031
       4883C438             add      rsp, 56
       C3                   ret      
						;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 54, prolog size 4, PerfScore 10.75, instruction count 11, allocated bytes for code 54 (MethodHash=f2a4546d) for method Program:ProduceHalf(float,float):System.Half (FullOpts)
; ============================================================

; Assembly listing for method Program:ConsumeHalf(System.Half,System.Half):System.Half (FullOpts)
; Emitting BLENDED_CODE for generic X64 + VEX + EVEX on Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  4,  4   )    half  ->  [rsp+0x10]  single-def <System.Half>
;  V01 arg1         [V01    ] (  3,  3   )    half  ->  [rsp+0x18]  single-def <System.Half>
;  V02 loc0         [V02    ] (  2,  2   )    half  ->  [rsp+0x00]  single-def <System.Half>
;# V03 OutArgs      [V03    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;
; Lcl frame size = 8

G_M44932_IG01:  ;; offset=0x0000
       50                   push     rax
       894C2410             mov      dword ptr [rsp+0x10], ecx
       89542418             mov      dword ptr [rsp+0x18], edx
						;; size=9 bbWeight=1 PerfScore 3.00
G_M44932_IG02:  ;; offset=0x0009
       62F57E081044240C     vmovsh   xmm0, word  ptr [rsp+0x18]
       62F57E0858442408     vaddsh   xmm0, xmm0, word  ptr [rsp+0x10]
       62F57E08110424       vmovsh   word  ptr [rsp], xmm0
       62F57E0858442408     vaddsh   xmm0, xmm0, word  ptr [rsp+0x10]
       C5F97EC0             vmovd    eax, xmm0
						;; size=35 bbWeight=1 PerfScore 8.00
G_M44932_IG03:  ;; offset=0x002C
       4883C408             add      rsp, 8
       C3                   ret      
						;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 49, prolog size 1, PerfScore 12.25, instruction count 10, allocated bytes for code 49 (MethodHash=5608507b) for method Program:ConsumeHalf(System.Half,System.Half):System.Half (FullOpts)
; ============================================================

Mostly, I am looking for some confirmation that the changes are on the right track, after which I will go back and work on this PR without such a hacking mindset (I know that for example, I have commented out an assert, and hacked in some if statement conditions).

tannergooding · 2025-10-21T16:11:49Z

src/coreclr/inc/corinfoinstructionset.h

            return "AVXVNNIINT";
        case InstructionSet_AVXVNNIINT_V512 :
            return "AVXVNNIINT_V512";
+        case InstructionSet_Half :


What CPUID bit or other support is InstructionSet_Half tracking?

I'd expect us to just use InstructionSet_AVX2 for F16C and InstructionSet_AVX10v1 for FP16

Good to know. I was kind of mirroring the Vector128 etc. ISAs. Mostly hacking. I will adjust this for the real PR.

tannergooding · 2025-10-21T16:12:52Z

src/coreclr/jit/codegencommon.cpp


        var_types storeType = genParamStackType(paramVarDsc, segment);
-        if (!varDsc->TypeIs(TYP_STRUCT) && (genTypeSize(genActualType(varDsc)) < genTypeSize(storeType)))
+        if (!varDsc->TypeIs(TYP_STRUCT) && (!varDsc->TypeIs(TYP_HALF)) && (genTypeSize(genActualType(varDsc)) < genTypeSize(storeType)))


Not sure this is needed. I'd expect genActualType to just keep TYP_HALF as TYP_HALF.

I may have placed this because the storeType was getting saved as TYP_HALF when I wanted to keep the handling similar to that of TYP_STRUCT.

Now, I may have fixed the issue in another spot and this is no longer relevant. I will check.

tannergooding · 2025-10-21T16:13:28Z

src/coreclr/jit/codegencommon.cpp

+    // todo-xarch-half: understand why I needed to comment this out
+    //assert(treeNode->TypeGet() == genActualType(treeNode));


I'm guessing the handling of TYP_HALF in genActualType is wrong here.

I believe the issue I was facing was genActualType was returning TYP_INT while treeNode->TypeGet() was returning a TYP_USHORT. It wasn't TYP_HALF directly that was the issue.

Edit: As in, the bitcast node treeNode is casting to a TYP_USHORT but looks like the assumption is that for a bitcast, both the node type and the VM type representing the node type be the same (which is not true for TYP_USHORT).

I wouldn't expect us to have TYP_USHORT here at all for that case

I wonder if it has to do with you having the simd type as TYP_HALF and then the base type ended up as TYP_USHORT, when we rather want simd type as TYP_SIMD16/32/64 and the base type as TYP_HALF

I'm not sure... here is what I get in tree...

It's that I am trying to preserve the ABI to pass a Half struct, so I am getting to where a store to the USHORT field from a TYP_HALF has to happen. Don't we want to represent the Half struct as a TYP_HALF, not a TYP_SIMD16 etc?

I think the problem is it being modeled as STORE_LCL_VAR half with a ushort field. I think that should've stayed as TYP_STRUCT instead for that scenario and had a GT_BITCAST node inserted

It might actually be that trying to preserve the ABI when we are producing TYP_HALF is more, rather than less, complicated and that it's fine to just let it naturally fall out if there isn't "extra" work required here.

That is, rather than the above, I'd expect this to be more an ABI classification for TYP_HALF and for us to mark that it is it passed/returned the same as TYP_STRUCT containing a single TYP_USHORT field rather than as floating-point.

It'd still be exposed as TYP_HALF, much as TYP_SIMD is still always TYP_SIMD while Unix passes it in register but Windows passes it via shadow copy.

Coming back to this now after the refactoring PR --- I understand what you are saying. I will work on this behavior now.

So I'm going back over this now...

From what I see, a C# struct with a single ushort field is handled internally a bit differently than a struct with a single float field. The latter I think is a bit closer to GenTree behavior we would need because we would be using a float register for the half type.

When a struct has a single ushort field, a bitcast isn't needed in the lowering from what I can see and instead a PUTREG is inserted and since genActualType(ushort) == int it works fine.

I wonder if I should program the half type as genActualType(half) == float, which would mean its bitcast would produce something closer to the struct with a float field, though I think other issues may arise.

tannergooding · 2025-10-21T16:14:29Z

src/coreclr/jit/compiler.h

+        else if (size == 2)
+        {
+            // todo-xarch-half: I think this is the wrong approach, come back to refactor
+            simdType = TYP_HALF;
+        }


Comment is correct. We only want TYP_SIMD8/12/16/32/64 to be "SIMD types".

TYP_HALF is more like TYP_FLOAT and is a potential "base type" for the SIMD type.

tannergooding · 2025-10-21T16:15:17Z

src/coreclr/jit/emitxarch.cpp

+        assert((leadingBytes == 0x0F) || ((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) ||
+                                            emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2) ||


nit, AVX10v1 implies AVX10v2, so we can just change the first line.

Suggested change

assert((leadingBytes == 0x0F) || ((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) ||

emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2) ||

assert((leadingBytes == 0x0F) || ((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) ||

tannergooding · 2025-10-21T16:15:35Z

src/coreclr/jit/emitxarch.cpp

        leadingBytes = (code >> 16) & 0xFF;
        assert(leadingBytes == 0x0F ||
-               (emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2) && leadingBytes >= 0x00 &&
+               ((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) || emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2)) && leadingBytes >= 0x00 &&


Similar here and elsewhere. AVX10v1 implies AVX10v2

tannergooding · 2025-10-21T16:17:27Z

src/coreclr/jit/emitxarch.cpp


 #if defined(TARGET_AMD64)
        case INS_movsxd:
+        case INS_vmovsh:


nit: This isn't 64-bit exclusive, it works fine in 32-bit compat mode

tannergooding · 2025-10-21T16:19:02Z

src/coreclr/jit/emitxarch.cpp


+        case INS_vmovsh:
+        {
+            hasSideEffect = false;


I'd expect this to be true, much like movss since it clears the upper bits of the register (i.e. it overwrites bits 0-15, preserves bits 16-127, and clears bits 128-MAXVL)

tannergooding · 2025-10-21T16:20:23Z

src/coreclr/jit/emitxarch.cpp

+            if (IsXMMReg(reg))
+            {
+                return emitXMMregName(reg);
+            }
+            else if (reg > REG_RDI)


This shouldn't be 64-bit exclusive and probably should follow the pattern of the other paths

if (IsXMMReg(reg)) { return emitXMMregName(reg); } assert(isGeneralRegister(reg)); #if defined(TARGET_AMD64)

tannergooding · 2025-10-21T16:22:40Z

src/coreclr/jit/emitxarch.cpp

    else if (code & 0xFF000000)
    {
-        if (size == EA_2BYTE)
+        if (size == EA_2BYTE && ins != INS_vmovsh)


I wonder if this and other paths should rather just check for INS_movbe since that's the only path that should hit it.

We would then instead add assert((size != EA_2BYTE) || IsFp16Instruction(ins)) (or something along those lines since this likely also expands to bfloat16 in the future).

I can check this as I rework the PR.

tannergooding · 2025-10-21T16:22:52Z

src/coreclr/jit/emitxarch.cpp

    else if (code & 0xFF000000)
    {
-        if (size == EA_2BYTE)
+        if (size == EA_2BYTE && (ins != INS_vmovsh && ins != INS_vaddsh))


Similar comment here

tannergooding · 2025-10-21T16:25:35Z

src/coreclr/jit/hwintrinsic.h

    HWIntrinsicFlag     flags;    // 4-bytes
    NamedIntrinsic      id;       // 2-bytes
+#if defined(TARGET_XARCH)
+    uint16_t            ins[11];  // 10 * 2-bytes


nit: The comment needs fixing here, as does the size listing above.

We really should also fix this for all platforms (and can be mostly handled on others by just fixing their #define to always specify INS_invalid), but that can be done separately if needed.

Ok. Was keeping it pay for play but I will make this change.

All the platforms will end up getting support here, it just hadn't bubbled up to getting done yet. So I think its fine in this case to just go ahead and setup that cost, which will ideally help motivate other platforms to adopt the changes sooner as well.

tannergooding · 2025-10-21T16:29:53Z

src/coreclr/jit/hwintrinsicxarch.cpp

+        case NI_Half_op_Explicit:
+        {
+            assert(sig->numArgs == 1);
+
+            if (compOpportunisticallyDependsOn(InstructionSet_AVX10v1))
+            {
+                StackEntry se   = impPopStack();
+                op1     = se.val;
+                retNode = gtNewSimdHWIntrinsicNode(retType, op1, NI_AVX10v1_ConvertFloatToHalf, CORINFO_TYPE_FLOAT, 2);
+            }
+
+            break;
+        }
+
+        case NI_Half_op_Addition:
+        {
+            assert(sig->numArgs == 2);
+
+            if (compOpportunisticallyDependsOn(InstructionSet_AVX10v1))
+            {
+                StackEntry se1   = impPopStack();
+                StackEntry se2   = impPopStack();
+                op1     = se1.val;
+                op2     = se2.val;
+                retNode = gtNewSimdHWIntrinsicNode(retType, op1, op2, NI_AVX10v1_HalfAdd, CORINFO_TYPE_HALF, 2);
+            }
+
+            break;
+        }


Rather than these being here, I'd expect we have handling similar to what we have for System.Single here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/importercalls.cpp#L10456-L10459

That is, under case 'H': we'd check for "Half" and then call lookupPrimitiveFloatNamedIntrinsic

We'd end up having that operate similarly to say NI_System_Math_FusedMultiplyAdd where we check for the ISA support (AVX2 for F16C or AVX10v1 for FP16) and then introduce the relevant gtNewSimdCreateScalarUnsafeNode and gtNewSimdHWIntrinsicNode for the conversion or addition, prior to calling gtNewSimdToScalarNode: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/importercalls.cpp#L4303-L4368

tannergooding · 2025-10-21T16:34:37Z

src/coreclr/jit/importer.cpp

+        if (structSizeMightRepresentSIMDType(originalSize) 
+#if defined(TARGET_XARCH)
+        || (originalSize == 2)
+#endif
+           )


Now that we're including more typical sizes, I imagine we want to change the check on L1187 to include CORINFO_FLG_INTRINSIC_TYPE and to mark System.Half as [Intrinsic]. We then likely want to tweak this a bit to handle the sizes a bit more efficiently, since we have 8/12/16/32/64 as hitting the SIMD path and 2 hitting the HALF path, but that may expand to include BFLOAT16 in the future or even INT128/UINT128 (16)

Doing this should make it a bit more "pay for play".

tannergooding · 2025-10-21T16:35:38Z

src/coreclr/jit/importercalls.cpp

    if (simdReturnType != call->TypeGet())
    {
-        assert(varTypeIsSIMD(simdReturnType));
+        assert(varTypeIsSIMD(simdReturnType) || simdReturnType == TYP_HALF);


I think we need to tweak the names here, maybe intrinsicReturnType or abiReturnType or something is "better". -- Don't have a preference, but others on jitcontrib might.

anthonycanino added 5 commits October 2, 2025 14:01

Adding the ISAs.

3f55c6e

Hacking in a Half/AVX512FP16 ISA and intrinsic.

00e2226

Merge fixes.

33aa89a

Getting the ISAs and import to a iterable state.

2668f46

(WIP) Hacking around with TYP_HALF.

28b9f5a

tannergooding reviewed Oct 21, 2025

View reviewed changes

		// todo-xarch-half: understand why I needed to comment this out
		//assert(treeNode->TypeGet() == genActualType(treeNode));

		assert((leadingBytes == 0x0F) \|\| ((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) \|\|
		emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2) \|\|

Prototype TYP_HALF changes #15

Are you sure you want to change the base?

Prototype TYP_HALF changes #15

Uh oh!

Conversation

anthonycanino commented Oct 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anthonycanino Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anthonycanino Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Prototype `TYP_HALF` changes #15

Prototype `TYP_HALF` changes #15

anthonycanino Oct 21, 2025 •

edited

Loading

anthonycanino Nov 7, 2025 •

edited

Loading