Skip to content

Conversation

@anthonycanino
Copy link
Owner

I've hacked a bit of changes together to see if I am working on the ABI in the right places. So far, the following code will

[MethodImpl(MethodImplOptions.NoInlining)]
public static Half ProduceHalf(float val, float val2)
{
	Half h1 = (Half)val;
	Half h2 = (Half)val2;
	return ConsumeHalf(h1, h2);
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static Half ConsumeHalf(Half h1, Half h2)
{
	Half h3 = h1 + h2;
	Half h4 = h1 + h3;
	return h4;
}

which will get lowered into the following asm

; Assembly listing for method Program:ProduceHalf(float,float):System.Half (FullOpts)
; Emitting BLENDED_CODE for generic X64 + VEX + EVEX on Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   float  ->  mm0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )   float  ->  mm1         single-def
;* V02 loc0         [V02    ] (  0,  0   )    half  ->  zero-ref    <System.Half>
;  V03 OutArgs      [V03    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <UNNAMED>
;* V04 tmp1         [V04    ] (  0,  0   )  ushort  ->  zero-ref    "field V02._value (fldOffset=0x0)" P-INDEP
;  V05 tmp2         [V05    ] (  2,  4   )    half  ->  [rsp+0x30]  "by-value struct argument" <System.Half>
;  V06 tmp3         [V06    ] (  2,  4   )    half  ->  [rsp+0x28]  "by-value struct argument" <System.Half>
;
; Lcl frame size = 56

G_M43922_IG01:  ;; offset=0x0000
       4883EC38             sub      rsp, 56
						;; size=4 bbWeight=1 PerfScore 0.25
G_M43922_IG02:  ;; offset=0x0004
       62F57C081DC0         vcvtss2sh xmm0, xmm0
       62F57E0811442418     vmovsh   word  ptr [rsp+0x30], xmm0
       62F574081DC1         vcvtss2sh xmm0, xmm1
       62F57E0811442414     vmovsh   word  ptr [rsp+0x28], xmm0
       0FB74C2430           movzx    rcx, word  ptr [rsp+0x30]
       0FB7542428           movzx    rdx, word  ptr [rsp+0x28]
       FF15E0C08700         call     [Program:ConsumeHalf(System.Half,System.Half):System.Half]
       90                   nop      
						;; size=45 bbWeight=1 PerfScore 9.25
G_M43922_IG03:  ;; offset=0x0031
       4883C438             add      rsp, 56
       C3                   ret      
						;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 54, prolog size 4, PerfScore 10.75, instruction count 11, allocated bytes for code 54 (MethodHash=f2a4546d) for method Program:ProduceHalf(float,float):System.Half (FullOpts)
; ============================================================

; Assembly listing for method Program:ConsumeHalf(System.Half,System.Half):System.Half (FullOpts)
; Emitting BLENDED_CODE for generic X64 + VEX + EVEX on Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  4,  4   )    half  ->  [rsp+0x10]  single-def <System.Half>
;  V01 arg1         [V01    ] (  3,  3   )    half  ->  [rsp+0x18]  single-def <System.Half>
;  V02 loc0         [V02    ] (  2,  2   )    half  ->  [rsp+0x00]  single-def <System.Half>
;# V03 OutArgs      [V03    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;
; Lcl frame size = 8

G_M44932_IG01:  ;; offset=0x0000
       50                   push     rax
       894C2410             mov      dword ptr [rsp+0x10], ecx
       89542418             mov      dword ptr [rsp+0x18], edx
						;; size=9 bbWeight=1 PerfScore 3.00
G_M44932_IG02:  ;; offset=0x0009
       62F57E081044240C     vmovsh   xmm0, word  ptr [rsp+0x18]
       62F57E0858442408     vaddsh   xmm0, xmm0, word  ptr [rsp+0x10]
       62F57E08110424       vmovsh   word  ptr [rsp], xmm0
       62F57E0858442408     vaddsh   xmm0, xmm0, word  ptr [rsp+0x10]
       C5F97EC0             vmovd    eax, xmm0
						;; size=35 bbWeight=1 PerfScore 8.00
G_M44932_IG03:  ;; offset=0x002C
       4883C408             add      rsp, 8
       C3                   ret      
						;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 49, prolog size 1, PerfScore 12.25, instruction count 10, allocated bytes for code 49 (MethodHash=5608507b) for method Program:ConsumeHalf(System.Half,System.Half):System.Half (FullOpts)
; ============================================================

Mostly, I am looking for some confirmation that the changes are on the right track, after which I will go back and work on this PR without such a hacking mindset (I know that for example, I have commented out an assert, and hacked in some if statement conditions).

return "AVXVNNIINT";
case InstructionSet_AVXVNNIINT_V512 :
return "AVXVNNIINT_V512";
case InstructionSet_Half :

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What CPUID bit or other support is InstructionSet_Half tracking?

I'd expect us to just use InstructionSet_AVX2 for F16C and InstructionSet_AVX10v1 for FP16

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. I was kind of mirroring the Vector128 etc. ISAs. Mostly hacking. I will adjust this for the real PR.


var_types storeType = genParamStackType(paramVarDsc, segment);
if (!varDsc->TypeIs(TYP_STRUCT) && (genTypeSize(genActualType(varDsc)) < genTypeSize(storeType)))
if (!varDsc->TypeIs(TYP_STRUCT) && (!varDsc->TypeIs(TYP_HALF)) && (genTypeSize(genActualType(varDsc)) < genTypeSize(storeType)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is needed. I'd expect genActualType to just keep TYP_HALF as TYP_HALF.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have placed this because the storeType was getting saved as TYP_HALF when I wanted to keep the handling similar to that of TYP_STRUCT.

Now, I may have fixed the issue in another spot and this is no longer relevant. I will check.

Comment on lines +7958 to +7959
// todo-xarch-half: understand why I needed to comment this out
//assert(treeNode->TypeGet() == genActualType(treeNode));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the handling of TYP_HALF in genActualType is wrong here.

Copy link
Owner Author

@anthonycanino anthonycanino Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the issue I was facing was genActualType was returning TYP_INT while treeNode->TypeGet() was returning a TYP_USHORT. It wasn't TYP_HALF directly that was the issue.

Edit: As in, the bitcast node treeNode is casting to a TYP_USHORT but looks like the assumption is that for a bitcast, both the node type and the VM type representing the node type be the same (which is not true for TYP_USHORT).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect us to have TYP_USHORT here at all for that case

I wonder if it has to do with you having the simd type as TYP_HALF and then the base type ended up as TYP_USHORT, when we rather want simd type as TYP_SIMD16/32/64 and the base type as TYP_HALF

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure... here is what I get in tree...

image

It's that I am trying to preserve the ABI to pass a Half struct, so I am getting to where a store to the USHORT field from a TYP_HALF has to happen. Don't we want to represent the Half struct as a TYP_HALF, not a TYP_SIMD16 etc?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is it being modeled as STORE_LCL_VAR half with a ushort field. I think that should've stayed as TYP_STRUCT instead for that scenario and had a GT_BITCAST node inserted

It might actually be that trying to preserve the ABI when we are producing TYP_HALF is more, rather than less, complicated and that it's fine to just let it naturally fall out if there isn't "extra" work required here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, rather than the above, I'd expect this to be more an ABI classification for TYP_HALF and for us to mark that it is it passed/returned the same as TYP_STRUCT containing a single TYP_USHORT field rather than as floating-point.

It'd still be exposed as TYP_HALF, much as TYP_SIMD is still always TYP_SIMD while Unix passes it in register but Windows passes it via shadow copy.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this now after the refactoring PR --- I understand what you are saying. I will work on this behavior now.

Copy link
Owner Author

@anthonycanino anthonycanino Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm going back over this now...

From what I see, a C# struct with a single ushort field is handled internally a bit differently than a struct with a single float field. The latter I think is a bit closer to GenTree behavior we would need because we would be using a float register for the half type.

When a struct has a single ushort field, a bitcast isn't needed in the lowering from what I can see and instead a PUTREG is inserted and since genActualType(ushort) == int it works fine.

I wonder if I should program the half type as genActualType(half) == float, which would mean its bitcast would produce something closer to the struct with a float field, though I think other issues may arise.

Comment on lines +9318 to +9322
else if (size == 2)
{
// todo-xarch-half: I think this is the wrong approach, come back to refactor
simdType = TYP_HALF;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is correct. We only want TYP_SIMD8/12/16/32/64 to be "SIMD types".

TYP_HALF is more like TYP_FLOAT and is a potential "base type" for the SIMD type.

Comment on lines +3080 to +3081
assert((leadingBytes == 0x0F) || ((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) ||
emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2) ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, AVX10v1 implies AVX10v2, so we can just change the first line.

Suggested change
assert((leadingBytes == 0x0F) || ((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) ||
emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2) ||
assert((leadingBytes == 0x0F) || ((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) ||

leadingBytes = (code >> 16) & 0xFF;
assert(leadingBytes == 0x0F ||
(emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2) && leadingBytes >= 0x00 &&
((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) || emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2)) && leadingBytes >= 0x00 &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here and elsewhere. AVX10v1 implies AVX10v2


#if defined(TARGET_AMD64)
case INS_movsxd:
case INS_vmovsh:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This isn't 64-bit exclusive, it works fine in 32-bit compat mode


case INS_vmovsh:
{
hasSideEffect = false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect this to be true, much like movss since it clears the upper bits of the register (i.e. it overwrites bits 0-15, preserves bits 16-127, and clears bits 128-MAXVL)

Comment on lines +11813 to +11817
if (IsXMMReg(reg))
{
return emitXMMregName(reg);
}
else if (reg > REG_RDI)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be 64-bit exclusive and probably should follow the pattern of the other paths

            if (IsXMMReg(reg))
            {
                return emitXMMregName(reg);
            }
            assert(isGeneralRegister(reg));
#if defined(TARGET_AMD64)

else if (code & 0xFF000000)
{
if (size == EA_2BYTE)
if (size == EA_2BYTE && ins != INS_vmovsh)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this and other paths should rather just check for INS_movbe since that's the only path that should hit it.

We would then instead add assert((size != EA_2BYTE) || IsFp16Instruction(ins)) (or something along those lines since this likely also expands to bfloat16 in the future).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check this as I rework the PR.

else if (code & 0xFF000000)
{
if (size == EA_2BYTE)
if (size == EA_2BYTE && (ins != INS_vmovsh && ins != INS_vaddsh))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here

HWIntrinsicFlag flags; // 4-bytes
NamedIntrinsic id; // 2-bytes
#if defined(TARGET_XARCH)
uint16_t ins[11]; // 10 * 2-bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The comment needs fixing here, as does the size listing above.

We really should also fix this for all platforms (and can be mostly handled on others by just fixing their #define to always specify INS_invalid), but that can be done separately if needed.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Was keeping it pay for play but I will make this change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the platforms will end up getting support here, it just hadn't bubbled up to getting done yet. So I think its fine in this case to just go ahead and setup that cost, which will ideally help motivate other platforms to adopt the changes sooner as well.

Comment on lines +1257 to +1285
case NI_Half_op_Explicit:
{
assert(sig->numArgs == 1);

if (compOpportunisticallyDependsOn(InstructionSet_AVX10v1))
{
StackEntry se = impPopStack();
op1 = se.val;
retNode = gtNewSimdHWIntrinsicNode(retType, op1, NI_AVX10v1_ConvertFloatToHalf, CORINFO_TYPE_FLOAT, 2);
}

break;
}

case NI_Half_op_Addition:
{
assert(sig->numArgs == 2);

if (compOpportunisticallyDependsOn(InstructionSet_AVX10v1))
{
StackEntry se1 = impPopStack();
StackEntry se2 = impPopStack();
op1 = se1.val;
op2 = se2.val;
retNode = gtNewSimdHWIntrinsicNode(retType, op1, op2, NI_AVX10v1_HalfAdd, CORINFO_TYPE_HALF, 2);
}

break;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than these being here, I'd expect we have handling similar to what we have for System.Single here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/importercalls.cpp#L10456-L10459

That is, under case 'H': we'd check for "Half" and then call lookupPrimitiveFloatNamedIntrinsic

We'd end up having that operate similarly to say NI_System_Math_FusedMultiplyAdd where we check for the ISA support (AVX2 for F16C or AVX10v1 for FP16) and then introduce the relevant gtNewSimdCreateScalarUnsafeNode and gtNewSimdHWIntrinsicNode for the conversion or addition, prior to calling gtNewSimdToScalarNode: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/importercalls.cpp#L4303-L4368

Comment on lines +1192 to +1196
if (structSizeMightRepresentSIMDType(originalSize)
#if defined(TARGET_XARCH)
|| (originalSize == 2)
#endif
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're including more typical sizes, I imagine we want to change the check on L1187 to include CORINFO_FLG_INTRINSIC_TYPE and to mark System.Half as [Intrinsic]. We then likely want to tweak this a bit to handle the sizes a bit more efficiently, since we have 8/12/16/32/64 as hitting the SIMD path and 2 hitting the HALF path, but that may expand to include BFLOAT16 in the future or even INT128/UINT128 (16)

Doing this should make it a bit more "pay for play".

if (simdReturnType != call->TypeGet())
{
assert(varTypeIsSIMD(simdReturnType));
assert(varTypeIsSIMD(simdReturnType) || simdReturnType == TYP_HALF);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to tweak the names here, maybe intrinsicReturnType or abiReturnType or something is "better". -- Don't have a preference, but others on jitcontrib might.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants