-
Notifications
You must be signed in to change notification settings - Fork 0
Prototype changes for TYP_HALF with refactored vartypes. #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: simd-vartypes
Are you sure you want to change the base?
Conversation
|
For the unoptimized after lowering look like this... For the optimized after lowering look like this... We can see the ABI respected when the half type needs to passed, but when it does not, it is treated as a float register. I like how the trees look a bit more now, but I have disabled some asserts, particular, we are allowing a For reference, here are the full jit dumps are attached. |
| instruction store_ins = ins_Store(storeType); | ||
| if (storeType == TYP_HALF) | ||
| { | ||
| // We cannot use `vmovsh` with an integer register, which is what the ABI would pass the | ||
| // 16-bit float value in. Switch to `vmov` which supports integer registers. | ||
| #if defined(TARGET_XARCH) | ||
| store_ins = INS_mov; | ||
| #else | ||
| store_ins = INS_invalid; | ||
| assert(!"TYP_HALF parameter passing not supported on this platform"); | ||
| #endif | ||
| } | ||
|
|
||
| GetEmitter()->emitIns_S_R(store_ins, emitActualTypeSize(storeType), segment.GetRegister(), lclNum, | ||
| offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get this, but it also seems like "extra work" that might pessimize the typical path.
I wonder if it'd be better to just keep TYP_STRUCT, much as it is for the struct S { float value; } case. In which case the classifier would state it lives in a GPR and we'd just get INS_mov already. I wouldn't expect us to track it as TYP_HALF unless we're
So we'd only retype/normalize to TYP_HALF in the case the ABI is being followed and its passed in SIMD register (at which point we can use vpextrw and vpinsrw/vmovd (F16C) or vmovw (AVX512-FP16)
This might even push us in the direction that following the ABI is trivial and expected. We only produce TYP_HALF if its accelerated and the handling in that case is correct by convention of it being floating-point and following the other floating-point conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I am not sure I completely follow.
Would we then perform a normalization/type conversion as a separate compiler pass? Meaning, we do not normalize on a struct import like it is currently done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the idea to preserve the existing behavior is actually making this more complex, rather than simpler. I'm thinking it's actually less work to just do this right in the first place, ABI wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've seen, I think it is.
One thing I can try is prototyping the ABI where half is passed and returned in a float register. We can see if it is much simpler. That shouldn't take too long to try out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding I put the changes on a separate branch #17
I have not addressed all nits below, but I wanted to get the major changes up and see what the code looked like. I think this is much easier I am only changing a few places. You'll see in the code that some of the signature checking might need to be updated. It's much easier to not treat TYP_HALF as a struct in any way, and instead treat it wholly as a primitive type (otherwise the jit picks it up as a struct in so many places where different handling is needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it intensionally
When TYP_HALF was treated as a struct type, it caused a lot handling to apply to the type that didn't make sense for a primitive type.
One confusion that is going on, and perhaps we should move the conversation, is that we are talking about the PR at #17, where I redid the handling of the half type to more closely match that of a primitive float type in terms of the ABI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a difference between treating it as TYP_STRUCT and having it classified as a struct such that varTypeIsStruct(...) report true (being marked with VTF_S).
We don't expect the former, because we're retyping it to TYP_HALF (much as we retype to TYP_SIMD8/12/16/32/64). But we do still expect the latter.
Looking at the new type entry, it may be that the combination of VTF_S and VTF_FLT is breaking things here, as that may get into ordering or other nuance that hasn't been accounted for due to it being a "new combination". So I'd suggest removing VTF_FLT for now and just having it be specially handled via the intrinsic code paths. We can always look at also handling VTF_FLT in the future if that provides benefit
-- This basically means varTypeIsFloating(...) will report false, but that's probably okay if we're importing everything as GT_HWINTRINSIC nodes anyways. We'd really only need the other if we end up making scalar support more "first class" such that GT_ADD and such works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had tried that route previously, and there are a lot of asserts that get hit. For example, in the gtNewSimdToScalarNode which we use to build up the Half operations, this will trigger...
likewise, I believe there are a lot of places where when generating the code or hwintrisic, varTypeIsFloating is checked, for example...
I don't think these are isolated to the one Vector128_ToScalar case, but I can try adding an or case there to see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I expect there will be several fixups regardless of including it or not.
I expect that its inclusion is what’s making the “invalid IR” right now.
You could (if keeping the flag) try to find and handle those cases, such as by ensuring varTypeIsStruct comes first, so it takes precedence over varTypeIsFloating, or (if you remove the flag) you could update the hardware intrinsic paths to also allow TYP_HALF in may of the paths that assert it must be floating-point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s really just an area where we’ve not quite done “this” before and so we don’t have things in place to make it easy.
We’d have the same general issues with adding some TYP_INT128. Anything added after this pr should be much easier, however
| void CodeGen::genCodeForBitCast(GenTreeOp* treeNode) | ||
| { | ||
| assert(treeNode->TypeGet() == genActualType(treeNode)); | ||
| //assert(treeNode->TypeGet() == genActualType(treeNode)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need commented out? I'd expect in this to just be TYP_HALF == TYP_HALF since we aren't implicitly extending the type up to a larger type size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see this in the trees above, but what happens is Lowering::InsertBitCastIfNecessary will insert the bitcast to TYP_USHORT not TYP_HALF, as the ABI defines the TYP_HALF to be passed in an integer register. The code will grab the appropriate integer register for a 2 byte value, hence, TYP_USHORT.
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lower.cpp#L1934-L1944
It's this line in particular that will produce the bitcast...
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lower.cpp#L1923
TYP_HALF has to be defined to use float registers, but its ABI register segement info defines it to require integer register for passing.
| assert((leadingBytes == 0x0F) || ((emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v1) || | ||
| emitComp->compIsaSupportedDebugOnly(InstructionSet_AVX10v2) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just check 10v1 as it implies 10v2
| else if (code & 0xFF000000) | ||
| { | ||
| if (size == EA_2BYTE) | ||
| if (size == EA_2BYTE && ins != INS_vmovsh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking if its a vex/evex/simd instruction might be more appropriate, since those all shouldn't be EA_2BYTE or should have this prefix byte as part of their better encoding.
| else if (code & 0xFF000000) | ||
| { | ||
| if (size == EA_2BYTE) | ||
| if (size == EA_2BYTE && (ins != INS_vmovsh && ins != INS_vaddsh)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here and elsewhere
| break; | ||
| } | ||
|
|
||
| case NI_AVX10v1_ConvertFloatToHalf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ConvertSingleToHalf
|
|
||
| // Don't bother if the struct contains GC references of byrefs, it can't be a SIMD type. | ||
| if ((structFlags & (CORINFO_FLG_CONTAINS_GC_PTR | CORINFO_FLG_BYREF_LIKE)) == 0) | ||
| if ((structFlags & (CORINFO_FLG_CONTAINS_GC_PTR | CORINFO_FLG_BYREF_LIKE)) == 0 && (structFlags & CORINFO_FLG_INTRINSIC_TYPE) != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to (structFlags & (CORINFO_FLG_CONTAINS_GC_PTR | CORINFO_FLG_BYREF_LIKE | CORINFO_FLG_INTRINSIC_TYPE)) == CORINFO_FLG_INTRINSIC_TYPE
| } | ||
| } | ||
| } | ||
| else if (strcmp(namespaceName, "System") == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is the "right" place for this, particularly given the current method name.
Enable X64's optimization where we clear LCLHEAP via STORE_BLK inserted
in Lower on arm64.
```cs
static void Test128() => Consume(stackalloc char[128]);
```
was:
```asm
stp xzr, xzr, [sp, #-0x10]!
stp xzr, xzr, [sp, #-0xF0]!
stp xzr, xzr, [sp, #0x10]
stp xzr, xzr, [sp, #0x20]
stp xzr, xzr, [sp, #0x30]
stp xzr, xzr, [sp, #0x40]
stp xzr, xzr, [sp, #0x50]
stp xzr, xzr, [sp, #0x60]
stp xzr, xzr, [sp, #0x70]
stp xzr, xzr, [sp, #0x80]
stp xzr, xzr, [sp, #0x90]
stp xzr, xzr, [sp, #0xA0]
stp xzr, xzr, [sp, #0xB0]
stp xzr, xzr, [sp, #0xC0]
stp xzr, xzr, [sp, #0xD0]
stp xzr, xzr, [sp, #0xE0]
```
now:
```asm
movi v16.16b, #0
stp q16, q16, [x0]
stp q16, q16, [x0, #0x20]
stp q16, q16, [x0, #0x40]
stp q16, q16, [x0, #0x60]
stp q16, q16, [x0, #0x80]
stp q16, q16, [x0, #0xA0]
stp q16, q16, [x0, #0xC0]
stp q16, q16, [x0, #0xE0]
```
Also, for larger sizes the previous logic used to emit a slow loop (e.g.
1024 bytes):
```asm
mov w0, #0x400
G_M30953_IG03:
stp xzr, xzr, [sp, #-0x10]!
subs x0, x0, #16
bne G_M30953_IG03
```
Now it will emit a call to `CORINFO_HELP_MEMZERO`
[Benchmarks.](EgorBot/runtime-utils#553)
```cs
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
public class Benchmarks
{
[Benchmark] public void Stackalloc64() => Consume(stackalloc byte[64]);
[Benchmark] public void Stackalloc128() => Consume(stackalloc byte[128]);
[Benchmark] public void Stackalloc256() => Consume(stackalloc byte[256]);
[Benchmark] public void Stackalloc512() => Consume(stackalloc byte[512]);
[Benchmark] public void Stackalloc1024() => Consume(stackalloc byte[1024]);
[Benchmark] public void Stackalloc16384() => Consume(stackalloc byte[16384]);
[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(Span<byte> x){}
}
```
| Method | Toolchain | Mean | Error | Ratio |
|---------------- |------------------------
|-----------:|----------:|------:|
| Stackalloc64 | Main | 3.425 ns | 0.0004 ns | 1.00 |
| Stackalloc64 | PR | 2.559 ns | 0.0008 ns | 0.75 |
| | | | | |
| Stackalloc128 | Main | 3.999 ns | 0.0002 ns | 1.00 |
| Stackalloc128 | PR | 2.404 ns | 0.0003 ns | 0.60 |
| | | | | |
| Stackalloc256 | Main | 5.431 ns | 0.0005 ns | 1.00 |
| Stackalloc256 | PR | 2.754 ns | 0.0003 ns | 0.51 |
| | | | | |
| Stackalloc512 | Main | 12.661 ns | 0.2744 ns | 1.00 |
| Stackalloc512 | PR | 7.423 ns | 0.0008 ns | 0.59 |
| | | | | |
| Stackalloc1024 | Main | 24.958 ns | 0.5326 ns | 1.00 |
| Stackalloc1024 | PR | 14.031 ns | 0.0040 ns | 0.56 |
| | | | | |
| Stackalloc16384 | Main | 374.899 ns | 0.0130 ns | 1.00 |
| Stackalloc16384 | PR | 111.029 ns | 1.2123 ns | 0.30 |
---------
Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>
Consider the following C# snippet
My changes produce the following Tier0 code...
and the following Tier1 code