-
Notifications
You must be signed in to change notification settings - Fork 52
Use aliasing in DirectToLDS to avoid unnecessary waits inserted by the backend #2109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9ebdee8 to
46db3f0
Compare
7034c14 to
58fe7f0
Compare
04c0719 to
72b02a1
Compare
| } | ||
| } | ||
| } | ||
| else if (auto storeOp = dyn_cast<LLVM::StoreOp>(aliasOp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do LLVM::StoreOp and LLVM::LoadOp share an interface? If yes, we can clean up some duplicate code that is almost the exact same for the if and else if branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I have refactored it to use the interface and also a bit the LoadToLDSOp/RawPtrBufferLoadLdsOp code, thanks
| if (existingAliasScopes && newAliasScopes) { | ||
| SmallVector<Attribute> mergedAliasScopes; | ||
| mergedAliasScopes.append(existingAliasScopes.begin(), existingAliasScopes.end()); | ||
| mergedAliasScopes.append(newAliasScopes.begin(), newAliasScopes.end()); | ||
| aliasIface.setAliasScopes(b.getArrayAttr(mergedAliasScopes)); | ||
| } else if (existingAliasScopes) { | ||
| aliasIface.setAliasScopes(existingAliasScopes); | ||
| } else if (newAliasScopes) { | ||
| aliasIface.setAliasScopes(newAliasScopes); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be simplified by just doing something like:
if (existingAliasScopes)
mergedAliasScopes.append(existingAliasScopes.begin(), existingAliasScopes.end());
if (newAliasScopes)
mergedAliasScopes.append(newAliasScopes.begin(), newAliasScopes.end());
if (!mergedAliasScopes.empty())
aliasIface.setAliasScopes(b.getArrayAttr(mergedAliasScopes));There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done
| if (existingNoAliasScopes && newNoAliasScopes) { | ||
| SmallVector<Attribute> mergedNoAliasScopes; | ||
| mergedNoAliasScopes.append(existingNoAliasScopes.begin(), existingNoAliasScopes.end()); | ||
| mergedNoAliasScopes.append(newNoAliasScopes.begin(), newNoAliasScopes.end()); | ||
| aliasIface.setNoAliasScopes(b.getArrayAttr(mergedNoAliasScopes)); | ||
| } else if (existingNoAliasScopes) { | ||
| aliasIface.setNoAliasScopes(existingNoAliasScopes); | ||
| } else if (newNoAliasScopes) { | ||
| aliasIface.setNoAliasScopes(newNoAliasScopes); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, actually they both do the same with different data so I refactored both into a single helper to increase code reuse
justinrosner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good now. I think this just needs to update rocmlir-driver/pipelines.mlir and it should be good to go!
| hackForDirectToLDS(hackForDirectToLDS) {} | ||
|
|
||
| Chipset chipset; | ||
| bool hackForDirectToLDS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this variable? it seems we don't use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, that shouldn't been committed in the first place, removed
| // on the MMRA to relax it to the semantics we want. | ||
| StringRef scope = "workgroup"; | ||
|
|
||
| auto relFence = LLVM::FenceOp::create(rewriter, loc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we remove the fence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we have to use the old upstream version of the fence (using SBarrierOp). Otherwise the backend still inserts unwanted waits.
I think that the backend somehow searches for the rocdl ops to generate the assembly. When they updated the AMDGPUToROCDL implementation with the new fence mechanism they didn't change the backend so it is still expecing the same rocdl ops; because they are now different, the output is not what we want.
We should probably check this again after the next upstream merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here? also maybe chat about this with @krzysz00. I think he made that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment, will talk with Krzysztof if he does not reply to this thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably check this again after the next upstream merge.
Did you check with november upstream merge ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably check this again after the next upstream merge.
Did you check with november upstream merge ?
yes, it was merged yesterday, is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked and yes, we still need this unfortunately
| namespace rock { | ||
| /// Add the direct-to-LDS load alias scope to the given operation. | ||
| /// This marks the operation as being part of the direct-to-LDS load scope. | ||
| /// It also marks the operation as not aliasing with local loads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the "local" very confusing, why not call it LDS as we do in the rest of the codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, fixed
| } | ||
| } | ||
| } else if (isa<ROCDL::LoadToLDSOp, ROCDL::RawPtrBufferLoadLdsOp>( | ||
| aliasOp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to have more generic way of detecting the ops. Can we use the MemoryEffects? on gfx1250 we'll have new ops that would need to be added here otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think MemoryEffects is too broad right? But yes it would be good to have a more generic way. Maybe we should add something like LDSLoadOpInterface and add LoadToLDSOp, RawPtrBufferLoadLdsOp and future LDS load ops to it. Maybe try to upstream it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think MemoryEffects is too broad right?
Why is it too broad? if an op is loading from LDS, it would need an alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But implementing MemoryEffectsOpInterface does not imply that it's writing to LDS. What I am thinking now is that we could check if it's MemoryEffects write, and in that case check if it's writing to LDS. That should work, I'll try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, it seems to work (LIT test pass), will confirm performance later
| aliasIface.setAliasScopes(aliasScopes[argNo]); | ||
| aliasIface.setNoAliasScopes(noaliasScopes[argNo]); | ||
|
|
||
| // Merge existing alias scopes (if any) with the new scopes we just created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: explain that the existing scopes can come from the AddAliasInfo pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| let dependentDialects = ["rock::RockDialect"]; | ||
| } | ||
|
|
||
| def RockAddAliasInfoPass : Pass<"rock-add-alias-info", "::mlir::gpu::GPUModuleOp"> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: AddAliasInfo sounds very generic. This only adds alias info for direct to lds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would like to extend this in the near future to support other cases not only DirectToLDS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but that's still not what the pass does. I'd prefer to change the name when the pass does something else rather than assuming we are going to do something in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this change, but done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is going to be useful outside of the new pass. I'd keep them inside the pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, done
| MLIRRockAnalysis | ||
| MLIRMHAL | ||
| MLIRLLVMDialect | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line
e1f046b to
424f57f
Compare
b913d87 to
120b7e4
Compare
41b5b5e to
ebbd908
Compare
82aa2b3 to
14ea1ec
Compare
…he backend still does weird things
6e53f39 to
0add71c
Compare
Motivation
In #2087 I discovered that disabling the DirectToLDS hack has an unexpected outcome:
One would expect that not inserting the
waitcnt 0would result in wrong results and significant speedup, but this was not the case.This is because, as I later figured out, the backend is inserting conservative
waitcnt 0for us (basically ignoring the value that we set for waitcnt and setting it to 0 instead). This happens in thesi-insert-waitcntspass, where the backend thinks that a wait is neccesary between the DirectToLDS loads and theds_reads.A solution to overcome this problem in the backend is to use LLVM aliasing to indicate the backend that this is not a problem. An initial idea is to add all the DirectToLDS loads to an an
aliasgroup and then adding that group asnoAliasto theds_reads.NOTE: This idea can be potentially applied to non DirectToLDS kernels and it would likely improve performance. However coming up with a general solution does not seem to be trivial, so that is left for future work.
Technical Details
Changes:
We define 2 new alias scopes in
AliasUtils.cpp:amdgpu.DirectToLDSLoads: Holds all the DirectToLDS loads (i.e., therocdl.load.to.ldsops)amdgpu.LocalLoads: Holds the local loads (i.e., thellvm.loadops which source operand is LDS)This PR adds the
add-alias-infopass, which transverses the IR and adds alias scope information to operations that perform direct-to-LDS loads or stores and local loads or stores. This includes:RockPrepareLLVM.cpp, which previously was overriding the alias info on all ops. Instead, we now merge existing alias info with the alias info added by this pass.Test Plan
Added LIT test with 5 testcases to ensure that alias information is inserted correctly.
Test Result
Submission Checklist