Use aliasing in DirectToLDS to avoid unnecessary waits inserted by the backend #2109

pabloantoniom · 2025-11-17T11:51:47Z

Motivation

In #2087 I discovered that disabling the DirectToLDS hack has an unexpected outcome:

Some E2E tests still work (correct result)
Performance does not improve (see this).

One would expect that not inserting the waitcnt 0 would result in wrong results and significant speedup, but this was not the case.

This is because, as I later figured out, the backend is inserting conservative waitcnt 0 for us (basically ignoring the value that we set for waitcnt and setting it to 0 instead). This happens in the si-insert-waitcnts pass, where the backend thinks that a wait is neccesary between the DirectToLDS loads and the ds_reads.

A solution to overcome this problem in the backend is to use LLVM aliasing to indicate the backend that this is not a problem. An initial idea is to add all the DirectToLDS loads to an an alias group and then adding that group as noAlias to the ds_reads.

NOTE: This idea can be potentially applied to non DirectToLDS kernels and it would likely improve performance. However coming up with a general solution does not seem to be trivial, so that is left for future work.

Technical Details

Changes:

We define 2 new alias scopes in AliasUtils.cpp:
- amdgpu.DirectToLDSLoads: Holds all the DirectToLDS loads (i.e., the rocdl.load.to.lds ops)
- amdgpu.LocalLoads: Holds the local loads (i.e., the llvm.load ops which source operand is LDS)
This PR adds the add-alias-info pass, which transverses the IR and adds alias scope information to operations that perform direct-to-LDS loads or stores and local loads or stores. This includes:

rocdl.load_to_lds operations (direct loads to LDS)
llvm.load operations from global memory to LDS (local loads)
llvm.store operations to LDS from global memory (local stores)

Fix a bug in RockPrepareLLVM.cpp, which previously was overriding the alias info on all ops. Instead, we now merge existing alias info with the alias info added by this pass.

Test Plan

Added LIT test with 5 testcases to ensure that alias information is inserted correctly.

Test Result

LIT_FILTER=DirectToLDS ninja check-rocmlir
...
Testing Time: 186.15s

Total Discovered Tests: 1180
  Excluded: 1141 (96.69%)
  Passed  :   39 (3.31%)

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

justinrosner · 2025-11-20T19:20:31Z

In #2062 I discovered that disabling the DirectToLDS hack has an unexpected outcome:

Did you mean #2087 instead?

justinrosner · 2025-11-20T20:17:37Z

mlir/lib/Dialect/Rock/Transforms/AddAliasInfo.cpp

+            }
+          }
+        }
+        else if (auto storeOp = dyn_cast<LLVM::StoreOp>(aliasOp)) {


Do LLVM::StoreOp and LLVM::LoadOp share an interface? If yes, we can clean up some duplicate code that is almost the exact same for the if and else if branches.

Indeed, I have refactored it to use the interface and also a bit the LoadToLDSOp/RawPtrBufferLoadLdsOp code, thanks

justinrosner · 2025-11-20T20:29:21Z

mlir/lib/Dialect/Rock/Transforms/RockPrepareLLVM.cpp

+    if (existingAliasScopes && newAliasScopes) {
+      SmallVector<Attribute> mergedAliasScopes;
+      mergedAliasScopes.append(existingAliasScopes.begin(), existingAliasScopes.end());
+      mergedAliasScopes.append(newAliasScopes.begin(), newAliasScopes.end());
+      aliasIface.setAliasScopes(b.getArrayAttr(mergedAliasScopes));
+    } else if (existingAliasScopes) {
+      aliasIface.setAliasScopes(existingAliasScopes);
+    } else if (newAliasScopes) {
+      aliasIface.setAliasScopes(newAliasScopes);
+    }


Could this be simplified by just doing something like:

if (existingAliasScopes) mergedAliasScopes.append(existingAliasScopes.begin(), existingAliasScopes.end()); if (newAliasScopes) mergedAliasScopes.append(newAliasScopes.begin(), newAliasScopes.end()); if (!mergedAliasScopes.empty()) aliasIface.setAliasScopes(b.getArrayAttr(mergedAliasScopes));

justinrosner · 2025-11-20T20:29:56Z

mlir/lib/Dialect/Rock/Transforms/RockPrepareLLVM.cpp

+    if (existingNoAliasScopes && newNoAliasScopes) {
+      SmallVector<Attribute> mergedNoAliasScopes;
+      mergedNoAliasScopes.append(existingNoAliasScopes.begin(), existingNoAliasScopes.end());
+      mergedNoAliasScopes.append(newNoAliasScopes.begin(), newNoAliasScopes.end());
+      aliasIface.setNoAliasScopes(b.getArrayAttr(mergedNoAliasScopes));
+    } else if (existingNoAliasScopes) {
+      aliasIface.setNoAliasScopes(existingNoAliasScopes);
+    } else if (newNoAliasScopes) {
+      aliasIface.setNoAliasScopes(newNoAliasScopes);
+    }


Same as above.

Yep, actually they both do the same with different data so I refactored both into a single helper to increase code reuse

pabloantoniom · 2025-11-21T08:10:03Z

In #2062 I discovered that disabling the DirectToLDS hack has an unexpected outcome:

Did you mean #2087 instead?

Yes, my bad!

justinrosner

Change looks good now. I think this just needs to update rocmlir-driver/pipelines.mlir and it should be good to go!

dhernandez0 · 2025-11-24T12:12:26Z

external/llvm-project/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

+        hackForDirectToLDS(hackForDirectToLDS) {}

  Chipset chipset;
+  bool hackForDirectToLDS;


can we remove this variable? it seems we don't use it.

My bad, that shouldn't been committed in the first place, removed

dhernandez0 · 2025-11-24T12:13:41Z

external/llvm-project/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

-    // on the MMRA to relax it to the semantics we want.
-    StringRef scope = "workgroup";
-
-    auto relFence = LLVM::FenceOp::create(rewriter, loc,


why do we remove the fence?

Because we have to use the old upstream version of the fence (using SBarrierOp). Otherwise the backend still inserts unwanted waits.

I think that the backend somehow searches for the rocdl ops to generate the assembly. When they updated the AMDGPUToROCDL implementation with the new fence mechanism they didn't change the backend so it is still expecing the same rocdl ops; because they are now different, the output is not what we want.

We should probably check this again after the next upstream merge.

can you add a comment here? also maybe chat about this with @krzysz00. I think he made that change.

Added a comment, will talk with Krzysztof if he does not reply to this thread

We should probably check this again after the next upstream merge.

Did you check with november upstream merge ?

We should probably check this again after the next upstream merge.

Did you check with november upstream merge ?

yes, it was merged yesterday, is this still needed?

I just checked and yes, we still need this unfortunately

dhernandez0 · 2025-11-24T12:14:57Z

mlir/include/mlir/Dialect/Rock/utility/AliasUtils.h

+namespace rock {
+/// Add the direct-to-LDS load alias scope to the given operation.
+/// This marks the operation as being part of the direct-to-LDS load scope.
+/// It also marks the operation as not aliasing with local loads.


I find the "local" very confusing, why not call it LDS as we do in the rest of the codebase?

Agreed, fixed

dhernandez0 · 2025-11-24T12:20:06Z

mlir/lib/Dialect/Rock/Transforms/AddDirectToLDSAliasInfo.cpp

+            }
+          }
+        } else if (isa<ROCDL::LoadToLDSOp, ROCDL::RawPtrBufferLoadLdsOp>(
+                       aliasOp)) {


I think it would be better to have more generic way of detecting the ops. Can we use the MemoryEffects? on gfx1250 we'll have new ops that would need to be added here otherwise.

I think MemoryEffects is too broad right? But yes it would be good to have a more generic way. Maybe we should add something like LDSLoadOpInterface and add LoadToLDSOp, RawPtrBufferLoadLdsOp and future LDS load ops to it. Maybe try to upstream it.

I think MemoryEffects is too broad right?

Why is it too broad? if an op is loading from LDS, it would need an alias.

But implementing MemoryEffectsOpInterface does not imply that it's writing to LDS. What I am thinking now is that we could check if it's MemoryEffects write, and in that case check if it's writing to LDS. That should work, I'll try that.

Done, it seems to work (LIT test pass), will confirm performance later

dhernandez0 · 2025-11-24T12:21:15Z

mlir/lib/Dialect/Rock/Transforms/RockPrepareLLVM.cpp

-    aliasIface.setAliasScopes(aliasScopes[argNo]);
-    aliasIface.setNoAliasScopes(noaliasScopes[argNo]);
+
+    // Merge existing alias scopes (if any) with the new scopes we just created.


nit: explain that the existing scopes can come from the AddAliasInfo pass

dhernandez0 · 2025-11-24T12:21:41Z

mlir/include/mlir/Dialect/Rock/Passes.td

  let dependentDialects = ["rock::RockDialect"];
 }

+def RockAddAliasInfoPass : Pass<"rock-add-alias-info", "::mlir::gpu::GPUModuleOp"> {


nit: AddAliasInfo sounds very generic. This only adds alias info for direct to lds.

Ideally we would like to extend this in the near future to support other cases not only DirectToLDS

sure, but that's still not what the pass does. I'd prefer to change the name when the pass does something else rather than assuming we are going to do something in the future.

I'm not a fan of this change, but done

dhernandez0 · 2025-11-24T12:23:24Z

mlir/lib/Dialect/Rock/utility/AliasUtils.cpp

I don't think this is going to be useful outside of the new pass. I'd keep them inside the pass.

Agreed, done

dhernandez0 · 2025-11-24T12:23:32Z

mlir/lib/Dialect/Rock/utility/CMakeLists.txt

  MLIRRockAnalysis
  MLIRMHAL
+  MLIRLLVMDialect
 )


mlir/lib/Dialect/Rock/Transforms/AddDirectToLDSAliasInfo.cpp

mlir/lib/Dialect/Rock/Transforms/RockPrepareLLVM.cpp

mlir/lib/Dialect/Rock/Transforms/AddDirectToLDSAliasInfo.cpp

…test

…he backend still does weird things

…c ops

pabloantoniom mentioned this pull request Nov 17, 2025

Remove DirectToLDS hack by adding rock.async_wait op #2087

Merged

1 task

pabloantoniom force-pushed the pablo-2062-alias3 branch 3 times, most recently from 9ebdee8 to 46db3f0 Compare November 18, 2025 10:59

pabloantoniom changed the title ~~[WIP] Use aliasing in DirectToLDS to avoid unnecessary waits inserted by the backend~~ Use aliasing in DirectToLDS to avoid unnecessary waits inserted by the backend Nov 18, 2025

pabloantoniom marked this pull request as ready for review November 18, 2025 14:18

pabloantoniom requested a review from causten as a code owner November 18, 2025 14:18

pabloantoniom requested review from dhernandez0, djramic, justinrosner and umangyadav November 18, 2025 14:18

pabloantoniom force-pushed the pablo-2062 branch from 7034c14 to 58fe7f0 Compare November 20, 2025 07:13

pabloantoniom force-pushed the pablo-2062-alias3 branch from 04c0719 to 72b02a1 Compare November 20, 2025 16:42

justinrosner reviewed Nov 20, 2025

View reviewed changes

pabloantoniom requested a review from justinrosner November 21, 2025 09:54

justinrosner approved these changes Nov 21, 2025

View reviewed changes

dhernandez0 reviewed Nov 24, 2025

View reviewed changes

pabloantoniom force-pushed the pablo-2062 branch from e1f046b to 424f57f Compare November 27, 2025 10:56

pabloantoniom force-pushed the pablo-2062-alias3 branch 3 times, most recently from b913d87 to 120b7e4 Compare November 27, 2025 11:57