Skip to content

Conversation

@pabloantoniom
Copy link
Contributor

@pabloantoniom pabloantoniom commented Nov 17, 2025

Motivation

In #2087 I discovered that disabling the DirectToLDS hack has an unexpected outcome:

  • Some E2E tests still work (correct result)
  • Performance does not improve (see this).

One would expect that not inserting the waitcnt 0 would result in wrong results and significant speedup, but this was not the case.

This is because, as I later figured out, the backend is inserting conservative waitcnt 0 for us (basically ignoring the value that we set for waitcnt and setting it to 0 instead). This happens in the si-insert-waitcnts pass, where the backend thinks that a wait is neccesary between the DirectToLDS loads and the ds_reads.

A solution to overcome this problem in the backend is to use LLVM aliasing to indicate the backend that this is not a problem. An initial idea is to add all the DirectToLDS loads to an an alias group and then adding that group as noAlias to the ds_reads.

NOTE: This idea can be potentially applied to non DirectToLDS kernels and it would likely improve performance. However coming up with a general solution does not seem to be trivial, so that is left for future work.

Technical Details

Changes:

  1. We define 2 new alias scopes in AliasUtils.cpp:

    • amdgpu.DirectToLDSLoads: Holds all the DirectToLDS loads (i.e., the rocdl.load.to.lds ops)
    • amdgpu.LocalLoads: Holds the local loads (i.e., the llvm.load ops which source operand is LDS)
  2. This PR adds the add-alias-info pass, which transverses the IR and adds alias scope information to operations that perform direct-to-LDS loads or stores and local loads or stores. This includes:

  • rocdl.load_to_lds operations (direct loads to LDS)
  • llvm.load operations from global memory to LDS (local loads)
  • llvm.store operations to LDS from global memory (local stores)
  1. Fix a bug in RockPrepareLLVM.cpp, which previously was overriding the alias info on all ops. Instead, we now merge existing alias info with the alias info added by this pass.

Test Plan

Added LIT test with 5 testcases to ensure that alias information is inserted correctly.

Test Result

LIT_FILTER=DirectToLDS ninja check-rocmlir
...
Testing Time: 186.15s

Total Discovered Tests: 1180
  Excluded: 1141 (96.69%)
  Passed  :   39 (3.31%)

Submission Checklist

@pabloantoniom pabloantoniom force-pushed the pablo-2062-alias3 branch 3 times, most recently from 9ebdee8 to 46db3f0 Compare November 18, 2025 10:59
@pabloantoniom pabloantoniom changed the title [WIP] Use aliasing in DirectToLDS to avoid unnecessary waits inserted by the backend Use aliasing in DirectToLDS to avoid unnecessary waits inserted by the backend Nov 18, 2025
@pabloantoniom pabloantoniom marked this pull request as ready for review November 18, 2025 14:18
@justinrosner
Copy link
Contributor

In #2062 I discovered that disabling the DirectToLDS hack has an unexpected outcome:

Did you mean #2087 instead?

}
}
}
else if (auto storeOp = dyn_cast<LLVM::StoreOp>(aliasOp)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do LLVM::StoreOp and LLVM::LoadOp share an interface? If yes, we can clean up some duplicate code that is almost the exact same for the if and else if branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I have refactored it to use the interface and also a bit the LoadToLDSOp/RawPtrBufferLoadLdsOp code, thanks

Comment on lines 214 to 223
if (existingAliasScopes && newAliasScopes) {
SmallVector<Attribute> mergedAliasScopes;
mergedAliasScopes.append(existingAliasScopes.begin(), existingAliasScopes.end());
mergedAliasScopes.append(newAliasScopes.begin(), newAliasScopes.end());
aliasIface.setAliasScopes(b.getArrayAttr(mergedAliasScopes));
} else if (existingAliasScopes) {
aliasIface.setAliasScopes(existingAliasScopes);
} else if (newAliasScopes) {
aliasIface.setAliasScopes(newAliasScopes);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be simplified by just doing something like:

    if (existingAliasScopes)
      mergedAliasScopes.append(existingAliasScopes.begin(), existingAliasScopes.end());
    if (newAliasScopes)
      mergedAliasScopes.append(newAliasScopes.begin(), newAliasScopes.end());

    if (!mergedAliasScopes.empty())
      aliasIface.setAliasScopes(b.getArrayAttr(mergedAliasScopes));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done

Comment on lines 228 to 237
if (existingNoAliasScopes && newNoAliasScopes) {
SmallVector<Attribute> mergedNoAliasScopes;
mergedNoAliasScopes.append(existingNoAliasScopes.begin(), existingNoAliasScopes.end());
mergedNoAliasScopes.append(newNoAliasScopes.begin(), newNoAliasScopes.end());
aliasIface.setNoAliasScopes(b.getArrayAttr(mergedNoAliasScopes));
} else if (existingNoAliasScopes) {
aliasIface.setNoAliasScopes(existingNoAliasScopes);
} else if (newNoAliasScopes) {
aliasIface.setNoAliasScopes(newNoAliasScopes);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, actually they both do the same with different data so I refactored both into a single helper to increase code reuse

@pabloantoniom
Copy link
Contributor Author

In #2062 I discovered that disabling the DirectToLDS hack has an unexpected outcome:

Did you mean #2087 instead?

Yes, my bad!

Copy link
Contributor

@justinrosner justinrosner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good now. I think this just needs to update rocmlir-driver/pipelines.mlir and it should be good to go!

hackForDirectToLDS(hackForDirectToLDS) {}

Chipset chipset;
bool hackForDirectToLDS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this variable? it seems we don't use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, that shouldn't been committed in the first place, removed

// on the MMRA to relax it to the semantics we want.
StringRef scope = "workgroup";

auto relFence = LLVM::FenceOp::create(rewriter, loc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we remove the fence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have to use the old upstream version of the fence (using SBarrierOp). Otherwise the backend still inserts unwanted waits.

I think that the backend somehow searches for the rocdl ops to generate the assembly. When they updated the AMDGPUToROCDL implementation with the new fence mechanism they didn't change the backend so it is still expecing the same rocdl ops; because they are now different, the output is not what we want.

We should probably check this again after the next upstream merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here? also maybe chat about this with @krzysz00. I think he made that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, will talk with Krzysztof if he does not reply to this thread

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably check this again after the next upstream merge.

Did you check with november upstream merge ?

Copy link
Contributor

@dhernandez0 dhernandez0 Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably check this again after the next upstream merge.

Did you check with november upstream merge ?

yes, it was merged yesterday, is this still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and yes, we still need this unfortunately

namespace rock {
/// Add the direct-to-LDS load alias scope to the given operation.
/// This marks the operation as being part of the direct-to-LDS load scope.
/// It also marks the operation as not aliasing with local loads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the "local" very confusing, why not call it LDS as we do in the rest of the codebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, fixed

}
}
} else if (isa<ROCDL::LoadToLDSOp, ROCDL::RawPtrBufferLoadLdsOp>(
aliasOp)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have more generic way of detecting the ops. Can we use the MemoryEffects? on gfx1250 we'll have new ops that would need to be added here otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think MemoryEffects is too broad right? But yes it would be good to have a more generic way. Maybe we should add something like LDSLoadOpInterface and add LoadToLDSOp, RawPtrBufferLoadLdsOp and future LDS load ops to it. Maybe try to upstream it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think MemoryEffects is too broad right?

Why is it too broad? if an op is loading from LDS, it would need an alias.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But implementing MemoryEffectsOpInterface does not imply that it's writing to LDS. What I am thinking now is that we could check if it's MemoryEffects write, and in that case check if it's writing to LDS. That should work, I'll try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, it seems to work (LIT test pass), will confirm performance later

aliasIface.setAliasScopes(aliasScopes[argNo]);
aliasIface.setNoAliasScopes(noaliasScopes[argNo]);

// Merge existing alias scopes (if any) with the new scopes we just created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: explain that the existing scopes can come from the AddAliasInfo pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

let dependentDialects = ["rock::RockDialect"];
}

def RockAddAliasInfoPass : Pass<"rock-add-alias-info", "::mlir::gpu::GPUModuleOp"> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: AddAliasInfo sounds very generic. This only adds alias info for direct to lds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would like to extend this in the near future to support other cases not only DirectToLDS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but that's still not what the pass does. I'd prefer to change the name when the pass does something else rather than assuming we are going to do something in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this change, but done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is going to be useful outside of the new pass. I'd keep them inside the pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, done

MLIRRockAnalysis
MLIRMHAL
MLIRLLVMDialect
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line

@pabloantoniom pabloantoniom force-pushed the pablo-2062-alias3 branch 3 times, most recently from b913d87 to 120b7e4 Compare November 27, 2025 11:57
@pabloantoniom pabloantoniom merged commit a63f1dd into develop Dec 4, 2025
18 checks passed
@pabloantoniom pabloantoniom deleted the pablo-2062-alias3 branch December 4, 2025 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants