Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion external/llvm-project/mlir/lib/IR/MLIRContext.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you have this change in a seperate commit with "[external]..."?

Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ struct MLIRContextOptions {
static llvm::ManagedStatic<MLIRContextOptions> clOptions;

static bool isThreadingGloballyDisabled() {
#if MLIR_ENABLE_THREADS != 0
#if LLVM_ENABLE_THREADS != 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an upstream PR to fix this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug is in rocm/llvm. I'll post a PR there.

return clOptions.isConstructed() && clOptions->disableThreading;
#else
return true;
Expand Down
60 changes: 41 additions & 19 deletions mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/InitLLVM.h"
#include "llvm/Support/SourceMgr.h"
#include "llvm/Support/ThreadPool.h"

#include <atomic>
#include <cassert>
Expand Down Expand Up @@ -88,6 +89,24 @@ void pArgs(const std::tuple<Ts...> &formals, void **_vargs) {
using namespace mlir;
using namespace rocmlir::tuningdriver;

//===----------------------------------------------------------------------===//
// Shared Resources for Multi-threaded Compilation
//===----------------------------------------------------------------------===//

/// Returns a shared dialect registry, initialized exactly once.
static DialectRegistry &getSharedDialectRegistry() {
static std::once_flag initFlag;
static DialectRegistry registry;
std::call_once(initFlag, []() { registerRocMLIRDialects(registry); });
return registry;
}

/// Returns a shared LLVM ThreadPool for all MLIR contexts.
static llvm::DefaultThreadPool &getSharedThreadPool() {
static llvm::DefaultThreadPool pool;
return pool;
}

static llvm::cl::opt<std::string> inputFilename{
llvm::cl::Positional, llvm::cl::desc("<input file>"), llvm::cl::init("-")};

Expand Down Expand Up @@ -316,9 +335,13 @@ struct ThreadResources {
const rock::KernelOptions &applicabilityOpts,
const rock::KernelOptions &compilationKernOpts,
const rock::BackendOptions &backendOpts) {
DialectRegistry registry;
registerRocMLIRDialects(registry);
ctx = std::make_unique<MLIRContext>(registry);
// Use the shared dialect registry (initialized exactly once)
DialectRegistry &registry = getSharedDialectRegistry();
// Create context with threading disabled internally, attach shared pool
ctx = std::make_unique<MLIRContext>(registry,
MLIRContext::Threading::DISABLED);
ctx->setThreadPool(getSharedThreadPool());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the compilation is parallelized internally as well? We could be oversubscribing threads because we are already parallelizing at a higher level. Can we control the number of threads in the pool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it looks like it would use threading internally as well.
But not sure how exactly it works.
I copied logic from MIGraphX's compilation
https://github.com/ROCm/AMDMIGraphX/blob/4d968f79f02de4de5aa3c36f12a179183c12c04e/src/targets/gpu/mlir.cpp#L286

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threading is enabled by default internally inside MLIRContext unless disabled explictly.

MLIRContextImpl(bool threadingIsEnabled)

This change of passing threadpool explictly is meant to reduce overheads of creating seperate threadpool across all parallel threads. But it looks like it is not really affecting runtime. But it is a good practice.

Copy link
Contributor

@mirza-halilcevic mirza-halilcevic Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good optimization.

In that case we are for sure oversubscribing. I think we can optimize further by manipulating the thread count. Maybe leave 50% of the compile threads for the compilation workers, and the other 50% for the thread pool. There's probably a way to tell the thread pool how many threads to use. 50% is arbitrary here, maybe a different distribution works better. That could be the reason why you don't see it affect runtime, the cpu is already oversaturated with threads.

ctx->loadAllAvailableDialects();
ctx->getDiagEngine().registerHandler([](Diagnostic &) {});

// Pre-build pipelines once per thread
Expand Down Expand Up @@ -423,12 +446,12 @@ measureLargeKernel(unsigned iterations, hipStream_t stream,
}

// In order to match rocprof, returns time in nanoseconds
static FailureOr<double>
benchmarkKernels(ArrayRef<std::string> binaries,
ArrayRef<std::string> funcNames, ArrayRef<uint32_t> blockSizes,
ArrayRef<uint32_t> gridSizes, ArrayRef<void *> hostBuffers,
MutableArrayRef<void *> gpuBuffers,
ArrayRef<size_t> bufferSizes, const BenchmarkParams &params) {
static FailureOr<double> benchmarkKernels(ArrayRef<std::string> binaries,
ArrayRef<std::string> funcNames,
ArrayRef<uint32_t> blockSizes,
ArrayRef<uint32_t> gridSizes,
MutableArrayRef<void *> gpuBuffers,
const BenchmarkParams &params) {
bool benchmarkMode = !params.benchmarkConfig.empty();
hipStream_t stream;
HIPCHECK(hipStreamCreate(&stream));
Expand All @@ -440,12 +463,6 @@ benchmarkKernels(ArrayRef<std::string> binaries,
}
});

// Initialize device buffers
for (size_t i = 0; i < bufferSizes.size(); i++) {
HIPCHECK(hipMemcpyAsync(gpuBuffers[i], hostBuffers[i], bufferSizes[i],
hipMemcpyHostToDevice, stream));
}

// HIP wants an array of pointers to each argument
std::vector<void *> argPointers;
for (void *&item : gpuBuffers) {
Expand Down Expand Up @@ -731,6 +748,12 @@ static LogicalResult runTuningLoop(ModuleOp source) {
gpuBuffers.push_back(gpuBuffer);
}

// Copy host buffers to GPU once (reused across all config benchmarks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we are using atomics? the results would be different in every iteration because we don't init the output tensor with the same values. I think that shouldn't affect run-time but asking just in case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't care about results here. Just the benchmarking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment just in case? so the reader is aware of this.

for (size_t i = 0; i < bufferLengths.size(); i++) {
HIPCHECK(hipMemcpy(gpuBuffers[i], hostBuffers[i], bufferLengths[i],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change from hipMemcpyAsync to hipMemcpy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful with this. I believe that host-to-device copies are still async unless the host memory is allocated page-locked (allocated with hipHostMalloc). It just stages it for DMA transfer and does not wait for the copy to finish.

CUDA behaves like this, I would suppose that HIP does as well: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should be using hipHostMalloc anyway to speed up the memory transfer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was changed from hipMemCpyAsync to hipMemcpy becuase it doesn't require "stream" or because it uses default stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, at this point you don't have a stream?

hipMemcpyHostToDevice));
}

// 4. Multi-iteration tuning loop
SmallString<64> bestConfigOverall;
float bestTimeOverall = std::numeric_limits<float>::max();
Expand Down Expand Up @@ -975,10 +998,9 @@ static LogicalResult runTuningLoop(ModuleOp source) {
assert(result.status == CompilationStatus::Success &&
"Unexpected compilation status in benchmarking phase");

FailureOr<double> timing =
benchmarkKernels(result.hipModules, kernelFuncNames,
result.blockSizes, result.gridSizes, hostBuffers,
gpuBuffers, bufferLengths, benchmarkParams);
FailureOr<double> timing = benchmarkKernels(
result.hipModules, kernelFuncNames, result.blockSizes,
result.gridSizes, gpuBuffers, benchmarkParams);

if (failed(timing)) {
llvm::errs() << "Kernel execution failed\n";
Expand Down