-
Notifications
You must be signed in to change notification settings - Fork 52
More tuning improvements #2201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
More tuning improvements #2201
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -76,7 +76,7 @@ struct MLIRContextOptions { | |
| static llvm::ManagedStatic<MLIRContextOptions> clOptions; | ||
|
|
||
| static bool isThreadingGloballyDisabled() { | ||
| #if MLIR_ENABLE_THREADS != 0 | ||
| #if LLVM_ENABLE_THREADS != 0 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you create an upstream PR to fix this?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. bug is in rocm/llvm. I'll post a PR there. |
||
| return clOptions.isConstructed() && clOptions->disableThreading; | ||
| #else | ||
| return true; | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -47,6 +47,7 @@ | |||
| #include "llvm/Support/ErrorHandling.h" | ||||
| #include "llvm/Support/InitLLVM.h" | ||||
| #include "llvm/Support/SourceMgr.h" | ||||
| #include "llvm/Support/ThreadPool.h" | ||||
|
|
||||
| #include <atomic> | ||||
| #include <cassert> | ||||
|
|
@@ -88,6 +89,24 @@ void pArgs(const std::tuple<Ts...> &formals, void **_vargs) { | |||
| using namespace mlir; | ||||
| using namespace rocmlir::tuningdriver; | ||||
|
|
||||
| //===----------------------------------------------------------------------===// | ||||
| // Shared Resources for Multi-threaded Compilation | ||||
| //===----------------------------------------------------------------------===// | ||||
|
|
||||
| /// Returns a shared dialect registry, initialized exactly once. | ||||
| static DialectRegistry &getSharedDialectRegistry() { | ||||
| static std::once_flag initFlag; | ||||
| static DialectRegistry registry; | ||||
| std::call_once(initFlag, []() { registerRocMLIRDialects(registry); }); | ||||
| return registry; | ||||
| } | ||||
|
|
||||
| /// Returns a shared LLVM ThreadPool for all MLIR contexts. | ||||
| static llvm::DefaultThreadPool &getSharedThreadPool() { | ||||
| static llvm::DefaultThreadPool pool; | ||||
| return pool; | ||||
| } | ||||
|
|
||||
| static llvm::cl::opt<std::string> inputFilename{ | ||||
| llvm::cl::Positional, llvm::cl::desc("<input file>"), llvm::cl::init("-")}; | ||||
|
|
||||
|
|
@@ -316,9 +335,13 @@ struct ThreadResources { | |||
| const rock::KernelOptions &applicabilityOpts, | ||||
| const rock::KernelOptions &compilationKernOpts, | ||||
| const rock::BackendOptions &backendOpts) { | ||||
| DialectRegistry registry; | ||||
| registerRocMLIRDialects(registry); | ||||
| ctx = std::make_unique<MLIRContext>(registry); | ||||
| // Use the shared dialect registry (initialized exactly once) | ||||
| DialectRegistry ®istry = getSharedDialectRegistry(); | ||||
| // Create context with threading disabled internally, attach shared pool | ||||
| ctx = std::make_unique<MLIRContext>(registry, | ||||
| MLIRContext::Threading::DISABLED); | ||||
| ctx->setThreadPool(getSharedThreadPool()); | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean that the compilation is parallelized internally as well? We could be oversubscribing threads because we are already parallelizing at a higher level. Can we control the number of threads in the pool?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes it looks like it would use threading internally as well.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Threading is enabled by default internally inside MLIRContext unless disabled explictly.
This change of passing threadpool explictly is meant to reduce overheads of creating seperate threadpool across all parallel threads. But it looks like it is not really affecting runtime. But it is a good practice.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good optimization. In that case we are for sure oversubscribing. I think we can optimize further by manipulating the thread count. Maybe leave 50% of the compile threads for the compilation workers, and the other 50% for the thread pool. There's probably a way to tell the thread pool how many threads to use. 50% is arbitrary here, maybe a different distribution works better. That could be the reason why you don't see it affect runtime, the cpu is already oversaturated with threads. |
||||
| ctx->loadAllAvailableDialects(); | ||||
| ctx->getDiagEngine().registerHandler([](Diagnostic &) {}); | ||||
|
|
||||
| // Pre-build pipelines once per thread | ||||
|
|
@@ -423,12 +446,12 @@ measureLargeKernel(unsigned iterations, hipStream_t stream, | |||
| } | ||||
|
|
||||
| // In order to match rocprof, returns time in nanoseconds | ||||
| static FailureOr<double> | ||||
| benchmarkKernels(ArrayRef<std::string> binaries, | ||||
| ArrayRef<std::string> funcNames, ArrayRef<uint32_t> blockSizes, | ||||
| ArrayRef<uint32_t> gridSizes, ArrayRef<void *> hostBuffers, | ||||
| MutableArrayRef<void *> gpuBuffers, | ||||
| ArrayRef<size_t> bufferSizes, const BenchmarkParams ¶ms) { | ||||
| static FailureOr<double> benchmarkKernels(ArrayRef<std::string> binaries, | ||||
| ArrayRef<std::string> funcNames, | ||||
| ArrayRef<uint32_t> blockSizes, | ||||
| ArrayRef<uint32_t> gridSizes, | ||||
| MutableArrayRef<void *> gpuBuffers, | ||||
| const BenchmarkParams ¶ms) { | ||||
| bool benchmarkMode = !params.benchmarkConfig.empty(); | ||||
| hipStream_t stream; | ||||
| HIPCHECK(hipStreamCreate(&stream)); | ||||
|
|
@@ -440,12 +463,6 @@ benchmarkKernels(ArrayRef<std::string> binaries, | |||
| } | ||||
| }); | ||||
|
|
||||
| // Initialize device buffers | ||||
| for (size_t i = 0; i < bufferSizes.size(); i++) { | ||||
| HIPCHECK(hipMemcpyAsync(gpuBuffers[i], hostBuffers[i], bufferSizes[i], | ||||
| hipMemcpyHostToDevice, stream)); | ||||
| } | ||||
|
|
||||
| // HIP wants an array of pointers to each argument | ||||
| std::vector<void *> argPointers; | ||||
| for (void *&item : gpuBuffers) { | ||||
|
|
@@ -731,6 +748,12 @@ static LogicalResult runTuningLoop(ModuleOp source) { | |||
| gpuBuffers.push_back(gpuBuffer); | ||||
| } | ||||
|
|
||||
| // Copy host buffers to GPU once (reused across all config benchmarks) | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what if we are using atomics? the results would be different in every iteration because we don't init the output tensor with the same values. I think that shouldn't affect run-time but asking just in case.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we don't care about results here. Just the benchmarking
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we add a comment just in case? so the reader is aware of this. |
||||
| for (size_t i = 0; i < bufferLengths.size(); i++) { | ||||
| HIPCHECK(hipMemcpy(gpuBuffers[i], hostBuffers[i], bufferLengths[i], | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why change from
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should be careful with this. I believe that host-to-device copies are still async unless the host memory is allocated page-locked (allocated with hipHostMalloc). It just stages it for DMA transfer and does not wait for the copy to finish. CUDA behaves like this, I would suppose that HIP does as well: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We probably should be using hipHostMalloc anyway to speed up the memory transfer.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was changed from
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah I see, at this point you don't have a stream? |
||||
| hipMemcpyHostToDevice)); | ||||
| } | ||||
|
|
||||
| // 4. Multi-iteration tuning loop | ||||
| SmallString<64> bestConfigOverall; | ||||
| float bestTimeOverall = std::numeric_limits<float>::max(); | ||||
|
|
@@ -975,10 +998,9 @@ static LogicalResult runTuningLoop(ModuleOp source) { | |||
| assert(result.status == CompilationStatus::Success && | ||||
| "Unexpected compilation status in benchmarking phase"); | ||||
|
|
||||
| FailureOr<double> timing = | ||||
| benchmarkKernels(result.hipModules, kernelFuncNames, | ||||
| result.blockSizes, result.gridSizes, hostBuffers, | ||||
| gpuBuffers, bufferLengths, benchmarkParams); | ||||
| FailureOr<double> timing = benchmarkKernels( | ||||
| result.hipModules, kernelFuncNames, result.blockSizes, | ||||
| result.gridSizes, gpuBuffers, benchmarkParams); | ||||
|
|
||||
| if (failed(timing)) { | ||||
| llvm::errs() << "Kernel execution failed\n"; | ||||
|
|
||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you have this change in a seperate commit with "[external]..."?