Merging code from IFU branch. #8

RichardChamberlain1 · 2025-11-06T17:38:30Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

amirakb89 · 2025-11-07T05:40:25Z

third-party/README.md


-# To build rocSHMEM with MPI disabled, please add this flag -DUSE_EXTERNAL_MPI=OFF
-MPI_ROOT=$BUILD_DIR/ompi ../rocSHMEM/scripts/build_configs/gda_mlx5 --fresh \
-  -DUSE_IPC=ON \


why are you reverting the README?

amirakb89 · 2025-11-07T05:43:02Z

setup.py

    parser.add_argument("--verbose", action="store_true", help="Verbose build")
    parser.add_argument("--enable_timer", action="store_true", help="Enable timer to debug time out in internode")
    parser.add_argument("--rocm-disable-ctx", action="store_true", help="Disable workgroup context optimization in internode")
-    parser.add_argument("--disable-mpi", action="store_true", help="Disable MPI detection and configuration")


disable-mpi should be kept.

amirakb89 · 2025-11-07T15:28:47Z

csrc/kernels/internode_ll.cu

                    for (int j = 0; j < kNumElemsPerRead; j += 2) {
                        float2 fp32x2 = {fp32_values[j] * scale, fp32_values[j + 1] * scale};
 #ifdef USE_ROCM
-#if defined(__gfx942__)


These changes need to be reverted. It breaks for MI350.

amirakb89 · 2025-11-07T15:29:23Z

csrc/kernels/internode_ll.cu

                    internode::shmem_ctx_schar_put_nbi_warp(ctx,
 #endif
                    reinterpret_cast<signed char*>(dst_ptr), reinterpret_cast<signed char*>(src_ptr), num_bytes_per_msg, dst_rank);
+#if defined(ROCM_DISABLE_CTX)


These changes also need to be reverted.

avbokovoy · 2025-11-10T09:38:24Z

csrc/deep_ep.cpp

+    // Assign bias pointers
+    /*auto bias_opts = std::vector<std::optional<torch::Tensor>>({bias_0, bias_1});
+    void* bias_ptrs[2] = {nullptr, nullptr};
+    for (int i = 0; i < 2; ++i)
+        if (bias_opts[i].has_value()) {
+            auto bias = bias_opts[i].value();
+            EP_HOST_ASSERT(bias.dim() == 2 and bias.is_contiguous());
+            EP_HOST_ASSERT(bias.scalar_type() == x.scalar_type());
+            EP_HOST_ASSERT(bias.size(0) == num_recv_tokens and bias.size(1) == hidden);
+            bias_ptrs[i] = bias.data_ptr();
+        }
+    */


Let's remove it or comment that it might be needed for future work

I've added a comment to say it's not supported at this time.

avbokovoy · 2025-11-10T09:38:47Z

csrc/deep_ep.cpp

+        /*for (auto& to : {topk_weights, recv_topk_weights, bias_0, bias_1}) {
+            to.has_value() ? to->record_stream(comm_stream) : void();
+            if (allocate_on_comm_stream)
+                to.has_value() ? to->record_stream(compute_stream) : void();
+        }*/


Let's remove it or comment that it might be needed for future work

Added a comment.

avbokovoy · 2025-11-10T09:45:33Z

csrc/deep_ep.hpp

+        //const std::optional<torch::Tensor>& bias_0,
+        //const std::optional<torch::Tensor>& bias_1,


Let's remove it

avbokovoy · 2025-11-10T09:45:57Z

csrc/kernels/api.cuh

 namespace intranode {

-void barrier(int **task_fifo_ptrs, int head, int rank, int num_ranks, cudaStream_t stream);
+//void barrier(int **task_fifo_ptrs, int head, int rank, int num_ranks, cudaStream_t stream);


Let's remove it

avbokovoy · 2025-11-10T09:47:25Z

csrc/kernels/exception.cuh

    if (not (cond)) { \
        printf("Assertion failed: %s:%d, condition: %s\n", __FILE__, __LINE__, #cond); \
-        trap(); \
+        abort();\


Why was that changed? As far as I remember, abort() function was unavailable on device side

Trap was unrecognized during compilation.

avbokovoy · 2025-11-10T09:48:38Z

csrc/kernels/internode_ll.cu

 #if !defined(ROCM_DISABLE_CTX)
    __shared__ internode::shmem_ctx_t ctx;
-    internode::shmem_wg_ctx_create(&ctx);
+    EP_DEVICE_ASSERT(internode::shmem_wg_ctx_create(&ctx) == 0);


Maybe there's something like INVALID_CTX to compare against, but not zero?

avbokovoy · 2025-11-10T09:49:08Z

csrc/kernels/intranode.cu

+            //#pragma unroll
+            //for (int i = 0; i < kNumRanks; ++ i)
+            //    per_rank_buffer[rank * kNumRanks + i] = num_tokens_per_rank[i];


Let's clean-up

avbokovoy · 2025-11-10T09:49:52Z

csrc/kernels/runtime.cu

 }

-void barrier(int** task_fifo_ptrs, int head, int rank, int num_ranks, cudaStream_t stream) {
+/*void barrier(int** task_fifo_ptrs, int head, int rank, int num_ranks, cudaStream_t stream) {


Let's remove old version

avbokovoy · 2025-11-10T09:50:56Z

csrc/kernels/utils.cuh

 #include "exception.cuh"

+#ifdef USE_ROCM
+#define syncthreads() __syncthreads()


Why can't we just use __syncthreads() everywhere? There's no custom functionality added behind this function, and __ will explicitly mark that we're using runtime one

I wondered this, but was just following how it's always done and assumed that there was some good reason for this?
Probably just some debug at some point?

It's seems like there's no point for that particular function to wrap it. It is (was) necessary for some other calls like __shfl_sync for example, because there we have different number of arguments compared to CUDA runtime thus a decorator is required. Let's revert to __synchtreads()

setup.py

Removed unused definition.

Richard Chamberlain added 4 commits November 6, 2025 09:09

Merging code from IFU branch.

909dbdb

Adding support for disabling MPI.

3e76106

Restored readme code.

00e9f4e

Adding missing kernel file layout.cu

2645c3e

amirakb89 reviewed Nov 7, 2025

View reviewed changes

Update internode_ll.cu

cefd395

avbokovoy requested changes Nov 10, 2025

View reviewed changes

RichardChamberlain1 and others added 19 commits November 10, 2025 12:19

Fix gfx950 FP8 datatypes

030be0b

Address review comments

751e87f

Update utils.cuh

57fa118

Removed unused definition.

Merge branch 'main' into main

6923897

Removed broken buffer cleanup code

4c9c51e

Update shmem_wrapper.cuh

dc41ca0

Update shmem_wrapper.cuh

5865024

Merging code from IFU branch.

3ec2aab

Adding support for disabling MPI.

96f6f61

Restored readme code.

98705ff

Adding missing kernel file layout.cu

89661b5

Update internode_ll.cu

5003cfb

Fix gfx950 FP8 datatypes

d1f9414

Address review comments

68c1015

Update utils.cuh

4f628e9

Removed unused definition.

Removed broken buffer cleanup code

231d01d

Update shmem_wrapper.cuh

e4a8886

Update shmem_wrapper.cuh

3045043

Merge branch 'ROCm-main'

ef5e9ca

		//const std::optional<torch::Tensor>& bias_0,
		//const std::optional<torch::Tensor>& bias_1,

Merging code from IFU branch. #8

Are you sure you want to change the base?

Merging code from IFU branch. #8

Uh oh!

Conversation

RichardChamberlain1 commented Nov 6, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants