Skip to content

Conversation

@justinrosner
Copy link
Contributor

@justinrosner justinrosner commented Jan 6, 2026

Motivation

This PR adds support for non-contiguous (long stride) tensors in the MIGraphX -> Rock pipeline. Previously ops would fail with the error writing to tensors with long strides or broadcasts is unsupported when the output tensor had padded strides.

This implements: #2195

Technical Details

MIGraphXToTosa Changes

Instead of rejecting non-contiguous output strides, we now handle them by creating a tosa.CustomOp for expanded_strides.

TosaToRock (GPU lowering Path)

Lower to rock.expand_strides op.

RocmlirCustomTosaDecompose (CPU lowering path)

Lower to an empty larger tensor + tensor.insert_slice to put the values in the correct spots in the expanded memory layout. Note, if we went with this approach for the GPU path as well then it would create memref.sub_view ops that would need to be handled in AlignTiling (had this approach in a previous iteration of this PR).

LowerExpandStrides

Created a new pass that runs after bufferization has occurred that lowers a rock.expand_strides to a rock.transform + memref.copy

Test Plan

  • Add new LIT tests
  • Night CI

Test Result

  • Nightly CI

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for non-contiguous (long stride) tensors in the MIGraphX to Rock compilation pipeline, addressing issue #2195. Previously, operations would fail with the error "writing to tensors with long strides or broadcasts is unsupported" when output tensors had padded strides.

Key changes:

  • MIGraphXToTosa conversion now handles long strides by creating padded tensors and using tensor.insert_slice to place computed results
  • AlignTiling pass enhanced to convert memref.subview operations into equivalent Pad transforms for proper handling of non-contiguous memory layouts
  • Error messaging improved to distinguish between unsupported broadcasts vs. now-supported long strides

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mlir/test/fusion/pr-e2e/mixr-non-contiguous-strides.mlir E2E test verifying dot+sigmoid fusion with non-contiguous output strides (50% initialized memory)
mlir/test/Dialect/Rock/align-tiling-non-contiguous-strides.mlir Test ensuring Pad transforms are created for subviews and memref.copy operations are eliminated
mlir/test/Conversion/MIGraphXToTosa/migraphx-to-tosa-non-contiguous-strides.mlir Test verifying tensor.insert_slice and reshape generation for long-stride outputs
mlir/lib/Dialect/Rock/Transforms/AlignTiling.cpp Implements subviewToPadTransform and buildViewChainWithSubviews to handle non-contiguous strides through pad transforms
mlir/lib/Conversion/MIGraphXToTosa/MIGraphXToTosa.cpp Modified AsUnderlyingShapeConverter to create padded tensors with insert_slice for long-stride outputs
mlir/lib/Conversion/MIGraphXToTosa/MIGraphXToTosaPass.cpp Adds TensorDialect to legal dialects to support tensor.insert_slice operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// Only half of the results will be correct since the non-contiguous strides
// in this example means that half of the memory is uninitialized.
// CHECK: relDiff = 0 : 2304/4608 (50.000000%)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use prefill to set all values to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by use prefill here? Are you talking about making test changes? Or changes to MIGraphXToTosa?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test changes. Could we use prefill setting in this test so that the other values are init to zero?

Copy link
Contributor Author

@justinrosner justinrosner Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rocmlir-gen currently have a flag that lets us do such a thing? I was looking around but couldn't see anything, so not sure if you are aware of something that I'm not. If we don't have this feature yet, I can create a ticket for us to do this as I think it would be quite useful in some cases (this one included)

@justinrosner justinrosner force-pushed the 2195-non-packed-strides branch from 65c6e1f to 7351030 Compare January 13, 2026 15:03
Copy link
Contributor

@pabloantoniom pabloantoniom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. Only one comment and some nitpick. Also, I find the term "non-packed strides" a bit weird. Isn't "non contiguous tensors" more precise?

// CHECK: tosa.custom %[[SIGMOID]] {domain_name = "rocmlir", implementation_attrs = "", operator_name = "expand_strides"}
// CHECK-SAME: (tensor<4x24x24xf16>) -> tensor<4x48x24xf16>
// CHECK: tosa.reshape
// CHECK-SAME: -> tensor<4608xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing to me.

Originally the function returns !migraphx.shaped<4x24x24xf16, 1152x24x1>, but after migraphx-to-tosa your approach makes it return tensor<4608xf16>. Is that valid? We are changing not only !migraphx.shaped with tensor but more importantly the shape information. The former has 2304 elements while the latter has 2x. And the key problem I see is that the tensor does not explicitly say that there is a non-unit stride.

That could break in some cases right?

  1. Is migraphx safe if we change the memory layout of the output tensor like this?
  2. Can you check what would happen if we have something else after the migraphx.sigmoid, would it still work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your questions:

  1. I talked with Paul and MIGraphX is expecting the memory layout of the output to be in this state. Specifically they are looking for that strided pattern where the first 576 elements are the actual values and the second 576 are uninitialized padding. The larger tensor with 2304 extra elements is expected.
  2. I created some sample code to test this out:
...
%1 = migraphx.sigmoid %0 : <4x24x24xf16, 576x24x1> -> <4x24x24xf16, 1152x24x1>
%cst = migraphx.literal(dense<1.0> : tensor<1xf16>) : <1xf16, 1>
%2 = migraphx.multibroadcast %cst {out_dyn_dims = [], out_lens = [4, 24, 24]} : <1xf16, 1> -> <4x24x24xf16, 1152x24x1>
%3 = migraphx.add %1, %2 : <4x24x24xf16, 1152x24x1>, <4x24x24xf16, 1152x24x1> -> <4x24x24xf16, 1152x24x1>
return %3 : !migraphx.shaped<4x24x24xf16, 1152x24x1>

It looks like this generates the following IR after MIGraphXToTosa:

  %8 = tosa.sigmoid %7 : (tensor<4x24x24xf16>) -> tensor<4x24x24xf16>
  %9 = "tosa.const"() <{values = dense<1.000000e+00> : tensor<4x24x24xf16>}> : () -> tensor<4x24x24xf16>
  %10 = tosa.add %8, %9 : (tensor<4x24x24xf16>, tensor<4x24x24xf16>) -> tensor<4x24x24xf16>
  %11 = tosa.custom %10 {domain_name = "rocmlir", implementation_attrs = "", operator_name = "expand_strides"} : (tensor<4x24x24xf16>) -> tensor<4x48x24xf16>
  %12 = tosa.const_shape  {values = dense<4608> : tensor<1xindex>} : () -> !tosa.shape<1>
  %13 = tosa.reshape %11, %12 : (tensor<4x48x24xf16>, !tosa.shape<1>) -> tensor<4608xf16>
  return %13 : tensor<4608xf16>

The expand_strides gets pushed to after the add which seems correct? I'm trying to think of other cases where this may break...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, if that is what MIGraphX wants I think it makes sense. This might break in some case but that, if it ever breaks, it's definitely not for this PR

@justinrosner
Copy link
Contributor Author

Code looks good to me. Only one comment and some nitpick. Also, I find the term "non-packed strides" a bit weird. Isn't "non contiguous tensors" more precise?

I'll update the Title of this PR and the description!

@justinrosner justinrosner changed the title Add support for non-packed strides from MIGraphX Add support for non-contiguous tensors from MIGraphX Jan 15, 2026
auto outputType = cast<RankedTensorType>(op.getResult(0).getType());

// Allocate the destination tensor with the larger (padded) size
Value dest =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so cpu and gpu paths will be diff here? cpu will lower it to tensor ops (mlir/lib/Conversion/RocmlirCustomTosaDecompose/RocmlirCustomTosaDecompose.cpp) and this will be the gpu path? or am I getting confused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, TosaToRock will generate rock.expand_stride ops for the GPU path, and the CPU path will generate tensor ops in RocmlirCustomTosaDecompose.

// Verify element types match
if (inputType.getElementType() != outputType.getElementType())
return emitOpError("input and output must have the same element type");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any other restriction? what happens if more than one dimension is different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than one dimension being different is allowed. We also check that output is >= input in all dimensions and that each output dimension is a multiple of the input dimension

%alloc_1 = memref.alloc() {alignment = 64 : i64} : memref<4x48x24xf16>
rock.expand_strides %alloc_0 into %alloc_1 : memref<4x24x24xf16> into memref<4x48x24xf16>
// CHECK: %[[TRANSFORM:.*]] = rock.transform %alloc_1 {{.*}} memref<4x48x24xf16> to memref<4x24x24xf16>
// CHECK: memref.copy %alloc_0, %[[TRANSFORM]] : memref<4x24x24xf16> to memref<4x24x24xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, the final IR ends up with to memref.copy, one from alloc_0 to alloc_1, then alloc_1 to arg2. Does how code handle that correctly? what happens if there are more memref.copys?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it work if the first arg to rock.expand_strides is a transform map as well?

Copy link
Contributor Author

@justinrosner justinrosner Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two copies are independent. The rock.expand_strides lowering to a memref.copy will always just populate the valid portion of the expanded buffer. The second copy is just moving the entire buffer, with the uninitialized gaps, to the output (this was already present in the pre-transform IR). If there are more memref.copy's in the IR that should not have an impact on the expand_stride lowering since each one is independently lowered to transform + copy. Other memref.copy ops in the IR are unaffected, they're just separate memory operations that happen to operate on the same or related buffers.

As for the question about the input to expand_strides being a rock.transform, I don't think that should make a difference. expand_strides doesn't care where the input is coming from, just as long as that input is a valid memref value. The memref.copy would copy from the transformed input (from the rock.transform) to the transformed/expanded output.

@justinrosner
Copy link
Contributor Author

@justinrosner justinrosner merged commit 687737e into develop Jan 15, 2026
15 checks passed
@justinrosner justinrosner deleted the 2195-non-packed-strides branch January 15, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Error: 'migraphx.mlir.as.underlying.shape' op writing to tensors with long strides or broadcasts is unsupported" when using non-packed output buffers

4 participants