-
Notifications
You must be signed in to change notification settings - Fork 52
Add support for non-contiguous tensors from MIGraphX #2198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for non-contiguous (long stride) tensors in the MIGraphX to Rock compilation pipeline, addressing issue #2195. Previously, operations would fail with the error "writing to tensors with long strides or broadcasts is unsupported" when output tensors had padded strides.
Key changes:
- MIGraphXToTosa conversion now handles long strides by creating padded tensors and using
tensor.insert_sliceto place computed results - AlignTiling pass enhanced to convert
memref.subviewoperations into equivalent Pad transforms for proper handling of non-contiguous memory layouts - Error messaging improved to distinguish between unsupported broadcasts vs. now-supported long strides
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
mlir/test/fusion/pr-e2e/mixr-non-contiguous-strides.mlir |
E2E test verifying dot+sigmoid fusion with non-contiguous output strides (50% initialized memory) |
mlir/test/Dialect/Rock/align-tiling-non-contiguous-strides.mlir |
Test ensuring Pad transforms are created for subviews and memref.copy operations are eliminated |
mlir/test/Conversion/MIGraphXToTosa/migraphx-to-tosa-non-contiguous-strides.mlir |
Test verifying tensor.insert_slice and reshape generation for long-stride outputs |
mlir/lib/Dialect/Rock/Transforms/AlignTiling.cpp |
Implements subviewToPadTransform and buildViewChainWithSubviews to handle non-contiguous strides through pad transforms |
mlir/lib/Conversion/MIGraphXToTosa/MIGraphXToTosa.cpp |
Modified AsUnderlyingShapeConverter to create padded tensors with insert_slice for long-stride outputs |
mlir/lib/Conversion/MIGraphXToTosa/MIGraphXToTosaPass.cpp |
Adds TensorDialect to legal dialects to support tensor.insert_slice operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mlir/test/Conversion/MIGraphXToTosa/migraphx-to-tosa-non-contiguous-strides.mlir
Show resolved
Hide resolved
|
|
||
| // Only half of the results will be correct since the non-contiguous strides | ||
| // in this example means that half of the memory is uninitialized. | ||
| // CHECK: relDiff = 0 : 2304/4608 (50.000000%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can use prefill to set all values to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by use prefill here? Are you talking about making test changes? Or changes to MIGraphXToTosa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test changes. Could we use prefill setting in this test so that the other values are init to zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does rocmlir-gen currently have a flag that lets us do such a thing? I was looking around but couldn't see anything, so not sure if you are aware of something that I'm not. If we don't have this feature yet, I can create a ticket for us to do this as I think it would be quite useful in some cases (this one included)
65c6e1f to
7351030
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me. Only one comment and some nitpick. Also, I find the term "non-packed strides" a bit weird. Isn't "non contiguous tensors" more precise?
| // CHECK: tosa.custom %[[SIGMOID]] {domain_name = "rocmlir", implementation_attrs = "", operator_name = "expand_strides"} | ||
| // CHECK-SAME: (tensor<4x24x24xf16>) -> tensor<4x48x24xf16> | ||
| // CHECK: tosa.reshape | ||
| // CHECK-SAME: -> tensor<4608xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing to me.
Originally the function returns !migraphx.shaped<4x24x24xf16, 1152x24x1>, but after migraphx-to-tosa your approach makes it return tensor<4608xf16>. Is that valid? We are changing not only !migraphx.shaped with tensor but more importantly the shape information. The former has 2304 elements while the latter has 2x. And the key problem I see is that the tensor does not explicitly say that there is a non-unit stride.
That could break in some cases right?
- Is migraphx safe if we change the memory layout of the output tensor like this?
- Can you check what would happen if we have something else after the
migraphx.sigmoid, would it still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer your questions:
- I talked with Paul and MIGraphX is expecting the memory layout of the output to be in this state. Specifically they are looking for that strided pattern where the first 576 elements are the actual values and the second 576 are uninitialized padding. The larger tensor with 2304 extra elements is expected.
- I created some sample code to test this out:
...
%1 = migraphx.sigmoid %0 : <4x24x24xf16, 576x24x1> -> <4x24x24xf16, 1152x24x1>
%cst = migraphx.literal(dense<1.0> : tensor<1xf16>) : <1xf16, 1>
%2 = migraphx.multibroadcast %cst {out_dyn_dims = [], out_lens = [4, 24, 24]} : <1xf16, 1> -> <4x24x24xf16, 1152x24x1>
%3 = migraphx.add %1, %2 : <4x24x24xf16, 1152x24x1>, <4x24x24xf16, 1152x24x1> -> <4x24x24xf16, 1152x24x1>
return %3 : !migraphx.shaped<4x24x24xf16, 1152x24x1>It looks like this generates the following IR after MIGraphXToTosa:
%8 = tosa.sigmoid %7 : (tensor<4x24x24xf16>) -> tensor<4x24x24xf16>
%9 = "tosa.const"() <{values = dense<1.000000e+00> : tensor<4x24x24xf16>}> : () -> tensor<4x24x24xf16>
%10 = tosa.add %8, %9 : (tensor<4x24x24xf16>, tensor<4x24x24xf16>) -> tensor<4x24x24xf16>
%11 = tosa.custom %10 {domain_name = "rocmlir", implementation_attrs = "", operator_name = "expand_strides"} : (tensor<4x24x24xf16>) -> tensor<4x48x24xf16>
%12 = tosa.const_shape {values = dense<4608> : tensor<1xindex>} : () -> !tosa.shape<1>
%13 = tosa.reshape %11, %12 : (tensor<4x48x24xf16>, !tosa.shape<1>) -> tensor<4608xf16>
return %13 : tensor<4608xf16>The expand_strides gets pushed to after the add which seems correct? I'm trying to think of other cases where this may break...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, if that is what MIGraphX wants I think it makes sense. This might break in some case but that, if it ever breaks, it's definitely not for this PR
I'll update the Title of this PR and the description! |
| auto outputType = cast<RankedTensorType>(op.getResult(0).getType()); | ||
|
|
||
| // Allocate the destination tensor with the larger (padded) size | ||
| Value dest = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so cpu and gpu paths will be diff here? cpu will lower it to tensor ops (mlir/lib/Conversion/RocmlirCustomTosaDecompose/RocmlirCustomTosaDecompose.cpp) and this will be the gpu path? or am I getting confused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, TosaToRock will generate rock.expand_stride ops for the GPU path, and the CPU path will generate tensor ops in RocmlirCustomTosaDecompose.
| // Verify element types match | ||
| if (inputType.getElementType() != outputType.getElementType()) | ||
| return emitOpError("input and output must have the same element type"); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any other restriction? what happens if more than one dimension is different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than one dimension being different is allowed. We also check that output is >= input in all dimensions and that each output dimension is a multiple of the input dimension
| %alloc_1 = memref.alloc() {alignment = 64 : i64} : memref<4x48x24xf16> | ||
| rock.expand_strides %alloc_0 into %alloc_1 : memref<4x24x24xf16> into memref<4x48x24xf16> | ||
| // CHECK: %[[TRANSFORM:.*]] = rock.transform %alloc_1 {{.*}} memref<4x48x24xf16> to memref<4x24x24xf16> | ||
| // CHECK: memref.copy %alloc_0, %[[TRANSFORM]] : memref<4x24x24xf16> to memref<4x24x24xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, the final IR ends up with to memref.copy, one from alloc_0 to alloc_1, then alloc_1 to arg2. Does how code handle that correctly? what happens if there are more memref.copys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it work if the first arg to rock.expand_strides is a transform map as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two copies are independent. The rock.expand_strides lowering to a memref.copy will always just populate the valid portion of the expanded buffer. The second copy is just moving the entire buffer, with the uninitialized gaps, to the output (this was already present in the pre-transform IR). If there are more memref.copy's in the IR that should not have an impact on the expand_stride lowering since each one is independently lowered to transform + copy. Other memref.copy ops in the IR are unaffected, they're just separate memory operations that happen to operate on the same or related buffers.
As for the question about the input to expand_strides being a rock.transform, I don't think that should make a difference. expand_strides doesn't care where the input is coming from, just as long as that input is a valid memref value. The memref.copy would copy from the transformed input (from the rock.transform) to the transformed/expanded output.
Motivation
This PR adds support for non-contiguous (long stride) tensors in the MIGraphX -> Rock pipeline. Previously ops would fail with the error
writing to tensors with long strides or broadcasts is unsupportedwhen the output tensor had padded strides.This implements: #2195
Technical Details
MIGraphXToTosa Changes
Instead of rejecting non-contiguous output strides, we now handle them by creating a tosa.CustomOp for
expanded_strides.TosaToRock (GPU lowering Path)
Lower to
rock.expand_stridesop.RocmlirCustomTosaDecompose (CPU lowering path)
Lower to an empty larger tensor +
tensor.insert_sliceto put the values in the correct spots in the expanded memory layout. Note, if we went with this approach for the GPU path as well then it would create memref.sub_view ops that would need to be handled in AlignTiling (had this approach in a previous iteration of this PR).LowerExpandStrides
Created a new pass that runs after bufferization has occurred that lowers a
rock.expand_stridesto arock.transform+memref.copyTest Plan
Test Result
Submission Checklist