Remove redundant cast chains #1944

justinrosner · 2025-08-12T15:58:04Z

Motivation

This pass identifies and removes redundant cast chains in the IR, specifically patterns where a value is converted from f32 to a smaller type (e.g., f16) and then immediately extended back to f32. By eliminating these unnecessary conversions, the pass simplifies the IR and can improve performance by reducing superfluous operations and memory traffic.

Technical Details

This pass can currently handle the following situations:

MFMA ops do accumulation in higher precision, so a GEMM returning a f16 type with MFMA enabled will under the hood to a TruncF back to the lower precision type. If this op is then used directly by a convert (extf), then we want to remove this chain of casts.

function {
  migraphx.dot A, B : f16 -> C : f16
  migraphx.convert (C) : f16 -> f32
}

Same case as above, but there are some additional uses of the initial GEMM that means the initial truncf needs to stick around.

function {
  migraphx.dot A, B : f16 -> C : f16
  ...
  use of C
  ... 
  migraphx.convert (C) : f16 -> f32
}

We don't explicitly have to look for the MFMA case, we could also have other cast -> cast chains that are redundant and can be removed.

function{
  opA -> f16
  convertA = migraphx.convert(opA) : f16 -> f32
  ...
  migraphx.convert(convertA) : f32 -> f16
}

Test Plan

check-rocmlir shows all tests passing
Run the design in https://github.com/ROCm/rocMLIR-internal/issues/1932 to confirm that redundant casts are removed in final assembly

Test Result

I've manually examined the final generated assembly in the attached design and confirmed that we no longer have the redundant cast chains.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

mlir/test/Dialect/Rock/remove_redundant_casts_multi_use.mlir

Copilot

Pull Request Overview

This PR introduces a new optimization pass that removes redundant cast chains in the IR. The pass identifies patterns where values are converted from f32 to smaller precision types (e.g., f16) and then immediately extended back to f32, eliminating these unnecessary conversions to simplify the IR and improve performance.

Key changes include:

Implementation of a new RockRemoveRedundantCastsPass that handles MFMA operations and generic cast chains
Integration of the pass into the compilation pipeline at appropriate stages
Comprehensive test coverage for different cast chain scenarios

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp`	Core implementation of the redundant cast removal pass
`mlir/test/Dialect/Rock/remove_redundant_casts*.mlir`	Test files covering various cast chain removal scenarios
`mlir/test/rocmlir-driver/pipelines.mlir`	Pipeline test update to include the new pass
`mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp`	Integration of pass into compilation pipelines
`mlir/include/mlir/Dialect/Rock/Passes.*`	Pass declaration and registration
`mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt`	Build system integration
`mlir/lib/Dialect/Rock/Transforms/GemmToGridwise.cpp`	Debug output addition

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

mlir/lib/Dialect/Rock/Transforms/GemmToGridwise.cpp

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp

umangyadav · 2025-08-15T18:02:32Z

mlir/include/mlir/Dialect/Rock/Passes.td


+def RockRemoveRedundantCastsPass : Pass<"rock-remove-redundant-casts", "::mlir::func::FuncOp"> {
+  let summary = "Remove redundant casts between ops";
+  let dependentDialects = ["rock::RockDialect", "linalg::LinalgDialect"];


Q : it uses arith Dialect as well. It is not explictly listed here but it seems to be working fine. Do you know why is that ?

I believe that this is because dependentDialects just serves as a hint to the compiler about what dialects it needs to load for a given pass. Regardless, I've added in Arith here as it makes sense to do so.

umangyadav · 2025-08-15T18:03:12Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/Bufferization/IR/Bufferization.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/GPU/IR/GPUDialect.h"


Why do you require GPU dialect ?

umangyadav · 2025-08-15T18:11:52Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+  // with either a single trunc or ext op. e.g.,:
+  // %1 = migraphx.convert %0 : <1x5x3xf16, 15x3x1> to <1x5x3xf32, 15x3x1>
+  template <typename OpType>
+  bool isGenericWithSingleOp(linalg::GenericOp generic) const {


We are running linalg-elementwise-op-fusion in highlevel pipeline which would fuse multiple linalg generic ops into a single one.

Therefore, I don't' think it is always guranteed that linalg will only contain single operation.

I think "extf/truncf' are a special case because they require tensors/memrefs of different types on inputs and outputs. therefore it is probably true that linalg generic will only contains extf/truncf and it won't be fused with other linalg ops when we run linalg-elementwise-fusion Do you think that's the case ?

You can probably write a small test and run it through linalg-elementwise-fusion

Yes, I agree, I think it would be better to just check its uses and go back until you find the pattern dtype->dtype2, dtype2->dtype. Instead of assuming linalg will have one op.

umangyadav · 2025-08-15T18:13:31Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+    // We don't need to investigate BlockArguments any further
+    Value input = generic.getInputs()[0];
+    if (isa<BlockArgument>(input))
+      return nullptr;


should it be returning input ?

umangyadav · 2025-08-15T18:31:33Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+
+    // If this op uses mfma, it will accumulate in higher precision (F32 or I32)
+    auto features = rock::getFeatures(rockOp);
+    bool isMfma = bitEnumContainsAll(features, GemmFeatures::mfma);


This should also work on wmma, it also accumulates in higher precision.

Also check the non-accel path if it is doing accumulation in higher precision or not.

umangyadav · 2025-08-15T18:33:12Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+    auto inputType = cast<RankedTensorType>(input->getResult(0).getType());
+    auto outputType = cast<RankedTensorType>(output->getResult(0).getType());


Use shapedtype

umangyadav · 2025-08-15T18:34:21Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+          Value outputArg = args[1];
+          Type oType = outputArg.getType();
+          Value truncResult =
+              builder.create<arith::TruncFOp>(loc, oType, blockArg);


arith::truncf
This wouldn't work on Integer type

umangyadav · 2025-08-15T18:44:51Z

mlir/test/Dialect/Rock/remove_redundant_casts_generic_truncf.mlir

+    %5 = rock.transform %4 by #transform_map2 : tensor<1x5x3xf32> to tensor<5x3xf32>
+
+    %temp_alloc = bufferization.alloc_tensor() : tensor<5x3xf16>
+    // CHECK-NOT: %downcast


Do not use variable names inside checks. variable names can change.

e.g.

if i comment out logic for remove-redundant-cast and just run empty pass,
variable name changes for %downcast to %6.

umangyadav · 2025-08-15T18:47:11Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+        } else if (isa<linalg::GenericOp>(user) &&
+                   (isGenericWithSingleOp<arith::ExtFOp>(
+                        cast<linalg::GenericOp>(user)) ||
+                    isGenericWithSingleOp<arith::ExtSIOp>(
+                        cast<linalg::GenericOp>(user)) ||
+                    isGenericWithSingleOp<arith::ExtUIOp>(
+                        cast<linalg::GenericOp>(user)))) {


Why does it look like formatting is incorrect here ?

dhernandez0 · 2025-08-18T09:08:11Z

More tests to add:

a test that checks the conversion instructions in assembly are not found anymore
input fusion tests
more than two linalgs

dhernandez0 · 2025-08-18T08:36:12Z

mlir/test/Dialect/Rock/remove_redundant_casts_asm_check.mlir

+    %1 = migraphx.convert %0 : <1x5x3xf16, 15x3x1> to <1x5x3xf32, 15x3x1>
+    return %1 : !migraphx.shaped<1x5x3xf32, 15x3x1>
+  }
+}


dhernandez0 · 2025-08-18T08:38:19Z

mlir/test/Dialect/Rock/remove_redundant_casts.mlir

can we add a test for rock.attention? especially the one in the ticket of this PR.

dhernandez0 · 2025-08-18T08:40:04Z

mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp

  auto &funcPm3 = pm.nest<func::FuncOp>();
  funcPm3.addPass(bufferization::createEmptyTensorToAllocTensorPass());
  funcPm3.addPass(createLinalgFoldUnitExtentDimsPass());
+  funcPm3.addPass(rock::createRockRemoveRedundantCastsPass());


the attention pass will create its own conversion linalgs way after this. So, I think for this to work on attention we need it to happen at least after ToBlockwise. Note that the case highlighted in the ticket is attention, so it's the main goal of this PR I think: https://github.com/ROCm/rocMLIR-internal/issues/1932

Unfortunately, this is going to be a major change to the code in this PR, because all of this happens after bufferization.

I spoke with @umangyadav about this last week as I was trying to avoid this case. Tracing uses becomes tricky because we need to come up with some additional memory analysis passes that can trace uses of a memref to find out if any reads/writes happen between two ops. I had an initial version of this in my first commits, but then changed it to this approach. Maybe we can discuss further after standup tomorrow.

IMO it only makes sense to do it after ToBlockwise, otherwise we won't be solving the issue described in the ticket.

I think doing it after ToBlockwise is simpler, you don't need to have a special case for rock.fusionop (gemm/conv etc). It's only linalg.generics. You only need to get linalg.generic and keep tracing back (though Allocs with BufferDependencyAnalysis) to find the pattern you are looking for.

yes, let's discuss in the meeting.

dhernandez0 · 2025-08-18T08:41:13Z

mlir/include/mlir/Dialect/Rock/Passes.td

 }

+def RockRemoveRedundantCastsPass : Pass<"rock-remove-redundant-casts", "::mlir::func::FuncOp"> {
+  let summary = "Remove redundant casts between ops";


nit: be more explicit about what we consider redundant.

dhernandez0 · 2025-08-18T08:42:17Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+// IR, specifically patterns where a value is converted from f32 to a smaller
+// type (e.g., f16) and then immediately extended back to f32. By eliminating
+// these unnecessary conversions, the pass simplifies the IR and can improve
+// performance by reducing superfluous operations and memory traffic.


nit: I don't think memory traffic is relevant here, I think everything is in registers in the assembly in most cases, right?

dhernandez0 · 2025-08-18T08:50:12Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+      // the return type isn't F32 or I32 (highest level of precision), then it
+      // means that there will be a trunc op inserted in RockGemmToGridwise
+      // that will potentially be redundant.
+      changed |= handleRockGemmWrapper(input, generic, rewriter);


as discussed earlier, we need this pass to happen later on to work properly for attention. So, there's no need to handle this special case there will be a linalg.generic for this after ToBlockwise (I think).

dhernandez0 · 2025-08-18T08:52:25Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+  // with either a single trunc or ext op. e.g.,:
+  // %1 = migraphx.convert %0 : <1x5x3xf16, 15x3x1> to <1x5x3xf32, 15x3x1>
+  template <typename OpType>
+  bool isGenericWithSingleOp(linalg::GenericOp generic) const {


Yes, I agree, I think it would be better to just check its uses and go back until you find the pattern dtype->dtype2, dtype2->dtype. Instead of assuming linalg will have one op.

dhernandez0 · 2025-08-18T08:52:55Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+
+  Value getExtInput(linalg::GenericOp generic) const {
+    // Check that there is only one input to the generic operation
+    if (generic.getInputs().size() != 1) {


I don't think this is a requirement

dhernandez0 · 2025-08-18T09:00:12Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+  // Helper function to create a new output value for a
+  // RockGemmWrapperInterface/RockGemmGemmwrapperInterface Op or LinalgGeneric
+  // TruncOp, and any corresponding rock.transformOps
+  Value createNewOutput(Value prevValue,


use rock::transform

dhernandez0 · 2025-08-18T09:06:17Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+
+    // If there is only a single ExtOp use, then we can go ahead and remove
+    // all of the TransformOps and the original truncf
+    auto singleExtOp = std::get<1>(tup);


if some ops are not used, they will get removed by the canonalization stage, I think there's no need to do it explicitly.

justinrosner · 2026-01-08T17:38:54Z

Went with a different approach (at the LLVMIR dialect level) to avoid a lot of the issues in dealing with linalg.generics. That new approach is in draft here: #2202

justinrosner added 17 commits August 11, 2025 10:42

Add basic pass setup

71e91fc

Partial implementation

400203c

Partially working algorithm

d8b95df

Run clang-format

108ff59

More changes to get RemoveRedundantCasts working

0fb380d

clang-format

646328b

Update handling of replacing ops

ae6fb9e

Temporarily comment out replacement code

ff4a2ea

Add comments to functions

d98f22e

Slight refactor to remove unsupported cases

385c190

Small refactors

8d1086c

Get direct case working

51688d4

Initial working version for RemoveRedundantCasts

3381fb0

Add logic for the generic case

7651967

Add support for arbitrary ops between truncf and extf

ff175f0

Clang format files

58020f5

Fix bugs and add LIT tests

ebd87a3

justinrosner marked this pull request as ready for review August 12, 2025 19:39

justinrosner requested a review from causten as a code owner August 12, 2025 19:39

justinrosner commented Aug 12, 2025

View reviewed changes

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp Show resolved Hide resolved

justinrosner requested review from dhernandez0, stefankoncarevic and umangyadav August 12, 2025 19:43

justinrosner added 2 commits August 12, 2025 15:03

Add RockGemmGemmWrapper support

5df8ecc

Add support for any dtype

a78c4a9

umangyadav reviewed Aug 12, 2025

View reviewed changes

mlir/test/Dialect/Rock/remove_redundant_casts_multi_use.mlir Show resolved Hide resolved

justinrosner requested a review from Copilot August 14, 2025 14:26

Clang-format

08cb344

Copilot AI reviewed Aug 14, 2025

View reviewed changes

Attend to Copilot review comments

97db63d

justinrosner added 2 commits August 14, 2025 10:28

Fix small crash with BlockArgs

33c8510

Add ASM check LIT test

4cf44be

justinrosner requested a review from umangyadav August 14, 2025 17:06

Merge branch 'develop' into justinr-bitcast-removal

991ded5

umangyadav reviewed Aug 15, 2025

View reviewed changes

dhernandez0 reviewed Aug 18, 2025

View reviewed changes

justinrosner closed this Jan 8, 2026

		auto inputType = cast<RankedTensorType>(input->getResult(0).getType());
		auto outputType = cast<RankedTensorType>(output->getResult(0).getType());

Remove redundant cast chains #1944

Remove redundant cast chains #1944

Conversation

justinrosner commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhernandez0 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

justinrosner commented Aug 12, 2025 •

edited

Loading

dhernandez0 commented Aug 18, 2025 •

edited

Loading

dhernandez0 Aug 18, 2025 •

edited

Loading