-
Notifications
You must be signed in to change notification settings - Fork 52
Introduce quick-tune lists for Attention #2203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect a huge speed up for quick tuning, as the list was hardcoded and arbitrary for attention. Can you show perf of develop vs this branch for tier1 attention (and g+g) list?
| return std::nullopt; | ||
| // Parse "vN:" - if not present, assume version 1 | ||
| int version = 1; | ||
| if (rest.consume_front("v")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't really sure if v1 had the version string or not, so I played it safe. I suppose this is not that important since I doubt v1 is being used anywhere.
| raise ValueError(f"Unknown operation: {op}") | ||
|
|
||
|
|
||
| def parse_perfconfig(perfconfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we use python bindings to parse this?
| }; | ||
| // END_ATTENTION_GemmGemm_f32_gfx942_DEFS | ||
|
|
||
| // BEGIN_ATTENTION_GemmGemm_f16_gfx942_DEFS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware we had bf16 and f16 perfConfigs, I think they should be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the other operations we didn't tune for bf16 so far, and currently we will use the f16 list anyway:
| // We use "f16" for bf16 and f16 generically |
But since now we have a bf16 list we should change this behavior, so it only falls back to f16 if no bf16 list is present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but my point is that I think, we shouldn't make the distinction between f16/b16. I don't see a reason why perfConfig should be different for those two. The reasons why one type has diff perfConfig is that mfma/wmma is different or how efficient it is at loading from memory/LDS etc. That should be the same for f16/bf16, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, the same convs/attn/gemm that you happen to have bf16 in the list could be run with f16 as well or the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see what you mean. That makes sense.
In that case, I can just treat bf16 and f16 as the same during generation and it will consolidate everything into one list. I can also skip tuning the same test vector for bf16 if it has already been tuned for f16, and vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you already tuned bf16/f16 for the same problem config, do you see very diff TFlops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however, if at some point we enable split-k quick tuning list. We would probably want to keep separate lists for bf16/f16 (for some archs). Because some archs have atomic instructions for f16 but not for bf16.
| // END_ATTENTION_GemmGemm_f16_gfx1000_DEFS | ||
|
|
||
| // BEGIN_ATTENTION_GemmGemm_f32_gfx942_DEFS | ||
| const StringRef PopulateParamsGemmGemm::initParametersF32AttentionGfx942[] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come gfx942 list is small compared to gfx12 for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the selection works, it means that there's less variation across best performing configs on gfx942. I can try and look into the data some more to see why that is.
Also to note, gfx942 has been tuned on 4 GPUs, while gfx12 has been tuned on 2 GPUs. May or may not be relevant.
…-quick-tune-lists
…-quick-tune-lists
Motivation
Technical Details
Related to https://github.com/ROCm/rocMLIR-internal/issues/2136
Test Plan
Confirm if the new quick-tune lists perform better than the old ones.
Test Result
On gfx942, the new quick-tune lists performs ~30% better.
Submission Checklist