Skip to content

Conversation

@5000user5000
Copy link
Owner

Based on the investigation in ISSUE #30 , it was found that LUT GEMM suffers from high L1 cache miss rates, making memory latency the performance bottleneck rather than computation. Several optimizations were applied, such as prefetching the LUT into L1 cache and tiling the matrices to avoid loading overly large chunks that cause cache misses. Additionally, multithreading was introduced to further improve performance. The original naive implementation was also updated to support multithreading, making it easier to compare performance under a fixed thread count.
Lastly, tests/test_matrix_ops.cpp was added to help dump the assembly of both the naive (float) and LUT GEMM versions for comparison.

@5000user5000
Copy link
Owner Author

@yungyuc
I've already looked into Monday's request — identifying why the LUT implementation performs similarly to, or even slightly worse than, the naive version. In this PR, I've also made efforts to mitigate some of LUT's drawbacks, such as excessive L1 cache misses.

@yungyuc
Copy link

yungyuc commented May 7, 2025

Good findings. When adding your findings into the presentation, make sure you address all the assessing points for grading.

Where in the code you want my review? Please make inline annotation.

Don't hesitate to merge the PR. Review and discussions still work after merging.

@5000user5000 5000user5000 merged commit f97df84 into main May 8, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants