-
Notifications
You must be signed in to change notification settings - Fork 83
Description
Regarding the access pattern of matrix A, I have a question:

- When data is transferred from global memory to shared memory, swizzling is applied.
- During the ldmatrix operation, each thread retrieves the corresponding address required by ldmatrix, ensuring that no bank conflicts occur in any transaction.
For example, thread 0 reads from the 0th position in shared memory, thread 1 reads from the 9th position, thread 2 reads from the 18th position, and so on. Threads 0 to 7 access different banks, which avoids bank conflicts. The content accessed by these 8 threads corresponds to the 8 vertical rows in global memory (highlighted in light blue).
Up to this point, I understand the process and appreciate the cleverness of the swizzling technique. However, the issue arises when thread 8 accesses the content 257, which is not aligned with the same column as the blocks accessed by threads 0 to 7.
According to the diagram in the documentation, threads 0 to 7 should access the top-left block, threads 8 to 15 should access the top-right block, threads 16 to 23 the bottom-left, and threads 24 to 31 the bottom-right.
However, based on the logs I printed, threads 0 to 7 are accessing the top-left block, threads 8 to 15 the bottom-right block, threads 16 to 23 the top-right block, and threads 24 to 31 the bottom-left block. With such access patterns, the data retrieved by ldmatrix is likely incorrect.
Where might the issue lie? Is there something wrong with my understanding? I'm very confused.
