fix(transformer): use correct stride in Transpose_Kernel shared memory indexing to eliminate bank conflicts by flutist · Pull Request #8055 · deepspeedai/DeepSpeed

flutist · 2026-06-09T09:26:39Z

PR Title

fix(transformer): use correct stride in Transpose_Kernel shared memory indexing to eliminate bank conflicts

PR Description

Summary

Fix a shared memory bank conflict bug in Transpose_Kernel where the +1
padding declared in the shared memory array was not used in the indexing.

Problem

The shared memory is declared with padding to avoid bank conflicts:

__shared__ T data_block[rows_trans * (cols_trans + 1)];  // 16 × 17

But both write and read indices use cols_trans (16) as stride instead of
cols_trans + 1 (17), making the padding ineffective and causing bank
conflicts on column-wise reads during transpose.

Fix

Change indexing stride from cols_trans to (cols_trans + 1) in both
the write and read loops. 2 lines changed.

Benchmark (NVIDIA L20, 1024×1024 float32)

nsys profile (1101 kernel calls):

Transpose_Kernel_Original: avg 5,287 ns/call  (51.1%)
Transpose_Kernel_Fixed:    avg 5,061 ns/call  (48.9%)
Speedup: ~4.3%

CUDA Event timing (1000 iterations):

Original: 0.0054 ms/iter  (1553.9 GB/s)
Fixed:    0.0051 ms/iter  (1643.8 GB/s)
Speedup: 5.8%

The kernel is already near DRAM bandwidth peak on L20 (~80% utilization),
partially masking the bank conflict overhead. Larger gains are expected on
GPUs with lower memory bandwidth.

Testing

Correctness was verified with a standalone CUDA A/B test (see below).
Both original and fixed kernels produce identical results (PASS, 0 errors).

No pytest unit test is added because Transpose_Kernel is not exposed to
Python via pybind11 — it is called internally by other C++ transformer
functions in ds_transformer_cuda.cpp. Adding a Python-level test would
require introducing a new pybind binding solely for this kernel, which is
disproportionate for a 2-line bugfix.

Standalone CUDA test script

#include <cstdio>
#include <cuda_fp16.h>

#define rows_trans 16
#define cols_trans 16
#define THREADS 256

template <typename T>
__global__ void Transpose_Kernel_Fixed(const T* inp, T* out, int row_width, int col_width) {
    __shared__ T data_block[rows_trans * (cols_trans + 1)];
    int r = threadIdx.x / cols_trans;
    int c = threadIdx.x % cols_trans;
    int m = row_width / cols_trans;
    int i = blockIdx.x / m * rows_trans + r;
    int j = blockIdx.x % m * cols_trans + c;
    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
    for (int k = 0; k < rows_trans; k += row_stride)
        data_block[(k + r) * (cols_trans + 1) + c] = inp[(i + k) * row_width + j];
    __syncthreads();
    i = blockIdx.x % m * rows_trans + r;
    j = blockIdx.x / m * cols_trans + c;
    for (int k = 0; k < rows_trans; k += row_stride)
        out[(i + k) * col_width + j] = data_block[c * (cols_trans + 1) + r + k];
}

template <typename T>
__global__ void Transpose_Kernel_Original(const T* inp, T* out, int row_width, int col_width) {
    __shared__ T data_block[rows_trans * (cols_trans + 1)];
    int r = threadIdx.x / cols_trans;
    int c = threadIdx.x % cols_trans;
    int m = row_width / cols_trans;
    int i = blockIdx.x / m * rows_trans + r;
    int j = blockIdx.x % m * cols_trans + c;
    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
    for (int k = 0; k < rows_trans; k += row_stride)
        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
    __syncthreads();
    i = blockIdx.x % m * rows_trans + r;
    j = blockIdx.x / m * cols_trans + c;
    for (int k = 0; k < rows_trans; k += row_stride)
        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
}

int main() {
    int rows = 1024, cols = 1024;
    size_t count = rows * cols;
    size_t sz = count * sizeof(float);
    float *h_in = (float*)malloc(sz);
    float *h_out = (float*)malloc(sz);
    for (size_t i = 0; i < count; i++) h_in[i] = (float)i;
    float *d_in, *d_out;
    cudaMalloc(&d_in, sz); cudaMalloc(&d_out, sz);
    cudaMemcpy(d_in, h_in, sz, cudaMemcpyHostToDevice);
    int threads = THREADS;
    int blocks = (rows * cols + threads - 1) / threads;
    int iters = 1000, warmup = 100;

    cudaMemset(d_out, 0, sz);
    Transpose_Kernel_Fixed<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaDeviceSynchronize();
    cudaMemcpy(h_out, d_out, sz, cudaMemcpyDeviceToHost);
    int err_f = 0;
    for (int i = 0; i < rows; i++)
        for (int j = 0; j < cols; j++)
            if (h_out[j * rows + i] != h_in[i * cols + j]) err_f++;

    cudaMemset(d_out, 0, sz);
    Transpose_Kernel_Original<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaDeviceSynchronize();
    cudaMemcpy(h_out, d_out, sz, cudaMemcpyDeviceToHost);
    int err_o = 0;
    for (int i = 0; i < rows; i++)
        for (int j = 0; j < cols; j++)
            if (h_out[j * rows + i] != h_in[i * cols + j]) err_o++;

    printf("=== Correctness ===\n");
    printf("  Original: %s (%d errors)\n", err_o == 0 ? "PASS" : "FAIL", err_o);
    printf("  Fixed:    %s (%d errors)\n", err_f == 0 ? "PASS" : "FAIL", err_f);

    cudaEvent_t start, stop;
    cudaEventCreate(&start); cudaEventCreate(&stop);

    for (int i = 0; i < warmup; i++)
        Transpose_Kernel_Original<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaDeviceSynchronize();
    cudaEventRecord(start);
    for (int i = 0; i < iters; i++)
        Transpose_Kernel_Original<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaEventRecord(stop); cudaDeviceSynchronize();
    float ms_o; cudaEventElapsedTime(&ms_o, start, stop);

    for (int i = 0; i < warmup; i++)
        Transpose_Kernel_Fixed<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaDeviceSynchronize();
    cudaEventRecord(start);
    for (int i = 0; i < iters; i++)
        Transpose_Kernel_Fixed<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaEventRecord(stop); cudaDeviceSynchronize();
    float ms_f; cudaEventElapsedTime(&ms_f, start, stop);

    printf("\n=== Performance (1024x1024 float32, %d iters) ===\n", iters);
    printf("  Original: %.4f ms/iter  (%.1f GB/s)\n", ms_o/iters, 2.0*sz/1e9/(ms_o/iters/1000));
    printf("  Fixed:    %.4f ms/iter  (%.1f GB/s)\n", ms_f/iters, 2.0*sz/1e9/(ms_f/iters/1000));
    printf("  Speedup:  %.1f%%\n", (ms_o/ms_f - 1.0) * 100.0);

    cudaEventDestroy(start); cudaEventDestroy(stop);
    free(h_in); free(h_out); cudaFree(d_in); cudaFree(d_out);
    return 0;
}

Build and run:

nvcc -O2 -o /tmp/test_transpose /tmp/test_transpose_kernel.cu && /tmp/test_transpose

Signed-off-by: xjx <493337577@qq.com>

flutist · 2026-06-09T12:20:18Z

Sorry to bother you, but could you please help me merge this PR file? This solved the problem. If there's anything else I can do, I'll continue. I'm very happy to hear your response.

@tjruwase

flutist · 2026-06-09T12:20:21Z

Sorry to bother you, but could you please help me merge this PR file? This solved the problem. If there's anything else I can do, I'll continue. I'm very happy to hear your response.

@tjruwase

tohtana

This looks good to me. Thank you for your contribution, @flutist!

flutist requested a review from tjruwase as a code owner June 9, 2026 09:26

Fix shared memory

91c41f5

Signed-off-by: xjx <493337577@qq.com>

flutist force-pushed the fix_shared_memory branch from 8501668 to 91c41f5 Compare June 9, 2026 09:27

tohtana approved these changes Jun 10, 2026

View reviewed changes

Merge branch 'master' into fix_shared_memory

59eef67

tohtana enabled auto-merge (squash) June 10, 2026 04:18

tohtana merged commit 65e5f6f into deepspeedai:master Jun 10, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(transformer): use correct stride in Transpose_Kernel shared memory indexing to eliminate bank conflicts#8055

fix(transformer): use correct stride in Transpose_Kernel shared memory indexing to eliminate bank conflicts#8055
tohtana merged 2 commits into
deepspeedai:masterfrom
flutist:fix_shared_memory

flutist commented Jun 9, 2026

Uh oh!

flutist commented Jun 9, 2026

Uh oh!

flutist commented Jun 9, 2026

Uh oh!

tohtana left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

flutist commented Jun 9, 2026

PR Title

PR Description

Summary

Problem

Fix

Benchmark (NVIDIA L20, 1024×1024 float32)

Testing

Uh oh!

flutist commented Jun 9, 2026

Uh oh!

flutist commented Jun 9, 2026

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants