Lokasi ngalangkungan proxy:   [ UP ]  
[Ngawartoskeun bug]   [Panyetelan cookie]                
Skip to content

fix(transformer): use correct stride in Transpose_Kernel shared memory indexing to eliminate bank conflicts#8055

Merged
tohtana merged 2 commits into
deepspeedai:masterfrom
flutist:fix_shared_memory
Jun 10, 2026
Merged

fix(transformer): use correct stride in Transpose_Kernel shared memory indexing to eliminate bank conflicts#8055
tohtana merged 2 commits into
deepspeedai:masterfrom
flutist:fix_shared_memory

Conversation

@flutist

@flutist flutist commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

PR Title

fix(transformer): use correct stride in Transpose_Kernel shared memory indexing to eliminate bank conflicts

PR Description

Summary

Fix a shared memory bank conflict bug in Transpose_Kernel where the +1
padding declared in the shared memory array was not used in the indexing.

Problem

The shared memory is declared with padding to avoid bank conflicts:

__shared__ T data_block[rows_trans * (cols_trans + 1)];  // 16 × 17

But both write and read indices use cols_trans (16) as stride instead of
cols_trans + 1 (17), making the padding ineffective and causing bank
conflicts on column-wise reads during transpose.

Fix

Change indexing stride from cols_trans to (cols_trans + 1) in both
the write and read loops. 2 lines changed.

Benchmark (NVIDIA L20, 1024×1024 float32)

nsys profile (1101 kernel calls):

Transpose_Kernel_Original: avg 5,287 ns/call  (51.1%)
Transpose_Kernel_Fixed:    avg 5,061 ns/call  (48.9%)
Speedup: ~4.3%

CUDA Event timing (1000 iterations):

Original: 0.0054 ms/iter  (1553.9 GB/s)
Fixed:    0.0051 ms/iter  (1643.8 GB/s)
Speedup: 5.8%

The kernel is already near DRAM bandwidth peak on L20 (~80% utilization),
partially masking the bank conflict overhead. Larger gains are expected on
GPUs with lower memory bandwidth.

Testing

Correctness was verified with a standalone CUDA A/B test (see below).
Both original and fixed kernels produce identical results (PASS, 0 errors).

No pytest unit test is added because Transpose_Kernel is not exposed to
Python via pybind11 — it is called internally by other C++ transformer
functions in ds_transformer_cuda.cpp. Adding a Python-level test would
require introducing a new pybind binding solely for this kernel, which is
disproportionate for a 2-line bugfix.

Standalone CUDA test script
#include <cstdio>
#include <cuda_fp16.h>

#define rows_trans 16
#define cols_trans 16
#define THREADS 256

template <typename T>
__global__ void Transpose_Kernel_Fixed(const T* inp, T* out, int row_width, int col_width) {
    __shared__ T data_block[rows_trans * (cols_trans + 1)];
    int r = threadIdx.x / cols_trans;
    int c = threadIdx.x % cols_trans;
    int m = row_width / cols_trans;
    int i = blockIdx.x / m * rows_trans + r;
    int j = blockIdx.x % m * cols_trans + c;
    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
    for (int k = 0; k < rows_trans; k += row_stride)
        data_block[(k + r) * (cols_trans + 1) + c] = inp[(i + k) * row_width + j];
    __syncthreads();
    i = blockIdx.x % m * rows_trans + r;
    j = blockIdx.x / m * cols_trans + c;
    for (int k = 0; k < rows_trans; k += row_stride)
        out[(i + k) * col_width + j] = data_block[c * (cols_trans + 1) + r + k];
}

template <typename T>
__global__ void Transpose_Kernel_Original(const T* inp, T* out, int row_width, int col_width) {
    __shared__ T data_block[rows_trans * (cols_trans + 1)];
    int r = threadIdx.x / cols_trans;
    int c = threadIdx.x % cols_trans;
    int m = row_width / cols_trans;
    int i = blockIdx.x / m * rows_trans + r;
    int j = blockIdx.x % m * cols_trans + c;
    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
    for (int k = 0; k < rows_trans; k += row_stride)
        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
    __syncthreads();
    i = blockIdx.x % m * rows_trans + r;
    j = blockIdx.x / m * cols_trans + c;
    for (int k = 0; k < rows_trans; k += row_stride)
        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
}

int main() {
    int rows = 1024, cols = 1024;
    size_t count = rows * cols;
    size_t sz = count * sizeof(float);
    float *h_in = (float*)malloc(sz);
    float *h_out = (float*)malloc(sz);
    for (size_t i = 0; i < count; i++) h_in[i] = (float)i;
    float *d_in, *d_out;
    cudaMalloc(&d_in, sz); cudaMalloc(&d_out, sz);
    cudaMemcpy(d_in, h_in, sz, cudaMemcpyHostToDevice);
    int threads = THREADS;
    int blocks = (rows * cols + threads - 1) / threads;
    int iters = 1000, warmup = 100;

    cudaMemset(d_out, 0, sz);
    Transpose_Kernel_Fixed<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaDeviceSynchronize();
    cudaMemcpy(h_out, d_out, sz, cudaMemcpyDeviceToHost);
    int err_f = 0;
    for (int i = 0; i < rows; i++)
        for (int j = 0; j < cols; j++)
            if (h_out[j * rows + i] != h_in[i * cols + j]) err_f++;

    cudaMemset(d_out, 0, sz);
    Transpose_Kernel_Original<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaDeviceSynchronize();
    cudaMemcpy(h_out, d_out, sz, cudaMemcpyDeviceToHost);
    int err_o = 0;
    for (int i = 0; i < rows; i++)
        for (int j = 0; j < cols; j++)
            if (h_out[j * rows + i] != h_in[i * cols + j]) err_o++;

    printf("=== Correctness ===\n");
    printf("  Original: %s (%d errors)\n", err_o == 0 ? "PASS" : "FAIL", err_o);
    printf("  Fixed:    %s (%d errors)\n", err_f == 0 ? "PASS" : "FAIL", err_f);

    cudaEvent_t start, stop;
    cudaEventCreate(&start); cudaEventCreate(&stop);

    for (int i = 0; i < warmup; i++)
        Transpose_Kernel_Original<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaDeviceSynchronize();
    cudaEventRecord(start);
    for (int i = 0; i < iters; i++)
        Transpose_Kernel_Original<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaEventRecord(stop); cudaDeviceSynchronize();
    float ms_o; cudaEventElapsedTime(&ms_o, start, stop);

    for (int i = 0; i < warmup; i++)
        Transpose_Kernel_Fixed<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaDeviceSynchronize();
    cudaEventRecord(start);
    for (int i = 0; i < iters; i++)
        Transpose_Kernel_Fixed<float><<<blocks, threads>>>(d_in, d_out, cols, rows);
    cudaEventRecord(stop); cudaDeviceSynchronize();
    float ms_f; cudaEventElapsedTime(&ms_f, start, stop);

    printf("\n=== Performance (1024x1024 float32, %d iters) ===\n", iters);
    printf("  Original: %.4f ms/iter  (%.1f GB/s)\n", ms_o/iters, 2.0*sz/1e9/(ms_o/iters/1000));
    printf("  Fixed:    %.4f ms/iter  (%.1f GB/s)\n", ms_f/iters, 2.0*sz/1e9/(ms_f/iters/1000));
    printf("  Speedup:  %.1f%%\n", (ms_o/ms_f - 1.0) * 100.0);

    cudaEventDestroy(start); cudaEventDestroy(stop);
    free(h_in); free(h_out); cudaFree(d_in); cudaFree(d_out);
    return 0;
}

Build and run:

nvcc -O2 -o /tmp/test_transpose /tmp/test_transpose_kernel.cu && /tmp/test_transpose

@flutist flutist requested a review from tjruwase as a code owner June 9, 2026 09:26
Signed-off-by: xjx <493337577@qq.com>
@flutist flutist force-pushed the fix_shared_memory branch from 8501668 to 91c41f5 Compare June 9, 2026 09:27
@flutist

flutist commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Sorry to bother you, but could you please help me merge this PR file? This solved the problem. If there's anything else I can do, I'll continue. I'm very happy to hear your response.

@tjruwase

1 similar comment
@flutist

flutist commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Sorry to bother you, but could you please help me merge this PR file? This solved the problem. If there's anything else I can do, I'll continue. I'm very happy to hear your response.

@tjruwase

@tohtana tohtana left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thank you for your contribution, @flutist!

@tohtana tohtana enabled auto-merge (squash) June 10, 2026 04:18
@tohtana tohtana merged commit 65e5f6f into deepspeedai:master Jun 10, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants