Agentic AI Atlas

II.

Page reference

page:library-gpu-programming

Reading · 11 min

GPU Programming and Parallel Computing (Library) reference

GPU Programming and Parallel Computing is a specialized domain focused on leveraging the massive parallelism of Graphics Processing Units (GPUs) to solve computationally intensive problems. Modern GPUs contain thousands of cores designed for executing thousands of threads simultaneously, making them ideal for data-parallel workloads that would be prohibitively slow on traditional CPUs.

Pagewiki/library/gpu-programming.mdOutgoing · 1Incoming · 0

GPU Programming and Parallel Computing

Overview

This specialization encompasses the entire ecosystem of GPU computing, from low-level hardware understanding to high-level programming abstractions, optimization techniques, and real-world application development. It bridges the gap between theoretical parallel computing concepts and practical implementation using industry-standard frameworks like CUDA, OpenCL, and compute shaders.

Key Roles and Responsibilities

GPU Software Engineer

Design and implement GPU-accelerated algorithms and applications
Write efficient CUDA, OpenCL, or compute shader code
Profile and optimize GPU kernel performance
Manage GPU memory hierarchies and data transfers
Integrate GPU computations with existing software systems

High-Performance Computing (HPC) Specialist

Architect large-scale parallel computing solutions
Optimize workload distribution across multiple GPUs and nodes
Implement efficient inter-GPU communication strategies
Tune applications for specific GPU architectures
Evaluate and benchmark GPU hardware for computational workloads

Graphics Programmer

Develop rendering pipelines and graphics systems
Implement compute shaders for visual effects and post-processing
Optimize real-time graphics performance
Create hybrid rendering-compute workflows
Design efficient GPU resource management systems

ML/AI Infrastructure Engineer

Build and optimize deep learning training infrastructure
Implement custom CUDA kernels for neural network operations
Optimize tensor operations and memory access patterns
Scale training across multiple GPUs and machines
Profile and improve model training performance

Goals and Objectives

Primary Goals

1. **Maximize Computational Throughput**: Achieve optimal utilization of GPU computational resources by designing algorithms that exploit data parallelism effectively 2. **Minimize Latency**: Reduce end-to-end execution time through efficient memory management, kernel optimization, and overlap of computation with data transfers 3. **Ensure Scalability**: Create solutions that scale efficiently from single GPU to multi-GPU and multi-node configurations 4. **Maintain Code Quality**: Write maintainable, portable, and well-documented GPU code that follows best practices

Learning Objectives

Understand GPU architecture and its implications for parallel algorithm design
Master CUDA and OpenCL programming models and APIs
Learn memory hierarchy optimization techniques
Develop proficiency in GPU debugging and profiling tools
Apply parallel design patterns to real-world problems

GPU Architecture Understanding

Hardware Fundamentals

Streaming Multiprocessors (SMs)

GPUs are organized into multiple Streaming Multiprocessors, each containing:

**CUDA Cores / Stream Processors**: Execute arithmetic operations in parallel
**Tensor Cores**: Specialized units for matrix multiply-accumulate operations (modern NVIDIA GPUs)
**RT Cores**: Ray tracing acceleration units (NVIDIA RTX series)
**Shared Memory**: Fast, low-latency memory shared among threads in a block
**L1 Cache**: Per-SM cache for reducing global memory access latency
**Warp Schedulers**: Hardware units that manage thread execution

Memory Hierarchy

Code

Registers (fastest, per-thread)
    |
Shared Memory / L1 Cache (per-SM, ~100 cycles latency)
    |
L2 Cache (shared across SMs, ~200 cycles latency)
    |
Global Memory / VRAM (highest capacity, ~400-800 cycles latency)
    |
System Memory / Host RAM (slowest, requires PCIe transfer)

Execution Model

**Warps/Wavefronts**: Groups of 32 (NVIDIA) or 64 (AMD) threads that execute in lockstep
**Thread Blocks**: Logical groupings of threads that share resources and can synchronize
**Grids**: Collections of thread blocks that execute a kernel
**Occupancy**: Ratio of active warps to maximum warps per SM

Architecture Generations

NVIDIA Architectures

**Volta/Turing**: Tensor cores, independent thread scheduling
**Ampere**: Third-generation tensor cores, improved sparsity support
**Hopper**: Fourth-generation tensor cores, transformer engine, DPX instructions
**Ada Lovelace**: Consumer architecture with advanced ray tracing and DLSS

AMD Architectures

**RDNA**: Gaming-focused architecture with improved power efficiency
**CDNA**: Compute-focused architecture for data centers (MI series)
**RDNA 3**: Chiplet design, AI accelerators

CUDA Programming Concepts

Kernel Development

Basic Kernel Structure

cuda

__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Thread Indexing

**threadIdx**: Thread index within a block (x, y, z dimensions)
**blockIdx**: Block index within the grid
**blockDim**: Number of threads per block
**gridDim**: Number of blocks in the grid

Memory Management

Memory Types

cuda

// Global memory allocation
float* d_array;
cudaMalloc(&d_array, size);
cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice);

// Shared memory declaration
__shared__ float sharedData[256];

// Constant memory
__constant__ float constData[64];

// Texture memory (for spatial locality)
cudaTextureObject_t tex;

Unified Memory

cuda

float* data;
cudaMallocManaged(&data, size);
// Accessible from both host and device
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize();

Synchronization

Thread Synchronization

cuda

__syncthreads();  // Block-level barrier
__syncwarp();     // Warp-level synchronization

Stream-Based Concurrency

cuda

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Concurrent kernel execution
kernel1<<<grid, block, 0, stream1>>>(data1);
kernel2<<<grid, block, 0, stream2>>>(data2);

// Asynchronous memory transfers
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream1);

Advanced CUDA Features

Dynamic Parallelism

cuda

__global__ void parentKernel() {
    // Launch child kernels from device code
    childKernel<<<childGrid, childBlock>>>(data);
}

Cooperative Groups

cuda

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void kernel() {
    cg::thread_block block = cg::this_thread_block();
    cg::grid_group grid = cg::this_grid();

    // Flexible synchronization
    block.sync();
    grid.sync();
}

OpenCL Fundamentals

Platform Model

Key Concepts

**Platform**: Implementation of OpenCL (e.g., NVIDIA, AMD, Intel)
**Device**: Computational unit (GPU, CPU, FPGA, etc.)
**Context**: Environment for managing devices, memory, and command queues
**Command Queue**: Sequence of commands for execution on a device

Kernel Programming

OpenCL Kernel Example

opencl

__kernel void vectorAdd(__global const float* a,
                        __global const float* b,
                        __global float* c,
                        const int n) {
    int gid = get_global_id(0);
    if (gid < n) {
        c[gid] = a[gid] + b[gid];
    }
}

Memory Model

Address Spaces

**__global**: Main device memory
**__local**: Shared memory within work-group
**__constant**: Read-only constant memory
**__private**: Per-work-item private memory

Host API

Typical Workflow

cpp

// Get platform and device
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

// Create context and command queue
cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);

// Create and build program
cl_program program = clCreateProgramWithSource(context, 1, &source, NULL, NULL);
clBuildProgram(program, 1, &device, NULL, NULL, NULL);

// Create kernel and set arguments
cl_kernel kernel = clCreateKernel(program, "vectorAdd", NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufferA);

// Execute kernel
size_t globalSize = n;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, NULL);

Parallel Computing Patterns

Data Parallel Patterns

Map

Apply a function independently to each element:

cuda

__global__ void mapKernel(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = transform(input[idx]);
    }
}

Reduce

Combine elements using an associative operator:

cuda

__global__ void reduceSum(float* input, float* output, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = (idx < n) ? input[idx] : 0;
    __syncthreads();

    // Tree-based reduction
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = sdata[0];
}

Scan (Prefix Sum)

Compute running totals with parallel efficiency:

Inclusive scan: Each output includes the current element
Exclusive scan: Each output excludes the current element
Applications: Stream compaction, sorting, histograms

Scatter/Gather

**Scatter**: Write to arbitrary locations based on computed indices
**Gather**: Read from arbitrary locations based on computed indices

Algorithmic Patterns

Stencil Computations

Process elements based on neighboring values:

cuda

__global__ void stencil2D(float* input, float* output, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x > 0 && x < width-1 && y > 0 && y < height-1) {
        int idx = y * width + x;
        output[idx] = 0.25f * (input[idx-1] + input[idx+1] +
                               input[idx-width] + input[idx+width]);
    }
}

Histogram

Count occurrences in parallel with atomic operations:

cuda

__global__ void histogram(int* data, int* bins, int n, int numBins) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int bin = data[idx] % numBins;
        atomicAdd(&bins[bin], 1);
    }
}

Sort

Parallel sorting algorithms:

**Bitonic Sort**: Network-based, O(n log^2 n) with high parallelism
**Radix Sort**: Digit-by-digit sorting, excellent for integers
**Merge Sort**: Divide-and-conquer with parallel merge phases

Performance Optimization Techniques

Memory Optimization

Coalesced Memory Access

Ensure threads in a warp access contiguous memory:

cuda

// Good: Coalesced access
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = data[idx];

// Bad: Strided access
float val = data[threadIdx.x * stride];

Shared Memory Bank Conflicts

Avoid multiple threads accessing the same bank:

cuda

// Pad shared memory to avoid conflicts
__shared__ float sdata[32][33];  // Extra column padding

Memory Transfer Optimization

Use pinned (page-locked) host memory for faster transfers
Overlap computation with memory transfers using streams
Minimize host-device data movement

Execution Optimization

Occupancy Optimization

Balance resources to maximize active warps:

Register usage per thread
Shared memory per block
Thread block size

Warp Divergence Minimization

Reduce branch divergence within warps:

cuda

// Bad: Divergent branches
if (threadIdx.x % 2 == 0) {
    // Path A
} else {
    // Path B
}

// Better: Process in warp-sized chunks
if (threadIdx.x < 16) {
    // All threads in half-warp take same path
}

Instruction-Level Optimization

Use fast math intrinsics (__fmaf_rn, __expf)
Leverage tensor cores for matrix operations
Unroll loops where beneficial

Profiling and Analysis

Tools

**NVIDIA Nsight Systems**: System-wide performance analysis
**NVIDIA Nsight Compute**: Kernel-level profiling
**AMD ROCm Profiler**: AMD GPU profiling
**Intel VTune**: Cross-platform performance analysis

Key Metrics

**SM Efficiency**: Percentage of time SMs are active
**Memory Bandwidth Utilization**: Actual vs. theoretical bandwidth
**Occupancy**: Active warps vs. maximum warps
**Instruction Throughput**: IPC and instruction mix

Compute Shader Programming

Graphics API Integration

DirectX Compute Shaders (HLSL)

hlsl

RWStructuredBuffer<float> output : register(u0);
StructuredBuffer<float> inputA : register(t0);
StructuredBuffer<float> inputB : register(t1);

[numthreads(256, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID) {
    output[id.x] = inputA[id.x] + inputB[id.x];
}

Vulkan Compute Shaders (GLSL)

glsl

#version 450
layout(local_size_x = 256) in;

layout(binding = 0) buffer OutputBuffer { float output[]; };
layout(binding = 1) buffer InputA { float inputA[]; };
layout(binding = 2) buffer InputB { float inputB[]; };

void main() {
    uint idx = gl_GlobalInvocationID.x;
    output[idx] = inputA[idx] + inputB[idx];
}

Metal Compute Shaders

metal

kernel void vectorAdd(device float* output [[buffer(0)]],
                      device const float* inputA [[buffer(1)]],
                      device const float* inputB [[buffer(2)]],
                      uint idx [[thread_position_in_grid]]) {
    output[idx] = inputA[idx] + inputB[idx];
}

Hybrid Rendering-Compute Workflows

Post-processing effects using compute shaders
GPU-driven rendering with compute-based culling
Particle systems and physics simulations
Texture generation and processing

Common Use Cases

Machine Learning and Deep Learning

Training Acceleration

Matrix multiplication for forward/backward passes
Convolution operations with cuDNN/MIOpen
Custom CUDA kernels for specialized layers
Multi-GPU training with NCCL/RCCL

Inference Optimization

Quantization and precision reduction
Kernel fusion for reduced memory bandwidth
Batching strategies for throughput
TensorRT and ONNX Runtime optimization

Scientific Computing

Computational Physics

Molecular dynamics simulations
Fluid dynamics (CFD) using lattice methods
N-body gravitational simulations
Finite element analysis

Linear Algebra

Dense matrix operations (cuBLAS, rocBLAS)
Sparse matrix computations (cuSPARSE)
Eigenvalue and SVD decompositions
Large-scale linear system solvers

Graphics and Visualization

Real-Time Rendering

Ray tracing acceleration
Global illumination techniques
Screen-space effects
Procedural generation

Image and Video Processing

Convolution filters and transforms
Video encoding/decoding acceleration
Computer vision algorithms
Real-time image enhancement

Cryptography and Blockchain

Mining and Hashing

Parallel hash computation
Memory-hard algorithm optimization
Proof-of-work calculations

Security Applications

Password cracking and auditing
Cryptographic key generation
Encryption/decryption acceleration

Financial Computing

Quantitative Finance

Monte Carlo simulations
Option pricing models
Risk analysis calculations
High-frequency trading algorithms

Best Practices

Code Organization

1. Separate host and device code logically 2. Use wrapper classes for resource management (RAII) 3. Implement error checking macros for all API calls 4. Document kernel assumptions and constraints

Performance Guidelines

1. Profile before optimizing 2. Focus on memory bottlenecks first 3. Maximize arithmetic intensity 4. Design for the target architecture 5. Test across different hardware generations

Portability Considerations

1. Abstract hardware-specific code 2. Use portable libraries where possible 3. Implement fallback CPU paths 4. Test on multiple platforms and vendors

Debugging Strategies

1. Use compute sanitizers (cuda-memcheck, rocm-smi) 2. Implement validation against CPU reference 3. Start with single-thread correctness 4. Use printf debugging sparingly in kernels

Conclusion

GPU Programming and Parallel Computing represents a critical skill set in modern software development. As computational demands continue to grow across domains from AI to scientific simulation, the ability to effectively harness GPU parallelism becomes increasingly valuable. This specialization provides the foundation for designing, implementing, and optimizing GPU-accelerated applications that push the boundaries of computational performance.

GPU Programming and Parallel Computing (Library) reference

Pagewiki/library/gpu-programming.mdOutgoing · 1Incoming · 0

GPU Programming and Parallel Computing

Overview

Key Roles and Responsibilities

GPU Software Engineer

Design and implement GPU-accelerated algorithms and applications
Write efficient CUDA, OpenCL, or compute shader code
Profile and optimize GPU kernel performance
Manage GPU memory hierarchies and data transfers
Integrate GPU computations with existing software systems

High-Performance Computing (HPC) Specialist

Architect large-scale parallel computing solutions
Optimize workload distribution across multiple GPUs and nodes
Implement efficient inter-GPU communication strategies
Tune applications for specific GPU architectures
Evaluate and benchmark GPU hardware for computational workloads

Graphics Programmer

Develop rendering pipelines and graphics systems
Implement compute shaders for visual effects and post-processing
Optimize real-time graphics performance
Create hybrid rendering-compute workflows
Design efficient GPU resource management systems

ML/AI Infrastructure Engineer

Build and optimize deep learning training infrastructure
Implement custom CUDA kernels for neural network operations
Optimize tensor operations and memory access patterns
Scale training across multiple GPUs and machines
Profile and improve model training performance

Goals and Objectives

Primary Goals

Learning Objectives

Understand GPU architecture and its implications for parallel algorithm design
Master CUDA and OpenCL programming models and APIs
Learn memory hierarchy optimization techniques
Develop proficiency in GPU debugging and profiling tools
Apply parallel design patterns to real-world problems

GPU Architecture Understanding

Hardware Fundamentals

Streaming Multiprocessors (SMs)

GPUs are organized into multiple Streaming Multiprocessors, each containing:

**CUDA Cores / Stream Processors**: Execute arithmetic operations in parallel
**Tensor Cores**: Specialized units for matrix multiply-accumulate operations (modern NVIDIA GPUs)
**RT Cores**: Ray tracing acceleration units (NVIDIA RTX series)
**Shared Memory**: Fast, low-latency memory shared among threads in a block
**L1 Cache**: Per-SM cache for reducing global memory access latency
**Warp Schedulers**: Hardware units that manage thread execution

Memory Hierarchy

Code

Registers (fastest, per-thread)
    |
Shared Memory / L1 Cache (per-SM, ~100 cycles latency)
    |
L2 Cache (shared across SMs, ~200 cycles latency)
    |
Global Memory / VRAM (highest capacity, ~400-800 cycles latency)
    |
System Memory / Host RAM (slowest, requires PCIe transfer)

Execution Model

**Warps/Wavefronts**: Groups of 32 (NVIDIA) or 64 (AMD) threads that execute in lockstep
**Thread Blocks**: Logical groupings of threads that share resources and can synchronize
**Grids**: Collections of thread blocks that execute a kernel
**Occupancy**: Ratio of active warps to maximum warps per SM

Architecture Generations

NVIDIA Architectures

**Volta/Turing**: Tensor cores, independent thread scheduling
**Ampere**: Third-generation tensor cores, improved sparsity support
**Hopper**: Fourth-generation tensor cores, transformer engine, DPX instructions
**Ada Lovelace**: Consumer architecture with advanced ray tracing and DLSS

AMD Architectures

**RDNA**: Gaming-focused architecture with improved power efficiency
**CDNA**: Compute-focused architecture for data centers (MI series)
**RDNA 3**: Chiplet design, AI accelerators

CUDA Programming Concepts

Kernel Development

Basic Kernel Structure

cuda

__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Thread Indexing

**threadIdx**: Thread index within a block (x, y, z dimensions)
**blockIdx**: Block index within the grid
**blockDim**: Number of threads per block
**gridDim**: Number of blocks in the grid

Memory Management

Memory Types

cuda

// Global memory allocation
float* d_array;
cudaMalloc(&d_array, size);
cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice);

// Shared memory declaration
__shared__ float sharedData[256];

// Constant memory
__constant__ float constData[64];

// Texture memory (for spatial locality)
cudaTextureObject_t tex;

Unified Memory

cuda

float* data;
cudaMallocManaged(&data, size);
// Accessible from both host and device
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize();

Synchronization

Thread Synchronization

cuda

__syncthreads();  // Block-level barrier
__syncwarp();     // Warp-level synchronization

Stream-Based Concurrency

cuda

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Concurrent kernel execution
kernel1<<<grid, block, 0, stream1>>>(data1);
kernel2<<<grid, block, 0, stream2>>>(data2);

// Asynchronous memory transfers
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream1);

Advanced CUDA Features

Dynamic Parallelism

cuda

__global__ void parentKernel() {
    // Launch child kernels from device code
    childKernel<<<childGrid, childBlock>>>(data);
}

Cooperative Groups

cuda

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void kernel() {
    cg::thread_block block = cg::this_thread_block();
    cg::grid_group grid = cg::this_grid();

    // Flexible synchronization
    block.sync();
    grid.sync();
}

OpenCL Fundamentals

Platform Model

Key Concepts

**Platform**: Implementation of OpenCL (e.g., NVIDIA, AMD, Intel)
**Device**: Computational unit (GPU, CPU, FPGA, etc.)
**Context**: Environment for managing devices, memory, and command queues
**Command Queue**: Sequence of commands for execution on a device

Kernel Programming

OpenCL Kernel Example

opencl

__kernel void vectorAdd(__global const float* a,
                        __global const float* b,
                        __global float* c,
                        const int n) {
    int gid = get_global_id(0);
    if (gid < n) {
        c[gid] = a[gid] + b[gid];
    }
}

Memory Model

Address Spaces

**__global**: Main device memory
**__local**: Shared memory within work-group
**__constant**: Read-only constant memory
**__private**: Per-work-item private memory

Host API

Typical Workflow

cpp

// Get platform and device
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

// Create context and command queue
cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);

// Create and build program
cl_program program = clCreateProgramWithSource(context, 1, &source, NULL, NULL);
clBuildProgram(program, 1, &device, NULL, NULL, NULL);

// Create kernel and set arguments
cl_kernel kernel = clCreateKernel(program, "vectorAdd", NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufferA);

// Execute kernel
size_t globalSize = n;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, NULL);

Parallel Computing Patterns

Data Parallel Patterns

Map

Apply a function independently to each element:

cuda

__global__ void mapKernel(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = transform(input[idx]);
    }
}

Reduce

Combine elements using an associative operator:

cuda

__global__ void reduceSum(float* input, float* output, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = (idx < n) ? input[idx] : 0;
    __syncthreads();

    // Tree-based reduction
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = sdata[0];
}

Scan (Prefix Sum)

Compute running totals with parallel efficiency:

Inclusive scan: Each output includes the current element
Exclusive scan: Each output excludes the current element
Applications: Stream compaction, sorting, histograms

Scatter/Gather

**Scatter**: Write to arbitrary locations based on computed indices
**Gather**: Read from arbitrary locations based on computed indices

Algorithmic Patterns

Stencil Computations

Process elements based on neighboring values:

cuda

__global__ void stencil2D(float* input, float* output, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x > 0 && x < width-1 && y > 0 && y < height-1) {
        int idx = y * width + x;
        output[idx] = 0.25f * (input[idx-1] + input[idx+1] +
                               input[idx-width] + input[idx+width]);
    }
}

Histogram

Count occurrences in parallel with atomic operations:

cuda

__global__ void histogram(int* data, int* bins, int n, int numBins) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int bin = data[idx] % numBins;
        atomicAdd(&bins[bin], 1);
    }
}

Sort

Parallel sorting algorithms:

**Bitonic Sort**: Network-based, O(n log^2 n) with high parallelism
**Radix Sort**: Digit-by-digit sorting, excellent for integers
**Merge Sort**: Divide-and-conquer with parallel merge phases

Performance Optimization Techniques

Memory Optimization

Coalesced Memory Access

Ensure threads in a warp access contiguous memory:

cuda

// Good: Coalesced access
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = data[idx];

// Bad: Strided access
float val = data[threadIdx.x * stride];

Shared Memory Bank Conflicts

Avoid multiple threads accessing the same bank:

cuda

// Pad shared memory to avoid conflicts
__shared__ float sdata[32][33];  // Extra column padding

Memory Transfer Optimization

Use pinned (page-locked) host memory for faster transfers
Overlap computation with memory transfers using streams
Minimize host-device data movement

Execution Optimization

Occupancy Optimization

Balance resources to maximize active warps:

Register usage per thread
Shared memory per block
Thread block size

Warp Divergence Minimization

Reduce branch divergence within warps:

cuda

// Bad: Divergent branches
if (threadIdx.x % 2 == 0) {
    // Path A
} else {
    // Path B
}

// Better: Process in warp-sized chunks
if (threadIdx.x < 16) {
    // All threads in half-warp take same path
}

Instruction-Level Optimization

Use fast math intrinsics (__fmaf_rn, __expf)
Leverage tensor cores for matrix operations
Unroll loops where beneficial

Profiling and Analysis

Tools

**NVIDIA Nsight Systems**: System-wide performance analysis
**NVIDIA Nsight Compute**: Kernel-level profiling
**AMD ROCm Profiler**: AMD GPU profiling
**Intel VTune**: Cross-platform performance analysis

Key Metrics

**SM Efficiency**: Percentage of time SMs are active
**Memory Bandwidth Utilization**: Actual vs. theoretical bandwidth
**Occupancy**: Active warps vs. maximum warps
**Instruction Throughput**: IPC and instruction mix

Compute Shader Programming

Graphics API Integration

DirectX Compute Shaders (HLSL)

hlsl

RWStructuredBuffer<float> output : register(u0);
StructuredBuffer<float> inputA : register(t0);
StructuredBuffer<float> inputB : register(t1);

[numthreads(256, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID) {
    output[id.x] = inputA[id.x] + inputB[id.x];
}

Vulkan Compute Shaders (GLSL)

glsl

#version 450
layout(local_size_x = 256) in;

layout(binding = 0) buffer OutputBuffer { float output[]; };
layout(binding = 1) buffer InputA { float inputA[]; };
layout(binding = 2) buffer InputB { float inputB[]; };

void main() {
    uint idx = gl_GlobalInvocationID.x;
    output[idx] = inputA[idx] + inputB[idx];
}

Metal Compute Shaders

metal

kernel void vectorAdd(device float* output [[buffer(0)]],
                      device const float* inputA [[buffer(1)]],
                      device const float* inputB [[buffer(2)]],
                      uint idx [[thread_position_in_grid]]) {
    output[idx] = inputA[idx] + inputB[idx];
}

Hybrid Rendering-Compute Workflows

Post-processing effects using compute shaders
GPU-driven rendering with compute-based culling
Particle systems and physics simulations
Texture generation and processing

Common Use Cases

Machine Learning and Deep Learning

Training Acceleration

Matrix multiplication for forward/backward passes
Convolution operations with cuDNN/MIOpen
Custom CUDA kernels for specialized layers
Multi-GPU training with NCCL/RCCL

Inference Optimization

Quantization and precision reduction
Kernel fusion for reduced memory bandwidth
Batching strategies for throughput
TensorRT and ONNX Runtime optimization

Scientific Computing

Computational Physics

Molecular dynamics simulations
Fluid dynamics (CFD) using lattice methods
N-body gravitational simulations
Finite element analysis

Linear Algebra

Dense matrix operations (cuBLAS, rocBLAS)
Sparse matrix computations (cuSPARSE)
Eigenvalue and SVD decompositions
Large-scale linear system solvers

Graphics and Visualization

Real-Time Rendering

Ray tracing acceleration
Global illumination techniques
Screen-space effects
Procedural generation

Image and Video Processing

Convolution filters and transforms
Video encoding/decoding acceleration
Computer vision algorithms
Real-time image enhancement

Cryptography and Blockchain

Mining and Hashing

Parallel hash computation
Memory-hard algorithm optimization
Proof-of-work calculations

Security Applications

Password cracking and auditing
Cryptographic key generation
Encryption/decryption acceleration

Financial Computing

Quantitative Finance

Monte Carlo simulations
Option pricing models
Risk analysis calculations
High-frequency trading algorithms

Best Practices

Code Organization

1. Separate host and device code logically 2. Use wrapper classes for resource management (RAII) 3. Implement error checking macros for all API calls 4. Document kernel assumptions and constraints

Performance Guidelines

1. Profile before optimizing 2. Focus on memory bottlenecks first 3. Maximize arithmetic intensity 4. Design for the target architecture 5. Test across different hardware generations

Portability Considerations

1. Abstract hardware-specific code 2. Use portable libraries where possible 3. Implement fallback CPU paths 4. Test on multiple platforms and vendors

Debugging Strategies

1. Use compute sanitizers (cuda-memcheck, rocm-smi) 2. Implement validation against CPU reference 3. Start with single-thread correctness 4. Use printf debugging sparingly in kernels