Agentic AI Atlasby a5c.ai
OverviewWikiGraphFor AgentsEdgesSearchWorkspace
/
GitHubDocsDiscord
i.2Wiki
Agentic AI Atlas · GPU Programming and Parallel Computing (Library)
library/gpu-programminga5c.ai
Search the atlas/
Wiki · linked records

Article and nearby pages

I.Current articlepp. 1 - 1
Aerospace Engineering Specialization (Library)AI Agents and Conversational AI Specialization (Library)Algorithms and Optimization Specialization (Library)Arts and Culture Specialization (Library)ATDD/TDD Methodology (Library)AutoMaker (Library)
II.Documented nodesrefs · 1
specialization:gpu-programming · Specialization
I.
Wiki article

library/gpu-programming

Reading · 11 min

GPU Programming and Parallel Computing (Library) reference

GPU Programming and Parallel Computing is a specialized domain focused on leveraging the massive parallelism of Graphics Processing Units (GPUs) to solve computationally intensive problems. Modern GPUs contain thousands of cores designed for executing thousands of threads simultaneously, making them ideal for data-parallel workloads that would be prohibitively slow on traditional CPUs.

Page nodewiki/library/gpu-programming.mdNearby pages · 111Documents · 1

Continue reading

Nearby pages in the same section.

Aerospace Engineering Specialization (Library)AI Agents and Conversational AI Specialization (Library)Algorithms and Optimization Specialization (Library)Arts and Culture Specialization (Library)ATDD/TDD Methodology (Library)AutoMaker (Library)Automotive Engineering Specialization (Library)Backend Development (Library)BDD/Specification by Example (Library)Bioinformatics and Genomics Specialization (Library)Biomedical Engineering Specialization (Library)BMAD Method (Library)Business Analysis and Consulting (Library)Business Strategy and Operations (Library)Business Strategy Specialization (Library)CC10X Methodology (Library)CCPM - Claude Code PM Methodology (Library)Chemical Engineering Specialization (Library)Civil Engineering Specialization (Library)ClaudeKit Methodology (Library)Cleanroom Software Engineering (Library)CLI and MCP Development Specialization (Library)Code Migration and Modernization Specialization (Library)COG Second Brain (Library)Common Utilities (Library)Computer Science Specialization (Library)Cryptography and Blockchain Development Specialization (Library)Customer Experience and Support Specialization (Library)Data Engineering, Analytics, and BI Specialization (Library)Data Science and Machine Learning Specialization (Library)Intelligence, Decision Support and Decision Making (Library)Desktop Product Development Specialization (Library)DevOps, SRE, and Platform Engineering Specialization (Library)Digital Marketing and Content Strategy Specialization (Library)Domain-Driven Design (DDD) Methodology (Library)Double Diamond Methodology (Library)Education and Learning Specialization (Library)Electrical Engineering Specialization (Library)Embedded Systems Engineering Specialization (Library)Entrepreneurship and Startup Processes (Library)Environmental Engineering Specialization (Library)Event Storming (Library)Everything Claude Code Methodology (Library)Example Mapping Methodology (Library)Extreme Programming (XP) (Library)Feature-Driven Development (FDD) (Library)Finance, Accounting, and Economics Specialization (Library)FPGA Programming and Hardware Description Specialization (Library)Game Product Development Specialization (Library)Gas Town Methodology (Library)GSD-Adapted Workflows for Babysitter SDK (Library)Healthcare and Medical Management Specialization (Library)Human Resources and People Operations Specialization (Library)Humanities and Anthropology Specialization (Library)Hypothesis-Driven Development (Library)Impact Mapping Methodology (Library)Industrial Engineering Specialization (Library)Jobs to Be Done (JTBD) Methodology (Library)Kanban (Library)Knowledge Management (Library)Legal and Compliance Specialization (Library)Logistics and Operations Specialization (Library)Maestro App Factory (Library)Marketing and Brand Management Specialization (Library)Materials Science Specialization (Library)Mathematics Specialization (Library)Mechanical Engineering Specialization (Library)Meta Specialization - Process, Skill, and Agent Creation (Library)Metaswarm Methodology (Library)Mobile Product Development Specialization (Library)Nanotechnology Specialization (Library)Network Programming and Protocols Specialization (Library)Enhanced Ontology-Driven Development (ODD) Methodology (Library)Operations Management Specialization (Library)Performance Optimization and Profiling Specialization (Library)Philosophy and Theology Specialization (Library)Physics Specialization (Library)Pilot Shell Methodology for Babysitter SDK (Library)Planning with Files (Library)Product Management and Product Strategy Specialization (Library)Production contract (Library)Programming Languages and Compilers Development Specialization (Library)Project Management and Leadership Specialization (Library)Public Relations and Communications Specialization (Library)QA, Testing, and Test Automation (Library)Quantum Computing Specialization (Library)Research Specialization (Library)Robotics and Simulation Engineering Specialization (Library)RPIKit Methodology (Library)Ruflo Methodology (Library)RUP (Rational Unified Process) (Library)Sales and Business Development Specialization (Library)Scientific Discovery and Problem Solving Specialization (Library)Scrum (Library)SDK, Platform, and Systems Development (Library)Security, Compliance, and Risk Management Specialization (Library)Security Research and Vulnerability Analysis Specialization (Library)Shape Up (Library)Social Sciences Specialization (Library)Software Architecture and Design Patterns Specialization (Library)Spec Kit Methodology (Library)Spiral Model (Library)Superpowers Extended Methodology (Library)Supply Chain Management Specialization (Library)Technical Documentation Specialization (Library)Travel (Curated-Dataset + SQL-Tool Pattern) (Library)UX/UI Design and User Experience Specialization (Library)V-Model Methodology (Library)Venture Capital and Investment Due Diligence Specialization (Library)Waterfall Methodology (Library)Web Product Development Specialization (Library)

Documented graph nodes

Records linked directly from this page’s Page node.

specialization:gpu-programming · Specialization

GPU Programming and Parallel Computing

Overview

GPU Programming and Parallel Computing is a specialized domain focused on leveraging the massive parallelism of Graphics Processing Units (GPUs) to solve computationally intensive problems. Modern GPUs contain thousands of cores designed for executing thousands of threads simultaneously, making them ideal for data-parallel workloads that would be prohibitively slow on traditional CPUs.

This specialization encompasses the entire ecosystem of GPU computing, from low-level hardware understanding to high-level programming abstractions, optimization techniques, and real-world application development. It bridges the gap between theoretical parallel computing concepts and practical implementation using industry-standard frameworks like CUDA, OpenCL, and compute shaders.

Key Roles and Responsibilities

GPU Software Engineer

  • Design and implement GPU-accelerated algorithms and applications
  • Write efficient CUDA, OpenCL, or compute shader code
  • Profile and optimize GPU kernel performance
  • Manage GPU memory hierarchies and data transfers
  • Integrate GPU computations with existing software systems

High-Performance Computing (HPC) Specialist

  • Architect large-scale parallel computing solutions
  • Optimize workload distribution across multiple GPUs and nodes
  • Implement efficient inter-GPU communication strategies
  • Tune applications for specific GPU architectures
  • Evaluate and benchmark GPU hardware for computational workloads

Graphics Programmer

  • Develop rendering pipelines and graphics systems
  • Implement compute shaders for visual effects and post-processing
  • Optimize real-time graphics performance
  • Create hybrid rendering-compute workflows
  • Design efficient GPU resource management systems

ML/AI Infrastructure Engineer

  • Build and optimize deep learning training infrastructure
  • Implement custom CUDA kernels for neural network operations
  • Optimize tensor operations and memory access patterns
  • Scale training across multiple GPUs and machines
  • Profile and improve model training performance

Goals and Objectives

Primary Goals

1. **Maximize Computational Throughput**: Achieve optimal utilization of GPU computational resources by designing algorithms that exploit data parallelism effectively 2. **Minimize Latency**: Reduce end-to-end execution time through efficient memory management, kernel optimization, and overlap of computation with data transfers 3. **Ensure Scalability**: Create solutions that scale efficiently from single GPU to multi-GPU and multi-node configurations 4. **Maintain Code Quality**: Write maintainable, portable, and well-documented GPU code that follows best practices

Learning Objectives

  • Understand GPU architecture and its implications for parallel algorithm design
  • Master CUDA and OpenCL programming models and APIs
  • Learn memory hierarchy optimization techniques
  • Develop proficiency in GPU debugging and profiling tools
  • Apply parallel design patterns to real-world problems

GPU Architecture Understanding

Hardware Fundamentals

Streaming Multiprocessors (SMs)

GPUs are organized into multiple Streaming Multiprocessors, each containing:

  • **CUDA Cores / Stream Processors**: Execute arithmetic operations in parallel
  • **Tensor Cores**: Specialized units for matrix multiply-accumulate operations (modern NVIDIA GPUs)
  • **RT Cores**: Ray tracing acceleration units (NVIDIA RTX series)
  • **Shared Memory**: Fast, low-latency memory shared among threads in a block
  • **L1 Cache**: Per-SM cache for reducing global memory access latency
  • **Warp Schedulers**: Hardware units that manage thread execution

Memory Hierarchy

Code
Registers (fastest, per-thread)
    |
Shared Memory / L1 Cache (per-SM, ~100 cycles latency)
    |
L2 Cache (shared across SMs, ~200 cycles latency)
    |
Global Memory / VRAM (highest capacity, ~400-800 cycles latency)
    |
System Memory / Host RAM (slowest, requires PCIe transfer)

Execution Model

  • **Warps/Wavefronts**: Groups of 32 (NVIDIA) or 64 (AMD) threads that execute in lockstep
  • **Thread Blocks**: Logical groupings of threads that share resources and can synchronize
  • **Grids**: Collections of thread blocks that execute a kernel
  • **Occupancy**: Ratio of active warps to maximum warps per SM

Architecture Generations

NVIDIA Architectures

  • **Volta/Turing**: Tensor cores, independent thread scheduling
  • **Ampere**: Third-generation tensor cores, improved sparsity support
  • **Hopper**: Fourth-generation tensor cores, transformer engine, DPX instructions
  • **Ada Lovelace**: Consumer architecture with advanced ray tracing and DLSS

AMD Architectures

  • **RDNA**: Gaming-focused architecture with improved power efficiency
  • **CDNA**: Compute-focused architecture for data centers (MI series)
  • **RDNA 3**: Chiplet design, AI accelerators

CUDA Programming Concepts

Kernel Development

Basic Kernel Structure

cuda
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Thread Indexing

  • **threadIdx**: Thread index within a block (x, y, z dimensions)
  • **blockIdx**: Block index within the grid
  • **blockDim**: Number of threads per block
  • **gridDim**: Number of blocks in the grid

Memory Management

Memory Types

cuda
// Global memory allocation
float* d_array;
cudaMalloc(&d_array, size);
cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice);

// Shared memory declaration
__shared__ float sharedData[256];

// Constant memory
__constant__ float constData[64];

// Texture memory (for spatial locality)
cudaTextureObject_t tex;

Unified Memory

cuda
float* data;
cudaMallocManaged(&data, size);
// Accessible from both host and device
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize();

Synchronization

Thread Synchronization

cuda
__syncthreads();  // Block-level barrier
__syncwarp();     // Warp-level synchronization

Stream-Based Concurrency

cuda
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Concurrent kernel execution
kernel1<<<grid, block, 0, stream1>>>(data1);
kernel2<<<grid, block, 0, stream2>>>(data2);

// Asynchronous memory transfers
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream1);

Advanced CUDA Features

Dynamic Parallelism

cuda
__global__ void parentKernel() {
    // Launch child kernels from device code
    childKernel<<<childGrid, childBlock>>>(data);
}

Cooperative Groups

cuda
#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void kernel() {
    cg::thread_block block = cg::this_thread_block();
    cg::grid_group grid = cg::this_grid();

    // Flexible synchronization
    block.sync();
    grid.sync();
}

OpenCL Fundamentals

Platform Model

Key Concepts

  • **Platform**: Implementation of OpenCL (e.g., NVIDIA, AMD, Intel)
  • **Device**: Computational unit (GPU, CPU, FPGA, etc.)
  • **Context**: Environment for managing devices, memory, and command queues
  • **Command Queue**: Sequence of commands for execution on a device

Kernel Programming

OpenCL Kernel Example

opencl
__kernel void vectorAdd(__global const float* a,
                        __global const float* b,
                        __global float* c,
                        const int n) {
    int gid = get_global_id(0);
    if (gid < n) {
        c[gid] = a[gid] + b[gid];
    }
}

Memory Model

Address Spaces

  • **__global**: Main device memory
  • **__local**: Shared memory within work-group
  • **__constant**: Read-only constant memory
  • **__private**: Per-work-item private memory

Host API

Typical Workflow

cpp
// Get platform and device
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

// Create context and command queue
cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);

// Create and build program
cl_program program = clCreateProgramWithSource(context, 1, &source, NULL, NULL);
clBuildProgram(program, 1, &device, NULL, NULL, NULL);

// Create kernel and set arguments
cl_kernel kernel = clCreateKernel(program, "vectorAdd", NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufferA);

// Execute kernel
size_t globalSize = n;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, NULL);

Parallel Computing Patterns

Data Parallel Patterns

Map

Apply a function independently to each element:

cuda
__global__ void mapKernel(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = transform(input[idx]);
    }
}

Reduce

Combine elements using an associative operator:

cuda
__global__ void reduceSum(float* input, float* output, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = (idx < n) ? input[idx] : 0;
    __syncthreads();

    // Tree-based reduction
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = sdata[0];
}

Scan (Prefix Sum)

Compute running totals with parallel efficiency:

  • Inclusive scan: Each output includes the current element
  • Exclusive scan: Each output excludes the current element
  • Applications: Stream compaction, sorting, histograms

Scatter/Gather

  • **Scatter**: Write to arbitrary locations based on computed indices
  • **Gather**: Read from arbitrary locations based on computed indices

Algorithmic Patterns

Stencil Computations

Process elements based on neighboring values:

cuda
__global__ void stencil2D(float* input, float* output, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x > 0 && x < width-1 && y > 0 && y < height-1) {
        int idx = y * width + x;
        output[idx] = 0.25f * (input[idx-1] + input[idx+1] +
                               input[idx-width] + input[idx+width]);
    }
}

Histogram

Count occurrences in parallel with atomic operations:

cuda
__global__ void histogram(int* data, int* bins, int n, int numBins) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int bin = data[idx] % numBins;
        atomicAdd(&bins[bin], 1);
    }
}

Sort

Parallel sorting algorithms:

  • **Bitonic Sort**: Network-based, O(n log^2 n) with high parallelism
  • **Radix Sort**: Digit-by-digit sorting, excellent for integers
  • **Merge Sort**: Divide-and-conquer with parallel merge phases

Performance Optimization Techniques

Memory Optimization

Coalesced Memory Access

Ensure threads in a warp access contiguous memory:

cuda
// Good: Coalesced access
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = data[idx];

// Bad: Strided access
float val = data[threadIdx.x * stride];

Shared Memory Bank Conflicts

Avoid multiple threads accessing the same bank:

cuda
// Pad shared memory to avoid conflicts
__shared__ float sdata[32][33];  // Extra column padding

Memory Transfer Optimization

  • Use pinned (page-locked) host memory for faster transfers
  • Overlap computation with memory transfers using streams
  • Minimize host-device data movement

Execution Optimization

Occupancy Optimization

Balance resources to maximize active warps:

  • Register usage per thread
  • Shared memory per block
  • Thread block size

Warp Divergence Minimization

Reduce branch divergence within warps:

cuda
// Bad: Divergent branches
if (threadIdx.x % 2 == 0) {
    // Path A
} else {
    // Path B
}

// Better: Process in warp-sized chunks
if (threadIdx.x < 16) {
    // All threads in half-warp take same path
}

Instruction-Level Optimization

  • Use fast math intrinsics (__fmaf_rn, __expf)
  • Leverage tensor cores for matrix operations
  • Unroll loops where beneficial

Profiling and Analysis

Tools

  • **NVIDIA Nsight Systems**: System-wide performance analysis
  • **NVIDIA Nsight Compute**: Kernel-level profiling
  • **AMD ROCm Profiler**: AMD GPU profiling
  • **Intel VTune**: Cross-platform performance analysis

Key Metrics

  • **SM Efficiency**: Percentage of time SMs are active
  • **Memory Bandwidth Utilization**: Actual vs. theoretical bandwidth
  • **Occupancy**: Active warps vs. maximum warps
  • **Instruction Throughput**: IPC and instruction mix

Compute Shader Programming

Graphics API Integration

DirectX Compute Shaders (HLSL)

hlsl
RWStructuredBuffer<float> output : register(u0);
StructuredBuffer<float> inputA : register(t0);
StructuredBuffer<float> inputB : register(t1);

[numthreads(256, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID) {
    output[id.x] = inputA[id.x] + inputB[id.x];
}

Vulkan Compute Shaders (GLSL)

glsl
#version 450
layout(local_size_x = 256) in;

layout(binding = 0) buffer OutputBuffer { float output[]; };
layout(binding = 1) buffer InputA { float inputA[]; };
layout(binding = 2) buffer InputB { float inputB[]; };

void main() {
    uint idx = gl_GlobalInvocationID.x;
    output[idx] = inputA[idx] + inputB[idx];
}

Metal Compute Shaders

metal
kernel void vectorAdd(device float* output [[buffer(0)]],
                      device const float* inputA [[buffer(1)]],
                      device const float* inputB [[buffer(2)]],
                      uint idx [[thread_position_in_grid]]) {
    output[idx] = inputA[idx] + inputB[idx];
}

Hybrid Rendering-Compute Workflows

  • Post-processing effects using compute shaders
  • GPU-driven rendering with compute-based culling
  • Particle systems and physics simulations
  • Texture generation and processing

Common Use Cases

Machine Learning and Deep Learning

Training Acceleration

  • Matrix multiplication for forward/backward passes
  • Convolution operations with cuDNN/MIOpen
  • Custom CUDA kernels for specialized layers
  • Multi-GPU training with NCCL/RCCL

Inference Optimization

  • Quantization and precision reduction
  • Kernel fusion for reduced memory bandwidth
  • Batching strategies for throughput
  • TensorRT and ONNX Runtime optimization

Scientific Computing

Computational Physics

  • Molecular dynamics simulations
  • Fluid dynamics (CFD) using lattice methods
  • N-body gravitational simulations
  • Finite element analysis

Linear Algebra

  • Dense matrix operations (cuBLAS, rocBLAS)
  • Sparse matrix computations (cuSPARSE)
  • Eigenvalue and SVD decompositions
  • Large-scale linear system solvers

Graphics and Visualization

Real-Time Rendering

  • Ray tracing acceleration
  • Global illumination techniques
  • Screen-space effects
  • Procedural generation

Image and Video Processing

  • Convolution filters and transforms
  • Video encoding/decoding acceleration
  • Computer vision algorithms
  • Real-time image enhancement

Cryptography and Blockchain

Mining and Hashing

  • Parallel hash computation
  • Memory-hard algorithm optimization
  • Proof-of-work calculations

Security Applications

  • Password cracking and auditing
  • Cryptographic key generation
  • Encryption/decryption acceleration

Financial Computing

Quantitative Finance

  • Monte Carlo simulations
  • Option pricing models
  • Risk analysis calculations
  • High-frequency trading algorithms

Best Practices

Code Organization

1. Separate host and device code logically 2. Use wrapper classes for resource management (RAII) 3. Implement error checking macros for all API calls 4. Document kernel assumptions and constraints

Performance Guidelines

1. Profile before optimizing 2. Focus on memory bottlenecks first 3. Maximize arithmetic intensity 4. Design for the target architecture 5. Test across different hardware generations

Portability Considerations

1. Abstract hardware-specific code 2. Use portable libraries where possible 3. Implement fallback CPU paths 4. Test on multiple platforms and vendors

Debugging Strategies

1. Use compute sanitizers (cuda-memcheck, rocm-smi) 2. Implement validation against CPU reference 3. Start with single-thread correctness 4. Use printf debugging sparingly in kernels

Conclusion

GPU Programming and Parallel Computing represents a critical skill set in modern software development. As computational demands continue to grow across domains from AI to scientific simulation, the ability to effectively harness GPU parallelism becomes increasingly valuable. This specialization provides the foundation for designing, implementing, and optimizing GPU-accelerated applications that push the boundaries of computational performance.

Trail

Wiki

Library

GPU Programming and Parallel Computing (Library)

Continue reading

Aerospace Engineering Specialization (Library)
AI Agents and Conversational AI Specialization (Library)
Algorithms and Optimization Specialization (Library)
Arts and Culture Specialization (Library)
ATDD/TDD Methodology (Library)
AutoMaker (Library)
Automotive Engineering Specialization (Library)
Backend Development (Library)

Page record

Open node ledger

wiki/library/gpu-programming.md

Documents

specialization:gpu-programming · Specialization