LLM Inference – Chapter-2 – AI Software Stack

Hardware potential is truly realized when it is tightly integrated with the software layer. Modern AI chips rely on specialized software stacks that maximize efficiency for LLM inference and overall AI acceleration. Because each hardware provider develops proprietary drivers to unlock their chip’s specific architecture, understanding these foundational layers is essential. Below, we explore the software stacks optimized for deep learning that bridge the gap between AI hardware and high-level frameworks like PyTorch and TensorFlow.

TL;DR

Hardware-Software Coupling: AI chips require specialized software stacks to optimize efficiency and accelerate deep learning workloads.
Proprietary vs. Open Source: While NVIDIA’s CUDA remains the industry standard, AMD (ROCm) and Intel (oneAPI) are pushing for open-source, interoperable alternatives.
The Rise of Hardware Agnosticism: Tools like OpenAI Triton and Modular’s MAX platform aim to let developers write code once and run it on any chip.

LLM inference and its hierarchical workflow

Popular AI Software Stacks

We provide a comprehensive overview of the AI software architectures deployed by prominent industry leaders such as NVIDIA, AMD, Intel, Amazon, and Google.

NVIDIA CUDA: The Industry Standard

NVIDIA’s Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model designed for general GPU processing. It enables developers to write software that leverages GPU power and acts as a critical interface between the hardware and the operating system.

The CUDA Toolkit provides a comprehensive suite for development, including the nvcc compiler, the cuda-gdb debugger, and performance profilers like Nsight Systems. While high-level frameworks (PyTorch/TensorFlow) primarily rely on Runtime APIs, building high-performance inference backends like vLLM or TensorRT requires the lower-level Driver APIs. For deep learning specifically, cuDNN sits atop CUDA, providing highly tuned optimizations for operations like convolutions and activations, specifically engineered to exploit Tensor Cores and efficient memory layouts.

AMD ROCm: The Open Alternative

ROCm (Radeon Open Compute) is AMD’s open-source software stack for AI chip programming, ranging from low-level kernels to high-level applications. Its core interface, HIP (Heterogeneous-compute Interface for Portability), serves as an open-source alternative to NVIDIA’s proprietary CUDA. To streamline migration, the HIPify tool can automatically convert CUDA code to HIP. The ecosystem includes the ROCm LLVM compiler infrastructure, with HIP-Clang acting as the counterpart to NVIDIA’s NVCC. For deep learning acceleration, ROCm offers MIOpen—similar to NVIDIA’s cuDNN—which supports frameworks like PyTorch and TensorFlow. However, cuDNN is generally considered more mature, offering highly optimized kernels for specialized architectures.

Intel oneAPI: Multi-Architecture Unity

The oneAPI initiative aims to establish an open, unified software platform that spans multiple architectures and vendors. Much like AMD’s ROCm, oneAPI is open-source and provides consistent functionality across CPUs, GPUs, and FPGAs. Its core programming model, SYCL, serves as a portable alternative to CUDA C++, enabling developers to run standard C++ across various accelerators. Based on Intel’s Data Parallel C++, SYCL also benefits from SYCLomatic, a migration tool designed to help developers port existing CUDA codebases to Intel hardware. Additionally, oneAPI includes oneDNN, a deep neural network library comparable to cuDNN or MIOpen, which optimizes operations like activations, normalizations, and tensor manipulations.

Amazon Neuron SDK: Specialized Stack

The Neuron SDK offers a comprehensive suite of tools, including a compiler, runtime, and various libraries for training and inference, alongside utilities for monitoring and profiling. Central to the SDK are the Neuron Kernel Interface (NKI) and its associated library (NKLib); NKI leverages an open-source, MLIR-based compiler to grant developers low-level control over memory allocation and chip scheduling. Much like NVIDIA’s ecosystem, AWS Neuron is specifically engineered for proprietary hardware—namely AWS Inferentia—resulting in limited adoption outside of the AWS infrastructure.

Google OpenXLA: Specialized Stack

While CUDA offers a versatile, tool-rich platform for general-purpose computing and custom kernel design, the Google TPU stack provides a more vertically integrated ecosystem tailored for specific AI workloads. Google’s counterpart to CUDA is Open Accelerated Linear Algebra (OpenXLA), a compiler and runtime designed to optimize and execute ML graphs on TPU hardware. While AMD and Intel develop tools to port CUDA code to their respective chips, OpenXLA is broadening its reach beyond TPUs to support NVIDIA and AMD GPUs, fostering cross-platform performance portability.

Others

While major industry leaders dominate the market, a diverse range of alternative AI software stacks is emerging. For instance, Huawei’s Compute Architecture for Neural Networks (CANN) provides specialized libraries and compilers for Ascend NPUs, functioning much like NVIDIA’s CUDA. Other notable entries include the ARM Compute library, the Apache TVM compiler framework, and the Multi-Level IR (MLIR) compiler infrastructure

Comparison of AI Software Stacks

Feature	NVIDIA CUDA	AMD ROCm	Intel oneAPI	Google XLA
Compiler	NVCC (Proprietary)	ROCmCC/HIP (Open)	DPC++ (Open)	XLA Compiler (Open)
DL Libraries	cuDNN / CUDA-X	MIOpen	oneDNN	PyTorch/XLA, JAX
Hardware	NVIDIA GPUs	AMD Instinct / Radeon	Intel CPU/GPU/FPGA	TPUs, GPUs, CPUs
Maturity	Industry Leader	Rapidly Growing	Open Standards Focus	Framework Integrated

The Move Toward Hardware Agnosticism

The primary objective is to develop AI software at the compiler level, bypassing the limitations of proprietary APIs. By generating optimized kernels tailored to available hardware at runtime, this approach ensures seamless hardware portability. It empowers researchers and engineers to maintain a single codebase for LLM inference—regardless of the underlying framework (TensorFlow, PyTorch) or hardware stack (CUDA, ROCm)—without requiring any manual code adjustments.

Modular AI MAX

The Modular MAX platform simplifies hardware management by supporting both AMD and NVIDIA AI chips. Developed using the Mojo programming language, it bridges the gap between Python’s ease of use and the high-level performance of an AI software stack. As a CPU+GPU unified language, Mojo integrates Pythonic syntax with the speed of C/C++ and the memory safety of Rust. Every kernel within MAX is built with Mojo, and the platform allows for the seamless porting of existing models from frameworks like PyTorch.

OpenAI Triton

Triton is a Python-based programming language and LLVM-powered compiler engineered for developing high-performance deep learning kernels. By providing a higher level of abstraction than NVIDIA CUDA, it simplifies kernel development through automatic block-level (tile) parallelization. Unlike the manual, low-level C++ control required by CUDA, Triton offers a portable solution that supports NVIDIA, AMD, and Intel hardware.

Summary and Future Outlook

While NVIDIA’s CUDA remains a proprietary standard, a shift toward open-source ecosystems is emerging through the efforts of AMD and Intel. For instance, AMD’s ROCm platform promotes interoperability by allowing developers to build libraries compatible with both AMD and NVIDIA hardware. In contrast, CUDA-based software often requires extensive rewriting to run on competing chips. The industry’s future depends on modular software stacks that decouple AI applications from specific hardware, as seen with initiatives like Modular AI and PyTorch’s cross-GPU compilation. In Chapter 3, we will explore inference optimization techniques—such as KV caching, continuous batching, and multi-query attention—designed to minimize latency and maximize throughput.

References

https://www.geeksforgeeks.org/deep-learning/cuda-deep-neural-network-cudnn/
https://illuri-sandeep5454.medium.com/demystifying-cuda-cudnn-and-the-gpu-stack-for-machine-learning-engineers-5944a90749ed
https://docs.nvidia.com/cuda/cuda-runtime-api/driver-vs-runtime-api.html
https://www.thundercompute.com/blog/rocm-vs-cuda-gpu-computing
https://newsletter.semianalysis.com/p/amd-vs-nvidia-inference-benchmark-who-wins-performance-cost-per-million-tokens
https://lmsys.org/ [LYMSYS]
https://rocm.docs.amd.com/en/latest/what-is-rocm.html [ROCm]
https://rocm.docs.amd.com/projects/MIOpen/en/latest/index.html [MIOpen]
https://oneapi-spec.uxlfoundation.org/specifications/oneapi/latest/introduction [oneAPI]
https://builtin.com/articles/nvidias-cuda-future-ai-infrastructure [Hardware Agnostic]
https://www.amd.com/en/blogs/2025/rocm7-supercharging-ai-and-hpc-infrastructure.html [AMD benchmark]
https://aws.amazon.com/ai/machine-learning/neuron/ [AWS Neuron]
https://introl.com/blog/aws-trainium-inferentia-silicon-ecosystem-guide-2025 [Trainium and Inferentia]
https://onnxruntime.ai/docs/execution-providers/community-maintained/CANN-ExecutionProvider.html [Huawei CANN]
https://github.com/ARM-software/ComputeLibrary/ [ARM Compute library]
https://github.com/modular/modular [Modular]
https://github.com/triton-lang/triton [Triton]
https://tvm.apache.org/ [Apache TVM]