LLM Inference – Chapter-1 – Hardware

Scaling Large Language Model (LLM) inference is essential for serving a large and growing user base. While only a small number of organizations focus on delivering inference for frontier models, many more concentrate on fine-tuning smaller or tiny open-source LLMs for domain- or task-specific applications. These teams still face the challenge of scaling inference to meet customer demand and support their AI workloads.

LLM inference refers to generating outputs from a trained LLM when it is given new input data. The process remains computationally intensive, so it must be optimized for efficiency to handle high data volumes effectively.

In this post, we discuss the types of hardware typically favored by organizations serving either frontier models or fine-tuned open-source models. The chosen hardware should be cost-effective, high-performance, and energy efficient. After procurement, organizations deploy this hardware in AI-ready data centers that provide adequate power and cooling.

Today, only a limited number of companies design and manufacture hardware capable of supporting LLM inference or serving with these characteristics. We will first examine the available hardware from different vendors, the technologies that underpin them, and the key parameters that distinguish one option from another.

Data center Visualization – Generated with Nano Banana

Company	Hardware Chips
NVIDIA	Rubin/B200/H100/H20/A100/L40/RTX 5090/T4/A10G
AMD	Instinct MI300X/MI300A/Radeon Pro W7900/Radeon RX 7900 XTX
INTEL	Gaudi 2 and 3/Max 1550/Flex 170/Xeon CPU/Arc A770/Arc B580/Pro B50
AMAZON	Inferentia2
GOOGLE	TPU7x (Ironwood)/v6e/v5p/v5e/v4/v3/v2
GRAPHCORE	Intelligence Processing Unit (IPU)
CEREBRAS	Wafer-Scale Engine (WSE) – WSE-3/WSE-2/
SAMBANOVA	RDU (Reconfigurable Dataflow Unit) (SN40L)
GROQ	Language Processing Unit (LPU) custom ASIC
TENSTORRENT	Blackhole [16 big RISC-V cores], Wormhole(n150/n300), Grayskull (e150/e300)
FURIOSA AI	RNGD “Renegade” [Tensor contraction processor]
POSITRON AI	Atlas
d-MATRIX	Corsair
HUAWEI	Ascend 910C/910B (GPUs/Neural Processing Units (NPU))
MICROSOFT	Maia 200
QUALCOMM	Cloud AI 100 Ultra/Cloud AI 080 Ultra

Comparison between Different Hardware

We look into some of the key comparisons between some common parameters of the aforementioned hardware that are usually procured for data centers. We can observe some hardware uses On-chip SRAM, which helps in storing model weights directly on the chip helping to reduce access latency compared to Off-chip HBM used in GPUs.

Company	Chips	Architecture	Memory	Memory Bandwidth	TBP (or TDP)
NVIDIA	B200	Blackwell	192GB HBM3e	8 TB/s	1000W (SXM6)
NVIDIA	H100	Hopper	80GB HBM3	3 TB/s	700W (SXM5)
NVIDIA	A100	Ampere	80GB HBM2e	2TB/s	400W (SXM)
NVIDIA	H20	Hopper	96GB HBM3	4 TB/s	400W (SXM5)
NVIDIA	L20	Ada Lovelace	48GB GDDR6	864 GB/s	275W

AMD	MI300X	CDNA 3	192 GB HBM3	5.3 TB/s	750W
AMD	MI300A	Infinity Architecture 4th Gen	128 GB HBM3	5.3 TB/s	750W

INTEL	Gaudi 3	Tensor Processor Cores (TPCs) and dual Matrix Multiplication Engines (MMEs)	128 GB HBM2e	3.7 TB/s	600W
INTEL	Gaudi 2	Tensor Processor Cores (TPCs) and dual Matrix Multiplication Engines (MMEs)	96 GB HBM2e	2.45 TB/s	600W
INTEL	Flex 170	Xe-HPG (Arc ACM-G10)	16 GB GDDR6	576 GB/s	150 W

AMAZON	Inferentia 2	NeuroCore-v2	32 GB HBM	820 GB/s	1 chip (100W)

GOOGLE	7x (Ironwood)	TPU	192 GB HBM3e	7.2 TB/s	1000W
GOOGLE	v6e	TPU	32GB HBM	1640 GB/s	300W

GRAPHCORE	Colossus MK2 GC200	IPU	900MB In-Processor-Memory	47.5 TB/s	192W
GRAPHCORE	Colossus MK1 GC2	IPU	300MB In-Processor-Memory	45 TB/s	192W

CEREBRAS	WSE-3	Wafer-Scale Engine (WSE)	44GB on-chip SRAM	21 PB/s	46kW per rack

SAMBANOVA	SN40L	RDU	64 GB HBM3	2 TB/s	10kW per rack

GROQ	LPU	Tensor Streaming Processor	230MB on-chip SRAM	80 TB/s	375W

TENSTORRENT	Blackhole P150a	Tensix Processor	180MB on-chip SRAM	512 GB/s	300W
TENSTORRENT	Warmhole n300d	Tensix Processor	192MB on-chip SRAM	576 GB/s	300W

MICROSOFT	Maia 2000	Tile Tensor Unit	216 HBM3	7 TB/s	750W

QUALCOMM	Cloud AI 100 Ultra	AI Core	576MB on-chip SRAM	548GB/s	150W

New Hardware Technologies driving Innovation

In addition to GPU architectures, there are several evolving hardware innovations, including:

Photonic AI Chips

Photonic chips use light (photons) instead of electricity to process information. The produce higher bandwidth with lower energy demands.

RISC-V

Seen as a power efficient alternative to GPUs. It is also highly suited for edge devices which can run AI workloads. It uses open ISA (Instruction Set Architecture).

Specialized Field-Programmable Gate Arrays (FPGAs)

Some of the examples include Xilinx Alveo U280 FPGA, Achronix Speedster7t FPGA which are seen as alternatives to GPU due to their optimal combination of computational power, memory bandwidth, and exceptional energy efficiency.

Summary

In this post, we took a tour through the emerging landscape of data center chips built for LLM inference. Instead of stopping at familiar GPU architectures, these new hardware designs branch out, chasing higher efficiency and squeezing more tokens out of every watt. Along the way, we saw how some approaches pull memory directly onto the chip, cutting out the latency of fetching model weights from off‑chip HBM and making inference feel far more immediate. We also looked at companies like CEREBRAS, which have pushed scale to extremes, packing enough compute into their racks to host 24‑trillion‑parameter models and giving frontier model builders a powerful new canvas to work with.

Future Post and Chapter-2

In this post, we looked at emerging hardware chips and alternative vendors in the market beyond NVIDIA. In the next post, we will focus on the software stacks that support these chips, including the compilers, tools, and accelerators used to run Deep Learning and LLM applications on them.