LLM Inference – Chapter-1 – Hardware

Scaling Large Language Model (LLM) inference is essential for serving a large and growing user base. While only a small number of organizations focus on delivering inference for frontier models, many more concentrate on fine-tuning smaller or tiny open-source LLMs for domain- or task-specific applications. These teams still face the challenge of scaling inference to meet customer demand and support their AI workloads.

LLM inference refers to generating outputs from a trained LLM when it is given new input data. The process remains computationally intensive, so it must be optimized for efficiency to handle high data volumes effectively.

In this post, we discuss the types of hardware typically favored by organizations serving either frontier models or fine-tuned open-source models. The chosen hardware should be cost-effective, high-performance, and energy efficient. After procurement, organizations deploy this hardware in AI-ready data centers that provide adequate power and cooling.

Today, only a limited number of companies design and manufacture hardware capable of supporting LLM inference or serving with these characteristics. We will first examine the available hardware from different vendors, the technologies that underpin them, and the key parameters that distinguish one option from another.

Data center Visualization – Generated with Nano Banana
CompanyHardware Chips
NVIDIARubin/B200/H100/H20/A100/L40/RTX 5090/T4/A10G
AMDInstinct MI300X/MI300A/Radeon Pro W7900/Radeon RX 7900 XTX
INTELGaudi 2 and 3/Max 1550/Flex 170/Xeon CPU/Arc A770/Arc B580/Pro B50
AMAZONInferentia2
GOOGLETPU7x (Ironwood)/v6e/v5p/v5e/v4/v3/v2
GRAPHCOREIntelligence Processing Unit (IPU)
CEREBRASWafer-Scale Engine (WSE) – WSE-3/WSE-2/
SAMBANOVA RDU (Reconfigurable Dataflow Unit) (SN40L)
GROQLanguage Processing Unit (LPU) custom ASIC 
TENSTORRENTBlackhole [16 big RISC-V cores], Wormhole(n150/n300), Grayskull (e150/e300)
FURIOSA AIRNGD “Renegade” [Tensor contraction processor]
POSITRON AIAtlas
d-MATRIXCorsair
HUAWEIAscend 910C/910B (GPUs/Neural Processing Units (NPU))
MICROSOFTMaia 200
QUALCOMMCloud AI 100 Ultra/Cloud AI 080 Ultra

Comparison between Different Hardware

We look into some of the key comparisons between some common parameters of the aforementioned hardware that are usually procured for data centers. We can observe some hardware uses On-chip SRAM, which helps in storing model weights directly on the chip helping to reduce access latency compared to Off-chip HBM used in GPUs.

CompanyChipsArchitectureMemoryMemory BandwidthTBP (or TDP)
NVIDIAB200Blackwell192GB HBM3e8 TB/s1000W (SXM6)
NVIDIAH100Hopper80GB HBM33 TB/s700W (SXM5)
NVIDIAA100Ampere80GB HBM2e2TB/s400W (SXM)
NVIDIAH20Hopper96GB HBM34 TB/s400W (SXM5)
NVIDIAL20Ada Lovelace48GB GDDR6864 GB/s275W
AMDMI300XCDNA 3192 GB HBM35.3 TB/s750W
AMDMI300AInfinity Architecture 4th Gen128 GB HBM35.3 TB/s750W
INTELGaudi 3Tensor Processor Cores (TPCs) and dual Matrix Multiplication Engines (MMEs)128 GB HBM2e3.7 TB/s600W
INTELGaudi 2Tensor Processor Cores (TPCs) and dual Matrix Multiplication Engines (MMEs)96 GB HBM2e2.45 TB/s600W
INTELFlex 170Xe-HPG (Arc ACM-G10)16 GB GDDR6576 GB/s150 W
AMAZONInferentia 2NeuroCore-v232 GB HBM820 GB/s1 chip (100W)
GOOGLE7x (Ironwood)TPU192 GB HBM3e7.2 TB/s1000W
GOOGLEv6eTPU32GB HBM1640 GB/s300W
GRAPHCOREColossus MK2 GC200IPU900MB In-Processor-Memory47.5 TB/s192W
GRAPHCOREColossus MK1 GC2IPU300MB In-Processor-Memory45 TB/s192W
CEREBRASWSE-3Wafer-Scale Engine (WSE)44GB on-chip SRAM21 PB/s46kW per rack
SAMBANOVASN40LRDU64 GB HBM32 TB/s10kW per rack
GROQLPUTensor Streaming Processor230MB on-chip SRAM80 TB/s375W
TENSTORRENTBlackhole P150aTensix Processor180MB on-chip SRAM512 GB/s300W
TENSTORRENTWarmhole n300dTensix Processor192MB on-chip SRAM576 GB/s300W
MICROSOFTMaia 2000Tile Tensor Unit216 HBM37 TB/s750W
QUALCOMMCloud AI 100 UltraAI Core576MB on-chip SRAM548GB/s150W

New Hardware Technologies driving Innovation

In addition to GPU architectures, there are several evolving hardware innovations, including:

Photonic AI Chips

Photonic chips use light (photons) instead of electricity to process information. The produce higher bandwidth with lower energy demands.

RISC-V

Seen as a power efficient alternative to GPUs. It is also highly suited for edge devices which can run AI workloads. It uses open ISA (Instruction Set Architecture).

Specialized Field-Programmable Gate Arrays (FPGAs)

Some of the examples include Xilinx Alveo U280 FPGA, Achronix Speedster7t FPGA which are seen as alternatives to GPU due to their optimal combination of computational power, memory bandwidth, and exceptional energy efficiency.

Summary

In this post, we took a tour through the emerging landscape of data center chips built for LLM inference. Instead of stopping at familiar GPU architectures, these new hardware designs branch out, chasing higher efficiency and squeezing more tokens out of every watt. Along the way, we saw how some approaches pull memory directly onto the chip, cutting out the latency of fetching model weights from off‑chip HBM and making inference feel far more immediate. We also looked at companies like CEREBRAS, which have pushed scale to extremes, packing enough compute into their racks to host 24‑trillion‑parameter models and giving frontier model builders a powerful new canvas to work with.

Future Post and Chapter-2

In this post, we looked at emerging hardware chips and alternative vendors in the market beyond NVIDIA. In the next post, we will focus on the software stacks that support these chips, including the compilers, tools, and accelerators used to run Deep Learning and LLM applications on them.


References


https://medium.com/@fenjiro/hardware-guide-for-large-language-models-and-deep-learning-b619af574cca

https://intuitionlabs.ai/articles/llm-inference-hardware-enterprise-guide

https://www.arxiv.org/pdf/2601.05047

https://bentoml.com/llm/getting-started/choosing-the-right-gpu

https://medium.com/@bijit211987/top-nvidia-gpus-for-llm-inference-8a5316184a10

https://pytorch.org/blog/high-performance-quantized-llm-inference-on-intel-cpus-with-native-pytorch

https://www.intel.com/content/www/us/en/developer/articles/technical/accelerating-language-model-inference-on-your-pc.html

https://github.com/intel/ipex-llm [IPEX-LLM]

https://docs.aws.amazon.com/sagemaker/latest/dg/model-optimize.html [AWS Inferentia [AWS Neuron]

https://github.com/AI-Hypercomputer/JetStream [for TPU]

https://github.com/AI-Hypercomputer/maxtext [for TPU]

https://arxiv.org/abs/2503.22937 [RDU]

https://groq.com/blog/inside-the-lpu-deconstructing-groq-speed [LPU]

https://furiosa.ai/blog/tensor-contraction-processor-ai-chip-architecture [RNGD]

https://www.positron.ai/atlas [ATLAS]

https://news.mit.edu/2024/photonic-processor-could-enable-ultrafast-ai-computations-1202 [Photonic AI chips]

https://www.achronix.com/blog/accelerating-llm-inferencing-fpgas [FPGA Achronix]

https://www.linkedin.com/pulse/nvidia-h20-vs-h100-h200-data-center-gpu-battle-thats-ai-mujeeb–ytd7f [NVIDIA GPU comparison]

https://tenstorrent.com/en/hardware/ [Tensorrent]

https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference

https://arxiv.org/html/2507.00418v1

Glossary

HBM – High Bandwidth Memory

TBP – Total Board Power

TDP – Thermal Design Power

GPU – Graphical Processing Unit

RISC – Reduced Instruction Set Computing

ASIC – Application-Specific Integrated Circuit

2 thoughts on “LLM Inference – Chapter-1 – Hardware”

Comments are closed.