Scaling Large Language Model (LLM) inference is essential for serving a large and growing user base. While only a small number of organizations focus on delivering inference for frontier models, many more concentrate on fine-tuning smaller or tiny open-source LLMs for domain- or task-specific applications. These teams still face the challenge of scaling inference to meet customer demand and support their AI workloads.
LLM inference refers to generating outputs from a trained LLM when it is given new input data. The process remains computationally intensive, so it must be optimized for efficiency to handle high data volumes effectively.
In this post, we discuss the types of hardware typically favored by organizations serving either frontier models or fine-tuned open-source models. The chosen hardware should be cost-effective, high-performance, and energy efficient. After procurement, organizations deploy this hardware in AI-ready data centers that provide adequate power and cooling.
Today, only a limited number of companies design and manufacture hardware capable of supporting LLM inference or serving with these characteristics. We will first examine the available hardware from different vendors, the technologies that underpin them, and the key parameters that distinguish one option from another.

| Company | Hardware Chips |
| NVIDIA | Rubin/B200/H100/H20/A100/L40/RTX 5090/T4/A10G |
| AMD | Instinct MI300X/MI300A/Radeon Pro W7900/Radeon RX 7900 XTX |
| INTEL | Gaudi 2 and 3/Max 1550/Flex 170/Xeon CPU/Arc A770/Arc B580/Pro B50 |
| AMAZON | Inferentia2 |
| TPU7x (Ironwood)/v6e/v5p/v5e/v4/v3/v2 | |
| GRAPHCORE | Intelligence Processing Unit (IPU) |
| CEREBRAS | Wafer-Scale Engine (WSE) – WSE-3/WSE-2/ |
| SAMBANOVA | RDU (Reconfigurable Dataflow Unit) (SN40L) |
| GROQ | Language Processing Unit (LPU) custom ASIC |
| TENSTORRENT | Blackhole [16 big RISC-V cores], Wormhole(n150/n300), Grayskull (e150/e300) |
| FURIOSA AI | RNGD “Renegade” [Tensor contraction processor] |
| POSITRON AI | Atlas |
| d-MATRIX | Corsair |
| HUAWEI | Ascend 910C/910B (GPUs/Neural Processing Units (NPU)) |
| MICROSOFT | Maia 200 |
| QUALCOMM | Cloud AI 100 Ultra/Cloud AI 080 Ultra |
Comparison between Different Hardware
We look into some of the key comparisons between some common parameters of the aforementioned hardware that are usually procured for data centers. We can observe some hardware uses On-chip SRAM, which helps in storing model weights directly on the chip helping to reduce access latency compared to Off-chip HBM used in GPUs.
| Company | Chips | Architecture | Memory | Memory Bandwidth | TBP (or TDP) |
| NVIDIA | B200 | Blackwell | 192GB HBM3e | 8 TB/s | 1000W (SXM6) |
| NVIDIA | H100 | Hopper | 80GB HBM3 | 3 TB/s | 700W (SXM5) |
| NVIDIA | A100 | Ampere | 80GB HBM2e | 2TB/s | 400W (SXM) |
| NVIDIA | H20 | Hopper | 96GB HBM3 | 4 TB/s | 400W (SXM5) |
| NVIDIA | L20 | Ada Lovelace | 48GB GDDR6 | 864 GB/s | 275W |
| AMD | MI300X | CDNA 3 | 192 GB HBM3 | 5.3 TB/s | 750W |
| AMD | MI300A | Infinity Architecture 4th Gen | 128 GB HBM3 | 5.3 TB/s | 750W |
| INTEL | Gaudi 3 | Tensor Processor Cores (TPCs) and dual Matrix Multiplication Engines (MMEs) | 128 GB HBM2e | 3.7 TB/s | 600W |
| INTEL | Gaudi 2 | Tensor Processor Cores (TPCs) and dual Matrix Multiplication Engines (MMEs) | 96 GB HBM2e | 2.45 TB/s | 600W |
| INTEL | Flex 170 | Xe-HPG (Arc ACM-G10) | 16 GB GDDR6 | 576 GB/s | 150 W |
| AMAZON | Inferentia 2 | NeuroCore-v2 | 32 GB HBM | 820 GB/s | 1 chip (100W) |
| 7x (Ironwood) | TPU | 192 GB HBM3e | 7.2 TB/s | 1000W | |
| v6e | TPU | 32GB HBM | 1640 GB/s | 300W | |
| GRAPHCORE | Colossus MK2 GC200 | IPU | 900MB In-Processor-Memory | 47.5 TB/s | 192W |
| GRAPHCORE | Colossus MK1 GC2 | IPU | 300MB In-Processor-Memory | 45 TB/s | 192W |
| CEREBRAS | WSE-3 | Wafer-Scale Engine (WSE) | 44GB on-chip SRAM | 21 PB/s | 46kW per rack |
| SAMBANOVA | SN40L | RDU | 64 GB HBM3 | 2 TB/s | 10kW per rack |
| GROQ | LPU | Tensor Streaming Processor | 230MB on-chip SRAM | 80 TB/s | 375W |
| TENSTORRENT | Blackhole P150a | Tensix Processor | 180MB on-chip SRAM | 512 GB/s | 300W |
| TENSTORRENT | Warmhole n300d | Tensix Processor | 192MB on-chip SRAM | 576 GB/s | 300W |
| MICROSOFT | Maia 2000 | Tile Tensor Unit | 216 HBM3 | 7 TB/s | 750W |
| QUALCOMM | Cloud AI 100 Ultra | AI Core | 576MB on-chip SRAM | 548GB/s | 150W |
New Hardware Technologies driving Innovation
In addition to GPU architectures, there are several evolving hardware innovations, including:
Photonic AI Chips
Photonic chips use light (photons) instead of electricity to process information. The produce higher bandwidth with lower energy demands.
RISC-V
Seen as a power efficient alternative to GPUs. It is also highly suited for edge devices which can run AI workloads. It uses open ISA (Instruction Set Architecture).
Specialized Field-Programmable Gate Arrays (FPGAs)
Some of the examples include Xilinx Alveo U280 FPGA, Achronix Speedster7t FPGA which are seen as alternatives to GPU due to their optimal combination of computational power, memory bandwidth, and exceptional energy efficiency.
Summary
In this post, we took a tour through the emerging landscape of data center chips built for LLM inference. Instead of stopping at familiar GPU architectures, these new hardware designs branch out, chasing higher efficiency and squeezing more tokens out of every watt. Along the way, we saw how some approaches pull memory directly onto the chip, cutting out the latency of fetching model weights from off‑chip HBM and making inference feel far more immediate. We also looked at companies like CEREBRAS, which have pushed scale to extremes, packing enough compute into their racks to host 24‑trillion‑parameter models and giving frontier model builders a powerful new canvas to work with.
Future Post and Chapter-2
In this post, we looked at emerging hardware chips and alternative vendors in the market beyond NVIDIA. In the next post, we will focus on the software stacks that support these chips, including the compilers, tools, and accelerators used to run Deep Learning and LLM applications on them.
References
https://medium.com/@fenjiro/hardware-guide-for-large-language-models-and-deep-learning-b619af574cca
https://intuitionlabs.ai/articles/llm-inference-hardware-enterprise-guide
https://www.arxiv.org/pdf/2601.05047
https://bentoml.com/llm/getting-started/choosing-the-right-gpu
https://medium.com/@bijit211987/top-nvidia-gpus-for-llm-inference-8a5316184a10
https://pytorch.org/blog/high-performance-quantized-llm-inference-on-intel-cpus-with-native-pytorch
https://github.com/intel/ipex-llm [IPEX-LLM]
https://docs.aws.amazon.com/sagemaker/latest/dg/model-optimize.html [AWS Inferentia [AWS Neuron]
https://github.com/AI-Hypercomputer/JetStream [for TPU]
https://github.com/AI-Hypercomputer/maxtext [for TPU]
https://arxiv.org/abs/2503.22937 [RDU]
https://groq.com/blog/inside-the-lpu-deconstructing-groq-speed [LPU]
https://furiosa.ai/blog/tensor-contraction-processor-ai-chip-architecture [RNGD]
https://www.positron.ai/atlas [ATLAS]
https://news.mit.edu/2024/photonic-processor-could-enable-ultrafast-ai-computations-1202 [Photonic AI chips]
https://www.achronix.com/blog/accelerating-llm-inferencing-fpgas [FPGA Achronix]
https://www.linkedin.com/pulse/nvidia-h20-vs-h100-h200-data-center-gpu-battle-thats-ai-mujeeb–ytd7f [NVIDIA GPU comparison]
https://tenstorrent.com/en/hardware/ [Tensorrent]
https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference
https://arxiv.org/html/2507.00418v1
Glossary
HBM – High Bandwidth Memory
TBP – Total Board Power
TDP – Thermal Design Power
GPU – Graphical Processing Unit
RISC – Reduced Instruction Set Computing
ASIC – Application-Specific Integrated Circuit
Is there a reason to exclude Qualcomm AI100 cards from this?
Included