Skip to main content

3 posts tagged with "Releases"

llm-d release announcements

View All Tags

llm-d 0.4: Achieve SOTA Performance Across Accelerators

· 10 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

llm-d’s mission is to provide the fastest time to SOTA inference performance across any accelerator and cloud. In our 0.3 release we enabled wide expert parallelism for large mixture-of-expert models to provide extremely high output token throughput - a key enabler for reinforcement learning - and we added preliminary support for multiple non-GPU accelerator families.

This release brings the complement to expert parallelism throughput: improving end-to-end request latency of production serving. We reduce DeepSeek per token latency up to 50% with speculative decoding and vLLM optimizations for latency critical workloads. We add dynamic disaggregated serving support to Google TPU and Intel XPU to further reduce time to first token latency when traffic is unpredictable, while our new well-lit path for prefix cache offloading helps you leverage CPU memory and high performance remote storage to increase hit rates and reduce tail latency. For users with multiple model deployments our workload autoscaler preview takes real-time server capacity and traffic into account to reduce the amount of time a model deployment is queuing requests - lessening the operational toil running multiple models over constrained accelerator capacity.

These OSS inference stack optimizations, surfaced through our well-lit paths, ensure you reach SOTA latency on frontier OSS models in real world scenarios.

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

· 10 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

In our 0.2 release, we introduced the first well-lit paths, tested blueprints for scaling inference on Kubernetes. With our 0.3 release, we double down on the mission: to provide a fast path to deploying high performance, hardware-agnostic, easy to operationalize, at scale inference.

This release delivers:

  • Expanded hardware support, now including Google TPU and Intel support
  • TCP and RDMA over RoCE validated for disaggregation
  • A predicted latency based balancing preview that improves P90 latency by up to 3x in long-prefill workloads
  • Wide expert parallel (EP) scaling to 2.2k tokens per second per H200 GPU
  • The GA release of the Inference Gateway (IGW v1.0).

Taken together, these results redefine the operating envelope for inference. llm-d enables clusters to run hotter before scaling out, extracting more value from each GPU, and still meet strict latency objectives. The result is a control plane built not just for speed, but for predictable, cost-efficient scale.

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

· 11 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

Our 0.2 release delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.

We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.