Well-lit Path: Wide Expert Parallelism (EP/DP) with LeaderWorkerSet

Overview

This guide demonstrates how to deploy DeepSeek-R1-0528 using vLLM's P/D disaggregation support with NIXL in a wide expert parallel pattern with LeaderWorkerSets. This guide has been validated on:

a 32xH200 cluster with InfiniBand networking
a 32xH200 cluster on GKE with RoCE networking
a 32xB200 cluster on GKE with RoCE networking

WARNING: We are still investigating and optimizing performance for other hardware and networking configurations

In this example, we will demonstrate a deployment of DeepSeek-R1-0528 with:

1 DP=16 Prefill Worker
1 DP=16 Decode Worker

Hardware Requirements

This guide requires 32 Nvidia H200 or B200 GPUs and InfiniBand or RoCE RDMA networking. Check modelserver/base/decode.yaml and modelserver/base/prefill.yaml for detailed resource requirements.

Prerequisites

Have the proper client tools installed on your local system to use this guide.
Ensure your cluster infrastructure is sufficient to deploy high scale inference
- You must have high speed inter-accelerator networking
- The pods leveraging inter-node EP must be deployed within the same networking domain
- You have deployed the LeaderWorkerSet optional controller
Configure and deploy your Gateway control plane.
Create the llm-d-hf-token secret in your target namespace with the key HF_TOKEN matching a valid HuggingFace token to pull models.
Have the Monitoring stack installed on your system.

Installation

Use the helmfile to compose and install the stack. The Namespace in which the stack will be deployed will be derived from the ${NAMESPACE} environment variable. If you have not set this, it will default to llm-d-wide-ep in this example.

# Clone the repo and switch to the latest release tag 
tag=$(curl -s https://api.github.com/repos/llm-d/llm-d/releases/latest | jq -r '.tag_name')
git clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout "$tag"

export NAMESPACE=llm-d-wide-ep # or any other namespace
cd guides/wide-ep-lws/
kubectl create namespace ${NAMESPACE}

Deploy Model Servers

GKE and CoreWeave are tested Kubernetes providers for this well-lit path. You can customize the manifests if you run on other Kubernetes providers.

GKE (H200)
GKE (B200)
CoreWeave

GKE (H200)

kubectl apply -k ./manifests/modelserver/gke -n ${NAMESPACE}

GKE (B200)

# Deploy on GKE for B200 on the a4 instance type to work around a known vLLM memory issue
kubectl apply -k ./manifests/modelserver/gke-a4 -n ${NAMESPACE}

CoreWeave

kubectl apply -k ./manifests/modelserver/coreweave  -n ${NAMESPACE}

Deploy InferencePool

Select the provider-specific Helm command using the tabs below.

GKE
Istio
Kgateway

GKE

helm install deepseek-r1 \
  -n ${NAMESPACE} \
  -f inferencepool.values.yaml \
  --set "provider.name=gke" \
  --set "inferencePool.apiVersion=inference.networking.k8s.io/v1" \
  --set "inferenceExtension.monitoring.gke.enable=true" \
  oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
  --version v1.2.0-rc.1

Istio

helm install deepseek-r1 \
  -n ${NAMESPACE} \
  -f inferencepool.values.yaml \
  --set "provider.name=istio" \
  --set "inferenceExtension.monitoring.prometheus.enable=true" \
  oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
  --version v1.2.0-rc.1

Kgateway

helm install deepseek-r1 \
  -n ${NAMESPACE} \
  -f inferencepool.values.yaml \
  oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
  --version v1.2.0-rc.1

Deploy Gateway and HTTPRoute

Choose the gateway manifest that matches your environment.

GKE (Regional External)
Istio
Kgateway
Kgateway on OCP

GKE (Regional External)

kubectl apply -k ./manifests/gateway/gke-l7-regional-external-managed -n ${NAMESPACE}

Istio

kubectl apply -k ./manifests/gateway/istio -n ${NAMESPACE}

Kgateway

kubectl apply -k ./manifests/gateway/kgateway -n ${NAMESPACE}

Kgateway on OCP

kubectl apply -k ./manifests/gateway/kgateway-openshift -n ${NAMESPACE}

Gateway options

To see what gateway options are supported refer to our gateway provider prereq doc. Gateway configurations per provider are tracked in the gateway-configurations directory.

You can also customize your gateway, for more information on how to do that see our gateway customization docs.

Tuning Selective PD

As with PD, the wide-ep-lws guide supports selective PD. For information on this refer to this section of the PD docs.

Verifying the installation

Firstly, you should be able to list all helm releases installed into your chosen namespace:

helm list -n ${NAMESPACE}
NAME            NAMESPACE       REVISION    UPDATED                                 STATUS      CHART                       APP VERSION
deepseek-r1     llm-d-wide-ep   1           2025-08-24 13:14:53.355639 -0700 PDT    deployed    inferencepool-v1.0          v0.3.0

Out of the box with this example you should have the following resources (if using Istio):

kubectl get all -n ${NAMESPACE}
NAME                                                         READY   STATUS    RESTARTS   AGE
pod/infra-wide-ep-inference-gateway-istio-74d5c66c86-h5mfn   1/1     Running   0          2m22s
pod/wide-ep-llm-d-decode-0                   2/2     Running   0          2m13s
pod/wide-ep-llm-d-decode-0-1                 2/2     Running   0          2m13s
pod/deepseek-r1-epp-84dd98f75b-r6lvh         1/1     Running   0          2m14s
pod/wide-ep-llm-d-prefill-0                  1/1     Running   0          2m13s
pod/wide-ep-llm-d-prefill-0-1                1/1     Running   0          2m13s


NAME                                            TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                        AGE
service/infra-wide-ep-inference-gateway-istio   ClusterIP      10.16.1.34    10.16.4.2     15021:30312/TCP,80:33662/TCP   2m22s
service/wide-ep-ip-1e480070                     ClusterIP      None          <none>        54321/TCP                      2d4h
service/wide-ep-llm-d-decode                    ClusterIP      None          <none>        <none>                         2m13s
service/deepseek-r1-epp                         ClusterIP      10.16.1.137   <none>        9002/TCP                       2d4h
service/wide-ep-llm-d-prefill                   ClusterIP      None          <none>        <none>                         2m13s

NAME                                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/infra-wide-ep-inference-gateway-istio   1/1     1            1           2m22s
deployment.apps/deepseek-r1-epp       1/1     1            1           2m14s

NAME                                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/infra-wide-ep-inference-gateway-istio-74d5c66c86   1         1         1       2m22s
replicaset.apps/deepseek-r1-epp-55bb9857cf       1         1         1       2m14s

NAME                                                      READY   AGE
statefulset.apps/wide-ep-llm-d-decode     1/1     2m13s
statefulset.apps/wide-ep-llm-d-decode-0   1/1     2m13s
statefulset.apps/wide-ep-llm-d-prefill    1/1     2m13s
statefulset.apps/wide-ep-llm-d-prefill-1  1/1     2m13s

NOTE: This assumes no other guide deployments in your given ${NAMESPACE} and you have not changed the default release names via the ${RELEASE_NAME} environment variable.

Using the stack

For instructions on getting started making inference requests see our docs

NOTE: This example particularly benefits from utilizing stern as described in the getting-started-inferencing docs, because while we only have 3 inferencing pods, it has 16 vllm servers or ranks.

NOTE: Compared to the other examples, this one takes anywhere between 7-10 minutes for the vllm API servers to startup so this might take longer before you can interact with this example.

Cleanup

To remove the deployment:

# From examples/wide-ep-lws
helm uninstall deepseek-r1 -n ${NAMESPACE}
kubectl delete -k ./manifests/modelserver/<gke|coreweave> -n ${NAMESPACE}
kubectl delete -k ./manifests/gateway/<gke-l7-regional-external-managed|istio|kgateway|kgateway-openshift> -n ${NAMESPACE}

Customization

For information on customizing a guide and tips to build your own, see our docs

Documentation Version

This documentation corresponds to llm-d v0.4.0, the latest public release. For the most current development changes, see this file on main.

📝 To suggest changes or report issues, please create an issue.

Source: guides/wide-ep-lws/README.md

Overview​

Hardware Requirements​

Prerequisites​

Installation​

Deploy Model Servers​

GKE (H200)​

GKE (B200)​

CoreWeave​

Deploy InferencePool​

GKE​

Istio​

Kgateway​

Deploy Gateway and HTTPRoute​

GKE (Regional External)​

Istio​

Kgateway​

Kgateway on OCP​

Gateway options​

Tuning Selective PD​

Verifying the installation​

Using the stack​

Cleanup​

Customization​

Overview

Hardware Requirements

Prerequisites

Installation

Deploy Model Servers

GKE (H200)

GKE (B200)

CoreWeave

Deploy InferencePool

GKE

Istio

Kgateway

Deploy Gateway and HTTPRoute

GKE (Regional External)

Istio

Kgateway

Kgateway on OCP

Gateway options

Tuning Selective PD

Verifying the installation

Using the stack

Cleanup

Customization