Offloading Prefix Cache to CPU Memory

Overview

This guide provides recipes to offload prefix cache to CPU RAM via the vLLM native offloading connector and the LMCache connector.

Prerequisites

All prerequisites from the upper level.
Have the proper client tools installed on your local system to use this guide.
Ensure your cluster infrastructure is sufficient to deploy high scale inference.

Installation

First, set up a namespace for the deployment and create the HuggingFace token secret.

# Clone the repo and switch to the latest release tag 
tag=$(curl -s https://api.github.com/repos/llm-d/llm-d/releases/latest | jq -r '.tag_name')
git clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout "$tag"

export NAMESPACE=llm-d-pfc-cpu # or any other namespace
kubectl create namespace ${NAMESPACE}

# NOTE: You must have your HuggingFace token stored in the HF_TOKEN environment variable.
export HF_TOKEN="<your-hugging-face-token>"
kubectl create secret generic llm-d-hf-token --from-literal=HF_TOKEN=${HF_TOKEN} -n ${NAMESPACE}

1. Deploy Gateway and HTTPRoute

Deploy the Gateway and HTTPRoute using the gateway recipe.

2. Deploy vLLM Model Server

Offloading Connector
LMCache Connector

Offloading Connector

Deploy the vLLM model server with the OffloadingConnector enabled.

kubectl apply -k ./manifests/vllm/offloading-connector -n ${NAMESPACE}

LMCache Connector

Deploy the vLLM model server with the LMCache connector enabled.

kubectl apply -k ./manifests/vllm/lmcache-connector -n ${NAMESPACE}

3. Deploy InferencePool

To deploy the InferencePool, select your provider below.

GKE
Istio
Kgateway

GKE

This command deploys the InferencePool on GKE with GKE-specific monitoring enabled.

helm install llm-d-infpool \
    -n ${NAMESPACE} \
    -f ./manifests/inferencepool/values.yaml \
    --set "provider.name=gke" \
    --set "inferenceExtension.monitoring.gke.enable=true" \
    oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
    --version v1.2.0-rc.1

Istio

This command deploys the InferencePool with Istio, enabling Prometheus monitoring.

helm install llm-d-infpool \
    -n ${NAMESPACE} \
    -f ./manifests/inferencepool/values.yaml \
    --set "provider.name=istio" \
    --set "inferenceExtension.monitoring.prometheus.enable=true" \
    oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
    --version v1.2.0-rc.1

Kgateway

This command deploys the InferencePool with Kgateway.

helm install llm-d-infpool \
    -n ${NAMESPACE} \
    -f ./manifests/inferencepool/values.yaml \
    --set "provider.name=kgateway" \
    oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
    --version v1.2.0-rc.1

To enable tiered prefix caching, we customize the InferencePool configuration (see manifests/inferencepool/values.yaml). We configure two prefix cache scorers: one for the GPU cache and another for the CPU cache.

For the CPU cache, we must manually configure the lruCapacityPerServer because vLLM currently does not emit CPU block metrics.

The current weight configuration is 2:2:1:1 (Queue Scorer : KV Cache Utilization Scorer : GPU Prefix Cache Scorer : CPU Prefix Cache Scorer). The current CPU offloading copies GPU cache entries to CPU, essentially making CPU cache a super set of GPU. This weight configuration ensures that the combined weight of the GPU and CPU prefix cache scorers equals 2.

You can tune these values, particularly the ratio between the GPU and CPU scorers, to suit your specific requirements. The current configuration has demonstrated improved performance in our Benchmark tests.

Verifying the installation

You can verify the installation by checking the status of the created resources.

Check the Gateway

kubectl get gateway -n ${NAMESPACE}

You should see output similar to the following, with the PROGRAMMED status as True.

NAME                      CLASS                              ADDRESS     PROGRAMMED   AGE
llm-d-inference-gateway   gke-l7-regional-external-managed   <redacted>  True         16m

Check the HTTPRoute

kubectl get httproute -n ${NAMESPACE}

NAME          HOSTNAMES   AGE
llm-d-route               17m

Check the InferencePool

kubectl get inferencepool -n ${NAMESPACE}

NAME            AGE
llm-d-infpool   16m

Check the Pods

kubectl get pods -n ${NAMESPACE}

You should see the InferencePool's endpoint pod and the model server pods in a Running state.

NAME                                  READY   STATUS    RESTARTS   AGE
llm-d-infpool-epp-xxxxxxxx-xxxxx     1/1     Running   0          16m
llm-d-model-server-xxxxxxxx-xxxxx   1/1     Running   0          11m
llm-d-model-server-xxxxxxxx-xxxxx   1/1     Running   0          11m

Cleanup

To remove the deployment:

helm uninstall llm-d-infpool -n ${NAMESPACE}
kubectl delete -k ./manifests/vllm/offloading-connector -n ${NAMESPACE}
kubectl delete -k ../../../../recipes/gateway/gke-l7-regional-external-managed -n ${NAMESPACE}
kubectl delete namespace ${NAMESPACE}

Appendix

Benchmark

The following benchmark results demonstrate the performance improvements of using vLLM's native CPU offloading and LMCache CPU offloading.

Benchmark Setup

Hardware:
- A total of 16 H100 GPUs, each with 80GB of HBM, were used.
- The GPUs were distributed across 4 a3-highgpu-4g instances, with 4 GPUs per instance.
vLLM Configuration:
- gpu_memory_utilization was set to 0.65 to reduce the pressure on the benchmark tool. In production configuration this is typically set to a higher value such as 0.9.
- CPU offloading was enabled with num_cpu_blocks set to 41000, which provides approximately 100GB of CPU cache.
LMCache Configuration:
- For LMCache setup, LMCACHE_MAX_LOCAL_CPU_SIZE is set to 100 GB.

The benchmark was conducted using the inference-perf tool with the following hardware, memory, and workload configurations:

Workload:
- The two different workloads were tested with a constant concurrency of 45 requests.
- High Cache:
  - num_groups: 45
  - system_prompt_len: 30,000
  - question_len: 256
  - output_len: 1024
  - num_prompts_per_group: 10
- Low Cache:
  - num_groups: 45
  - system_prompt_len: 8000
  - question_len: 256
  - output_len: 1024
  - num_prompts_per_group: 10
Memory Calculation:
- The KVCache size for the Qwen/Qwen3-32B model is approximately 0.0002 GB per token.
- With gpu_memory_utilization at 0.65, there are 9271 GPU blocks available per engine.
- The available HBM for KVCache per engine is approximately 24.3GB (9271 blocks * 2.62 MB/block).
- The total available HBM for the KVCache across the entire system was 193.4 GB (8 engines * 24.3 GB/engine).

Key Findings

In High cache scenarios, where the KVCache size exceeds the available HBM, both the vLLM native CPU offloading connector and LMCache connector significantly enhance performance.
In Low cache scenarios, where the KVCache fits entirely within the GPU's HBM, all offloading configurations perform similarly to the baseline. However, consistent slight decreases in performance across metrics indicate a small overhead associated with enabling CPU offloading, even when it is not actively utilized.

High Cache Performance

The following table compares the performance of the baseline vLLM with the vLLM using the CPU offloading connector when the KVCache size is larger than the available HBM.

HBM < KVCache < HBM + CPU RAM	Mean TTFT (second)	P90 TTFT (second)	Mean E2E Latency (second)	P90 E2E Latency (second)	Overall Throughput (token per second)
Baseline vLLM	9.0	20.9	37.8	49.7	38534.8
vLLM + CPU offloading 100GB	6.7 (-25.6%)	20.2 (-3.3%)	30.9 (-18.3%)	44.2 (-11.1%)	46751.0 (+21.3%)
vLLM + LMCache CPU offloading 100GB	6.5 (-27.8%)	18.8 (-10.0%)	30.8 (-18.5%)	43.0 (-13.5%)	46910.6 (+21.7%)

Low Cache Performance

The following table shows that when the KVCache fits within the HBM, the performance of all configurations is similar, indicating minimal but measurable overhead from the CPU offloading mechanism.

KVCache < HBM	Mean TTFT (second)	P90 TTFT (second)	Mean E2E Latency (second)	P90 E2E Latency (second)	Overall Throughput (token per second)
Baseline vLLM	0.12	0.09	18.4	19.6	23389.6
vLLM + CPU offloading 100GB	0.13	0.11	18.6	20.6	23032.6
vLLM + LMCache CPU offloading 100GB	0.15	0.10	18.9	19.6	22772.5

Documentation Version

This documentation corresponds to llm-d v0.4.0, the latest public release. For the most current development changes, see this file on main.

📝 To suggest changes or report issues, please create an issue.

Source: guides/tiered-prefix-cache/cpu/README.md

Overview​

Prerequisites​

Installation​

1. Deploy Gateway and HTTPRoute​

2. Deploy vLLM Model Server​

Offloading Connector​

LMCache Connector​

3. Deploy InferencePool​

GKE​

Istio​

Kgateway​

Verifying the installation​

Check the Gateway​

Check the HTTPRoute​

Check the InferencePool​

Check the Pods​

Cleanup​

Appendix​

Benchmark​

Benchmark Setup​

Key Findings​

High Cache Performance​

Low Cache Performance​