Context caching

Context caching (KV cache) automatically caches prompt prefixes across multiple requests. When subsequent requests share the same prefix, the cached context is reused, reducing latency and cost.

How it works

  • Default: on — no code changes required.
  • Automatic — the cache is managed by the gateway. You do not need to enable or configure it.
  • Prefix matching — if two requests share a common prompt prefix, the shared portion is served from cache.

Pricing

Cache hits are billed at a lower rate than cache misses:

ItemRate
Cache hit$0.028 / 1M tokens (10x cheaper)
Cache miss$0.28 / 1M tokens (standard input rate)

The exact rates for each model are listed in the Model catalog.

Monitoring cache effectiveness

The response usage object includes cache hit/miss statistics:

{
  "usage": {
    "prompt_tokens": 1000,
    "completion_tokens": 200,
    "total_tokens": 1200,
    "prompt_cache_hit_tokens": 800,
    "prompt_cache_miss_tokens": 200
  }
}
FieldDescription
prompt_cache_hit_tokensNumber of prompt tokens served from cache (billed at lower rate).
prompt_cache_miss_tokensNumber of prompt tokens not served from cache (billed at standard rate).

Cache hit rate

You can calculate the cache hit rate for each request:

hit_rate = prompt_cache_hit_tokens / prompt_tokens

A high hit rate means effective caching and lower costs.

Best practices for maximizing cache hits

  1. Use consistent prompt prefixes: If you have a system message or few-shot examples that stay the same across requests, keep them at the beginning of the prompt.
  2. Batch similar requests: Requests with similar prefixes within a short time window are more likely to hit the cache.
  3. Avoid randomizing the prompt: Random IDs, timestamps, or varying system messages reduce cache effectiveness.

Cache behavior

BehaviorDescription
Minimum cache unit64 tokens
Cache build timeSeconds
Cache expirationHours to days (automatic)
Cache invalidationAutomatic when the cache is stale

Cost example

A 10,000 token prompt with 8,000 cache hits and 2,000 cache misses:

ItemTokensRateCost
Cache hit8,000$0.028/M$0.000224
Cache miss2,000$0.28/M$0.000560
Total10,000$0.000784

Without caching, the same prompt would cost $0.002800 — a 72% savings with an 80% cache hit rate.


See also