Context caching
Context caching (KV cache) automatically caches prompt prefixes across multiple requests. When subsequent requests share the same prefix, the cached context is reused, reducing latency and cost.
How it works
- Default: on — no code changes required.
- Automatic — the cache is managed by the gateway. You do not need to enable or configure it.
- Prefix matching — if two requests share a common prompt prefix, the shared portion is served from cache.
Pricing
Cache hits are billed at a lower rate than cache misses:
| Item | Rate |
|---|---|
| Cache hit | $0.028 / 1M tokens (10x cheaper) |
| Cache miss | $0.28 / 1M tokens (standard input rate) |
The exact rates for each model are listed in the Model catalog.
Monitoring cache effectiveness
The response usage object includes cache hit/miss statistics:
{
"usage": {
"prompt_tokens": 1000,
"completion_tokens": 200,
"total_tokens": 1200,
"prompt_cache_hit_tokens": 800,
"prompt_cache_miss_tokens": 200
}
}
| Field | Description |
|---|---|
prompt_cache_hit_tokens | Number of prompt tokens served from cache (billed at lower rate). |
prompt_cache_miss_tokens | Number of prompt tokens not served from cache (billed at standard rate). |
Cache hit rate
You can calculate the cache hit rate for each request:
hit_rate = prompt_cache_hit_tokens / prompt_tokens
A high hit rate means effective caching and lower costs.
Best practices for maximizing cache hits
- Use consistent prompt prefixes: If you have a system message or few-shot examples that stay the same across requests, keep them at the beginning of the prompt.
- Batch similar requests: Requests with similar prefixes within a short time window are more likely to hit the cache.
- Avoid randomizing the prompt: Random IDs, timestamps, or varying system messages reduce cache effectiveness.
Cache behavior
| Behavior | Description |
|---|---|
| Minimum cache unit | 64 tokens |
| Cache build time | Seconds |
| Cache expiration | Hours to days (automatic) |
| Cache invalidation | Automatic when the cache is stale |
Cost example
A 10,000 token prompt with 8,000 cache hits and 2,000 cache misses:
| Item | Tokens | Rate | Cost |
|---|---|---|---|
| Cache hit | 8,000 | $0.028/M | $0.000224 |
| Cache miss | 2,000 | $0.28/M | $0.000560 |
| Total | 10,000 | — | $0.000784 |
Without caching, the same prompt would cost $0.002800 — a 72% savings with an 80% cache hit rate.