databasePrompt Caching

FastRouter supports prompt caching across all major providers — automatically where providers allow it, and with explicit controls for Anthropic Claude.

Overview

Prompt caching reduces the cost of repeated context — long system prompts, RAG chunks, documents — by charging a fraction of the normal input price on cache hits.

Zero-Config Providers

The following providers cache automatically. No changes to your requests needed.

Provider
Cache write
Cache read

OpenAI

Free

0.25x – 0.50x input

DeepSeek

Same as input

~0.10x input

Google AI Studio

Free

0.10x input

Google Vertex AI

Free

0.10x input

Grok

Free

See provider pricing

Moonshot AI

Free

See provider pricing

Baseten

Free

See provider pricing

OpenAI requires a minimum of 1024 tokens.

Google AI Studio and Vertex AI both support implicit caching on Gemini 2.5 and newer models — no configuration needed. FastRouter keeps your prompt prefixes stable to maximize cache hits. The 0.10x cache-read rate (90% discount) applies to all Gemini 2.5+ models; legacy Gemini 2.0 Flash is discounted at 0.25x. Implicit caches are managed entirely by Google's serving infrastructure with no storage cost to you. TTL is typically 3–5 minutes. To maximize cache hits, keep large static content (system instructions, RAG context, few-shot examples) at the beginning of your prompt and push dynamic content to the end.

Minimum token thresholds before caching applies:

Model
Min tokens

Gemini 2.5 Pro

4,096

Gemini 2.5 Flash

1,024

Gemini 2.5 Flash-Lite

1,024


Anthropic Claude

Anthropic requires you to explicitly mark what should be cached using cache_control. FastRouter supports two approaches.

Add cache_control once at the request root. FastRouter automatically places the cache breakpoint at the last cacheable block and advances it as the conversation grows.

Only works when routed to Anthropic directly.

Option B — Per-block (for precise control)

Place cache_control on individual content blocks. Useful when you have a large stable payload (a document, RAG chunks, a character card) and want to cache exactly that. Maximum 4 breakpoints per request.

Per-block caching works across Anthropic and Vertex.

TTL

TTL
Syntax
Write cost
Read cost

5 min (default)

{ "type": "ephemeral" }

1.25x input

0.10x input

Model minimums

Min tokens
Models

4096

Opus 4.5, 4.6, 4.7 · Haiku 4.5

2048

Sonnet 4.6 · Haiku 3.5

1024

Sonnet 4, 4.5 · Opus 4, 4.1 · Sonnet 3.7


Checking Cache Savings

Every API response includes a prompt_tokens_details object:

cached_tokens > 0 means you're hitting the cache.

You can also check per-request cache usage on the Activity Logs page flyout on the FastRouter dashboard.

Last updated