Cost engineering

Caching is where the money is.

For high-volume Anthropic workloads, prompt caching is the single biggest cost lever you have. Done right, it cuts your repeat-input bill by ~90%. Done wrong, it does nothing or costs more. Here is the actual math.

Widgets in this dispatch

Inline Token Cost. Plug in your real input tokens, output tokens, and request volume to see what each Anthropic tier costs per month with and without caching.
Inline ROI Calc. Caching also kills the engineer-hours your team spends shrinking system prompts. Use the calc near section 4 to translate hours saved into dollars.
Glossary hover. Hover any underlined term for a plain-English definition.

What caching actually is

Anthropic prompt caching lets you mark a chunk of the prompt as cacheable, typically the system prompt or any large static context. The first call writes that block at 1.25x the normal input rate. Every subsequent call within the cache TTL reads the same block at 0.10x the normal rate.

Phrased plainly: if your system prompt costs $1 normally, the first call costs $1.25 to write the cache. Each later call reads the same prompt for $0.10. After your second hit, you are already net positive. After ten hits, you have spent $2.15 instead of $10.

The default TTL is 5 minutes, refreshed on every read. If your traffic clusters tightly into bursts (which most production traffic does), most of your input is reading the cache, not paying full price.

The math, with your numbers

Below is a live calculator. Punch in your typical request shape and toggle caching on or off. The savings show up as a green pill on each model row.

A few patterns worth noting as you play with the inputs:

The bigger the system prompt, the bigger the win. Caching only discounts what you mark as cacheable. Tiny prompts have nothing to discount.
Output cost does not change. Caching only touches input. Output tokens are billed at full price either way.
The relative win is the same across tiers. Whether you run Opus, Sonnet, or Haiku, caching cuts the cached portion to ~10%. Pick the model on capability, not on whether to cache.

When it does not help

Three failure modes show up repeatedly:

Low hit rate. If your traffic is sparse, calls land outside the 5-minute TTL and every request pays the 1.25x write penalty. Below ~20% hit rate, caching makes you slower and more expensive. Measure before you ship.
Prompt churn. If you change the system prompt on every request, you cannot cache it. Yes, that includes injecting fresh user data into the system prompt. Move dynamic content to the user message and keep the system prompt static.
Trivial system prompts. A 200-token system prompt is not worth caching. The 1.25x write fee on something that small is rounding error either way. Caching pays off when you have multi-thousand-token system prompts or large attached context.

The hidden engineer-hours win

The cost story is half of why caching matters. The other half is that it removes pressure on your engineers to keep shrinking the system prompt to save money.

Without caching, every token in your system prompt costs you on every call. That creates real incentive to compress, abbreviate, drop examples, drop guardrails. Half of those compressions hurt quality and you do not notice until production. With caching, that pressure goes away. You can write the system prompt the way the model actually performs best with, including the long examples and the careful instructions.

Translate the saved engineer time into annual dollars:

For most teams the engineer-hours number is comparable in size to the token-cost savings. The point is not which one is bigger. The point is that they stack.

Want to see the full picture, including embedding cost and forecasting over time, in one place? Open the Cost Lab.