Buyer literacy

We evaluate every vendor in 4 prompts.

Generic benchmarks tell you how a model does on someone else's eval. They do not tell you how it does on yours. We use the same four prompts on every vendor before we recommend one. They cover four different failure modes, and they are short.

Why four prompts, and which four

Four is enough to surface the failure modes that matter without becoming a meeting. More than four and we are just running benchmarks again. Less and we miss whole categories. The four cover:

Strict structured output. Can the model emit the exact JSON shape we need, every time, without commentary?
Refusal under reasonable pretext. Does the model refuse a benign task because of an over-eager safety prior? Costs you whole categories of automation.
Long-context recall. When we hand it 30 KB of context, does it actually find the answer or hallucinate one?
Tone and brevity. Asked for two sentences, does it return two? Asked to sound like an operator, does it sound like a sales deck?

Pick four prompts that cover the four failure modes your workload exposes. Ours are below. Yours will differ.

See the responses

These are real-shape responses recorded from each model on the prompt class. Switch between the four to see how each vendor handles each kind of ask.

How we score

Each prompt gets a 1 to 5 score per vendor on a single criterion: did it produce something we could ship? Not 'is the answer correct'. Not 'is the writing nice'. Did the output land where production expects it.

5: Drop-in usable. The pipeline downstream of this would not need to do extra work.
4: Usable with one trivial transform (strip a code fence, trim a preamble).
3: Usable with non-trivial fixup (manual review or a second LLM pass).
2: Refused, hedged, or wandered off.
1: Hallucinated something it should have said it did not know.

We run each prompt at temperature 0 and at temperature 0.7. The 0 run shows the model's default. The 0.7 run shows the variance we will see across users in production. A model with a good 0 score and a wide 0.7 score is a flag; a model with both being tight is a buy.

The cost side

Quality is half the picture. The other half is what running this in production will cost. The four prompts above land at roughly 1500 input tokens and 200 output tokens each. With caching on, here is the monthly bill across the three Anthropic tiers for 1000 production runs per day:

Two patterns: Sonnet 4.6 is the right default for this kind of mixed workload (structured output + tone + light reasoning). Haiku 4.5 handles the structured-output prompt at ~1/3 the cost; we route to it on the high-volume path and fall back to Sonnet on edge cases.

Our current recommendation

For most production workloads we ship today, Claude Sonnet 4.6 wins on this eval. GPT-5 is competitive and the right pick for some shapes (anything where 400K context is not enough, or where the JSON pretty-printing is preferred). Gemini 2.5 Pro is improving but currently lands in the soft-refusal zone often enough that we would not put it on a customer-facing path without serious system-prompt steering.

Run your own four prompts. Do not run ours. The whole point is that 'best model' is workload-specific, and the only way to know is to put your prompts in front of each vendor and look at the output.