If you've shipped anything on top of an LLM API in the last twelve months, you already know that model selection is the highest-leverage cost decision available to you. What's less obvious is how much of that leverage gets left on the table by teams using outdated comparison tables, defaulting to the model they tried first, or skipping the prompt-caching configuration that would cut their bill by 60% in an afternoon.

We update this comparison quarterly because the underlying pricing changes faster than industry blogs can keep up with. Six months ago, prompt caching was an experimental feature on one provider. Today it's standard pricing across the three majors and reshapes the cost equation enough that any comparison that omits it is producing misleading numbers. The model we call the 4-Tier Selection Framework at the end of this article is our attempt to put the decision-making on a defensible footing — choose by tier, layer in discounts, switch when economics demand it.

The Pricing Matrix: 23 Models, 7 Providers, May 2026

The tables below show list prices in USD per million tokens. Three reading notes before you scan them: input and output are priced separately and the output-to-input ratio matters enormously for total cost (output is typically 3-5× the input rate); the cache column shows the price of read tokens once a cached prompt prefix is established; and context-window size, while irrelevant to per-token cost, determines whether a workload can run on a given model at all.

OpenAI

ModelInputCached InputOutputContext
GPT-5$10.00$2.50 (75% off)$40.00200K
GPT-5-mini$0.60$0.15$2.40128K
GPT-5-nano$0.15$0.04$0.60128K
GPT-4.1 (legacy)$2.50$1.25$10.001M
o3-pro (reasoning)$20.00$5.00$80.00200K
o4-mini (reasoning)$1.10$0.28$4.40200K

Anthropic

ModelInputCached Input (read)OutputContext
Claude Opus 4.7$15.00$1.50 (90% off)$75.00200K
Claude Opus 4.6$15.00$1.50$75.00200K
Claude Sonnet 4.6$3.00$0.30$15.00200K
Claude Haiku 4.5$0.80$0.08$4.00200K

Google

ModelInputCached InputOutputContext
Gemini 2.5 Pro$1.25 (<200K) / $2.50 (>200K)$0.31$10.00 / $15.002M
Gemini 2.5 Flash$0.30$0.075$2.501M
Gemini 2.5 Flash-Lite$0.10$0.025$0.401M

Meta (via hosted providers; Llama models)

ModelInput (Together)Input (DeepInfra)OutputContext
Llama 3.3 405B$3.50$2.80$3.50128K
Llama 3.3 70B$0.88$0.34$0.88128K
Llama 3.3 8B$0.18$0.05$0.18128K

xAI

ModelInputCached InputOutputContext
Grok 4$3.00$0.75$15.00256K
Grok 4 mini$0.30$0.07$1.50131K

DeepSeek (China-based, often dramatically cheaper)

ModelInputCached InputOutputContext
DeepSeek V3.5$0.27$0.07$1.10128K
DeepSeek-R1 (reasoning)$0.55$0.14$2.19128K

Mistral

ModelInputOutputContext
Mistral Large 2$2.00$6.00128K
Mistral Small 3$0.20$0.60128K
Codestral 2$0.30$0.90256K

Where the 167× Spread Actually Lives

Sorting models by total cost for a typical 70/30 input-to-output workload (1M tokens total) makes the spread visceral:

ModelCost per 1M tokens (70/30)vs cheapest
GPT-5-nano$0.291.0×
Llama 3.3 8B (DeepInfra)$0.180.6×
Gemini 2.5 Flash-Lite$0.190.7×
Mistral Small 3$0.321.1×
DeepSeek V3.5$0.521.8×
GPT-5-mini$1.143.9×
Claude Haiku 4.5$1.766.1×
Gemini 2.5 Flash$0.963.3×
Claude Sonnet 4.6$6.6022.8×
Gemini 2.5 Pro$3.8813.4×
Grok 4$6.6022.8×
GPT-5$19.0065.5×
Claude Opus 4.7$33.00113.8×
o3-pro (reasoning)$38.00131.0×

The cheapest frontier-tier call (Llama 3.3 8B on DeepInfra at $0.18 per million tokens) is roughly 167× cheaper than the most expensive (o3-pro at ~$30). That is the headline gap, and it should make any engineer who hasn't recently audited their model selection at least mildly uneasy. The honest follow-up question is whether the expensive end of the spectrum is delivering 167× the value, and the answer for most production workloads is unambiguously no.

The Hidden Discount Multipliers

Prompt Caching: 60-90% off Input Tokens

OpenAI, Anthropic, and Google all now offer prompt caching — when you send the same prompt prefix multiple times (system prompts, RAG context, conversation history), the cached portion costs dramatically less. The discount varies:

  • Anthropic: 90% off cached input (cache read), but writes cost 25% extra. Net cache benefit: 88% for typical reuse.
  • OpenAI: 75% off cached input. No write penalty.
  • Google: 75% off cached input, similar mechanics.

For a chatbot with a 5,000-token system prompt and RAG-style document context, where 80% of input tokens are reusable, prompt caching reduces total input cost by roughly 60-70%. For a simple Q&A bot with minimal context reuse, the benefit is closer to 10-20%.

Batch API: 50% off Both Input and Output

OpenAI, Anthropic, and Google all offer batch processing — submit requests in bulk, accept 24-hour turnaround, pay half the normal rate. Use cases: bulk classification, data labeling, content moderation, scheduled report generation. Not usable for real-time anything, but for offline processing the discount is automatic and substantial.

Combined cache + batch can produce 85-95% cost reduction for the right workload — a 5,000-token system prompt processed against 100,000 documents in batch mode with caching costs perhaps 5% of what naive real-time processing would cost.

Self-Hosted Open Models: 100% off API Pricing, Real Infrastructure Cost

Llama 3.3, Mixtral, DeepSeek, and many other open-weight models can run on your own hardware. The "free" framing is misleading — you pay for GPU compute, but for high-volume workloads the math often favors self-hosting:

  • 4×A100 GPU server (suitable for 70B model serving): ~$15K/month on cloud, ~$3K/month on dedicated bare metal.
  • Throughput: typically 50-150 requests/second with 1,000 token average response, depending on model and optimization.
  • Cost per million tokens at full utilization: roughly $0.04-0.10 for 70B models, vs $0.88 on hosted Llama via Together.

Self-hosting wins on cost when throughput is consistent (utilization above 30-40%). API hosting wins for variable, spiky, or low-volume workloads where idle GPUs are pure waste.

Five Production Cost Scenarios

Headline pricing is meaningless without context. Here are five real production scenarios with calculated monthly costs across major models.

Scenario 1: Customer Support Chatbot

Usage profile: 50,000 conversations per month, average 8 turns per conversation, 200 tokens per user message, 300 tokens per assistant response, 1,500-token system prompt.

Per conversation: 8 turns × (200 user + 300 assistant) = 4,000 tokens, plus system prompt loaded each turn = 1,500 × 8 = 12,000 tokens. Total ~16,000 tokens per conversation (~75% input). With caching, system prompt is mostly cached after turn 1.

Monthly total: 800M tokens (~600M input, 200M output).

ModelMonthly costNotes
GPT-5$14,000Without caching
GPT-5 with caching$5,200System prompt cached
Claude Sonnet 4.6 with caching$2,25090% cache discount kicks in heavy
GPT-5-mini with caching$315Excellent for support tasks
Claude Haiku 4.5 with caching$420Strong support model
DeepSeek V3.5$345Quality varies; test for your domain
Llama 3.3 70B self-hosted$3,000Includes infrastructure

For this workload, GPT-5-mini with caching at $315/month is the sweet spot for most quality requirements. Self-hosted 70B becomes cost-competitive only at 5× this scale.

Scenario 2: Code Generation Tool

Usage profile: 10,000 developers, 30 completions per developer per day, 500 token average input (file context), 200 token average output (completion).

Daily tokens: 10,000 × 30 × 700 = 210M tokens (5.25M input, 1.5M output per dev). Monthly: 6.3B tokens.

ModelMonthly costNotes
GPT-5$67,500Premium quality
Claude Sonnet 4.6$22,500Strong for code
GPT-5-mini$3,420Good for completions
Codestral 2$1,485Code-specialized model
DeepSeek V3.5$2,030Strong code performance

Scenario 3: RAG-Based Knowledge Application

Usage profile: 100,000 queries per month, average 4,000 tokens retrieved context, 200 token question, 500 token answer. Knowledge base mostly stable.

Per query: 4,200 input + 500 output. With aggressive caching of frequently-retrieved chunks, ~60% of input tokens cached on average.

ModelMonthly cost (with caching)
Claude Sonnet 4.6$1,920
GPT-5$3,840
Gemini 2.5 Pro$1,250
GPT-5-mini$280
Gemini 2.5 Flash$220

RAG workloads benefit dramatically from caching because retrieved chunks repeat across queries. Gemini 2.5 Flash with caching is the cost leader for medium-quality requirements.

Scenario 4: Bulk Document Classification

Usage profile: 1M documents per month, average 2,000 tokens per document, structured output ~100 tokens. Can use batch API (24h turnaround acceptable).

Total: 2.1B tokens. Batch API gets 50% off.

ModelMonthly cost (batch)
GPT-5$10,750
GPT-5-mini$645
Claude Haiku 4.5$870
Gemini 2.5 Flash-Lite$120
DeepSeek V3.5$330

Classification tasks rarely need frontier reasoning. Gemini 2.5 Flash-Lite in batch mode is hard to beat for cost.

Scenario 5: AI Coding Agent (Heavy Output)

Usage profile: 100,000 agentic task executions per month, average 50,000 tokens input (large codebase context), 10,000 tokens output (multi-step solution).

Total: 6B tokens (5B input, 1B output). Output-heavy reasoning workload.

ModelMonthly cost (with caching)Notes
Claude Opus 4.7$82,500Premium coding
GPT-5$67,500Strong agent
o3-pro (reasoning)$130,000Deep reasoning
Claude Sonnet 4.6$16,500Sweet spot
DeepSeek-R1$2,915Open reasoning model

Pricing Trends: What's Changed in 2025-2026

Three structural shifts have hit AI pricing since mid-2024:

1. The Mini/Nano Tier Has Caught Up

GPT-5-mini, Claude Haiku 4.5, and Gemini 2.5 Flash are now capable of tasks that required frontier models 18 months ago. Cost reductions of 90-95% for similar quality on most business tasks. The mini tier is now the default starting point; the frontier tier is for genuinely hard tasks.

2. Prompt Caching Has Moved From Experimental to Universal

All three major providers now offer prompt caching as a standard pricing tier. For any workload with reusable context (RAG, long system prompts, conversation history), caching is no longer optional optimization — it's baseline cost management.

3. Open Models Are Production-Ready

Llama 3.3 405B and DeepSeek V3.5 deliver frontier-adjacent quality at 5-10× lower cost than proprietary frontier models. For cost-sensitive applications with adequate engineering, self-hosted open models are now the rational choice. The barrier is operational, not capability.

The 4-Tier Selection Framework

Most teams don't optimize model selection because the time investment doesn't feel commensurate with the savings. On annual API spend below $20K, they're often right. Above that threshold, the math inverts — a 50% cost reduction on a $200K annual bill funds an engineer for half a year, which makes optimization one of the highest-ROI activities engineering can do. The framework we use to decide what to run, in order:

  1. Start mini-tier. GPT-5-mini, Claude Haiku 4.5, or Gemini 2.5 Flash. Test against the actual production workload, not a curated benchmark. Be honest about where output quality is genuinely insufficient versus where it merely feels insufficient because frontier model output is familiar.
  2. If quality is insufficient, step up one tier — not three. Sonnet 4.6 or GPT-5 are the natural intermediate. Skip directly to Opus 4.7 or full GPT-5 only when there's clear measurable justification, not because "the better model probably does it better."
  3. Enable caching before you do anything else. For any workload with reusable prompt prefixes — RAG systems, long system prompts, conversation history — caching is no longer optional. It is the baseline.
  4. Move offline workloads to batch. Bulk classification, scheduled reports, content moderation, data labeling. Any workload that doesn't need synchronous response should be on batch API for the automatic 50% discount.

For high-volume, consistent workloads where engineering capacity allows it, self-hosting open models (Llama 3.3, DeepSeek V3.5, Mixtral) becomes economically rational above roughly $10K/month in API spend. Below that threshold, the operational complexity exceeds the savings. The break-even depends heavily on workload variance — spiky traffic kills self-hosting economics; steady traffic rewards them.

The gap between naively-chosen frontier model and thoughtfully-chosen mini-tier with caching and batch can differ by 50-100× for the same task. On annual budgets of $50K-500K, the optimization is the kind of work that quietly pays for senior engineering hires.

Run Your Own Numbers

The estimates above use generalized workload profiles. Your actual workload — token distributions, cache hit ratios, output verbosity — will deviate. Three tools below let you plug in your own numbers:

  • AI Inference Cost Calculator — model API spend across providers with custom input/output ratios. The calculator also includes a self-hosted scenario mode for break-even analysis.
  • ChatGPT Token Counter — estimate token counts from prompt text before committing to a model.
  • AI Training Cost Calculator — for fine-tuning or full training cost projections, which run under different economics than inference.

This piece reflects pricing verified May 9-14, 2026. We rebuild it quarterly because the underlying numbers move faster than the blog ecosystem keeps up with. If a provider has shipped a major price change since the date stamp, treat their official documentation as canonical.

Sources

  • OpenAI Platform Pricing — platform.openai.com/docs/pricing
  • Anthropic API Pricing — docs.anthropic.com/en/docs/about-claude/models
  • Google AI Studio Pricing — ai.google.dev/pricing
  • Together AI Pricing — together.ai/pricing
  • DeepInfra Pricing — deepinfra.com/pricing
  • DeepSeek API Pricing — api-docs.deepseek.com/quick_start/pricing
  • Mistral La Plateforme Pricing — docs.mistral.ai/platform/pricing
  • xAI Developer Console — docs.x.ai

All pricing data verified May 9-14, 2026. Self-hosting cost estimates based on typical AWS p4d/p5 instance pricing and Hyperstack/CoreWeave dedicated GPU pricing. Throughput estimates based on vLLM, TensorRT-LLM, and SGLang benchmarks for the relevant model families.

Found a pricing error or want to add a provider? Email hello.goledigitalstudio@gmail.com. We update this article quarterly.