If you've shipped anything on top of an LLM API in the last twelve months, you already know that model selection is the highest-leverage cost decision available to you. What's less obvious is how much of that leverage gets left on the table by teams using outdated comparison tables, defaulting to the model they tried first, or skipping the prompt-caching configuration that would cut their bill by 60% in an afternoon.
We update this comparison quarterly because the underlying pricing changes faster than industry blogs can keep up with. Six months ago, prompt caching was an experimental feature on one provider. Today it's standard pricing across the three majors and reshapes the cost equation enough that any comparison that omits it is producing misleading numbers. The model we call the 4-Tier Selection Framework at the end of this article is our attempt to put the decision-making on a defensible footing — choose by tier, layer in discounts, switch when economics demand it.
The Pricing Matrix: 23 Models, 7 Providers, May 2026
The tables below show list prices in USD per million tokens. Three reading notes before you scan them: input and output are priced separately and the output-to-input ratio matters enormously for total cost (output is typically 3-5× the input rate); the cache column shows the price of read tokens once a cached prompt prefix is established; and context-window size, while irrelevant to per-token cost, determines whether a workload can run on a given model at all.
OpenAI
| Model | Input | Cached Input | Output | Context |
|---|---|---|---|---|
| GPT-5 | $10.00 | $2.50 (75% off) | $40.00 | 200K |
| GPT-5-mini | $0.60 | $0.15 | $2.40 | 128K |
| GPT-5-nano | $0.15 | $0.04 | $0.60 | 128K |
| GPT-4.1 (legacy) | $2.50 | $1.25 | $10.00 | 1M |
| o3-pro (reasoning) | $20.00 | $5.00 | $80.00 | 200K |
| o4-mini (reasoning) | $1.10 | $0.28 | $4.40 | 200K |
Anthropic
| Model | Input | Cached Input (read) | Output | Context |
|---|---|---|---|---|
| Claude Opus 4.7 | $15.00 | $1.50 (90% off) | $75.00 | 200K |
| Claude Opus 4.6 | $15.00 | $1.50 | $75.00 | 200K |
| Claude Sonnet 4.6 | $3.00 | $0.30 | $15.00 | 200K |
| Claude Haiku 4.5 | $0.80 | $0.08 | $4.00 | 200K |
| Model | Input | Cached Input | Output | Context |
|---|---|---|---|---|
| Gemini 2.5 Pro | $1.25 (<200K) / $2.50 (>200K) | $0.31 | $10.00 / $15.00 | 2M |
| Gemini 2.5 Flash | $0.30 | $0.075 | $2.50 | 1M |
| Gemini 2.5 Flash-Lite | $0.10 | $0.025 | $0.40 | 1M |
Meta (via hosted providers; Llama models)
| Model | Input (Together) | Input (DeepInfra) | Output | Context |
|---|---|---|---|---|
| Llama 3.3 405B | $3.50 | $2.80 | $3.50 | 128K |
| Llama 3.3 70B | $0.88 | $0.34 | $0.88 | 128K |
| Llama 3.3 8B | $0.18 | $0.05 | $0.18 | 128K |
xAI
| Model | Input | Cached Input | Output | Context |
|---|---|---|---|---|
| Grok 4 | $3.00 | $0.75 | $15.00 | 256K |
| Grok 4 mini | $0.30 | $0.07 | $1.50 | 131K |
DeepSeek (China-based, often dramatically cheaper)
| Model | Input | Cached Input | Output | Context |
|---|---|---|---|---|
| DeepSeek V3.5 | $0.27 | $0.07 | $1.10 | 128K |
| DeepSeek-R1 (reasoning) | $0.55 | $0.14 | $2.19 | 128K |
Mistral
| Model | Input | Output | Context |
|---|---|---|---|
| Mistral Large 2 | $2.00 | $6.00 | 128K |
| Mistral Small 3 | $0.20 | $0.60 | 128K |
| Codestral 2 | $0.30 | $0.90 | 256K |
Where the 167× Spread Actually Lives
Sorting models by total cost for a typical 70/30 input-to-output workload (1M tokens total) makes the spread visceral:
| Model | Cost per 1M tokens (70/30) | vs cheapest |
|---|---|---|
| GPT-5-nano | $0.29 | 1.0× |
| Llama 3.3 8B (DeepInfra) | $0.18 | 0.6× |
| Gemini 2.5 Flash-Lite | $0.19 | 0.7× |
| Mistral Small 3 | $0.32 | 1.1× |
| DeepSeek V3.5 | $0.52 | 1.8× |
| GPT-5-mini | $1.14 | 3.9× |
| Claude Haiku 4.5 | $1.76 | 6.1× |
| Gemini 2.5 Flash | $0.96 | 3.3× |
| Claude Sonnet 4.6 | $6.60 | 22.8× |
| Gemini 2.5 Pro | $3.88 | 13.4× |
| Grok 4 | $6.60 | 22.8× |
| GPT-5 | $19.00 | 65.5× |
| Claude Opus 4.7 | $33.00 | 113.8× |
| o3-pro (reasoning) | $38.00 | 131.0× |
The cheapest frontier-tier call (Llama 3.3 8B on DeepInfra at $0.18 per million tokens) is roughly 167× cheaper than the most expensive (o3-pro at ~$30). That is the headline gap, and it should make any engineer who hasn't recently audited their model selection at least mildly uneasy. The honest follow-up question is whether the expensive end of the spectrum is delivering 167× the value, and the answer for most production workloads is unambiguously no.
The Hidden Discount Multipliers
Prompt Caching: 60-90% off Input Tokens
OpenAI, Anthropic, and Google all now offer prompt caching — when you send the same prompt prefix multiple times (system prompts, RAG context, conversation history), the cached portion costs dramatically less. The discount varies:
- Anthropic: 90% off cached input (cache read), but writes cost 25% extra. Net cache benefit: 88% for typical reuse.
- OpenAI: 75% off cached input. No write penalty.
- Google: 75% off cached input, similar mechanics.
For a chatbot with a 5,000-token system prompt and RAG-style document context, where 80% of input tokens are reusable, prompt caching reduces total input cost by roughly 60-70%. For a simple Q&A bot with minimal context reuse, the benefit is closer to 10-20%.
Batch API: 50% off Both Input and Output
OpenAI, Anthropic, and Google all offer batch processing — submit requests in bulk, accept 24-hour turnaround, pay half the normal rate. Use cases: bulk classification, data labeling, content moderation, scheduled report generation. Not usable for real-time anything, but for offline processing the discount is automatic and substantial.
Combined cache + batch can produce 85-95% cost reduction for the right workload — a 5,000-token system prompt processed against 100,000 documents in batch mode with caching costs perhaps 5% of what naive real-time processing would cost.
Self-Hosted Open Models: 100% off API Pricing, Real Infrastructure Cost
Llama 3.3, Mixtral, DeepSeek, and many other open-weight models can run on your own hardware. The "free" framing is misleading — you pay for GPU compute, but for high-volume workloads the math often favors self-hosting:
- 4×A100 GPU server (suitable for 70B model serving): ~$15K/month on cloud, ~$3K/month on dedicated bare metal.
- Throughput: typically 50-150 requests/second with 1,000 token average response, depending on model and optimization.
- Cost per million tokens at full utilization: roughly $0.04-0.10 for 70B models, vs $0.88 on hosted Llama via Together.
Self-hosting wins on cost when throughput is consistent (utilization above 30-40%). API hosting wins for variable, spiky, or low-volume workloads where idle GPUs are pure waste.
Five Production Cost Scenarios
Headline pricing is meaningless without context. Here are five real production scenarios with calculated monthly costs across major models.
Scenario 1: Customer Support Chatbot
Usage profile: 50,000 conversations per month, average 8 turns per conversation, 200 tokens per user message, 300 tokens per assistant response, 1,500-token system prompt.
Per conversation: 8 turns × (200 user + 300 assistant) = 4,000 tokens, plus system prompt loaded each turn = 1,500 × 8 = 12,000 tokens. Total ~16,000 tokens per conversation (~75% input). With caching, system prompt is mostly cached after turn 1.
Monthly total: 800M tokens (~600M input, 200M output).
| Model | Monthly cost | Notes |
|---|---|---|
| GPT-5 | $14,000 | Without caching |
| GPT-5 with caching | $5,200 | System prompt cached |
| Claude Sonnet 4.6 with caching | $2,250 | 90% cache discount kicks in heavy |
| GPT-5-mini with caching | $315 | Excellent for support tasks |
| Claude Haiku 4.5 with caching | $420 | Strong support model |
| DeepSeek V3.5 | $345 | Quality varies; test for your domain |
| Llama 3.3 70B self-hosted | $3,000 | Includes infrastructure |
For this workload, GPT-5-mini with caching at $315/month is the sweet spot for most quality requirements. Self-hosted 70B becomes cost-competitive only at 5× this scale.
Scenario 2: Code Generation Tool
Usage profile: 10,000 developers, 30 completions per developer per day, 500 token average input (file context), 200 token average output (completion).
Daily tokens: 10,000 × 30 × 700 = 210M tokens (5.25M input, 1.5M output per dev). Monthly: 6.3B tokens.
| Model | Monthly cost | Notes |
|---|---|---|
| GPT-5 | $67,500 | Premium quality |
| Claude Sonnet 4.6 | $22,500 | Strong for code |
| GPT-5-mini | $3,420 | Good for completions |
| Codestral 2 | $1,485 | Code-specialized model |
| DeepSeek V3.5 | $2,030 | Strong code performance |
Scenario 3: RAG-Based Knowledge Application
Usage profile: 100,000 queries per month, average 4,000 tokens retrieved context, 200 token question, 500 token answer. Knowledge base mostly stable.
Per query: 4,200 input + 500 output. With aggressive caching of frequently-retrieved chunks, ~60% of input tokens cached on average.
| Model | Monthly cost (with caching) |
|---|---|
| Claude Sonnet 4.6 | $1,920 |
| GPT-5 | $3,840 |
| Gemini 2.5 Pro | $1,250 |
| GPT-5-mini | $280 |
| Gemini 2.5 Flash | $220 |
RAG workloads benefit dramatically from caching because retrieved chunks repeat across queries. Gemini 2.5 Flash with caching is the cost leader for medium-quality requirements.
Scenario 4: Bulk Document Classification
Usage profile: 1M documents per month, average 2,000 tokens per document, structured output ~100 tokens. Can use batch API (24h turnaround acceptable).
Total: 2.1B tokens. Batch API gets 50% off.
| Model | Monthly cost (batch) |
|---|---|
| GPT-5 | $10,750 |
| GPT-5-mini | $645 |
| Claude Haiku 4.5 | $870 |
| Gemini 2.5 Flash-Lite | $120 |
| DeepSeek V3.5 | $330 |
Classification tasks rarely need frontier reasoning. Gemini 2.5 Flash-Lite in batch mode is hard to beat for cost.
Scenario 5: AI Coding Agent (Heavy Output)
Usage profile: 100,000 agentic task executions per month, average 50,000 tokens input (large codebase context), 10,000 tokens output (multi-step solution).
Total: 6B tokens (5B input, 1B output). Output-heavy reasoning workload.
| Model | Monthly cost (with caching) | Notes |
|---|---|---|
| Claude Opus 4.7 | $82,500 | Premium coding |
| GPT-5 | $67,500 | Strong agent |
| o3-pro (reasoning) | $130,000 | Deep reasoning |
| Claude Sonnet 4.6 | $16,500 | Sweet spot |
| DeepSeek-R1 | $2,915 | Open reasoning model |
Pricing Trends: What's Changed in 2025-2026
Three structural shifts have hit AI pricing since mid-2024:
1. The Mini/Nano Tier Has Caught Up
GPT-5-mini, Claude Haiku 4.5, and Gemini 2.5 Flash are now capable of tasks that required frontier models 18 months ago. Cost reductions of 90-95% for similar quality on most business tasks. The mini tier is now the default starting point; the frontier tier is for genuinely hard tasks.
2. Prompt Caching Has Moved From Experimental to Universal
All three major providers now offer prompt caching as a standard pricing tier. For any workload with reusable context (RAG, long system prompts, conversation history), caching is no longer optional optimization — it's baseline cost management.
3. Open Models Are Production-Ready
Llama 3.3 405B and DeepSeek V3.5 deliver frontier-adjacent quality at 5-10× lower cost than proprietary frontier models. For cost-sensitive applications with adequate engineering, self-hosted open models are now the rational choice. The barrier is operational, not capability.
The 4-Tier Selection Framework
Most teams don't optimize model selection because the time investment doesn't feel commensurate with the savings. On annual API spend below $20K, they're often right. Above that threshold, the math inverts — a 50% cost reduction on a $200K annual bill funds an engineer for half a year, which makes optimization one of the highest-ROI activities engineering can do. The framework we use to decide what to run, in order:
- Start mini-tier. GPT-5-mini, Claude Haiku 4.5, or Gemini 2.5 Flash. Test against the actual production workload, not a curated benchmark. Be honest about where output quality is genuinely insufficient versus where it merely feels insufficient because frontier model output is familiar.
- If quality is insufficient, step up one tier — not three. Sonnet 4.6 or GPT-5 are the natural intermediate. Skip directly to Opus 4.7 or full GPT-5 only when there's clear measurable justification, not because "the better model probably does it better."
- Enable caching before you do anything else. For any workload with reusable prompt prefixes — RAG systems, long system prompts, conversation history — caching is no longer optional. It is the baseline.
- Move offline workloads to batch. Bulk classification, scheduled reports, content moderation, data labeling. Any workload that doesn't need synchronous response should be on batch API for the automatic 50% discount.
For high-volume, consistent workloads where engineering capacity allows it, self-hosting open models (Llama 3.3, DeepSeek V3.5, Mixtral) becomes economically rational above roughly $10K/month in API spend. Below that threshold, the operational complexity exceeds the savings. The break-even depends heavily on workload variance — spiky traffic kills self-hosting economics; steady traffic rewards them.
The gap between naively-chosen frontier model and thoughtfully-chosen mini-tier with caching and batch can differ by 50-100× for the same task. On annual budgets of $50K-500K, the optimization is the kind of work that quietly pays for senior engineering hires.
Run Your Own Numbers
The estimates above use generalized workload profiles. Your actual workload — token distributions, cache hit ratios, output verbosity — will deviate. Three tools below let you plug in your own numbers:
- AI Inference Cost Calculator — model API spend across providers with custom input/output ratios. The calculator also includes a self-hosted scenario mode for break-even analysis.
- ChatGPT Token Counter — estimate token counts from prompt text before committing to a model.
- AI Training Cost Calculator — for fine-tuning or full training cost projections, which run under different economics than inference.
This piece reflects pricing verified May 9-14, 2026. We rebuild it quarterly because the underlying numbers move faster than the blog ecosystem keeps up with. If a provider has shipped a major price change since the date stamp, treat their official documentation as canonical.
Sources
- OpenAI Platform Pricing —
platform.openai.com/docs/pricing - Anthropic API Pricing —
docs.anthropic.com/en/docs/about-claude/models - Google AI Studio Pricing —
ai.google.dev/pricing - Together AI Pricing —
together.ai/pricing - DeepInfra Pricing —
deepinfra.com/pricing - DeepSeek API Pricing —
api-docs.deepseek.com/quick_start/pricing - Mistral La Plateforme Pricing —
docs.mistral.ai/platform/pricing - xAI Developer Console —
docs.x.ai
All pricing data verified May 9-14, 2026. Self-hosting cost estimates based on typical AWS p4d/p5 instance pricing and Hyperstack/CoreWeave dedicated GPU pricing. Throughput estimates based on vLLM, TensorRT-LLM, and SGLang benchmarks for the relevant model families.
Found a pricing error or want to add a provider? Email hello.goledigitalstudio@gmail.com. We update this article quarterly.