How to optimize token consumption during prompt orchestrations
How to optimize token consumption during prompt orchestrations
Token consumption is the hidden cost driver in complex AI workflows.
While the previous article explored how information flows between prompts, agents, and tools, this article focuses on minimizing the tokens consumed during that flow—directly reducing costs, latency, and context window pressure.
You’ll learn nine distinct optimization strategies organized into three categories that can be combined for maximum effect:
Input Optimization:
- Context reduction — Minimizing what enters the context window
- Provider prompt caching — Leveraging built-in provider caching mechanisms
- Semantic caching — Reusing results for semantically similar queries
Processing Optimization:
- Model selection — Using smaller models for simpler tasks
- Batch processing — Async processing with 50% cost discount
- Request consolidation — Combining sequential steps into single requests
Output Optimization:
- Output token reduction — Generating fewer tokens
- Deterministic tools — Bypassing AI entirely for predictable operations
- Streaming and parallelization — Reducing perceived latency and enabling speculative execution
By applying these strategies, you can achieve 50-90% cost reduction while maintaining or improving response quality.
Table of contents
- 🎯 Why token optimization matters
- 📊 Token consumption anatomy
- ✂️ Strategy 1: Context reduction
- 💾 Strategy 2: Provider prompt caching
- 🧠 Strategy 3: Semantic caching
- 🎚️ Strategy 4: Model selection
- 📦 Strategy 5: Batch processing
- 🔗 Strategy 6: Request consolidation
- 📉 Strategy 7: Output token reduction
- ⚙️ Strategy 8: Deterministic tools
- ⚡ Strategy 9: Streaming and parallelization
- 📈 Strategy comparison matrix
- 🔧 Implementation patterns
- ⚠️ Common pitfalls
- 🎯 Conclusion
- 📚 References
🎯 Why token optimization matters
The cost equation
Every interaction with an AI model consumes tokens from three sources:
| Source | Description | Cost Impact |
|---|---|---|
| Input tokens | Context window content (prompts, instructions, context) | Paid per request |
| Output tokens | Model-generated response | Paid per request (typically 3-5× input cost) |
| Reasoning tokens | Internal reasoning (o3, o4-mini, extended thinking) | Hidden cost, often 10-50× visible output |
The multiplication problem
In multi-agent orchestrations, token consumption isn’t additive—it’s multiplicative:
Single request: 1,000 input + 500 output = 1,500 tokens
5-phase workflow with full context transfer:
├── Phase 1: 1,000 + 500 = 1,500 tokens
├── Phase 2: 1,500 + 800 = 2,300 tokens (inherited context)
├── Phase 3: 2,300 + 600 = 2,900 tokens
├── Phase 4: 2,900 + 700 = 3,600 tokens
├── Phase 5: 3,600 + 400 = 4,000 tokens
│
└── TOTAL: 14,300 tokens (9.5× single request)
Without optimization, a 5-phase workflow consumes nearly 10× the tokens of a single interaction.
Real-world cost impact
| Scenario | Unoptimized | Optimized | Savings |
|---|---|---|---|
| Simple 3-phase workflow | ~8,000 tokens | ~3,000 tokens | 62% |
| Complex 6-phase orchestration | ~45,000 tokens | ~12,000 tokens | 73% |
| Validation pipeline (10 articles) | ~200,000 tokens | ~40,000 tokens | 80% |
| Daily development workflow | ~500,000 tokens | ~100,000 tokens | 80% |
At typical API pricing, 80% savings translates to substantial cost reduction over time.
Beyond cost: accuracy degradation
Token optimization isn’t only about money—it’s also about maintaining model accuracy. As the context window grows, models experience context rot: a progressive loss of accuracy on earlier instructions. Research shows that at 32,000 tokens, accuracy can drop from 88% to 30%, even for capable models. Every optimization strategy below doesn’t just save cost—it also keeps the model more accurate by keeping the context window smaller. For the full definition and benchmarks, see Context rot: why context management is urgent in the information flow article.
📊 Token consumption anatomy
Before optimizing, understand where tokens are consumed in a typical prompt orchestration:
┌─────────────────────────────────────────────────────────────────────────┐
│ CONTEXT WINDOW BREAKDOWN │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ SYSTEM CONTEXT (~2,000-5,000 tokens) │ │
│ │ ├── Agent definition (.agent.md) ~800-1,500 tokens │ │
│ │ ├── Instructions (.instructions.md) ~500-2,000 tokens │ │
│ │ └── Global instructions (copilot-instructions.md) ~500-1,000 │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ USER CONTEXT (~500-10,000 tokens) │ │
│ │ ├── Prompt file content ~300-2,000 tokens │ │
│ │ ├── Included snippets (#file:...) ~200-2,000 tokens │ │
│ │ ├── User message ~50-500 tokens │ │
│ │ └── Handoff context (from previous agent) ~0-5,000 tokens │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ TOOL RESULTS (~0-50,000 tokens) │ │
│ │ ├── File reads (read_file) ~500-5,000 each │ │
│ │ ├── Search results (semantic_search) ~1,000-3,000 │ │
│ │ ├── Web fetches (fetch_webpage) ~3,000-15,000 │ │
│ │ └── MCP tool results ~500-10,000 │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ CONVERSATION HISTORY (~0-100,000 tokens) │ │
│ │ └── Prior turns in multi-turn conversation │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Optimization opportunities by category
| Category | Typical % of Context | Optimization Strategy |
|---|---|---|
| System context | 5-15% | Instruction file pruning, agent streamlining |
| User context | 5-20% | Prompt compression, snippet elimination |
| Tool results | 20-60% | Targeted reads, result limits, deterministic tools |
| Conversation history | 10-50% | Progressive summarization, file-based isolation |
Key insight: Tool results and conversation history are the biggest optimization targets. Focus there first.
✂️ Strategy 1: Context reduction
Context reduction minimizes what enters the context window in the first place—the most direct path to token savings.
Technique 1.1: Targeted file reads
❌ Wasteful pattern:
Read the entire file for context.This often results in reading 500+ lines when only 20 lines are relevant.
✅ Efficient pattern:
Read lines 45-65 of the configuration file where the validation logic is defined.Implementation in prompt files:
---
name: targeted-review
description: "Review specific section with minimal context"
tools: ['read_file']
---
## Process
1. Read ONLY the specific section:
- For class definitions: read class header + target method (20-50 lines)
- For configuration: read only relevant section
- Use grep_search first to identify exact line ranges
2. **NEVER read entire files unless explicitly required**Token savings example
| Approach | Tokens Consumed | Savings |
|---|---|---|
| Read entire 500-line file | ~5,000 tokens | — |
| Read targeted 50 lines | ~500 tokens | 90% |
| Use grep to find, then read 20 lines | ~200 tokens | 96% |
Technique 1.2: Search result limiting
❌ Wasteful pattern:
Search the codebase for validation patterns.This can return 20+ results, each consuming hundreds of tokens.
✅ Efficient pattern:
Search for validation patterns, limit to 3 most relevant results.In prompts:
---
tools: ['semantic_search', 'grep_search']
---
## Tool Usage Guidelines
When searching:
1. **Use grep_search first** for known patterns (cheaper, faster)
2. **Limit semantic_search results** to 3-5 maximum
3. **Be specific in queries** — "authentication middleware" not "middleware"Technique 1.3: Progressive summarization
Instead of passing full conversation history between phases, compress it to essential data:
## Phase Completion Template
When completing this phase, produce a **PHASE SUMMARY** (max 200 tokens):
### Phase {N} Summary
**Decisions Made**:
- [Decision 1]: [Rationale]
- [Decision 2]: [Rationale]
**Critical Outputs**:
- [Output file/artifact]: [1-line description]
**For Next Phase**:
- [Specific instruction for successor]
---
<!-- Full details are in: [file reference] -->Token savings from summarization
| Handoff Method | Tokens per Handoff | 5-Phase Total |
|---|---|---|
Full context (send: true) |
~3,000 tokens | ~15,000 tokens |
| Progressive summary | ~300 tokens | ~1,500 tokens |
| Savings | 90% |
Technique 1.4: Instruction file pruning
Instructions files apply automatically based on applyTo patterns. Overly broad patterns waste tokens:
❌ Wasteful:
---
applyTo: "**/*" # Applies to EVERY file
---
# 200 lines of React component guidelines...✅ Targeted:
---
applyTo: "**/*.tsx,**/*.jsx" # Only React files
---Audit your instruction files:
# Find instruction files and their patterns
Get-ChildItem -Path ".github/instructions" -Filter "*.instructions.md" |
ForEach-Object {
$content = Get-Content $_.FullName -Raw
if ($content -match 'applyTo:\s*"([^"]+)"') {
[PSCustomObject]@{
File = $_.Name
Pattern = $matches[1]
}
}
}💾 Strategy 2: Provider prompt caching
Provider prompt caching is automatic caching offered by LLM providers for repeated prompt prefixes. Unlike semantic caching, it’s exact-match based and built into the API.
How provider caching works
┌─────────────────────────────────────────────────────────────────────────┐
│ PROVIDER PROMPT CACHING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Request 1: │
│ ┌─────────────────────────────────────────────────┬──────────────────┐ │
│ │ Static Prefix (cached after 1024+ tokens) │ Dynamic Content │ │
│ │ • System instructions │ • User query │ │
│ │ • Few-shot examples │ • Context │ │
│ │ • Tool definitions │ │ │
│ └─────────────────────────────────────────────────┴──────────────────┘ │
│ ▲ │
│ │ Cache write (1.25× base price for Anthropic) │
│ │
│ Request 2 (same prefix): │
│ ┌─────────────────────────────────────────────────┬──────────────────┐ │
│ │ Static Prefix (CACHE HIT - 0.1× base price) │ New Dynamic │ │
│ │ │ Content │ │
│ └─────────────────────────────────────────────────┴──────────────────┘ │
│ │
│ SAVINGS: 50-90% on cached prefix tokens │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Provider comparison
| Provider | Minimum Tokens | Cache Duration | Cache Read Cost | Cache Write Cost |
|---|---|---|---|---|
| OpenAI | 1,024 tokens | 5-60 min (in-memory) or 24h (extended) | 50% discount | No extra cost |
| Anthropic | 1,024-4,096 tokens (model-dependent) | 5 min default, 1h optional | 90% discount | 25% premium |
| Varies | Context caching available | Reduced cost | Initial write cost |
📝 Anthropic model-specific minimums:
- Claude Sonnet 4/4.5, Opus 4/4.1, Opus 4.6: 1,024 tokens minimum
- Claude Haiku 3.5, Opus 4.5: 4,096 tokens minimum
- Claude Haiku 3: 2,048 tokens minimum
1-hour cache TTL: For agent workflows or conversations where follow-up prompts may exceed 5 minutes, use
"ttl": "1h"in thecache_controlblock (2× base write price instead of 1.25×).
⚠️ Cache invalidation: The following changes invalidate cached content:
- Tool definitions: Modifying names, descriptions, or parameters invalidates the entire cache
- Enabling/disabling features: Web search, citations, or thinking toggles modify system prompts
- Images: Adding or removing images anywhere in the prompt
- Content edits: Any changes to cached prefix content require re-caching
Optimization: Structure prompts for caching
The key principle: Static content first, dynamic content last.
❌ Cache-unfriendly structure:
## User Request
{{dynamic_user_input}}
## Instructions
[Static rules that could be cached but won't be]✅ Cache-friendly structure:
## Identity
You are a senior code reviewer specializing in security analysis.
## Instructions
1. Check for injection vulnerabilities
2. Validate input sanitization
[... 500+ tokens of stable instructions ...]
## Examples
[... few-shot examples that rarely change ...]
## User Request
{{dynamic_content}} <!-- Only this part isn't cached -->Token savings from provider caching
| Scenario | Without Caching | With Caching | Savings |
|---|---|---|---|
| First request (1,500 token prefix) | 1,500 tokens at full price | 1,500 tokens at 1.25× (Anthropic) | -25% (write cost) |
| Subsequent requests (same prefix) | 1,500 tokens at full price | 1,500 tokens at 0.1× (Anthropic) | 90% |
| 10 requests with same prefix | 15,000 tokens | 1,875 + 1,350 = 3,225 tokens | 78% |
🧠 Strategy 3: Semantic caching
Semantic caching stores and retrieves results for semantically similar queries—not just exact matches. This is powerful for:
- Repeated questions with slight variations
- Common patterns across projects
- Expensive operations (web fetches, large analyses)
How semantic caching works
┌─────────────────────────────────────────────────────────────────────────┐
│ SEMANTIC CACHE ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ User Query: "How do I authenticate users in Azure AD?" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 1. EMBEDDING GENERATION │ │
│ │ Convert query → vector embedding [0.23, -0.15, 0.87, ...] │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 2. SIMILARITY SEARCH │ │
│ │ Find cached queries with embedding distance < threshold │ │
│ │ • "Azure AD user authentication" (0.92 similarity) ✅ │ │
│ │ • "Set up OAuth in Azure" (0.78 similarity) ⚠️ │ │
│ │ • "Install Azure CLI" (0.31 similarity) ❌ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ CACHE HIT │ │ CACHE MISS │ │
│ │ Return cached │ │ Call LLM API │ │
│ │ response │ │ Store result │ │
│ │ (0 tokens) │ │ in cache │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Implementation: GPTCache
GPTCache is an open-source semantic caching library:
from gptcache import cache
from gptcache.adapter import openai
# Initialize semantic cache
cache.init()
cache.set_openai_key()
# Queries with similar embeddings hit cache
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "How to authenticate Azure AD?"}]
)
# Second query with similar meaning → cache hit
response2 = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Azure AD authentication setup"}]
)When semantic caching helps vs. hurts
| ✅ Good Use Cases | ❌ Poor Use Cases |
|---|---|
| Repeated documentation queries | Unique, context-specific questions |
| Common coding patterns | Code with dynamic requirements |
| FAQ-style interactions | Creative/generative tasks |
| Reference lookups | Tasks requiring current state |
🎚️ Strategy 4: Model selection
Model selection uses smaller, faster models for simpler tasks—matching model capability to task complexity.
The model selection principle
From OpenAI’s latency optimization guide:
“Smaller models usually run faster (and cheaper), and when used correctly can even outperform larger models.”
When to use which model size
| Task Complexity | Recommended Model | Token Cost |
|---|---|---|
| Simple classification | GPT-3.5, Claude Haiku | 10-20× cheaper |
| Structured extraction | GPT-4o-mini, Haiku | 5-10× cheaper |
| Code completion | Fine-tuned smaller model | Variable |
| Complex reasoning | GPT-4, Claude Sonnet/Opus | Full price |
| Creative/nuanced writing | GPT-4, Claude Opus | Full price |
Implementation: Task routing
# orchestrator.agent.md
---
name: task-router
description: "Route tasks to appropriate model size"
---
## Task Routing Rules
### Use SMALLER model (GPT-3.5 / Claude Haiku) for:
- Simple classification (sentiment, category)
- Structured data extraction
- Reformatting/transformation
- Simple Q&A with clear answers
### Use LARGER model (GPT-4 / Claude Sonnet) for:
- Complex reasoning chains
- Nuanced analysis
- Creative content generation
- Multi-step problem solvingPractical example: Split prompt by complexity
From OpenAI’s latency guide—split a single GPT-4 prompt into two:
┌─────────────────────────────────────────────────────────────────┐
│ BEFORE: Single GPT-4 request │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ GPT-4: Reasoning + Classification + Response Generation │ │
│ │ Cost: $$$ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ AFTER: Split by complexity │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ GPT-3.5: Classification │ → │ GPT-4: Response Generation │ │
│ │ (cheap, fast) │ │ (only when needed) │ │
│ │ Cost: $ │ │ Cost: $$ │ │
│ └─────────────────────────┘ └─────────────────────────────┘ │
│ TOTAL: $$ (vs $$$ before) │
└─────────────────────────────────────────────────────────────────┘
Token savings from model selection
| Approach | Cost per 1M tokens | Savings |
|---|---|---|
| All GPT-4 | ~$30 | — |
| GPT-3.5 for 70% of tasks | ~$10 | 67% |
| Fine-tuned GPT-3.5 | ~$8 | 73% |
📦 Strategy 5: Batch processing
Batch processing submits multiple requests together for asynchronous processing, receiving a 50% cost discount from both OpenAI and Anthropic.
When to use batch processing
| ✅ Good Use Cases | ❌ Poor Use Cases |
|---|---|
| Bulk content analysis | Real-time chat interactions |
| Large-scale evaluations | Interactive coding assistance |
| Dataset classification | Time-sensitive operations |
| Content moderation pipelines | User-facing latency-sensitive features |
Provider batch APIs
| Provider | Discount | Max Batch Size | Completion Time |
|---|---|---|---|
| OpenAI | 50% | 50,000 requests or 200MB | Within 24 hours |
| Anthropic | 50% | 100,000 requests or 256MB | Within 24 hours (usually <1h) |
Implementation example (Anthropic)
import anthropic
client = anthropic.Anthropic()
# Create batch with multiple requests
batch = client.messages.batches.create(
requests=[
{
"custom_id": "article-1",
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Analyze this article..."}
]
}
},
{
"custom_id": "article-2",
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Analyze this article..."}
]
}
}
# ... up to 100,000 requests
]
)
# Poll for completion
while batch.processing_status != "ended":
batch = client.messages.batches.retrieve(batch.id)
time.sleep(60)
# Retrieve results
results = client.messages.batches.results(batch.id)Combining batch processing with prompt caching
Batch processing and prompt caching discounts stack:
⚠️ Important: Cache hits in batch processing are best-effort due to asynchronous, concurrent processing. Anthropic reports typical cache hit rates of 30-98% depending on traffic patterns. To maximize hits: include identical
cache_controlblocks in every request, maintain steady request flow, and consider the 1-hour cache TTL for batch workloads.
| Optimization | Discount | Combined |
|---|---|---|
| Batch only | 50% | 50% |
| Cache read only | 90% | 90% |
| Batch + Cache read | 50% + 90% | 95% |
Token savings from batch processing
| Scenario | Standard API | Batch API | Savings |
|---|---|---|---|
| 1,000 article validations | $30 | $15 | 50% |
| With prompt caching | $15 | $7.50 | 75% (combined) |
🔗 Strategy 6: Request consolidation
Request consolidation combines multiple sequential LLM calls into a single request, eliminating round-trip latency and reducing total tokens.
The consolidation principle
From OpenAI’s latency guide:
“Each time you make a request you incur some round-trip latency. If you have sequential steps for the LLM to perform, instead of firing off one request per step consider putting them in a single prompt.”
Before vs. After consolidation
❌ Before: Sequential requests
Request 1: "Classify this text" → 200 tokens
Request 2: "Extract entities" → 300 tokens
Request 3: "Summarize findings" → 400 tokens
─────────────
TOTAL: 3 requests, ~900 tokens + 3× round-trip latency
✅ After: Consolidated request
Single Request:
"Perform the following in sequence:
1. Classify this text
2. Extract entities
3. Summarize findings
Return results as JSON with keys: classification, entities, summary"
TOTAL: 1 request, ~600 tokens + 1× round-trip latency
Implementation pattern
---
name: consolidated-analysis
description: "Multiple analysis steps in one request"
---
## Task
Perform ALL of the following analyses on the provided content:
1. **Classification**: Categorize the content type
2. **Entity Extraction**: Identify key entities (people, places, concepts)
3. **Sentiment Analysis**: Determine overall sentiment
4. **Summary**: Create a 2-sentence summary
## Output Format
Return a single JSON object:
```json
{
"classification": "...",
"entities": [...],
"sentiment": "positive|neutral|negative",
"summary": "..."
}
### Token savings from consolidation
| Approach | Requests | Tokens | Savings |
|----------|----------|--------|---------|
| Sequential (3 requests) | 3 | ~900 | — |
| Consolidated (1 request) | 1 | ~600 | <mark>**33% tokens + 66% latency**</mark> |
---
## 📉 Strategy 7: Output token reduction
<mark>**Output token reduction**</mark> generates fewer response tokens—often the highest-latency step in LLM processing.
### The output token principle
From OpenAI's latency guide:
> "Generating tokens is almost always the highest latency step when using an LLM: cutting 50% of your output tokens may cut ~50% your latency."
### Techniques for reducing output tokens
### Technique 7.1: Explicit brevity instructions
```markdown
## Response Guidelines
- Keep responses under 100 words
- Use bullet points, not paragraphs
- Omit pleasantries and preamble
- Answer directly, then stop
Technique 7.2: Shortened JSON field names
❌ Verbose output (~120 tokens):
{
"message_is_conversation_continuation": "True",
"number_of_messages_in_conversation_so_far": "5",
"user_sentiment": "Frustrated",
"query_type": "Technical Support",
"response_requirements": "Provide step-by-step troubleshooting"
}✅ Compact output (~50 tokens):
{
"cont": true,
"n_msg": 5,
"tone": "frustrated",
"type": "tech_support",
"reqs": "troubleshoot_steps"
}Technique 7.3: Use max_tokens and stop_tokens
response = client.chat.completions.create(
model="gpt-4",
messages=[...],
max_tokens=500, # Hard limit on output
stop=["\n\n", "---"] # Stop at section breaks
)Token savings from output reduction
| Technique | Before | After | Savings |
|---|---|---|---|
| Brevity instructions | ~500 tokens | ~150 tokens | 70% |
| Shortened field names | ~120 tokens | ~50 tokens | 58% |
| max_tokens limit | Variable | Capped | Predictable |
⚙️ Strategy 8: Deterministic tools
Deterministic tools bypass AI entirely for operations that don’t require intelligence—they’re infinitely faster and consume zero tokens.
When to use deterministic tools
| Operation Type | Use AI? | Use Deterministic Tool? |
|---|---|---|
| Parse YAML frontmatter | ❌ | ✅ Regex/parser |
| Check if file exists | ❌ | ✅ File system check |
| Validate JSON schema | ❌ | ✅ JSON Schema validator |
| Count lines matching pattern | ❌ | ✅ grep + wc |
| Calculate hash/checksum | ❌ | ✅ Hash function |
| Compare timestamps | ❌ | ✅ Date comparison |
| Analyze code semantics | ✅ | ❌ |
| Generate creative content | ✅ | ❌ |
| Make judgment calls | ✅ | ❌ |
Implementation: MCP server with deterministic tools
The key insight: wrap deterministic operations in MCP tools so agents can invoke them without AI processing.
Example: Validation cache check (deterministic)
// IQPilot MCP Server - CheckValidationCache tool
[McpTool("check_validation_cache")]
public async Task<CacheCheckResult> CheckValidationCache(
string filePath,
string validationType,
int cacheDays = 7)
{
// Pure file parsing - no AI involved
var metadata = await ParseBottomMetadata(filePath);
if (metadata.Validations.TryGetValue(validationType, out var validation))
{
var lastRun = validation.LastRun;
var daysSinceRun = (DateTime.UtcNow - lastRun).TotalDays;
if (daysSinceRun < cacheDays)
{
return new CacheCheckResult
{
IsCached = true,
CachedResult = validation.Outcome,
DaysSinceRun = daysSinceRun,
Message = $"Cache valid. Last run: {lastRun:yyyy-MM-dd}"
};
}
}
return new CacheCheckResult
{
IsCached = false,
Message = "No valid cache found. Run validation."
};
}Agent prompt using deterministic tool:
---
name: grammar-review-cached
tools: ['check_validation_cache', 'read_file', 'replace_string_in_file']
---
## Process
### Phase 1: Cache Check (DETERMINISTIC - Zero tokens)
1. Call `check_validation_cache`:
- filePath: target file
- validationType: "grammar"
- cacheDays: 7
2. **IF cache is valid** → Report cached result → EXIT
3. **IF no cache** → Proceed to Phase 2
### Phase 2: Grammar Validation (AI - Consumes tokens)
Only reached if cache miss. Perform full validation...Token savings from deterministic cache checks
| Scenario | Without Deterministic Check | With Deterministic Check |
|---|---|---|
| Cache hit (70% of cases) | 3,000 tokens | 0 tokens |
| Cache miss (30% of cases) | 3,000 tokens | 3,000 tokens |
| Expected tokens per run | 3,000 tokens | 900 tokens |
| Savings | 70% |
Common deterministic operations to implement
| Operation | Implementation | Token Savings |
|---|---|---|
| Metadata parsing | YAML parser, regex | Avoid AI parsing docs |
| File existence check | File.Exists() |
Avoid AI file searches |
| Link validation | HTTP HEAD request | Avoid AI fetching pages |
| Pattern matching | Regex engine | Avoid AI text analysis |
| Schema validation | JSON Schema validator | Avoid AI structure checks |
| Diff computation | Git diff | Avoid AI comparison |
| Timestamp comparison | Date arithmetic | Avoid AI date reasoning |
⚡ Strategy 9: Streaming and parallelization
Streaming and parallelization don’t reduce actual token costs but dramatically improve perceived latency and throughput. These techniques are essential for production-quality user experiences.
Streaming: Progressive output delivery
Streaming delivers tokens as they’re generated instead of waiting for the complete response:
┌─────────────────────────────────────────────────────────────────────────┐
│ STREAMING vs BATCH RESPONSE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ WITHOUT STREAMING: │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Request ─────────────────────────────────────────────▶ Response │ │
│ │ [User waits 3-5 seconds seeing nothing...] │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ WITH STREAMING: │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Request ──▶ Here ──▶ is ──▶ the ──▶ response ──▶ ... │ │
│ │ [~100ms] [~100ms] [~100ms] [~100ms] │ │
│ │ [User sees progress immediately] │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ PERCEIVED LATENCY: 100ms vs 3-5 seconds │
│ │
└─────────────────────────────────────────────────────────────────────────┘
When to use streaming
| ✅ Good Use Cases | ❌ Poor Use Cases |
|---|---|
| User-facing chat interfaces | Background batch processing |
| Code generation with preview | API integrations requiring complete response |
| Documentation generation | Validation pipelines |
| Interactive debugging | Structured JSON output parsing |
Parallelization: Concurrent independent tasks
When tasks don’t depend on each other, process them concurrently rather than sequentially:
┌─────────────────────────────────────────────────────────────────────────┐
│ SEQUENTIAL vs PARALLEL EXECUTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SEQUENTIAL (Bad): │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Task 1 │──▶│ Task 2 │──▶│ Task 3 │──▶│ Task 4 │ │
│ │ 2 sec │ │ 2 sec │ │ 2 sec │ │ 2 sec │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ Total time: 8 seconds │
│ │
│ PARALLEL (Good - when tasks are independent): │
│ ┌─────────┐ │
│ │ Task 1 │ │
│ │ 2 sec │ │
│ ├─────────┤ │
│ │ Task 2 │ All complete at ~2 seconds │
│ │ 2 sec │ │
│ ├─────────┤ │
│ │ Task 3 │ │
│ │ 2 sec │ │
│ ├─────────┤ │
│ │ Task 4 │ │
│ │ 2 sec │ │
│ └─────────┘ │
│ Total time: ~2 seconds (4× faster) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Implementation: Parallel validation
// Validate multiple articles in parallel
public async Task<ValidationReport[]> ValidateArticles(string[] articlePaths)
{
var validationTasks = articlePaths.Select(async path =>
{
var content = await File.ReadAllTextAsync(path);
// These validations can run in parallel
var grammarTask = ValidateGrammarAsync(content);
var readabilityTask = ValidateReadabilityAsync(content);
var structureTask = ValidateStructureAsync(content);
var linksTask = ValidateLinksAsync(content);
await Task.WhenAll(grammarTask, readabilityTask, structureTask, linksTask);
return new ValidationReport
{
Path = path,
Grammar = await grammarTask,
Readability = await readabilityTask,
Structure = await structureTask,
Links = await linksTask
};
});
return await Task.WhenAll(validationTasks);
}Speculative execution: Predicted outputs
OpenAI’s Predicted Outputs feature reduces latency when you know most of the expected output:
from openai import OpenAI
client = OpenAI()
# When editing code, most of the file stays the same
code = """
def calculate_total(items):
total = 0
for item in items:
total += item.price
return total
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Add tax calculation to this function:\n{code}"
}],
prediction={
"type": "content",
"content": code # Most of this content will be in output
}
)
# Latency reduced by ~2-3× because unchanged tokens
# are confirmed rather than regeneratedWhen speculative execution helps
| Scenario | Latency Reduction | Best Model |
|---|---|---|
| Code refactoring (small changes) | 2-3× | gpt-4o, gpt-4o-mini |
| Document editing | 2× | gpt-4o |
| Template completion | 3-4× | gpt-4o-mini |
| Translation with preserved formatting | 2× | gpt-4o |
Combined latency optimization
A well-optimized user-facing workflow combines these techniques:
┌─────────────────────────────────────────────────────────────────────────┐
│ OPTIMAL LATENCY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ User Request │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PARALLEL SETUP (concurrent) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Check cache │ │ Fetch context│ │ Load templates│ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ STREAMING RESPONSE │ │
│ │ User sees tokens appear immediately (~100ms first token) │ │
│ │ │ │
│ │ If editing existing content: use PREDICTED OUTPUTS │ │
│ │ for 2-3× additional latency reduction │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PARALLEL POST-PROCESSING (while user sees response) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Update cache │ │ Log metrics │ │ Pre-warm next│ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ RESULT: Sub-second perceived latency │
│ │
└─────────────────────────────────────────────────────────────────────────┘
📈 Strategy comparison matrix
| Strategy | Token Savings | Implementation Effort | Best For | Limitations |
|---|---|---|---|---|
| 1. Context Reduction | 30-90% | Low | All workflows | Requires discipline |
| 2. Provider Caching | 50-90% (stable prefixes) | Low (structure prompts correctly) | High-volume, consistent prompts | Minimum token requirements |
| 3. Semantic Caching | 50-80% (high hit scenarios) | High (embedding infrastructure) | Repeated queries, documentation | Stale data risk, false positives |
| 4. Model Selection | 60-80% | Low | Simple tasks, high volume | May reduce quality for complex tasks |
| 5. Batch Processing | 50% cost | Low | Non-urgent, high volume | 24h latency |
| 6. Request Consolidation | 40-60% | Medium | Multi-step pipelines | Increased prompt complexity |
| 7. Output Reduction | 30-50% | Low | Verbose outputs | May lose useful detail |
| 8. Deterministic Tools | 70-100% | Medium (MCP development) | Cache checks, validation, file ops | Limited to predictable operations |
| 9. Streaming/Parallelization | Latency only | Low-Medium | User-facing, independent tasks | No token savings |
Strategy selection flowchart
┌─────────────────────────────────────────────────────────────────────────┐
│ OPTIMIZATION STRATEGY SELECTION │
└─────────────────────────────────────────────────────────────────────────┘
Is the operation predictable/deterministic?
├── YES → Use DETERMINISTIC TOOL (zero tokens)
│ Examples: cache checks, file existence, schema validation
│
└── NO → Is it a repeated query pattern?
├── YES → Is the prefix stable?
│ ├── YES → Use PROVIDER CACHING (90% savings)
│ │ Structure: static first, dynamic last
│ │
│ └── NO → Consider SEMANTIC CACHING (50-80% savings)
│ If query variations are semantically similar
│
└── NO → Is it high volume, non-urgent?
├── YES → Use BATCH PROCESSING (50% discount)
│
└── NO → Is output verbose?
├── YES → Use OUTPUT REDUCTION (30-50% savings)
│
└── NO → Use CONTEXT REDUCTION (30-90% savings)
• Targeted file reads
• Search result limits
• Progressive summarization
Combined optimization example
A well-optimized 5-phase workflow uses multiple strategies in combination:
Phase 1: Research
├── MODEL SELECTION: Use GPT-4o-mini for initial search
├── CONTEXT REDUCTION: Limit search to 5 results
├── PROVIDER CACHING: Stable research prompt prefix
└── Expected: 70% savings
Phase 2: Cache Check
├── DETERMINISTIC TOOL: Check existing validation cache
├── If cache hit: Skip remaining phases
└── Expected: 70% of runs skip AI entirely
Phase 3: Analysis (cache miss only)
├── CONTEXT REDUCTION: Read only relevant sections
├── PROVIDER CACHING: Stable analysis prompt
├── REQUEST CONSOLIDATION: Combine grammar + readability checks
└── Expected: 60% savings
Phase 4: Generation
├── CONTEXT REDUCTION: Progressive summary from Phase 3
├── SEMANTIC CACHING: Cache common generation patterns
├── OUTPUT REDUCTION: Structured JSON output only
└── Expected: 50% savings
Phase 5: Validation
├── DETERMINISTIC TOOL: Schema validation, link checks
├── BATCH PROCESSING: Queue non-urgent validations
├── Only AI for: Grammar, readability, semantic checks
└── Expected: 70% of checks bypass AI
CUMULATIVE SAVINGS: 75-90%
🔧 Implementation patterns
Pattern 1: Validation pipeline with caching
# validation-pipeline.prompt.md
---
name: validation-pipeline
description: "Multi-validation with deterministic cache checks"
tools: ['check_validation_cache', 'run_grammar_check', 'run_readability_check']
---
## Process
### Step 1: Batch Cache Check (DETERMINISTIC)
For each validation type (grammar, readability, structure, fact-check):
1. Call `check_validation_cache(file, type, days=7)`
2. Record which validations need running
### Step 2: Run Only Missing Validations (AI)
For each validation NOT in cache:
1. Run appropriate validation prompt
2. Store result in metadata cache
### Step 3: Aggregate Results
Combine cached + fresh results into unified report.Pattern 2: Research with semantic caching
# Research pattern with semantic cache
async def research_topic(topic: str, cache: SemanticCache):
# Check semantic cache first
cached = await cache.get_similar(topic)
if cached and cached.similarity > 0.90:
return cached.response
# Cache miss - perform research
results = await perform_research(topic)
# Store for future similar queries
await cache.store(topic, results)
return resultsPattern 3: Progressive summarization handoff
# builder.agent.md
---
name: builder
handoffs:
- label: "Validate Result"
agent: validator
send: false # Don't send full context
prompt: |
**Summary from Builder:**
{{PHASE_SUMMARY}}
**Artifact location:** {{OUTPUT_FILE}}
Validate the created artifact.
---
## Phase Completion Instructions
Before any handoff, produce a PHASE_SUMMARY (max 200 tokens):
1. Decisions made (bullet list)
2. Artifacts created (file paths)
3. Key constraints applied
4. Specific validation needs
Store full details in output file for reference if needed.⚠️ Common pitfalls
Pitfall 1: Over-caching dynamic content
❌ Wrong: Caching responses that depend on current file state
# DON'T cache file-dependent analyses
cache.store(
"analyze security of auth.py", # Query seems cacheable...
analysis_result # But result depends on file content!
)✅ Right: Include content hash in cache key
content_hash = hashlib.md5(file_content.encode()).hexdigest()
cache.store(
f"analyze security of auth.py:{content_hash}",
analysis_result
)Pitfall 2: Cache key collisions
❌ Wrong: Overly broad cache keys
cache.store("validate article", result) # Which article?✅ Right: Include all relevant context in key
cache.store(f"validate:{file_path}:{validation_type}:{content_hash}", result)Pitfall 3: Ignoring cache write costs
For Anthropic, cache writes cost 25% more than regular input tokens.
❌ Wrong: Caching tiny prefixes that are rarely reused
✅ Right: Only cache prefixes that will be reused 3+ times
Break-even calculation (Anthropic):
- Cache write: 1.25× base cost
- Cache read: 0.1× base cost
To save money: Need 2+ cache hits to offset write cost
1.25 (write) + 0.1 (read) + 0.1 (read) = 1.45
vs. 1.0 + 1.0 + 1.0 = 3.0 without caching
Savings start at 3rd use.
Pitfall 4: Placing dynamic content before static
❌ Wrong: User input first
## User Request: {{input}}
## Instructions (static)
[These won't be cached because they come after dynamic content]✅ Right: Static first, dynamic last
## Instructions (static - cached)
[1,000+ tokens of stable content]
## User Request: {{input}}🎯 Conclusion
Token optimization isn’t optional for production AI workflows—it’s the difference between sustainable and unsustainable costs.
Key takeaways
- Context reduction is the foundation: targeted reads, limited searches, progressive summarization
- Provider caching offers up to 90% savings—structure prompts with static content first
- Semantic caching captures similar queries—powerful for documentation and reference lookups
- Model selection and batch processing provide substantial cost reduction for high-volume workflows
- Request consolidation and output reduction minimize token counts per interaction
- Deterministic tools bypass AI entirely for predictable operations—cache checks, validation, file operations
- Streaming/parallelization improve perceived latency without changing token costs
Implementation priority
| Priority | Strategy | Expected Savings | Effort |
|---|---|---|---|
| 1 | Context reduction (targeted reads) | 30-50% | Low |
| 2 | Provider caching (prompt structure) | 50-90% | Low |
| 3 | Model selection (right-size tasks) | 60-80% | Low |
| 4 | Batch processing (async high-volume) | 50% cost | Low |
| 5 | Output reduction (structured output) | 30-50% | Low |
| 6 | Request consolidation (combine steps) | 40-60% | Medium |
| 7 | Deterministic tools (cache checks) | 70%+ for cached ops | Medium |
| 8 | Semantic caching | 50-80% | High |
| 9 | Streaming/parallelization | Latency only | Low-Medium |
Next steps
- Audit current prompts for context reduction opportunities
- Restructure prompts for provider caching (static first)
- Evaluate model selection for different task complexities
- Identify batch opportunities for non-urgent high-volume tasks
- Identify deterministic operations to move to MCP tools
- Monitor token usage to validate savings
For information flow patterns between phases, see: How to Manage Information Flow During Prompt Orchestrations.
📚 References
Official Documentation
OpenAI Latency Optimization Guide 📘 [Official]
Seven principles for optimizing latency: process tokens faster, generate fewer tokens, use fewer input tokens, make fewer requests, parallelize, make users wait less, and don’t default to LLM.
OpenAI Prompt Caching Guide 📘 [Official]
Comprehensive guide to OpenAI’s automatic prompt caching, including requirements (1024+ tokens), cache duration, and best practices for structuring prompts to maximize cache hits.
Anthropic Prompt Caching Documentation 📘 [Official]
Comprehensive guide to Claude’s prompt caching with cache_control breakpoints, 90% read discount, 25% write premium, 5-minute default or 1-hour extended TTL, up to 4 cache breakpoints, and model-specific minimum token requirements (1,024-4,096 tokens).
OpenAI Batch API Guide 📘 [Official]
Documentation for OpenAI’s Batch API offering 50% cost discount for async processing with 24-hour completion window and up to 50,000 requests.
Anthropic Message Batches API 📘 [Official]
Documentation for Anthropic’s batch processing with 50% cost discount, 100,000 request limit (or 256 MB size limit), typically under 1-hour completion, and 29-day result retention.
OpenAI Predicted Outputs 📘 [Official]
Guide to using predicted outputs for speculative execution, reducing latency by 2-3× when output is largely known in advance.
Community Resources
GPTCache Documentation 📗 [Verified Community]
Open-source semantic caching library by Zilliz. Provides embedding-based similarity matching to cache LLM responses for similar queries, with support for multiple vector stores and embedding providers.
Internal References
Validation Caching Pattern 📘 [Internal]
Repository-specific implementation of the 7-day validation caching pattern using bottom metadata blocks.
Tool Composition Guide 📘 [Internal]
Optimization patterns for tool usage including “narrow before wide” and lazy loading strategies.
Context Engineering Principles 📘 [Internal]
Token budget guidelines and context window management principles.