How to optimize token consumption during prompt orchestrations

Master token optimization strategies for multi-agent workflows: context reduction, deterministic tools, semantic caching, and provider caching to reduce costs by up to 90%
Author

Dario Airoldi

Published

January 25, 2026

How to optimize token consumption during prompt orchestrations

Token consumption is the hidden cost driver in complex AI workflows.
While the previous article explored how information flows between prompts, agents, and tools, this article focuses on minimizing the tokens consumed during that flow—directly reducing costs, latency, and context window pressure.

You’ll learn nine distinct optimization strategies organized into three categories that can be combined for maximum effect:

Input Optimization:

  1. Context reduction — Minimizing what enters the context window
  2. Provider prompt caching — Leveraging built-in provider caching mechanisms
  3. Semantic caching — Reusing results for semantically similar queries

Processing Optimization:

  1. Model selection — Using smaller models for simpler tasks
  2. Batch processing — Async processing with 50% cost discount
  3. Request consolidation — Combining sequential steps into single requests

Output Optimization:

  1. Output token reduction — Generating fewer tokens
  2. Deterministic tools — Bypassing AI entirely for predictable operations
  3. Streaming and parallelization — Reducing perceived latency and enabling speculative execution

By applying these strategies, you can achieve 50-90% cost reduction while maintaining or improving response quality.

Table of contents


🎯 Why token optimization matters

The cost equation

Every interaction with an AI model consumes tokens from three sources:

Source Description Cost Impact
Input tokens Context window content (prompts, instructions, context) Paid per request
Output tokens Model-generated response Paid per request (typically 3-5× input cost)
Reasoning tokens Internal reasoning (o3, o4-mini, extended thinking) Hidden cost, often 10-50× visible output

The multiplication problem

In multi-agent orchestrations, token consumption isn’t additive—it’s multiplicative:

Single request:     1,000 input + 500 output = 1,500 tokens

5-phase workflow with full context transfer:
├── Phase 1:  1,000 + 500 = 1,500 tokens
├── Phase 2:  1,500 + 800 = 2,300 tokens (inherited context)
├── Phase 3:  2,300 + 600 = 2,900 tokens
├── Phase 4:  2,900 + 700 = 3,600 tokens
├── Phase 5:  3,600 + 400 = 4,000 tokens
│
└── TOTAL: 14,300 tokens (9.5× single request)

Without optimization, a 5-phase workflow consumes nearly 10× the tokens of a single interaction.

Real-world cost impact

Scenario Unoptimized Optimized Savings
Simple 3-phase workflow ~8,000 tokens ~3,000 tokens 62%
Complex 6-phase orchestration ~45,000 tokens ~12,000 tokens 73%
Validation pipeline (10 articles) ~200,000 tokens ~40,000 tokens 80%
Daily development workflow ~500,000 tokens ~100,000 tokens 80%

At typical API pricing, 80% savings translates to substantial cost reduction over time.

Beyond cost: accuracy degradation

Token optimization isn’t only about money—it’s also about maintaining model accuracy. As the context window grows, models experience context rot: a progressive loss of accuracy on earlier instructions. Research shows that at 32,000 tokens, accuracy can drop from 88% to 30%, even for capable models. Every optimization strategy below doesn’t just save cost—it also keeps the model more accurate by keeping the context window smaller. For the full definition and benchmarks, see Context rot: why context management is urgent in the information flow article.


📊 Token consumption anatomy

Before optimizing, understand where tokens are consumed in a typical prompt orchestration:

┌─────────────────────────────────────────────────────────────────────────┐
│                        CONTEXT WINDOW BREAKDOWN                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  SYSTEM CONTEXT (~2,000-5,000 tokens)                            │  │
│  │  ├── Agent definition (.agent.md)          ~800-1,500 tokens     │  │
│  │  ├── Instructions (.instructions.md)       ~500-2,000 tokens     │  │
│  │  └── Global instructions (copilot-instructions.md) ~500-1,000    │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  USER CONTEXT (~500-10,000 tokens)                               │  │
│  │  ├── Prompt file content                   ~300-2,000 tokens     │  │
│  │  ├── Included snippets (#file:...)         ~200-2,000 tokens     │  │
│  │  ├── User message                          ~50-500 tokens        │  │
│  │  └── Handoff context (from previous agent) ~0-5,000 tokens       │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  TOOL RESULTS (~0-50,000 tokens)                                 │  │
│  │  ├── File reads (read_file)                ~500-5,000 each       │  │
│  │  ├── Search results (semantic_search)      ~1,000-3,000          │  │
│  │  ├── Web fetches (fetch_webpage)           ~3,000-15,000         │  │
│  │  └── MCP tool results                      ~500-10,000           │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  CONVERSATION HISTORY (~0-100,000 tokens)                        │  │
│  │  └── Prior turns in multi-turn conversation                      │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Optimization opportunities by category

Category Typical % of Context Optimization Strategy
System context 5-15% Instruction file pruning, agent streamlining
User context 5-20% Prompt compression, snippet elimination
Tool results 20-60% Targeted reads, result limits, deterministic tools
Conversation history 10-50% Progressive summarization, file-based isolation

Key insight: Tool results and conversation history are the biggest optimization targets. Focus there first.


✂️ Strategy 1: Context reduction

Context reduction minimizes what enters the context window in the first place—the most direct path to token savings.

Technique 1.1: Targeted file reads

Wasteful pattern:

Read the entire file for context.

This often results in reading 500+ lines when only 20 lines are relevant.

Efficient pattern:

Read lines 45-65 of the configuration file where the validation logic is defined.

Implementation in prompt files:

---
name: targeted-review
description: "Review specific section with minimal context"
tools: ['read_file']
---

## Process

1. Read ONLY the specific section:
   - For class definitions: read class header + target method (20-50 lines)
   - For configuration: read only relevant section
   - Use grep_search first to identify exact line ranges

2. **NEVER read entire files unless explicitly required**

Token savings example

Approach Tokens Consumed Savings
Read entire 500-line file ~5,000 tokens
Read targeted 50 lines ~500 tokens 90%
Use grep to find, then read 20 lines ~200 tokens 96%

Technique 1.2: Search result limiting

Wasteful pattern:

Search the codebase for validation patterns.

This can return 20+ results, each consuming hundreds of tokens.

Efficient pattern:

Search for validation patterns, limit to 3 most relevant results.

In prompts:

---
tools: ['semantic_search', 'grep_search']
---

## Tool Usage Guidelines

When searching:
1. **Use grep_search first** for known patterns (cheaper, faster)
2. **Limit semantic_search results** to 3-5 maximum
3. **Be specific in queries** — "authentication middleware" not "middleware"

Technique 1.3: Progressive summarization

Instead of passing full conversation history between phases, compress it to essential data:

## Phase Completion Template

When completing this phase, produce a **PHASE SUMMARY** (max 200 tokens):

### Phase {N} Summary

**Decisions Made**:
- [Decision 1]: [Rationale]
- [Decision 2]: [Rationale]

**Critical Outputs**:
- [Output file/artifact]: [1-line description]

**For Next Phase**:
- [Specific instruction for successor]

---
<!-- Full details are in: [file reference] -->

Token savings from summarization

Handoff Method Tokens per Handoff 5-Phase Total
Full context (send: true) ~3,000 tokens ~15,000 tokens
Progressive summary ~300 tokens ~1,500 tokens
Savings 90%

Technique 1.4: Instruction file pruning

Instructions files apply automatically based on applyTo patterns. Overly broad patterns waste tokens:

Wasteful:

---
applyTo: "**/*"   # Applies to EVERY file
---
# 200 lines of React component guidelines...

Targeted:

---
applyTo: "**/*.tsx,**/*.jsx"   # Only React files
---

Audit your instruction files:

# Find instruction files and their patterns
Get-ChildItem -Path ".github/instructions" -Filter "*.instructions.md" | 
    ForEach-Object { 
        $content = Get-Content $_.FullName -Raw
        if ($content -match 'applyTo:\s*"([^"]+)"') {
            [PSCustomObject]@{
                File = $_.Name
                Pattern = $matches[1]
            }
        }
    }

💾 Strategy 2: Provider prompt caching

Provider prompt caching is automatic caching offered by LLM providers for repeated prompt prefixes. Unlike semantic caching, it’s exact-match based and built into the API.

How provider caching works

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROVIDER PROMPT CACHING                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Request 1:                                                             │
│  ┌─────────────────────────────────────────────────┬──────────────────┐ │
│  │ Static Prefix (cached after 1024+ tokens)       │ Dynamic Content  │ │
│  │ • System instructions                           │ • User query     │ │
│  │ • Few-shot examples                             │ • Context        │ │
│  │ • Tool definitions                              │                  │ │
│  └─────────────────────────────────────────────────┴──────────────────┘ │
│            ▲                                                            │
│            │ Cache write (1.25× base price for Anthropic)               │
│                                                                         │
│  Request 2 (same prefix):                                               │
│  ┌─────────────────────────────────────────────────┬──────────────────┐ │
│  │ Static Prefix (CACHE HIT - 0.1× base price)     │ New Dynamic      │ │
│  │                                                 │ Content          │ │
│  └─────────────────────────────────────────────────┴──────────────────┘ │
│                                                                         │
│  SAVINGS: 50-90% on cached prefix tokens                                │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Provider comparison

Provider Minimum Tokens Cache Duration Cache Read Cost Cache Write Cost
OpenAI 1,024 tokens 5-60 min (in-memory) or 24h (extended) 50% discount No extra cost
Anthropic 1,024-4,096 tokens (model-dependent) 5 min default, 1h optional 90% discount 25% premium
Google Varies Context caching available Reduced cost Initial write cost

📝 Anthropic model-specific minimums:

  • Claude Sonnet 4/4.5, Opus 4/4.1, Opus 4.6: 1,024 tokens minimum
  • Claude Haiku 3.5, Opus 4.5: 4,096 tokens minimum
  • Claude Haiku 3: 2,048 tokens minimum

1-hour cache TTL: For agent workflows or conversations where follow-up prompts may exceed 5 minutes, use "ttl": "1h" in the cache_control block (2× base write price instead of 1.25×).

⚠️ Cache invalidation: The following changes invalidate cached content:

  • Tool definitions: Modifying names, descriptions, or parameters invalidates the entire cache
  • Enabling/disabling features: Web search, citations, or thinking toggles modify system prompts
  • Images: Adding or removing images anywhere in the prompt
  • Content edits: Any changes to cached prefix content require re-caching

Optimization: Structure prompts for caching

The key principle: Static content first, dynamic content last.

Cache-unfriendly structure:

## User Request
{{dynamic_user_input}}

## Instructions
[Static rules that could be cached but won't be]

Cache-friendly structure:

## Identity
You are a senior code reviewer specializing in security analysis.

## Instructions
1. Check for injection vulnerabilities
2. Validate input sanitization
[... 500+ tokens of stable instructions ...]

## Examples
[... few-shot examples that rarely change ...]

## User Request
{{dynamic_content}} <!-- Only this part isn't cached -->

Token savings from provider caching

Scenario Without Caching With Caching Savings
First request (1,500 token prefix) 1,500 tokens at full price 1,500 tokens at 1.25× (Anthropic) -25% (write cost)
Subsequent requests (same prefix) 1,500 tokens at full price 1,500 tokens at 0.1× (Anthropic) 90%
10 requests with same prefix 15,000 tokens 1,875 + 1,350 = 3,225 tokens 78%

🧠 Strategy 3: Semantic caching

Semantic caching stores and retrieves results for semantically similar queries—not just exact matches. This is powerful for:

  • Repeated questions with slight variations
  • Common patterns across projects
  • Expensive operations (web fetches, large analyses)

How semantic caching works

┌─────────────────────────────────────────────────────────────────────────┐
│                    SEMANTIC CACHE ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  User Query: "How do I authenticate users in Azure AD?"                 │
│                           │                                             │
│                           ▼                                             │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  1. EMBEDDING GENERATION                                         │   │
│  │     Convert query → vector embedding [0.23, -0.15, 0.87, ...]   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                           │                                             │
│                           ▼                                             │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  2. SIMILARITY SEARCH                                            │   │
│  │     Find cached queries with embedding distance < threshold      │   │
│  │     • "Azure AD user authentication" (0.92 similarity) ✅        │   │
│  │     • "Set up OAuth in Azure" (0.78 similarity) ⚠️               │   │
│  │     • "Install Azure CLI" (0.31 similarity) ❌                   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                           │                                             │
│           ┌───────────────┴───────────────┐                            │
│           │                               │                            │
│           ▼                               ▼                            │
│  ┌─────────────────┐            ┌─────────────────┐                    │
│  │  CACHE HIT      │            │  CACHE MISS     │                    │
│  │  Return cached  │            │  Call LLM API   │                    │
│  │  response       │            │  Store result   │                    │
│  │  (0 tokens)     │            │  in cache       │                    │
│  └─────────────────┘            └─────────────────┘                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Implementation: GPTCache

GPTCache is an open-source semantic caching library:

from gptcache import cache
from gptcache.adapter import openai

# Initialize semantic cache
cache.init()
cache.set_openai_key()

# Queries with similar embeddings hit cache
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "How to authenticate Azure AD?"}]
)
# Second query with similar meaning → cache hit
response2 = openai.ChatCompletion.create(
    model="gpt-4", 
    messages=[{"role": "user", "content": "Azure AD authentication setup"}]
)

When semantic caching helps vs. hurts

✅ Good Use Cases ❌ Poor Use Cases
Repeated documentation queries Unique, context-specific questions
Common coding patterns Code with dynamic requirements
FAQ-style interactions Creative/generative tasks
Reference lookups Tasks requiring current state

🎚️ Strategy 4: Model selection

Model selection uses smaller, faster models for simpler tasks—matching model capability to task complexity.

The model selection principle

From OpenAI’s latency optimization guide:

“Smaller models usually run faster (and cheaper), and when used correctly can even outperform larger models.”

When to use which model size

Task Complexity Recommended Model Token Cost
Simple classification GPT-3.5, Claude Haiku 10-20× cheaper
Structured extraction GPT-4o-mini, Haiku 5-10× cheaper
Code completion Fine-tuned smaller model Variable
Complex reasoning GPT-4, Claude Sonnet/Opus Full price
Creative/nuanced writing GPT-4, Claude Opus Full price

Implementation: Task routing

# orchestrator.agent.md
---
name: task-router
description: "Route tasks to appropriate model size"
---

## Task Routing Rules

### Use SMALLER model (GPT-3.5 / Claude Haiku) for:
- Simple classification (sentiment, category)
- Structured data extraction
- Reformatting/transformation
- Simple Q&A with clear answers

### Use LARGER model (GPT-4 / Claude Sonnet) for:
- Complex reasoning chains
- Nuanced analysis
- Creative content generation
- Multi-step problem solving

Practical example: Split prompt by complexity

From OpenAI’s latency guide—split a single GPT-4 prompt into two:

┌─────────────────────────────────────────────────────────────────┐
│  BEFORE: Single GPT-4 request                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ GPT-4: Reasoning + Classification + Response Generation   │  │
│  │ Cost: $$$                                                  │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  AFTER: Split by complexity                                     │
│  ┌─────────────────────────┐   ┌─────────────────────────────┐  │
│  │ GPT-3.5: Classification │ → │ GPT-4: Response Generation  │  │
│  │ (cheap, fast)           │   │ (only when needed)          │  │
│  │ Cost: $                 │   │ Cost: $$                    │  │
│  └─────────────────────────┘   └─────────────────────────────┘  │
│  TOTAL: $$ (vs $$$ before)                                      │
└─────────────────────────────────────────────────────────────────┘

Token savings from model selection

Approach Cost per 1M tokens Savings
All GPT-4 ~$30
GPT-3.5 for 70% of tasks ~$10 67%
Fine-tuned GPT-3.5 ~$8 73%

📦 Strategy 5: Batch processing

Batch processing submits multiple requests together for asynchronous processing, receiving a 50% cost discount from both OpenAI and Anthropic.

When to use batch processing

✅ Good Use Cases ❌ Poor Use Cases
Bulk content analysis Real-time chat interactions
Large-scale evaluations Interactive coding assistance
Dataset classification Time-sensitive operations
Content moderation pipelines User-facing latency-sensitive features

Provider batch APIs

Provider Discount Max Batch Size Completion Time
OpenAI 50% 50,000 requests or 200MB Within 24 hours
Anthropic 50% 100,000 requests or 256MB Within 24 hours (usually <1h)

Implementation example (Anthropic)

import anthropic

client = anthropic.Anthropic()

# Create batch with multiple requests
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "article-1",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": "Analyze this article..."}
                ]
            }
        },
        {
            "custom_id": "article-2",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": "Analyze this article..."}
                ]
            }
        }
        # ... up to 100,000 requests
    ]
)

# Poll for completion
while batch.processing_status != "ended":
    batch = client.messages.batches.retrieve(batch.id)
    time.sleep(60)

# Retrieve results
results = client.messages.batches.results(batch.id)

Combining batch processing with prompt caching

Batch processing and prompt caching discounts stack:

⚠️ Important: Cache hits in batch processing are best-effort due to asynchronous, concurrent processing. Anthropic reports typical cache hit rates of 30-98% depending on traffic patterns. To maximize hits: include identical cache_control blocks in every request, maintain steady request flow, and consider the 1-hour cache TTL for batch workloads.

Optimization Discount Combined
Batch only 50% 50%
Cache read only 90% 90%
Batch + Cache read 50% + 90% 95%

Token savings from batch processing

Scenario Standard API Batch API Savings
1,000 article validations $30 $15 50%
With prompt caching $15 $7.50 75% (combined)

🔗 Strategy 6: Request consolidation

Request consolidation combines multiple sequential LLM calls into a single request, eliminating round-trip latency and reducing total tokens.

The consolidation principle

From OpenAI’s latency guide:

“Each time you make a request you incur some round-trip latency. If you have sequential steps for the LLM to perform, instead of firing off one request per step consider putting them in a single prompt.”

Before vs. After consolidation

Before: Sequential requests

Request 1: "Classify this text"           → 200 tokens
Request 2: "Extract entities"             → 300 tokens  
Request 3: "Summarize findings"           → 400 tokens
                                          ─────────────
TOTAL: 3 requests, ~900 tokens + 3× round-trip latency

After: Consolidated request

Single Request: 
"Perform the following in sequence:
1. Classify this text
2. Extract entities
3. Summarize findings

Return results as JSON with keys: classification, entities, summary"

TOTAL: 1 request, ~600 tokens + 1× round-trip latency

Implementation pattern

---
name: consolidated-analysis
description: "Multiple analysis steps in one request"
---

## Task

Perform ALL of the following analyses on the provided content:

1. **Classification**: Categorize the content type
2. **Entity Extraction**: Identify key entities (people, places, concepts)
3. **Sentiment Analysis**: Determine overall sentiment
4. **Summary**: Create a 2-sentence summary

## Output Format

Return a single JSON object:
```json
{
  "classification": "...",
  "entities": [...],
  "sentiment": "positive|neutral|negative",
  "summary": "..."
}

### Token savings from consolidation

| Approach | Requests | Tokens | Savings |
|----------|----------|--------|---------|
| Sequential (3 requests) | 3 | ~900 | — |
| Consolidated (1 request) | 1 | ~600 | <mark>**33% tokens + 66% latency**</mark> |

---

## 📉 Strategy 7: Output token reduction

<mark>**Output token reduction**</mark> generates fewer response tokens—often the highest-latency step in LLM processing.

### The output token principle

From OpenAI's latency guide:

> "Generating tokens is almost always the highest latency step when using an LLM: cutting 50% of your output tokens may cut ~50% your latency."

### Techniques for reducing output tokens

### Technique 7.1: Explicit brevity instructions

```markdown
## Response Guidelines

- Keep responses under 100 words
- Use bullet points, not paragraphs
- Omit pleasantries and preamble
- Answer directly, then stop

Technique 7.2: Shortened JSON field names

Verbose output (~120 tokens):

{
  "message_is_conversation_continuation": "True",
  "number_of_messages_in_conversation_so_far": "5",
  "user_sentiment": "Frustrated",
  "query_type": "Technical Support",
  "response_requirements": "Provide step-by-step troubleshooting"
}

Compact output (~50 tokens):

{
  "cont": true,
  "n_msg": 5,
  "tone": "frustrated",
  "type": "tech_support",
  "reqs": "troubleshoot_steps"
}

Technique 7.3: Use max_tokens and stop_tokens

response = client.chat.completions.create(
    model="gpt-4",
    messages=[...],
    max_tokens=500,  # Hard limit on output
    stop=["\n\n", "---"]  # Stop at section breaks
)

Token savings from output reduction

Technique Before After Savings
Brevity instructions ~500 tokens ~150 tokens 70%
Shortened field names ~120 tokens ~50 tokens 58%
max_tokens limit Variable Capped Predictable

⚙️ Strategy 8: Deterministic tools

Deterministic tools bypass AI entirely for operations that don’t require intelligence—they’re infinitely faster and consume zero tokens.

When to use deterministic tools

Operation Type Use AI? Use Deterministic Tool?
Parse YAML frontmatter ✅ Regex/parser
Check if file exists ✅ File system check
Validate JSON schema ✅ JSON Schema validator
Count lines matching pattern ✅ grep + wc
Calculate hash/checksum ✅ Hash function
Compare timestamps ✅ Date comparison
Analyze code semantics
Generate creative content
Make judgment calls

Implementation: MCP server with deterministic tools

The key insight: wrap deterministic operations in MCP tools so agents can invoke them without AI processing.

Example: Validation cache check (deterministic)

// IQPilot MCP Server - CheckValidationCache tool
[McpTool("check_validation_cache")]
public async Task<CacheCheckResult> CheckValidationCache(
    string filePath,
    string validationType,
    int cacheDays = 7)
{
    // Pure file parsing - no AI involved
    var metadata = await ParseBottomMetadata(filePath);
    
    if (metadata.Validations.TryGetValue(validationType, out var validation))
    {
        var lastRun = validation.LastRun;
        var daysSinceRun = (DateTime.UtcNow - lastRun).TotalDays;
        
        if (daysSinceRun < cacheDays)
        {
            return new CacheCheckResult
            {
                IsCached = true,
                CachedResult = validation.Outcome,
                DaysSinceRun = daysSinceRun,
                Message = $"Cache valid. Last run: {lastRun:yyyy-MM-dd}"
            };
        }
    }
    
    return new CacheCheckResult
    {
        IsCached = false,
        Message = "No valid cache found. Run validation."
    };
}

Agent prompt using deterministic tool:

---
name: grammar-review-cached
tools: ['check_validation_cache', 'read_file', 'replace_string_in_file']
---

## Process

### Phase 1: Cache Check (DETERMINISTIC - Zero tokens)

1. Call `check_validation_cache`:
   - filePath: target file
   - validationType: "grammar"
   - cacheDays: 7

2. **IF cache is valid** → Report cached result → EXIT
3. **IF no cache** → Proceed to Phase 2

### Phase 2: Grammar Validation (AI - Consumes tokens)

Only reached if cache miss. Perform full validation...

Token savings from deterministic cache checks

Scenario Without Deterministic Check With Deterministic Check
Cache hit (70% of cases) 3,000 tokens 0 tokens
Cache miss (30% of cases) 3,000 tokens 3,000 tokens
Expected tokens per run 3,000 tokens 900 tokens
Savings 70%

Common deterministic operations to implement

Operation Implementation Token Savings
Metadata parsing YAML parser, regex Avoid AI parsing docs
File existence check File.Exists() Avoid AI file searches
Link validation HTTP HEAD request Avoid AI fetching pages
Pattern matching Regex engine Avoid AI text analysis
Schema validation JSON Schema validator Avoid AI structure checks
Diff computation Git diff Avoid AI comparison
Timestamp comparison Date arithmetic Avoid AI date reasoning

⚡ Strategy 9: Streaming and parallelization

Streaming and parallelization don’t reduce actual token costs but dramatically improve perceived latency and throughput. These techniques are essential for production-quality user experiences.

Streaming: Progressive output delivery

Streaming delivers tokens as they’re generated instead of waiting for the complete response:

┌─────────────────────────────────────────────────────────────────────────┐
│                    STREAMING vs BATCH RESPONSE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  WITHOUT STREAMING:                                                     │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │ Request ─────────────────────────────────────────────▶ Response  │  │
│  │          [User waits 3-5 seconds seeing nothing...]              │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  WITH STREAMING:                                                        │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │ Request ──▶ Here ──▶ is ──▶ the ──▶ response ──▶ ...             │  │
│  │         [~100ms] [~100ms] [~100ms] [~100ms]                       │  │
│  │         [User sees progress immediately]                          │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  PERCEIVED LATENCY: 100ms vs 3-5 seconds                               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

When to use streaming

✅ Good Use Cases ❌ Poor Use Cases
User-facing chat interfaces Background batch processing
Code generation with preview API integrations requiring complete response
Documentation generation Validation pipelines
Interactive debugging Structured JSON output parsing

Parallelization: Concurrent independent tasks

When tasks don’t depend on each other, process them concurrently rather than sequentially:

┌─────────────────────────────────────────────────────────────────────────┐
│                    SEQUENTIAL vs PARALLEL EXECUTION                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  SEQUENTIAL (Bad):                                                      │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐                 │
│  │ Task 1  │──▶│ Task 2  │──▶│ Task 3  │──▶│ Task 4  │                 │
│  │  2 sec  │   │  2 sec  │   │  2 sec  │   │  2 sec  │                 │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘                 │
│  Total time: 8 seconds                                                  │
│                                                                         │
│  PARALLEL (Good - when tasks are independent):                          │
│  ┌─────────┐                                                            │
│  │ Task 1  │                                                            │
│  │  2 sec  │                                                            │
│  ├─────────┤                                                            │
│  │ Task 2  │     All complete at ~2 seconds                            │
│  │  2 sec  │                                                            │
│  ├─────────┤                                                            │
│  │ Task 3  │                                                            │
│  │  2 sec  │                                                            │
│  ├─────────┤                                                            │
│  │ Task 4  │                                                            │
│  │  2 sec  │                                                            │
│  └─────────┘                                                            │
│  Total time: ~2 seconds (4× faster)                                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Implementation: Parallel validation

// Validate multiple articles in parallel
public async Task<ValidationReport[]> ValidateArticles(string[] articlePaths)
{
    var validationTasks = articlePaths.Select(async path =>
    {
        var content = await File.ReadAllTextAsync(path);
        
        // These validations can run in parallel
        var grammarTask = ValidateGrammarAsync(content);
        var readabilityTask = ValidateReadabilityAsync(content);
        var structureTask = ValidateStructureAsync(content);
        var linksTask = ValidateLinksAsync(content);
        
        await Task.WhenAll(grammarTask, readabilityTask, structureTask, linksTask);
        
        return new ValidationReport
        {
            Path = path,
            Grammar = await grammarTask,
            Readability = await readabilityTask,
            Structure = await structureTask,
            Links = await linksTask
        };
    });
    
    return await Task.WhenAll(validationTasks);
}

Speculative execution: Predicted outputs

OpenAI’s Predicted Outputs feature reduces latency when you know most of the expected output:

from openai import OpenAI

client = OpenAI()

# When editing code, most of the file stays the same
code = """
def calculate_total(items):
    total = 0
    for item in items:
        total += item.price
    return total
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"Add tax calculation to this function:\n{code}"
    }],
    prediction={
        "type": "content",
        "content": code  # Most of this content will be in output
    }
)

# Latency reduced by ~2-3× because unchanged tokens 
# are confirmed rather than regenerated

When speculative execution helps

Scenario Latency Reduction Best Model
Code refactoring (small changes) 2-3× gpt-4o, gpt-4o-mini
Document editing gpt-4o
Template completion 3-4× gpt-4o-mini
Translation with preserved formatting gpt-4o

Combined latency optimization

A well-optimized user-facing workflow combines these techniques:

┌─────────────────────────────────────────────────────────────────────────┐
│                    OPTIMAL LATENCY ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  User Request                                                           │
│       │                                                                 │
│       ▼                                                                 │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ PARALLEL SETUP (concurrent)                                      │   │
│  │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐              │   │
│  │ │ Check cache  │ │ Fetch context│ │ Load templates│              │   │
│  │ └──────────────┘ └──────────────┘ └──────────────┘              │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                 │
│       ▼                                                                 │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ STREAMING RESPONSE                                               │   │
│  │ User sees tokens appear immediately (~100ms first token)         │   │
│  │                                                                   │   │
│  │ If editing existing content: use PREDICTED OUTPUTS               │   │
│  │ for 2-3× additional latency reduction                            │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                 │
│       ▼                                                                 │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ PARALLEL POST-PROCESSING (while user sees response)             │   │
│  │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐              │   │
│  │ │ Update cache │ │ Log metrics  │ │ Pre-warm next│              │   │
│  │ └──────────────┘ └──────────────┘ └──────────────┘              │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│  RESULT: Sub-second perceived latency                                  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

📈 Strategy comparison matrix

Strategy Token Savings Implementation Effort Best For Limitations
1. Context Reduction 30-90% Low All workflows Requires discipline
2. Provider Caching 50-90% (stable prefixes) Low (structure prompts correctly) High-volume, consistent prompts Minimum token requirements
3. Semantic Caching 50-80% (high hit scenarios) High (embedding infrastructure) Repeated queries, documentation Stale data risk, false positives
4. Model Selection 60-80% Low Simple tasks, high volume May reduce quality for complex tasks
5. Batch Processing 50% cost Low Non-urgent, high volume 24h latency
6. Request Consolidation 40-60% Medium Multi-step pipelines Increased prompt complexity
7. Output Reduction 30-50% Low Verbose outputs May lose useful detail
8. Deterministic Tools 70-100% Medium (MCP development) Cache checks, validation, file ops Limited to predictable operations
9. Streaming/Parallelization Latency only Low-Medium User-facing, independent tasks No token savings

Strategy selection flowchart

┌─────────────────────────────────────────────────────────────────────────┐
│                    OPTIMIZATION STRATEGY SELECTION                       │
└─────────────────────────────────────────────────────────────────────────┘

Is the operation predictable/deterministic?
├── YES → Use DETERMINISTIC TOOL (zero tokens)
│         Examples: cache checks, file existence, schema validation
│
└── NO → Is it a repeated query pattern?
         ├── YES → Is the prefix stable?
         │         ├── YES → Use PROVIDER CACHING (90% savings)
         │         │         Structure: static first, dynamic last
         │         │
         │         └── NO → Consider SEMANTIC CACHING (50-80% savings)
         │                  If query variations are semantically similar
         │
         └── NO → Is it high volume, non-urgent?
                  ├── YES → Use BATCH PROCESSING (50% discount)
                  │
                  └── NO → Is output verbose?
                           ├── YES → Use OUTPUT REDUCTION (30-50% savings)
                           │
                           └── NO → Use CONTEXT REDUCTION (30-90% savings)
                                    • Targeted file reads
                                    • Search result limits
                                    • Progressive summarization

Combined optimization example

A well-optimized 5-phase workflow uses multiple strategies in combination:

Phase 1: Research
├── MODEL SELECTION: Use GPT-4o-mini for initial search
├── CONTEXT REDUCTION: Limit search to 5 results
├── PROVIDER CACHING: Stable research prompt prefix
└── Expected: 70% savings

Phase 2: Cache Check
├── DETERMINISTIC TOOL: Check existing validation cache
├── If cache hit: Skip remaining phases
└── Expected: 70% of runs skip AI entirely

Phase 3: Analysis (cache miss only)
├── CONTEXT REDUCTION: Read only relevant sections
├── PROVIDER CACHING: Stable analysis prompt
├── REQUEST CONSOLIDATION: Combine grammar + readability checks
└── Expected: 60% savings

Phase 4: Generation
├── CONTEXT REDUCTION: Progressive summary from Phase 3
├── SEMANTIC CACHING: Cache common generation patterns
├── OUTPUT REDUCTION: Structured JSON output only
└── Expected: 50% savings

Phase 5: Validation
├── DETERMINISTIC TOOL: Schema validation, link checks
├── BATCH PROCESSING: Queue non-urgent validations
├── Only AI for: Grammar, readability, semantic checks
└── Expected: 70% of checks bypass AI

CUMULATIVE SAVINGS: 75-90%

🔧 Implementation patterns

Pattern 1: Validation pipeline with caching

# validation-pipeline.prompt.md
---
name: validation-pipeline
description: "Multi-validation with deterministic cache checks"
tools: ['check_validation_cache', 'run_grammar_check', 'run_readability_check']
---

## Process

### Step 1: Batch Cache Check (DETERMINISTIC)

For each validation type (grammar, readability, structure, fact-check):
1. Call `check_validation_cache(file, type, days=7)`
2. Record which validations need running

### Step 2: Run Only Missing Validations (AI)

For each validation NOT in cache:
1. Run appropriate validation prompt
2. Store result in metadata cache

### Step 3: Aggregate Results

Combine cached + fresh results into unified report.

Pattern 2: Research with semantic caching

# Research pattern with semantic cache
async def research_topic(topic: str, cache: SemanticCache):
    # Check semantic cache first
    cached = await cache.get_similar(topic)
    if cached and cached.similarity > 0.90:
        return cached.response
    
    # Cache miss - perform research
    results = await perform_research(topic)
    
    # Store for future similar queries
    await cache.store(topic, results)
    
    return results

Pattern 3: Progressive summarization handoff

# builder.agent.md
---
name: builder
handoffs:
  - label: "Validate Result"
    agent: validator
    send: false    # Don't send full context
    prompt: |
      **Summary from Builder:**
      {{PHASE_SUMMARY}}
      
      **Artifact location:** {{OUTPUT_FILE}}
      
      Validate the created artifact.
---

## Phase Completion Instructions

Before any handoff, produce a PHASE_SUMMARY (max 200 tokens):

1. Decisions made (bullet list)
2. Artifacts created (file paths)
3. Key constraints applied
4. Specific validation needs

Store full details in output file for reference if needed.

⚠️ Common pitfalls

Pitfall 1: Over-caching dynamic content

Wrong: Caching responses that depend on current file state

# DON'T cache file-dependent analyses
cache.store(
    "analyze security of auth.py",  # Query seems cacheable...
    analysis_result  # But result depends on file content!
)

Right: Include content hash in cache key

content_hash = hashlib.md5(file_content.encode()).hexdigest()
cache.store(
    f"analyze security of auth.py:{content_hash}",
    analysis_result
)

Pitfall 2: Cache key collisions

Wrong: Overly broad cache keys

cache.store("validate article", result)  # Which article?

Right: Include all relevant context in key

cache.store(f"validate:{file_path}:{validation_type}:{content_hash}", result)

Pitfall 3: Ignoring cache write costs

For Anthropic, cache writes cost 25% more than regular input tokens.

Wrong: Caching tiny prefixes that are rarely reused

Right: Only cache prefixes that will be reused 3+ times

Break-even calculation (Anthropic):
- Cache write: 1.25× base cost
- Cache read: 0.1× base cost

To save money: Need 2+ cache hits to offset write cost
  1.25 (write) + 0.1 (read) + 0.1 (read) = 1.45
  vs. 1.0 + 1.0 + 1.0 = 3.0 without caching

Savings start at 3rd use.

Pitfall 4: Placing dynamic content before static

Wrong: User input first

## User Request: {{input}}

## Instructions (static)
[These won't be cached because they come after dynamic content]

Right: Static first, dynamic last

## Instructions (static - cached)
[1,000+ tokens of stable content]

## User Request: {{input}}

🎯 Conclusion

Token optimization isn’t optional for production AI workflows—it’s the difference between sustainable and unsustainable costs.

Key takeaways

  1. Context reduction is the foundation: targeted reads, limited searches, progressive summarization
  2. Provider caching offers up to 90% savings—structure prompts with static content first
  3. Semantic caching captures similar queries—powerful for documentation and reference lookups
  4. Model selection and batch processing provide substantial cost reduction for high-volume workflows
  5. Request consolidation and output reduction minimize token counts per interaction
  6. Deterministic tools bypass AI entirely for predictable operations—cache checks, validation, file operations
  7. Streaming/parallelization improve perceived latency without changing token costs

Implementation priority

Priority Strategy Expected Savings Effort
1 Context reduction (targeted reads) 30-50% Low
2 Provider caching (prompt structure) 50-90% Low
3 Model selection (right-size tasks) 60-80% Low
4 Batch processing (async high-volume) 50% cost Low
5 Output reduction (structured output) 30-50% Low
6 Request consolidation (combine steps) 40-60% Medium
7 Deterministic tools (cache checks) 70%+ for cached ops Medium
8 Semantic caching 50-80% High
9 Streaming/parallelization Latency only Low-Medium

Next steps

  • Audit current prompts for context reduction opportunities
  • Restructure prompts for provider caching (static first)
  • Evaluate model selection for different task complexities
  • Identify batch opportunities for non-urgent high-volume tasks
  • Identify deterministic operations to move to MCP tools
  • Monitor token usage to validate savings

For information flow patterns between phases, see: How to Manage Information Flow During Prompt Orchestrations.


📚 References

Official Documentation

OpenAI Latency Optimization Guide 📘 [Official]
Seven principles for optimizing latency: process tokens faster, generate fewer tokens, use fewer input tokens, make fewer requests, parallelize, make users wait less, and don’t default to LLM.

OpenAI Prompt Caching Guide 📘 [Official]
Comprehensive guide to OpenAI’s automatic prompt caching, including requirements (1024+ tokens), cache duration, and best practices for structuring prompts to maximize cache hits.

Anthropic Prompt Caching Documentation 📘 [Official]
Comprehensive guide to Claude’s prompt caching with cache_control breakpoints, 90% read discount, 25% write premium, 5-minute default or 1-hour extended TTL, up to 4 cache breakpoints, and model-specific minimum token requirements (1,024-4,096 tokens).

OpenAI Batch API Guide 📘 [Official]
Documentation for OpenAI’s Batch API offering 50% cost discount for async processing with 24-hour completion window and up to 50,000 requests.

Anthropic Message Batches API 📘 [Official]
Documentation for Anthropic’s batch processing with 50% cost discount, 100,000 request limit (or 256 MB size limit), typically under 1-hour completion, and 29-day result retention.

OpenAI Predicted Outputs 📘 [Official]
Guide to using predicted outputs for speculative execution, reducing latency by 2-3× when output is largely known in advance.

Community Resources

GPTCache Documentation 📗 [Verified Community]
Open-source semantic caching library by Zilliz. Provides embedding-based similarity matching to cache LLM responses for similar queries, with support for multiple vector stores and embedding providers.

Internal References

Validation Caching Pattern 📘 [Internal]
Repository-specific implementation of the 7-day validation caching pattern using bottom metadata blocks.

Tool Composition Guide 📘 [Internal]
Optimization patterns for tool usage including “narrow before wide” and lazy loading strategies.

Context Engineering Principles 📘 [Internal]
Token budget guidelines and context window management principles.

Series Navigation

Previous: How to Manage Information Flow During Prompt Orchestrations
Series Index: The GitHub Copilot customization stack