Understanding LLM models and model selection
Understanding LLM models and model selection
A βgood generic promptβ doesnβt exist β there exists only a good prompt for that specific model. Different models have fundamentally different behaviors, strengths, and optimal prompting strategies. What works brilliantly with Claude may fail with GPT-4o; what excels with Gemini may confuse reasoning models.
GitHub Copilot gives you access to models from multiple providers β OpenAI, Anthropic, Google, and more through BYOK (bring-your-own-key). This article explains how these model families differ, what makes each one strong, how to choose the right model for a given task, and how the multi-model architecture enables advanced workflows.
Table of contents
- π― The compiler analogy: why models matter
- π Model families and their characteristics
- π§ Standard models vs. reasoning models
- π§ Model-specific prompting strategies
- ποΈ Multi-model architecture patterns
- π BYOK: bring-your-own-key providers
- π Model selection decision framework
- β οΈ Key considerations
- π― Conclusion
- π References
π― The compiler analogy: why models matter
Think of each model as a different compiler. The same βsource codeβ (your prompt) produces different βexecutablesβ (responses) depending on which compiler processes it. Just as you wouldnβt expect C++ code to compile identically on GCC and MSVC without adjustments, you shouldnβt expect the same prompt to perform identically across GPT-4o, Claude, and Gemini.
What changes between models:
| Aspect | How it differs |
|---|---|
| Sensitivity to constraints | Some models follow explicit constraints rigidly; others interpret them flexibly |
| Ambiguity handling | Models differ in whether they ask for clarification or make assumptions |
| Response patterns | Default verbosity, formatting preferences, and structure vary |
| Token interpretation | Context window utilization, attention patterns, and recency bias differ |
| Chain of thought | Some benefit from explicit CoT prompting; others do it internally |
This means that every time you change model or version, you should re-validate your prompts against the new modelβs behavior.
π Model families and their characteristics
GitHub Copilot provides access to models from three major providers, plus BYOK options:
OpenAI models
| Model | Context window | Best for | Key behavior |
|---|---|---|---|
| GPT-4o / GPT-4.1 | 128K | General tasks, code generation | Fast, balanced, highly steerable |
| GPT-5 / GPT-5.2 | 1M+ | Complex tasks, broad domains | Latest capabilities, vision support |
| o3 / o4-mini | 200K | Complex reasoning, planning | Internal chain of thought |
GPT models respond best to explicit instructions with developer messages (formerly system messages), few-shot examples, and clear Markdown/XML formatting. Theyβre the βfollow my instructions preciselyβ family.
Anthropic models
| Model | Context window | Best for | Key behavior |
|---|---|---|---|
| Claude Sonnet 4 | 200K | Long documents, nuanced analysis | Thoughtful, cautious, detailed |
| Claude Opus 4.6 | 200K | Frontier agentic tasks | Highest-capability, multi-step reasoning |
| Claude Extended Thinking | 200K | Complex STEM, constraint problems | Deep internal reasoning |
Claude models excel with clarity and context β clear XML-tagged structure, explicit context about your norms and preferences, and well-organized reference material. Think of Claude as a brilliant but new colleague who needs explicit context about your expectations.
Google models
| Model | Context window | Best for | Key behavior |
|---|---|---|---|
| Gemini 2.0 Flash | 1M+ | Fast inference, multimodal | Quick responses, visual reasoning |
| Gemini 3 | Varies | Advanced reasoning, agentic tasks | Strong instruction following |
Gemini models respond best to structured prompts with consistent formatting and clear organization. They often perform well with zero-shot prompts but benefit from few-shot examples when specific output formats are needed.
Capability comparison
The model picker in VS Code shows capability indicators for each model:
Model Context Vision Tools Reasoning
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Claude Sonnet 4 200K β
β
β
GPT-4o 128K β
β
β
Claude Opus 4.6 200K β
β
β
o3 200K β β
β
Gemini 2.0 Flash 1M+ β
β
β
GPT-5 1M+ β
β
β
Not all models support all capabilities. Vision (image understanding), tool calling (function invocation), and reasoning (internal chain of thought) are the three key capability dimensions. Your choice of model constrains what your agents can do.
π§ Standard models vs. reasoning models
The most important conceptual division isnβt between providers β itβs between standard models and reasoning models.
Standard language models
GPT-4o, Claude Sonnet 4, Gemini 2.0 Flash β these models benefit from explicit, detailed instructions:
- Provide step-by-step guidance
- Use few-shot examples liberally
- Explicitly state constraints and output formats
- Use chain-of-thought prompting when reasoning is needed
Think of standard models as junior colleagues who need clear, detailed instructions.
Reasoning models
o3, o4-mini, Claude Extended Thinking β these models perform internal reasoning before responding:
- Give high-level goals, not step-by-step instructions
- Trust the model to work out the details
- Be specific about success criteria and constraints
- Donβt include βthink step by stepβ β they already do this internally
Think of reasoning models as senior colleagues who need goals, not instructions.
Side-by-side comparison
| Aspect | Standard models | Reasoning models |
|---|---|---|
| Instruction style | Detailed, step-by-step | High-level goals |
| Chain of thought | Must be prompted explicitly | Happens internally |
| βThink step by stepβ | Helpful | Unnecessary or harmful |
| Few-shot examples | Often required | Try zero-shot first |
| Constraints | Embedded in instructions | Specify success criteria |
| Speed | Fast | Slower (thinking time) |
| Cost | Lower per token | Higher per token |
| Best for | Well-defined tasks | Ambiguous, complex problems |
When to use each category
Standard models:
- Code generation with clear requirements
- Formatting and text transformation
- Following established patterns
- High-volume, latency-sensitive tasks
Reasoning models:
- Complex multi-step planning
- Ambiguous tasks requiring interpretation
- Large document analysis (needle in haystack)
- Nuanced decision-making with many factors
- Scientific and mathematical reasoning
π§ Model-specific prompting strategies
Each model family has an optimal prompting style. Hereβs a conceptual overview:
GPT models: explicit instruction optimization
# Identity
You are a [role] specializing in [domain].
# Instructions
* [Specific rule 1]
* [Specific rule 2]
# Examples
[Input] β [Output]
# Context
[Additional information]
Key techniques: developer messages for identity/rules, Markdown/XML formatting, few-shot examples, prompt caching optimization (static content first).
Claude models: clarity and context optimization
<role>You are a technical documentation specialist.</role>
<context>You are reviewing API documentation.</context>
<instructions>
1. Check for completeness
2. Verify all parameters are documented
3. Flag missing error codes
</instructions>
<output_format>Markdown table</output_format>Key techniques: XML tags for structure, explicit context about norms/preferences, chain-of-thought with tags for complex tasks, long-context with critical instructions at the beginning.
Gemini models: structured prompting
Key techniques: consistent formatting (XML or Markdown headers, pick one and stay with it), zero-shot first then add examples if needed, completion patterns for format control, context anchoring after large blocks.
Reasoning models: minimal guidance
Key techniques: high-level goals instead of steps, specify success criteria, reserve tokens for internal reasoning (at least 25K for o3/o4-mini, minimum 1024 budget for Claude Extended Thinking), trust the modelβs process.
ποΈ Multi-model architecture patterns
Production systems often benefit from using different models for different tasks within the same workflow. In Copilot, this is possible through the model field in prompt/agent YAML and through subagent delegation.
Pattern 1: planner + executors
User request
β
βΌ
βββββββββββββββββββββββββββ
β Reasoning model (o3) β β Analyzes request, decomposes into steps
β "The planner" β
βββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
βΌ βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β GPT-4o β β Claude Sonnet 4 β
β Fast code gen β β Long doc analysis β
βββββββββββββββββββββββ βββββββββββββββββββββββ
A reasoning model handles the complex planning, then delegates execution to faster, cheaper standard models.
Pattern 2: task-specific model selection
| Task type | Recommended model | Why |
|---|---|---|
| Agent orchestration | GPT-4o | Fast, balanced, reliable |
| Long document analysis | Claude Sonnet 4 | 200K context, strong comprehension |
| Complex reasoning | o3 / o4-mini | Internal chain of thought |
| Code generation | GPT-4o / Claude | Fast, accurate output |
| Multimodal (image + text) | Gemini 2.0 / GPT-4o | Strong vision capabilities |
| Evaluation / grading | o3 | Nuanced judgment, high accuracy |
| Agentic multi-step workflows | Claude Opus 4.6 | Highest agentic capability |
| Deep analysis, research | Claude Opus 4.6 | Multi-step reasoning |
Pattern 3: model-specific reviewers
Create dedicated review agents optimized for each modelβs prompting style:
# openai-prompt-reviewer.agent.md
---
name: openai-prompt-reviewer
description: Reviews prompts for GPT model optimization
model: gpt-4o
tools: ['codebase', 'search']
---Each reviewer checks that prompts follow the optimal patterns for their target model β developer message structure for GPT, XML tags for Claude, consistent formatting for Gemini.
π BYOK: bring-your-own-key providers
GitHub Copilotβs model picker isnβt limited to the built-in models. BYOK (bring-your-own-key) lets you connect external model providers using your own API keys.
Available BYOK providers
| Provider | Models available | Key advantage |
|---|---|---|
| Cerebras | Llama 3.3, DeepSeek v3.2, GLM-4.6 | Extremely fast inference |
| OpenRouter | 100+ models | Unified API for multiple providers |
| Ollama | Local models | Fully local, no API calls |
| Azure OpenAI | GPT-4o, GPT-4 Turbo | Enterprise deployment |
| Anthropic (direct) | Claude models | Direct API access |
HuggingFace integration
The HuggingFace Inference Provider extension enables access to open-weights models:
- Multiple inference providers β HuggingFace API, Nebius, SambaNova, Together AI
- Automatic routing β fastest or cheapest mode
- Open-weights models β Llama, Mistral, DeepSeek, Qwen
Quota implications
BYOK models donβt consume your GitHub Copilot quota, but:
- An active Copilot subscription is still required
- BYOK costs are billed directly by the provider
- Background query refinement (using GPT-4o Mini) doesnβt count against quota
- Full prompt logging is available in the output channel for debugging
π Model selection decision framework
What's your top priority?
β
ββ Speed and cost
β ββ GPT-4o mini / Gemini 2.0 Flash
β
ββ Accuracy and reliability
β ββ Is the task complex/ambiguous?
β β ββ Yes β o3 or Claude Extended Thinking
β β ββ No β GPT-4o or Claude Sonnet 4
β ββ Does it need agentic multi-step?
β ββ Yes β Claude Opus 4.6 or GPT-5
β
ββ Long context (>100K tokens)
β ββ Claude Sonnet 4 or Gemini 2.0
β
ββ Multimodal (images + text)
β ββ Gemini 2.0 or GPT-4o
β
ββ Local/private (no cloud)
ββ Ollama via BYOK
Quick reference table
| Scenario | Primary model | Fallback |
|---|---|---|
| Production agent orchestration | GPT-4o | Claude Sonnet 4 |
| Complex multi-step reasoning | o3 | o4-mini (faster) |
| Document summarization (long) | Claude Sonnet 4 | Gemini 2.0 |
| Code generation | GPT-4o | Claude Sonnet 4 |
| Visual reasoning | Gemini 2.0 | GPT-4o |
| Mathematical problems | o3 | Claude Extended Thinking |
| Agentic planning | o3 | GPT-5 |
| Agentic workflows | Claude Opus 4.6 | GPT-5, o3 |
| Deep research and analysis | Claude Opus 4.6 | Claude Extended Thinking |
β οΈ Key considerations
The re-validation rule
Every time you change model or version:
- Read the official prompting guide for that model
- Re-validate existing prompts against the new modelβs behavior
- Update your test pipeline with latest guide recommendations
This isnβt optional for production systems. Model updates can change behavior in subtle ways that break previously working prompts.
Cost vs. capability trade-off
More capable models cost more per token and respond more slowly. For production systems, this creates a design tension:
- Donβt use o3 for tasks that GPT-4o handles well
- Do use reasoning models for genuinely complex planning
- Consider multi-model architectures that route tasks to the appropriate model
Context window isnβt everything
A model with a 1M+ context window doesnβt automatically handle long documents well. Context rot (attention degradation in the middle of long prompts) affects all models. Large context windows help, but you still need to structure your prompts so critical information appears at the beginning and end.
π― Conclusion
Model selection is a first-class prompt engineering concern. Each model family brings distinct strengths: GPT excels at following explicit instructions, Claude at nuanced analysis with rich context, Gemini at structured multimodal tasks, and reasoning models at complex planning. Understanding these differences β and designing your agents, prompts, and orchestrations to leverage them β is what separates generic prompt engineering from production-quality systems.
Key takeaways
- No βbest modelβ exists β only the best model for a specific task and prompting style
- The compiler analogy captures the core insight: same prompt, different models, different results
- Standard models need detailed instructions; reasoning models need high-level goals
- Multi-model architectures let you route different tasks to different models within the same workflow
- BYOK extends your options to 100+ models through OpenRouter, Ollama, Cerebras, and HuggingFace
- Re-validate prompts every time you change model or version β this isnβt optional for production
- Context rot affects all models β structure prompts with critical information at the beginning and end
Next steps
- How to optimize prompts for specific models β detailed, actionable optimization techniques per model family
- How Copilot assembles and processes prompts β understanding context windows and attention patterns
- Chat modes, Agent HQ, and execution contexts β how the
modelfield interacts with execution modes
π References
OpenAI Prompt Engineering Guide [π Official] Comprehensive guide for GPT-4o, GPT-5, and latest OpenAI models. Covers developer messages, few-shot examples, prompt caching, and model-specific optimization.
Anthropic Prompt Engineering Overview [π Official] Master guide for Claude models. Covers XML tagging, chain-of-thought prompting, extended thinking, and long-context optimization.
Google Gemini Prompt Design Strategies [π Official] Comprehensive guide for Gemini 2.0 and Gemini 3 models. Covers structured prompting, completion patterns, and multimodal inputs.
OpenAI Reasoning Models Guide [π Official] Technical documentation for o3 and o4-mini reasoning models. Covers when to use reasoning, effort levels, and token budgeting.
VS Code Copilot Language Models Documentation [π Official] Microsoftβs documentation for model selection in VS Code, including the Language Models Editor, BYOK provider configuration, and capability filtering.