Understanding LLM models and model selection

tech

prompt-engineering

github-copilot

concepts

models

Understand the different model families available in GitHub Copilot, how they behave differently, how to select the right model for each task, and how BYOK providers extend your options.

Author

Dario Airoldi

Published

March 1, 2026

Understanding LLM models and model selection

A “good generic prompt” doesn’t exist — there exists only a good prompt for that specific model. Different models have fundamentally different behaviors, strengths, and optimal prompting strategies. What works brilliantly with Claude may fail with GPT-4o; what excels with Gemini may confuse reasoning models.

GitHub Copilot gives you access to models from multiple providers — OpenAI, Anthropic, Google, and more through BYOK (bring-your-own-key). This article explains how these model families differ, what makes each one strong, how to choose the right model for a given task, and how the multi-model architecture enables advanced workflows.

🎯 The compiler analogy: why models matter
📊 Model families and their characteristics
🧠 Standard models vs. reasoning models
🔧 Model-specific prompting strategies
🏗️ Multi-model architecture patterns
🔑 BYOK: bring-your-own-key providers
📋 Model selection decision framework
⚠️ Key considerations
🎯 Conclusion
📚 References

🎯 The compiler analogy: why models matter

Think of each model as a different compiler. The same “source code” (your prompt) produces different “executables” (responses) depending on which compiler processes it. Just as you wouldn’t expect C++ code to compile identically on GCC and MSVC without adjustments, you shouldn’t expect the same prompt to perform identically across GPT-4o, Claude, and Gemini.

What changes between models:

Aspect	How it differs
Sensitivity to constraints	Some models follow explicit constraints rigidly; others interpret them flexibly
Ambiguity handling	Models differ in whether they ask for clarification or make assumptions
Response patterns	Default verbosity, formatting preferences, and structure vary
Token interpretation	Context window utilization, attention patterns, and recency bias differ
Chain of thought	Some benefit from explicit CoT prompting; others do it internally

This means that every time you change model or version, you should re-validate your prompts against the new model’s behavior.

📊 Model families and their characteristics

GitHub Copilot provides access to models from three major providers, plus BYOK options:

OpenAI models

Model	Context window	Best for	Key behavior
GPT-4.1 / GPT-5 mini	128K–1M+	General tasks, code generation	Fast, balanced, highly steerable
GPT-5.2 / GPT-5.4	1M+	Deep reasoning, debugging	Multi-step problem solving
GPT-5.x-Codex	1M+	Agentic software development	Optimized for agentic tasks

GPT models respond best to explicit instructions with developer messages (formerly system messages), few-shot examples, and clear Markdown/XML formatting. They’re the “follow my instructions precisely” family.

Anthropic models

Model	Context window	Best for	Key behavior
Claude Sonnet 4.0–4.6	200K	General coding, agent tasks	Balanced performance, strong reasoning
Claude Opus 4.6–4.7	200K	Frontier agentic tasks	Highest-capability, sophisticated reasoning
Claude Haiku 4.5	200K	Fast, lightweight tasks	Quick responses, cost-efficient
Claude Extended Thinking	200K	Complex STEM, constraint problems	Deep internal reasoning

Claude models excel with clarity and context — clear XML-tagged structure, explicit context about your norms and preferences, and well-organized reference material. Think of Claude as a brilliant but new colleague who needs explicit context about your expectations.

Google models

Model	Context window	Best for	Key behavior
Gemini 2.5 Pro	1M+	Deep reasoning, research	Complex code generation, debugging
Gemini 3 Flash	1M+	Fast inference, multimodal	Quick responses, visual reasoning
Gemini 3.1 Pro	1M+	Advanced reasoning, agentic tasks	High tool precision, edit-test loops

Gemini models respond best to structured prompts with consistent formatting and clear organization. They often perform well with zero-shot prompts but benefit from few-shot examples when specific output formats are needed.

Capability comparison

The model picker in VS Code shows capability indicators for each model:

Model                    Context    Vision   Tools   Reasoning
─────────────────────────────────────────────────────────────
Claude Sonnet 4.6        200K       ✅       ✅      —
GPT-5 mini               1M+        ✅       ✅      —
Claude Opus 4.7          200K       ✅       ✅      —
GPT-5.4                  1M+        —        ✅      ✅
Gemini 3 Flash           1M+        ✅       ✅      —
Gemini 3.1 Pro           1M+        ✅       ✅      ✅

Not all models support all capabilities. Vision (image understanding), tool calling (function invocation), and reasoning (internal chain of thought) are the three key capability dimensions. Your choice of model constrains what your agents can do.

🧠 Standard models vs. reasoning models

The most important conceptual division isn’t between providers — it’s between standard models and reasoning models.

Standard language models

GPT-5 mini, Claude Sonnet 4.6, Gemini 3 Flash — these models benefit from explicit, detailed instructions:

Provide step-by-step guidance
Use few-shot examples liberally
Explicitly state constraints and output formats
Use chain-of-thought prompting when reasoning is needed

Think of standard models as junior colleagues who need clear, detailed instructions.

Reasoning models

GPT-5.4, Claude Extended Thinking — these models perform internal reasoning before responding:

Give high-level goals, not step-by-step instructions
Trust the model to work out the details
Be specific about success criteria and constraints
Don’t include “think step by step” — they already do this internally

Think of reasoning models as senior colleagues who need goals, not instructions.

Side-by-side comparison

Aspect	Standard models	Reasoning models
Instruction style	Detailed, step-by-step	High-level goals
Chain of thought	Must be prompted explicitly	Happens internally
“Think step by step”	Helpful	Unnecessary or harmful
Few-shot examples	Often required	Try zero-shot first
Constraints	Embedded in instructions	Specify success criteria
Speed	Fast	Slower (thinking time)
Cost	Lower per token	Higher per token
Best for	Well-defined tasks	Ambiguous, complex problems

When to use each category

Standard models:

Code generation with clear requirements
Formatting and text transformation
Following established patterns
High-volume, latency-sensitive tasks

Reasoning models:

Complex multi-step planning
Ambiguous tasks requiring interpretation
Large document analysis (needle in haystack)
Nuanced decision-making with many factors
Scientific and mathematical reasoning

🔧 Model-specific prompting strategies

Each model family has an optimal prompting style. Here’s a conceptual overview:

GPT models: explicit instruction optimization

# Identity
You are a [role] specializing in [domain].

# Instructions
* [Specific rule 1]
* [Specific rule 2]

# Examples
[Input] → [Output]

# Context
[Additional information]

Key techniques: developer messages for identity/rules, Markdown/XML formatting, few-shot examples, prompt caching optimization (static content first).

Claude models: clarity and context optimization

<role>You are a technical documentation specialist.</role>
<context>You are reviewing API documentation.</context>
<instructions>
1. Check for completeness
2. Verify all parameters are documented
3. Flag missing error codes
</instructions>
<output_format>Markdown table</output_format>

Key techniques: XML tags for structure, explicit context about norms/preferences, chain-of-thought with tags for complex tasks, long-context with critical instructions at the beginning.

Gemini models: structured prompting

Key techniques: consistent formatting (XML or Markdown headers, pick one and stay with it), zero-shot first then add examples if needed, completion patterns for format control, context anchoring after large blocks.

Reasoning models: minimal guidance

Key techniques: high-level goals instead of steps, specify success criteria, reserve tokens for internal reasoning (at least 25K for GPT-5.4, minimum 1024 budget for Claude Extended Thinking), trust the model’s process.

🏗️ Multi-model architecture patterns

Production systems often benefit from using different models for different tasks within the same workflow. In Copilot, this is possible through the model field in prompt/agent YAML and through subagent delegation.

Pattern 1: planner + executors

User request
     │
     ▼
┌─────────────────────────┐
│  Reasoning model (o3)   │  ← Analyzes request, decomposes into steps
│  "The planner"          │
└─────────────────────────┘
     │
     ├─────────────────────────────┐
     ▼                             ▼
┌─────────────────────┐  ┌─────────────────────┐
│  GPT-4o             │  │  Claude Sonnet 4    │
│  Fast code gen      │  │  Long doc analysis  │
└─────────────────────┘  └─────────────────────┘

A reasoning model handles the complex planning, then delegates execution to faster, cheaper standard models.

Pattern 2: task-specific model selection

Task type	Recommended model	Why
Agent orchestration	GPT-4o	Fast, balanced, reliable
Long document analysis	Claude Sonnet 4	200K context, strong comprehension
Complex reasoning	o3 / o4-mini	Internal chain of thought
Code generation	GPT-4o / Claude	Fast, accurate output
Multimodal (image + text)	Gemini 2.0 / GPT-4o	Strong vision capabilities
Evaluation / grading	o3	Nuanced judgment, high accuracy
Agentic multi-step workflows	Claude Opus 4.6	Highest agentic capability
Deep analysis, research	Claude Opus 4.6	Multi-step reasoning

Pattern 3: model-specific reviewers

Create dedicated review agents optimized for each model’s prompting style:

# openai-prompt-reviewer.agent.md
---
name: openai-prompt-reviewer
description: Reviews prompts for GPT model optimization
model: gpt-4o
tools: ['codebase', 'search']
---

Each reviewer checks that prompts follow the optimal patterns for their target model — developer message structure for GPT, XML tags for Claude, consistent formatting for Gemini.

🔑 BYOK: bring-your-own-key providers

GitHub Copilot’s model picker isn’t limited to the built-in models. BYOK (bring-your-own-key) lets you connect external model providers using your own API keys.

Available BYOK providers

Provider	Models available	Key advantage
Cerebras	Llama 3.3, DeepSeek v3.2, GLM-4.6	Extremely fast inference
OpenRouter	100+ models	Unified API for multiple providers
Ollama	Local models	Fully local, no API calls
Azure OpenAI	GPT-4o, GPT-4 Turbo	Enterprise deployment
Anthropic (direct)	Claude models	Direct API access

HuggingFace integration

The HuggingFace Inference Provider extension enables access to open-weights models:

Multiple inference providers — HuggingFace API, Nebius, SambaNova, Together AI
Automatic routing — fastest or cheapest mode
Open-weights models — Llama, Mistral, DeepSeek, Qwen

Quota implications

BYOK models don’t consume your GitHub Copilot quota, but:

An active Copilot subscription is still required
BYOK costs are billed directly by the provider
Background query refinement (using GPT-4o Mini) doesn’t count against quota
Full prompt logging is available in the output channel for debugging

📋 Model selection decision framework

What's your top priority?
│
├─ Speed and cost
│   └─ GPT-4o mini / Gemini 2.0 Flash
│
├─ Accuracy and reliability
│   ├─ Is the task complex/ambiguous?
│   │   ├─ Yes → o3 or Claude Extended Thinking
│   │   └─ No  → GPT-4o or Claude Sonnet 4
│   └─ Does it need agentic multi-step?
│       └─ Yes → Claude Opus 4.6 or GPT-5
│
├─ Long context (>100K tokens)
│   └─ Claude Sonnet 4 or Gemini 2.0
│
├─ Multimodal (images + text)
│   └─ Gemini 2.0 or GPT-4o
│
└─ Local/private (no cloud)
    └─ Ollama via BYOK

Quick reference table

Scenario	Primary model	Fallback
Production agent orchestration	GPT-4o	Claude Sonnet 4
Complex multi-step reasoning	o3	o4-mini (faster)
Document summarization (long)	Claude Sonnet 4	Gemini 2.0
Code generation	GPT-4o	Claude Sonnet 4
Visual reasoning	Gemini 2.0	GPT-4o
Mathematical problems	o3	Claude Extended Thinking
Agentic planning	o3	GPT-5
Agentic workflows	Claude Opus 4.6	GPT-5, o3
Deep research and analysis	Claude Opus 4.6	Claude Extended Thinking

⚠️ Key considerations

The re-validation rule

Every time you change model or version:

Read the official prompting guide for that model
Re-validate existing prompts against the new model’s behavior
Update your test pipeline with latest guide recommendations

This isn’t optional for production systems. Model updates can change behavior in subtle ways that break previously working prompts.

Cost vs. capability trade-off

More capable models cost more per token and respond more slowly. For production systems, this creates a design tension:

Don’t use o3 for tasks that GPT-4o handles well
Do use reasoning models for genuinely complex planning
Consider multi-model architectures that route tasks to the appropriate model

Context window isn’t everything

A model with a 1M+ context window doesn’t automatically handle long documents well. Context rot (attention degradation in the middle of long prompts) affects all models. Large context windows help, but you still need to structure your prompts so critical information appears at the beginning and end.

🎯 Conclusion

Model selection is a first-class prompt engineering concern. Each model family brings distinct strengths: GPT excels at following explicit instructions, Claude at nuanced analysis with rich context, Gemini at structured multimodal tasks, and reasoning models at complex planning. Understanding these differences — and designing your agents, prompts, and orchestrations to leverage them — is what separates generic prompt engineering from production-quality systems.

Key takeaways

No “best model” exists — only the best model for a specific task and prompting style
The compiler analogy captures the core insight: same prompt, different models, different results
Standard models need detailed instructions; reasoning models need high-level goals
Multi-model architectures let you route different tasks to different models within the same workflow
BYOK extends your options to 100+ models through OpenRouter, Ollama, Cerebras, and HuggingFace
Re-validate prompts every time you change model or version — this isn’t optional for production
Context rot affects all models — structure prompts with critical information at the beginning and end

Next steps

How to optimize prompts for specific models — detailed, actionable optimization techniques per model family
How Copilot assembles and processes prompts — understanding context windows and attention patterns
Chat modes, Agent HQ, and execution contexts — how the model field interacts with execution modes

📚 References

OpenAI Prompt Engineering Guide [📘 Official] Comprehensive guide for GPT-4o, GPT-5, and latest OpenAI models. Covers developer messages, few-shot examples, prompt caching, and model-specific optimization.

Anthropic Prompt Engineering Overview [📘 Official] Master guide for Claude models. Covers XML tagging, chain-of-thought prompting, extended thinking, and long-context optimization.

Google Gemini Prompt Design Strategies [📘 Official] Comprehensive guide for Gemini 2.0 and Gemini 3 models. Covers structured prompting, completion patterns, and multimodal inputs.

OpenAI Reasoning Models Guide [📘 Official] Technical documentation for o3 and o4-mini reasoning models. Covers when to use reasoning, effort levels, and token budgeting.

VS Code Copilot Language Models Documentation [📘 Official] Microsoft’s documentation for model selection in VS Code, including the Language Models Editor, BYOK provider configuration, and capability filtering.

Understanding LLM models and model selection

Table of contents

🎯 The compiler analogy: why models matter

📊 Model families and their characteristics

OpenAI models

Anthropic models

Google models

Capability comparison

🧠 Standard models vs. reasoning models

Standard language models

Reasoning models

Side-by-side comparison

When to use each category

🔧 Model-specific prompting strategies

GPT models: explicit instruction optimization

Claude models: clarity and context optimization

Gemini models: structured prompting

Reasoning models: minimal guidance

🏗️ Multi-model architecture patterns

Pattern 1: planner + executors

Pattern 2: task-specific model selection

Pattern 3: model-specific reviewers

🔑 BYOK: bring-your-own-key providers

Available BYOK providers

HuggingFace integration

Quota implications

📋 Model selection decision framework

Quick reference table

⚠️ Key considerations

The re-validation rule

Cost vs. capability trade-off

Context window isn’t everything

🎯 Conclusion

Key takeaways

Next steps

📚 References