Understanding LLM models and model selection

tech
prompt-engineering
github-copilot
concepts
models
Understand the different model families available in GitHub Copilot, how they behave differently, how to select the right model for each task, and how BYOK providers extend your options.
Author

Dario Airoldi

Published

March 1, 2026

Understanding LLM models and model selection

A β€œgood generic prompt” doesn’t exist β€” there exists only a good prompt for that specific model. Different models have fundamentally different behaviors, strengths, and optimal prompting strategies. What works brilliantly with Claude may fail with GPT-4o; what excels with Gemini may confuse reasoning models.

GitHub Copilot gives you access to models from multiple providers β€” OpenAI, Anthropic, Google, and more through BYOK (bring-your-own-key). This article explains how these model families differ, what makes each one strong, how to choose the right model for a given task, and how the multi-model architecture enables advanced workflows.

Table of contents


🎯 The compiler analogy: why models matter

Think of each model as a different compiler. The same β€œsource code” (your prompt) produces different β€œexecutables” (responses) depending on which compiler processes it. Just as you wouldn’t expect C++ code to compile identically on GCC and MSVC without adjustments, you shouldn’t expect the same prompt to perform identically across GPT-4o, Claude, and Gemini.

What changes between models:

Aspect How it differs
Sensitivity to constraints Some models follow explicit constraints rigidly; others interpret them flexibly
Ambiguity handling Models differ in whether they ask for clarification or make assumptions
Response patterns Default verbosity, formatting preferences, and structure vary
Token interpretation Context window utilization, attention patterns, and recency bias differ
Chain of thought Some benefit from explicit CoT prompting; others do it internally

This means that every time you change model or version, you should re-validate your prompts against the new model’s behavior.


πŸ“Š Model families and their characteristics

GitHub Copilot provides access to models from three major providers, plus BYOK options:

OpenAI models

Model Context window Best for Key behavior
GPT-4o / GPT-4.1 128K General tasks, code generation Fast, balanced, highly steerable
GPT-5 / GPT-5.2 1M+ Complex tasks, broad domains Latest capabilities, vision support
o3 / o4-mini 200K Complex reasoning, planning Internal chain of thought

GPT models respond best to explicit instructions with developer messages (formerly system messages), few-shot examples, and clear Markdown/XML formatting. They’re the β€œfollow my instructions precisely” family.

Anthropic models

Model Context window Best for Key behavior
Claude Sonnet 4 200K Long documents, nuanced analysis Thoughtful, cautious, detailed
Claude Opus 4.6 200K Frontier agentic tasks Highest-capability, multi-step reasoning
Claude Extended Thinking 200K Complex STEM, constraint problems Deep internal reasoning

Claude models excel with clarity and context β€” clear XML-tagged structure, explicit context about your norms and preferences, and well-organized reference material. Think of Claude as a brilliant but new colleague who needs explicit context about your expectations.

Google models

Model Context window Best for Key behavior
Gemini 2.0 Flash 1M+ Fast inference, multimodal Quick responses, visual reasoning
Gemini 3 Varies Advanced reasoning, agentic tasks Strong instruction following

Gemini models respond best to structured prompts with consistent formatting and clear organization. They often perform well with zero-shot prompts but benefit from few-shot examples when specific output formats are needed.

Capability comparison

The model picker in VS Code shows capability indicators for each model:

Model                    Context    Vision   Tools   Reasoning
─────────────────────────────────────────────────────────────
Claude Sonnet 4          200K       βœ…       βœ…      β€”
GPT-4o                   128K       βœ…       βœ…      β€”
Claude Opus 4.6          200K       βœ…       βœ…      β€”
o3                       200K       β€”        βœ…      βœ…
Gemini 2.0 Flash         1M+        βœ…       βœ…      β€”
GPT-5                    1M+        βœ…       βœ…      β€”

Not all models support all capabilities. Vision (image understanding), tool calling (function invocation), and reasoning (internal chain of thought) are the three key capability dimensions. Your choice of model constrains what your agents can do.


🧠 Standard models vs. reasoning models

The most important conceptual division isn’t between providers β€” it’s between standard models and reasoning models.

Standard language models

GPT-4o, Claude Sonnet 4, Gemini 2.0 Flash β€” these models benefit from explicit, detailed instructions:

  • Provide step-by-step guidance
  • Use few-shot examples liberally
  • Explicitly state constraints and output formats
  • Use chain-of-thought prompting when reasoning is needed

Think of standard models as junior colleagues who need clear, detailed instructions.

Reasoning models

o3, o4-mini, Claude Extended Thinking β€” these models perform internal reasoning before responding:

  • Give high-level goals, not step-by-step instructions
  • Trust the model to work out the details
  • Be specific about success criteria and constraints
  • Don’t include β€œthink step by step” β€” they already do this internally

Think of reasoning models as senior colleagues who need goals, not instructions.

Side-by-side comparison

Aspect Standard models Reasoning models
Instruction style Detailed, step-by-step High-level goals
Chain of thought Must be prompted explicitly Happens internally
β€œThink step by step” Helpful Unnecessary or harmful
Few-shot examples Often required Try zero-shot first
Constraints Embedded in instructions Specify success criteria
Speed Fast Slower (thinking time)
Cost Lower per token Higher per token
Best for Well-defined tasks Ambiguous, complex problems

When to use each category

Standard models:

  • Code generation with clear requirements
  • Formatting and text transformation
  • Following established patterns
  • High-volume, latency-sensitive tasks

Reasoning models:

  • Complex multi-step planning
  • Ambiguous tasks requiring interpretation
  • Large document analysis (needle in haystack)
  • Nuanced decision-making with many factors
  • Scientific and mathematical reasoning

πŸ”§ Model-specific prompting strategies

Each model family has an optimal prompting style. Here’s a conceptual overview:

GPT models: explicit instruction optimization

# Identity
You are a [role] specializing in [domain].

# Instructions
* [Specific rule 1]
* [Specific rule 2]

# Examples
[Input] β†’ [Output]

# Context
[Additional information]

Key techniques: developer messages for identity/rules, Markdown/XML formatting, few-shot examples, prompt caching optimization (static content first).

Claude models: clarity and context optimization

<role>You are a technical documentation specialist.</role>
<context>You are reviewing API documentation.</context>
<instructions>
1. Check for completeness
2. Verify all parameters are documented
3. Flag missing error codes
</instructions>
<output_format>Markdown table</output_format>

Key techniques: XML tags for structure, explicit context about norms/preferences, chain-of-thought with tags for complex tasks, long-context with critical instructions at the beginning.

Gemini models: structured prompting

Key techniques: consistent formatting (XML or Markdown headers, pick one and stay with it), zero-shot first then add examples if needed, completion patterns for format control, context anchoring after large blocks.

Reasoning models: minimal guidance

Key techniques: high-level goals instead of steps, specify success criteria, reserve tokens for internal reasoning (at least 25K for o3/o4-mini, minimum 1024 budget for Claude Extended Thinking), trust the model’s process.


πŸ—οΈ Multi-model architecture patterns

Production systems often benefit from using different models for different tasks within the same workflow. In Copilot, this is possible through the model field in prompt/agent YAML and through subagent delegation.

Pattern 1: planner + executors

User request
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Reasoning model (o3)   β”‚  ← Analyzes request, decomposes into steps
β”‚  "The planner"          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β–Ό                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GPT-4o             β”‚  β”‚  Claude Sonnet 4    β”‚
β”‚  Fast code gen      β”‚  β”‚  Long doc analysis  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

A reasoning model handles the complex planning, then delegates execution to faster, cheaper standard models.

Pattern 2: task-specific model selection

Task type Recommended model Why
Agent orchestration GPT-4o Fast, balanced, reliable
Long document analysis Claude Sonnet 4 200K context, strong comprehension
Complex reasoning o3 / o4-mini Internal chain of thought
Code generation GPT-4o / Claude Fast, accurate output
Multimodal (image + text) Gemini 2.0 / GPT-4o Strong vision capabilities
Evaluation / grading o3 Nuanced judgment, high accuracy
Agentic multi-step workflows Claude Opus 4.6 Highest agentic capability
Deep analysis, research Claude Opus 4.6 Multi-step reasoning

Pattern 3: model-specific reviewers

Create dedicated review agents optimized for each model’s prompting style:

# openai-prompt-reviewer.agent.md
---
name: openai-prompt-reviewer
description: Reviews prompts for GPT model optimization
model: gpt-4o
tools: ['codebase', 'search']
---

Each reviewer checks that prompts follow the optimal patterns for their target model β€” developer message structure for GPT, XML tags for Claude, consistent formatting for Gemini.


πŸ”‘ BYOK: bring-your-own-key providers

GitHub Copilot’s model picker isn’t limited to the built-in models. BYOK (bring-your-own-key) lets you connect external model providers using your own API keys.

Available BYOK providers

Provider Models available Key advantage
Cerebras Llama 3.3, DeepSeek v3.2, GLM-4.6 Extremely fast inference
OpenRouter 100+ models Unified API for multiple providers
Ollama Local models Fully local, no API calls
Azure OpenAI GPT-4o, GPT-4 Turbo Enterprise deployment
Anthropic (direct) Claude models Direct API access

HuggingFace integration

The HuggingFace Inference Provider extension enables access to open-weights models:

  • Multiple inference providers β€” HuggingFace API, Nebius, SambaNova, Together AI
  • Automatic routing β€” fastest or cheapest mode
  • Open-weights models β€” Llama, Mistral, DeepSeek, Qwen

Quota implications

BYOK models don’t consume your GitHub Copilot quota, but:

  • An active Copilot subscription is still required
  • BYOK costs are billed directly by the provider
  • Background query refinement (using GPT-4o Mini) doesn’t count against quota
  • Full prompt logging is available in the output channel for debugging

πŸ“‹ Model selection decision framework

What's your top priority?
β”‚
β”œβ”€ Speed and cost
β”‚   └─ GPT-4o mini / Gemini 2.0 Flash
β”‚
β”œβ”€ Accuracy and reliability
β”‚   β”œβ”€ Is the task complex/ambiguous?
β”‚   β”‚   β”œβ”€ Yes β†’ o3 or Claude Extended Thinking
β”‚   β”‚   └─ No  β†’ GPT-4o or Claude Sonnet 4
β”‚   └─ Does it need agentic multi-step?
β”‚       └─ Yes β†’ Claude Opus 4.6 or GPT-5
β”‚
β”œβ”€ Long context (>100K tokens)
β”‚   └─ Claude Sonnet 4 or Gemini 2.0
β”‚
β”œβ”€ Multimodal (images + text)
β”‚   └─ Gemini 2.0 or GPT-4o
β”‚
└─ Local/private (no cloud)
    └─ Ollama via BYOK

Quick reference table

Scenario Primary model Fallback
Production agent orchestration GPT-4o Claude Sonnet 4
Complex multi-step reasoning o3 o4-mini (faster)
Document summarization (long) Claude Sonnet 4 Gemini 2.0
Code generation GPT-4o Claude Sonnet 4
Visual reasoning Gemini 2.0 GPT-4o
Mathematical problems o3 Claude Extended Thinking
Agentic planning o3 GPT-5
Agentic workflows Claude Opus 4.6 GPT-5, o3
Deep research and analysis Claude Opus 4.6 Claude Extended Thinking

⚠️ Key considerations

The re-validation rule

Every time you change model or version:

  1. Read the official prompting guide for that model
  2. Re-validate existing prompts against the new model’s behavior
  3. Update your test pipeline with latest guide recommendations

This isn’t optional for production systems. Model updates can change behavior in subtle ways that break previously working prompts.

Cost vs. capability trade-off

More capable models cost more per token and respond more slowly. For production systems, this creates a design tension:

  • Don’t use o3 for tasks that GPT-4o handles well
  • Do use reasoning models for genuinely complex planning
  • Consider multi-model architectures that route tasks to the appropriate model

Context window isn’t everything

A model with a 1M+ context window doesn’t automatically handle long documents well. Context rot (attention degradation in the middle of long prompts) affects all models. Large context windows help, but you still need to structure your prompts so critical information appears at the beginning and end.


🎯 Conclusion

Model selection is a first-class prompt engineering concern. Each model family brings distinct strengths: GPT excels at following explicit instructions, Claude at nuanced analysis with rich context, Gemini at structured multimodal tasks, and reasoning models at complex planning. Understanding these differences β€” and designing your agents, prompts, and orchestrations to leverage them β€” is what separates generic prompt engineering from production-quality systems.

Key takeaways

  • No β€œbest model” exists β€” only the best model for a specific task and prompting style
  • The compiler analogy captures the core insight: same prompt, different models, different results
  • Standard models need detailed instructions; reasoning models need high-level goals
  • Multi-model architectures let you route different tasks to different models within the same workflow
  • BYOK extends your options to 100+ models through OpenRouter, Ollama, Cerebras, and HuggingFace
  • Re-validate prompts every time you change model or version β€” this isn’t optional for production
  • Context rot affects all models β€” structure prompts with critical information at the beginning and end

Next steps


πŸ“š References

OpenAI Prompt Engineering Guide [πŸ“˜ Official] Comprehensive guide for GPT-4o, GPT-5, and latest OpenAI models. Covers developer messages, few-shot examples, prompt caching, and model-specific optimization.

Anthropic Prompt Engineering Overview [πŸ“˜ Official] Master guide for Claude models. Covers XML tagging, chain-of-thought prompting, extended thinking, and long-context optimization.

Google Gemini Prompt Design Strategies [πŸ“˜ Official] Comprehensive guide for Gemini 2.0 and Gemini 3 models. Covers structured prompting, completion patterns, and multimodal inputs.

OpenAI Reasoning Models Guide [πŸ“˜ Official] Technical documentation for o3 and o4-mini reasoning models. Covers when to use reasoning, effort levels, and token budgeting.

VS Code Copilot Language Models Documentation [πŸ“˜ Official] Microsoft’s documentation for model selection in VS Code, including the Language Models Editor, BYOK provider configuration, and capability filtering.