Measuring Readability and Comprehension

technical-writing
readability
comprehension
usability-testing
quality-metrics
diataxis
Go beyond Flesch scores with comprehensive readability formulas, comprehension testing methodologies, information foraging theory, and documentation usability testing for measurable quality
Author

Dario Airoldi

Published

February 28, 2026

Measuring Readability and Comprehension

Move from gut feelings to evidence-based quality assessment—measuring not just whether readers can decode your words, but whether they understand, retain, and act on your documentation

Table of Contents

🎯 Introduction

Readability scores tell you whether your text is linguistically accessible. But readability isn’t comprehension. A sentence can score 65 on Flesch Reading Ease and still leave readers confused about what to do next. Measuring documentation quality requires a broader toolkit—one that spans surface-level readability, deep comprehension, information findability, and usability.

Article 01 surveys all seven readability formulas with practical targets and scoring guidance. This article goes deeper—providing mathematical foundations, comprehension testing methodologies, and quality measurement frameworks that readability scores alone can’t capture:

  • Readability formulas in depth — Coleman-Liau, SMOG, Dale-Chall, and ARI with full mathematical treatment, strengths, weaknesses, and when each outperforms the others
  • Functional quality vs. deep quality — The Diátaxis framework’s distinction between measurable standards and the subjective experience of excellent documentation
  • Comprehension testing — Cloze tests, recall tests, think-aloud protocols, and task-based testing
  • Information scent and foraging theory — Why users abandon documentation and how to keep them on the right path
  • Mental model alignment — Ensuring your documentation’s conceptual structure matches how readers think
  • Documentation usability testing — Task completion rates, time-on-task, and error rates as quality indicators
  • Quantitative benchmarks — Target scores by content type (tutorials, reference, how-to guides, explanation)
  • Tools comparison — textstat, Vale, Hemingway, readable.com, and other readability measurement tools

Why this matters: Readability and understandability are explicitly requested validation criteria in this repository (see 05-validation-and-quality-assurance.md). Without comprehensive measurement, “good enough” is just a guess.

Prerequisites: Familiarity with writing style principles (especially the readability formulas section) and validation and quality assurance is recommended.

📊 Readability formulas compared

Article 01 surveys all seven readability formulas with practical targets and score interpretation tables. This section provides deeper mathematical treatment of the four formulas beyond Flesch—Coleman-Liau, SMOG, Dale-Chall, and ARI—with full formulas, statistical validation context, and guidance on when each outperforms the others.

On deliberate overlap with Article 01: Both articles cover readability formulas, but with different purposes. Article 01 presents all seven formulas as a practical survey—what they are, what scores mean, and what targets to use. This article provides analytical depth—mathematical foundations, comparative strengths, and how formulas connect to comprehension testing and usability measurement. This intentional layering follows the series’ redundancy policy (see Article 08, acceptable redundancy).

Why multiple formulas matter

No single readability formula captures every dimension of text complexity. Each formula uses different linguistic features as proxies for difficulty:

Proxy Formulas that use it Limitation
Syllable count Flesch, FK Grade, Gunning Fog Penalizes technical terms that are actually familiar to the audience
Word length (characters) Coleman-Liau, ARI Doesn’t distinguish between common long words and rare short ones
Sentence length All formulas Doesn’t account for clause complexity or nesting depth
Vocabulary familiarity Dale-Chall List-dependent; may not reflect domain-specific audiences
Polysyllabic word count SMOG Better for health/medical content; less tested for technical docs

Using multiple formulas and comparing their results provides a more reliable assessment than relying on any single score.

Coleman-Liau Index

The Coleman-Liau Index estimates the US grade level required to understand a text. Unlike Flesch-based formulas, it uses character count instead of syllable count—making it easier to compute automatically and more reliable for machine scoring.

Formula:

\[CLI = 0.0588 \times L - 0.296 \times S - 15.8\]

Where:

  • \(L\) = average number of letters per 100 words
  • \(S\) = average number of sentences per 100 words

Strengths:

  • Doesn’t require syllable counting (syllable detection is error-prone in automated tools)
  • Designed explicitly for machine scoring
  • Strong correlation with comprehension test results

Weaknesses:

  • Character count penalizes languages with longer average words
  • Doesn’t account for vocabulary familiarity
  • Less intuitive to interpret than Flesch Reading Ease

Typical range for technical documentation: 10–14 (high school to early college)

SMOG (Simple Measure of Gobbledygook)

The SMOG grade estimates years of education needed for 100% comprehension of a text. It was developed by G. Harry McLaughlin in 1969 as a more accurate substitute for the Gunning Fog Index.

Formula:

\[SMOG = 1.0430 \times \sqrt{polysyllables \times \frac{30}{sentences}} + 3.1291\]

Where:

  • polysyllables = words with 3+ syllables in a 30-sentence sample
  • sentences = number of sentences in the sample

Strengths:

  • Yields a 0.985 correlation with comprehension test results (the highest of any readability formula)
  • Recommended for health communication materials by the American Medical Association
  • Simple to calculate manually with the approximate formula: count polysyllabic words in 30 sentences, take the square root of the nearest perfect square, add 3

Weaknesses:

  • Requires a minimum of 30 sentences for statistical validity
  • Polysyllabic word counting still penalizes familiar technical terms
  • Tends to give higher (harder) scores than Flesch-Kincaid for the same text

Typical range for technical documentation: 10–14

Dale-Chall readability formula

The Dale-Chall formula takes a fundamentally different approach: instead of measuring word length, it checks words against a list of 3,000 words that 80% of fourth-grade students could reliably understand. Any word not on this list counts as “difficult.”

Formula:

\[Raw = 0.1579 \times \left(\frac{difficult\ words}{total\ words} \times 100\right) + 0.0496 \times \left(\frac{total\ words}{sentences}\right)\]

If the percentage of difficult words exceeds 5%, add 3.6365 to get the adjusted score.

Score interpretation:

Score Reading level
4.9 or lower Easily understood by a 4th-grade student
5.0–5.9 5th- or 6th-grade student
6.0–6.9 7th- or 8th-grade student
7.0–7.9 9th- or 10th-grade student
8.0–8.9 11th- or 12th-grade student
9.0–9.9 College student

Strengths:

  • Directly measures vocabulary difficulty rather than using length as a proxy
  • More sensitive to audience-appropriate vocabulary choices
  • Updated word list (1995 revision) reflects modern English usage

Weaknesses:

  • Word list doesn’t account for domain expertise (terms like “endpoint,” “middleware,” and “deployment” aren’t on the list but are basic vocabulary for developers)
  • List-based approach requires maintenance as language evolves
  • Less useful for highly technical content where the audience knows jargon

Typical range for technical documentation: 7.0–9.0

Automated Readability Index (ARI)

The Automated Readability Index uses characters per word and words per sentence to estimate the US grade level needed to understand a text. Like Coleman-Liau, it avoids syllable counting.

Formula:

\[ARI = 4.71 \times \frac{characters}{words} + 0.5 \times \frac{words}{sentences} - 21.43\]

Strengths:

  • Very fast to compute (character and word counting only)
  • Designed specifically for real-time monitoring of readability on typewriters and early computers
  • No ambiguity in counting (unlike syllable-based formulas)

Weaknesses:

  • Crude proxy—character count captures less linguistic information than syllable count
  • Tends to overestimate difficulty for technical content with precise but familiar terms
  • Less research validation than Flesch or SMOG

Typical range for technical documentation: 10–14

Comprehensive comparison

The following table compares all seven readability formulas covered across this article and Article 01:

Formula Input Output Best for This repo’s target
Flesch Reading Ease Syllables, sentences 0–100 score (higher = easier) General readability screening 50–70
Flesch-Kincaid Grade Syllables, sentences US grade level Grade-level benchmarking 9–10
Gunning Fog Complex words, sentences Years of education Academic/professional content 8–12
Coleman-Liau Characters, sentences US grade level Automated pipelines 10–14
SMOG Polysyllabic words, sentences Years of education Health/medical content, high accuracy 10–14
Dale-Chall Unfamiliar words, sentences Adjusted score → grade level Vocabulary-sensitive assessment 7.0–9.0
Automated Readability Index (ARI) Characters, words, sentences US grade level Real-time monitoring 10–14

Practical recommendation: Use Flesch Reading Ease as your primary screening metric (it’s the most widely supported). Supplement with SMOG for accuracy validation and Dale-Chall when vocabulary difficulty is a concern. Run Coleman-Liau or ARI in automated CI pipelines where syllable counting adds unnecessary complexity.

⚖️ Functional quality vs. deep quality

The Diátaxis framework draws a critical distinction between two kinds of documentation quality that readability formulas alone can’t capture. Article 00 covers this distinction in full—definitions, characteristics, comparison table, and how Diátaxis serves each quality type. Here’s a brief recap as context for measurement.

On deliberate overlap with Article 00: Article 00 provides the full definition and theory of functional vs. deep quality—what each means, how they’re characterized, and how they relate. This article applies the distinction specifically to measurement strategy—what readability formulas can and can’t capture, and what a complete measurement approach requires. See Article 08 for the series redundancy policy.

Functional quality encompasses objectively measurable properties—accuracy, completeness, consistency, usefulness, and precision. These characteristics are independent of each other and can be assessed with metrics and checklists. Deep quality encompasses subjective, interdependent characteristics—flow, beauty, anticipation, and fitness for human needs. Deep quality can’t be reduced to scores; it requires human judgment.

The critical asymmetry: deep quality is conditional upon functional quality. Documentation won’t feel excellent if it’s inaccurate. But meeting every functional standard doesn’t guarantee it’ll feel good to use.

What this means for measurement

Readability formulas measure one aspect of functional quality. They’re necessary but insufficient. A complete measurement strategy must also:

  1. Measure all dimensions of functional quality — not just readability, but accuracy, completeness, consistency, and usefulness (the seven validation dimensions in Article 05 operationalize this)
  2. Create conditions for deep quality — through user testing, information architecture analysis, and flow assessment (see comprehension testing and documentation usability testing below)
  3. Recognize the limits of metrics — deep quality can’t be reduced to a dashboard, but it can be enquired into through qualitative methods

The Diátaxis framework helps by preventing disruptions to flow (for example, keeping explanation out of how-to guides) and by aligning documentation types with user needs. But applying Diátaxis doesn’t guarantee deep quality—it lays down conditions for its possibility.

🧪 Comprehension testing methodologies

Readability formulas predict whether text should be understandable based on linguistic features. Comprehension tests measure whether text is actually understood by real readers. They answer a fundamentally different question.

Cloze tests

A cloze test (from “closure” in Gestalt psychology) deletes every Nth word from a passage and asks readers to fill in the blanks. The percentage of correctly restored words indicates comprehension.

How to administer:

  1. Select a representative passage (250–350 words)
  2. Delete every 5th word (some researchers use every 7th)
  3. Replace deleted words with uniform-length blanks
  4. Ask test subjects to fill in the blanks
  5. Score: count exact word matches (synonyms don’t count in the standard method)

Score interpretation:

Cloze score Comprehension level Implication
60%+ Independent level Reader understands without assistance
40–59% Instructional level Reader understands with some support
Below 40% Frustration level Reader can’t understand effectively

Strengths:

  • Well-researched and statistically validated (Taylor, 1953)
  • Measures actual comprehension, not predicted readability
  • Easy to create and administer
  • Works across document types and audiences

Weaknesses:

  • Requires actual human participants
  • Results depend heavily on passage selection
  • Exact-word scoring can undercount comprehension (a reader who writes “use” instead of “utilize” clearly understood the text)
  • Doesn’t measure ability to apply knowledge

When to use: Validate that documentation meets audience reading level before publishing; compare comprehension across draft versions; assess whether technical vocabulary creates barriers.

Recall and recognition tests

Recall tests measure what readers remember after reading documentation without prompts. Recognition tests present options (like multiple-choice questions) and ask readers to identify correct information.

Free recall protocol:

  1. Ask readers to read a section of documentation
  2. Remove the documentation
  3. Ask readers to write down everything they remember
  4. Score: count the number of key concepts accurately recalled

Cued recall protocol:

  1. Ask readers to read documentation
  2. Provide questions about specific concepts (“What command starts the development server?”)
  3. Score: accuracy of responses

Recognition protocol:

  1. Ask readers to read documentation
  2. Present multiple-choice or true/false questions
  3. Score: percentage of correct answers

Recall vs. recognition comparison:

Aspect Recall Recognition
Difficulty Harder (retrieval from memory) Easier (matching against options)
What it measures Deep encoding of information Familiarity with information
Use for docs “Can users remember the steps?” “Can users identify the right approach?”
Practical value Higher—reflects real-world use Lower—doesn’t mean users can act on knowledge

Think-aloud protocols

In a think-aloud protocol, readers verbalize their thoughts while reading documentation. A researcher observes and records where confusion, satisfaction, or frustration occurs.

How to conduct:

  1. Select 3–5 representative readers from your target audience
  2. Ask them to read a section while speaking their thoughts aloud
  3. Record the session (audio or video)
  4. Code the transcript for comprehension markers:
    • Understanding signals: “Oh, so this means…” / “That makes sense because…”
    • Confusion signals: “Wait, what?” / “I don’t understand why…” / re-reading
    • Inference signals: “I think this means…” (correct or incorrect inferences reveal gaps)

Strengths:

  • Reveals where and why comprehension breaks down (not just that it did)
  • Identifies assumptions readers bring to the documentation
  • Catches problems that no formula can detect (misleading examples, ambiguous instructions, missing context)

Weaknesses:

  • Time-intensive (1–2 hours per participant for analysis)
  • Small sample sizes (typically 3–5 participants)
  • The act of thinking aloud may alter reading behavior
  • Requires skilled facilitation to avoid leading participants

When to use: Before publishing high-stakes documentation (onboarding guides, API quickstarts); when readability scores are acceptable but user feedback indicates confusion; when redesigning documentation structure.

Task-based comprehension testing

Task-based testing measures comprehension by asking readers to do something after reading documentation. This is the most authentic test because it mirrors real-world documentation use.

Protocol:

  1. Define specific tasks that documentation should enable (“Deploy an Azure Function using the CLI”)
  2. Provide relevant documentation
  3. Ask participants to complete the task
  4. Measure: task completion rate, time-on-task, errors, help requests

Metrics:

Metric What it measures Target for good docs
Task completion rate Can users succeed? 80%+ first attempt
Time-on-task How efficiently? Within 1.5× estimated time
Error rate How accurately? <2 wrong actions per task
Help requests Is documentation self-sufficient? <1 per task

Strengths:

  • Directly measures documentation’s practical value
  • Reveals gaps between what documentation says and what users need
  • Results are concrete and actionable (“Users failed at step 5—that’s where we need to improve”)

Weaknesses:

  • Most resource-intensive testing method
  • Requires a working environment for task completion
  • Hard to isolate documentation quality from tool/product usability

🔍 Information scent and foraging theory

Information foraging theory (Pirolli & Card, 1999) applies ecological foraging models to explain how people search for information. Just as animals follow scent trails to find food, users follow information scent—cues that signal whether a path will lead to useful content.

The foraging model

In the natural world, animals make continuous decisions: keep foraging in this patch, or move to a new one? The decision depends on the rate of return—when a patch becomes depleted, the rational strategy is to move on.

Information foraging applies the same logic to documentation users:

  • Information patches = documentation pages, sections, search results
  • Information scent = headings, link text, navigation labels, breadcrumbs, summaries
  • Foraging decision = continue reading this page or navigate elsewhere
  • Rate of return = amount of useful information gained per unit of time invested

Why users abandon documentation

Users leave documentation when information scent is weak:

Weak scent → abandonment:

  • Vague headings (“Overview,” “Introduction,” “Getting Started” with no specifics)
  • Navigation labels that don’t match user terminology
  • Long pages without clear section boundaries
  • Search results with misleading snippets

Strong scent → engagement:

  • Specific headings that match user queries (“Configure OAuth 2.0 for Azure Functions”)
  • Navigation labels using the user’s task language, not the product’s internal terminology
  • Summary boxes that preview section content
  • Progressive disclosure that rewards scanning

Measuring information scent

Method 1: First-click testing

  1. Present users with a documentation homepage or navigation
  2. Ask: “Where would you click to find [specific information]?”
  3. Measure: percentage of users who click the correct link first

A first-click success rate below 50% indicates weak information scent. Research shows that users who click correctly on the first try succeed at their overall task 87% of the time, compared to 46% for those who click incorrectly first.

Method 2: Navigation path analysis

  1. Track the sequence of pages users visit before finding target information
  2. Measure: number of pages visited, backtracking frequency, time to target
  3. Optimal: users reach information in 2–3 clicks with no backtracking

Method 3: Heading prediction

  1. Show users a heading and ask: “What content would you expect under this heading?”
  2. Compare predictions to actual content
  3. Misalignment indicates heading doesn’t carry accurate scent

Improving information scent in documentation

Problem Solution Example
Generic headings Use specific, task-oriented headings “Getting Started” → “Deploy your first function in 5 minutes”
Ambiguous navigation Match user vocabulary, not internal naming “Resources” → “API reference”
Long pages without landmarks Add summary boxes, anchor links, visual breaks TL;DR boxes at section start
Search result snippets Write informative descriptions for each page Meta descriptions in YAML frontmatter

🧠 Mental model alignment

A mental model is the internal representation a person holds about how something works. Documentation succeeds when its conceptual structure aligns with the reader’s mental model. It fails when it forces readers to build a new mental model just to navigate the docs.

What mental model alignment looks like

Aligned: > A developer expects “authentication” to involve tokens and API keys. The authentication section covers exactly that, organized by authentication method.

Misaligned: > The same developer looks for “authentication” but finds it under “Security” → “Identity” → “Credential Management.” The path doesn’t match how they think about the concept.

Measuring alignment

Card sorting:

  1. Write each documentation topic on a card
  2. Ask users to group cards into categories that make sense to them
  3. Compare user-generated categories to your documentation’s actual organization
  4. Overlap percentage indicates alignment
  • Open card sort: Users create their own category labels (reveals how they think)
  • Closed card sort: Users sort cards into predefined categories (tests your structure)

Tree testing:

  1. Present your documentation’s hierarchy as a text-only tree (no visual design)
  2. Ask users to find specific information by navigating the tree
  3. Measure: success rate, directness (no backtracking), time to completion

Tree testing removes visual design influence—if users can’t find information in the tree, the problem lies with the information architecture, not the page design.

Concept mapping:

  1. Ask users to draw how they think the documented system works
  2. Compare their concept maps to the system’s actual architecture
  3. Identify gaps between user understanding and reality

If documentation is effective, concept maps drawn after reading should more closely match reality than maps drawn before reading.

Common alignment failures

Failure Symptom Fix
Org-chart structure Docs mirror internal team structure, not user needs Reorganize around user tasks and concepts
Expert blind spot Docs assume knowledge the audience doesn’t have User test with representative beginners
Feature-centric Docs organized by features, not by user goals Add task-based navigation alongside feature reference
Jargon mismatch Docs use internal terms users don’t recognize Conduct vocabulary alignment testing with users

📋 Documentation usability testing

Usability testing for documentation applies the same principles as software usability testing: observe real users attempting real tasks, measure outcomes, fix what’s broken.

Planning a documentation usability test

Define objectives:

  • Which documentation sections are you testing?
  • What tasks should readers be able to complete?
  • What success metrics matter most?

Recruit participants:

  • 5 participants detect ~85% of usability issues (Nielsen & Landauer, 1993)
  • Match participants to your actual audience (developers testing developer docs)
  • Include a range of experience levels if your documentation serves multiple audiences

Design tasks:

  • Base tasks on real user goals, not documentation structure
  • Include both “find information” and “accomplish a task” scenarios
  • Order tasks from simple to complex

Core metrics

Metric Definition How to measure Benchmark
Task success rate Percentage of participants who complete the task Binary: completed or not 78%+ (Sauro & Lewis, 2016)
Time-on-task Time from task start to successful completion Stopwatch or screen recording Within 2× expert time
Error rate Number of wrong actions per task Observer counts deviations from optimal path <3 per task
Satisfaction (SUS) System Usability Scale score (0–100) Post-test questionnaire 68+ (above average)
Findability Time to first correct navigation choice Screen recording analysis <30 seconds for top-level nav

The System Usability Scale (SUS) for documentation

The System Usability Scale (Brooke, 1996) is a 10-item questionnaire that produces a single usability score. Although designed for software, it adapts well to documentation:

Adapted SUS questions for documentation:

  1. I think I would use this documentation frequently
  2. I found the documentation unnecessarily complex
  3. I thought the documentation was easy to use
  4. I think I would need support from a person to use this documentation
  5. I found the various sections were well integrated
  6. I thought there was too much inconsistency in this documentation
  7. I imagine most people would learn to navigate this documentation quickly
  8. I found the documentation very cumbersome to navigate
  9. I felt confident finding information in this documentation
  10. I needed to learn a lot before I could get going with this documentation

Scoring: SUS scores range from 0 to 100. A score of 68 is average. Above 80 is good. Above 90 is exceptional.

Lightweight alternatives

Full usability testing isn’t always feasible. These lighter methods still provide actionable data:

Five-second test:

  1. Show a documentation page for 5 seconds
  2. Ask: “What is this page about?” and “What can you do from here?”
  3. Measures: first impression clarity, information scent strength

Highlighter test:

  1. Give readers a printed documentation page and two highlighters
  2. Green = “I understand this clearly”; Pink = “I’m confused by this”
  3. Collect and stack pages—confusion clusters become visually obvious

Heuristic evaluation: Apply Jakob Nielsen’s 10 usability heuristics to documentation:

  • Visibility of system status → “Does the TOC show where the reader is?”
  • Match between system and real world → “Does the documentation use the user’s language?”
  • User control and freedom → “Can readers easily navigate back?”
  • Consistency and standards → “Do similar sections follow the same structure?”

📏 Quantitative benchmarks by content type

Different Diátaxis documentation types serve different purposes, and their readability targets should differ accordingly. The following benchmarks combine readability research with the content type characteristics described in Article 00.

Benchmarks table

Metric Tutorial How-to guide Reference Explanation
Flesch Reading Ease 60–70 55–65 45–60 50–65
FK Grade Level 8–9 9–10 10–12 9–11
Gunning Fog 8–10 9–11 10–14 9–12
Coleman-Liau 9–11 10–12 11–14 10–13
SMOG 9–11 10–12 11–14 10–13
Dale-Chall 6.5–7.5 7.0–8.0 7.5–9.0 7.0–8.5
ARI 9–11 10–12 11–14 10–13
Avg. sentence length 14–20 words 15–22 words 12–20 words 16–24 words
Cloze score 55%+ 50%+ 45%+ 50%+
Task success rate 85%+ 80%+ N/A (lookup, not task) N/A (understanding, not task)

Rationale by content type

Tutorials target the easiest readability because they serve newcomers who are learning. Shorter sentences, simpler vocabulary, and higher cloze scores reflect the need for maximum clarity. Task success rates should be high because tutorials control the environment.

How-to guides allow slightly more complexity because they assume prior knowledge. They’re task-oriented, so task success rate remains a key metric, but the tolerance for technical vocabulary increases.

Reference documentation can have the highest complexity because its audience actively seeks specific information. Dense, precise descriptions are acceptable—but sentence length should remain short because reference is consumed in fragments, not read linearly.

Explanation falls in the middle. It discusses concepts and builds understanding, requiring enough complexity to cover nuance without becoming impenetrable. Longer sentences are acceptable because readers engage in sustained reading, but vocabulary should remain accessible.

Using benchmarks effectively

Don’t: enforce benchmark targets rigidly. A reference page with a Flesch score of 43 isn’t automatically bad—it might be accurately describing complex API behavior.

Do: use benchmarks as investigation triggers. A tutorial with a Flesch score of 40 deserves a second look—is the vocabulary unnecessarily complex for newcomers? Can sentences be shortened without losing meaning?

Do: track trends over time. A series of articles that progressively drift toward lower readability scores may indicate creeping complexity.

🛠️ Tools comparison

Multiple tools can measure readability, enforce style rules, and support comprehension assessment. Here’s how they compare for technical documentation workflows.

Readability measurement tools

Tool Type Formulas supported Best for Cost
textstat Python library Flesch, FK, Gunning Fog, Coleman-Liau, SMOG, Dale-Chall, ARI, Linsear Write CI/CD pipeline integration, batch analysis Free (open-source)
readable.com Web service All major formulas + proprietary metrics One-off analysis, content marketing teams Paid (free trial)
Hemingway Editor Web/desktop app Custom (grade level) Real-time writing feedback, sentence simplification Free (web), paid (desktop)
readability-cli Node.js CLI Flesch, FK, Coleman-Liau, ARI, SMOG, Dale-Chall Command-line workflows, quick checks Free (open-source)

Prose linting tools

Tool Type What it checks Best for Cost
Vale CLI linter Style, terminology, jargon, consistency (configurable rules) Enforcing style guides at scale, CI integration Free (open-source)
Vale + Microsoft style Vale package Microsoft Writing Style Guide compliance Microsoft-aligned documentation Free (open-source)
alex CLI linter Insensitive, inconsiderate language Inclusive language enforcement Free (open-source)
write-good CLI linter Passive voice, weasel words, adverbs Quick writing quality checks Free (open-source)

Comprehensive comparison: textstat vs. Vale vs. Hemingway

Dimension textstat Vale Hemingway
Readability formulas All 7+ (programmatic) None built-in (different purpose) Grade-level equivalent
Style enforcement None Extensive (custom rules) Basic (adverbs, passive voice, complexity)
CI/CD integration Excellent (Python library) Excellent (CLI, GitHub Actions) None (manual only)
Custom rules Write Python code YAML rule definitions None
Terminology checking None Built-in (substitution, existence, occurrence rules) None
Learning curve Low (Python API) Medium (rule configuration) Very low (visual interface)
Batch processing Yes (scripted) Yes (glob patterns) No

Recommendation for this repository:

  • Primary readability tool: textstat in Python scripts for automated validation
  • Primary style tool: Vale with Microsoft style package for consistent terminology
  • Quick checks during writing: Hemingway Editor for real-time sentence simplification
  • CI integration: Vale + textstat in GitHub Actions for pre-merge validation

📌 Applying readability measurement to this repository

Current coverage

This repository currently measures readability through:

Opportunities for expanded measurement

Short-term (addable with minimal effort):

  1. Add Coleman-Liau and SMOG scores to the readability review prompt for cross-validation
  2. Include Dale-Chall analysis when reviewing vocabulary accessibility
  3. Create content-type benchmarks (the table in this article can serve as a starting reference)

Medium-term (requires tooling):

  1. Integrate textstat into a Python-based validation script
  2. Configure Vale with Microsoft style rules for terminology consistency
  3. Add readability scoring to CI pipeline for pull request validation

Long-term (requires user testing):

  1. Conduct cloze testing on representative articles in each Diátaxis type
  2. Run first-click testing on the repository’s navigation structure
  3. Perform tree testing on the table of contents hierarchy
  4. Track documentation usability metrics over time as part of quality dashboards

Practical implementation: textstat example

# Install: pip install textstat
import textstat

def analyze_readability(text: str) -> dict:
    """Calculate comprehensive readability metrics for documentation."""
    return {
        "flesch_reading_ease": textstat.flesch_reading_ease(text),
        "flesch_kincaid_grade": textstat.flesch_kincaid_grade(text),
        "gunning_fog": textstat.gunning_fog(text),
        "coleman_liau": textstat.coleman_liau_index(text),
        "smog": textstat.smog_index(text),
        "dale_chall": textstat.dale_chall_readability_score(text),
        "ari": textstat.automated_readability_index(text),
        "avg_sentence_length": textstat.avg_sentence_length(text),
    }

# Example usage:
# with open("article.md", "r") as f:
#     text = f.read()
# scores = analyze_readability(text)
# print(f"Flesch RE: {scores['flesch_reading_ease']:.1f}")
# print(f"SMOG: {scores['smog']:.1f}")
# print(f"Dale-Chall: {scores['dale_chall']:.1f}")

✅ Conclusion

Measuring documentation quality is a multi-dimensional challenge. Readability formulas provide a necessary but insufficient foundation—they measure linguistic surface features but can’t tell you whether readers understand, can act on, or enjoy your documentation.

Key takeaways

  • No single formula suffices — Use multiple readability formulas (Flesch, SMOG, Dale-Chall, Coleman-Liau) to get a reliable picture; each measures different linguistic features
  • Functional quality is necessary but not sufficient — The Diátaxis framework distinguishes between measurable functional quality (accuracy, completeness, consistency) and subjective deep quality (flow, anticipation, beauty)
  • Comprehension testing reveals what formulas can’t — Cloze tests, recall tests, think-aloud protocols, and task-based testing measure actual understanding, not predicted readability
  • Information scent drives navigation success — Users follow cues (headings, link text, navigation labels) like foraging animals follow scent; weak scent causes abandonment
  • Mental models shape comprehension — Documentation succeeds when its structure matches how readers think about the topic; card sorting and tree testing reveal alignment
  • Usability metrics ground quality in evidence — Task completion rates, time-on-task, error rates, and SUS scores provide objective evidence for documentation effectiveness
  • Benchmarks should vary by content type — Tutorials need higher readability than reference; how-to guides need higher task success rates than explanations

Next steps

📚 References

Readability formulas and research

Flesch-Kincaid Readability Tests - Wikipedia 📘 [Official]
Technical explanation of Flesch Reading Ease and Flesch-Kincaid Grade Level formulas, with scoring interpretation and history.

Coleman-Liau Index - Wikipedia 📘 [Official]
Description of the character-based readability formula designed for machine scoring, including the original 1975 formula and worked example.

SMOG - Wikipedia 📘 [Official]
Overview of the Simple Measure of Gobbledygook formula, its 0.985 correlation with comprehension tests, and its recommendation for health communication materials.

Dale-Chall Readability Formula - Wikipedia 📘 [Official]
Description of the vocabulary-based readability formula using a 3,000-word familiar word list, with formula, scoring table, and history.

Automated Readability Index - Wikipedia 📘 [Official]
The character-and-word-count formula designed for real-time readability monitoring without syllable analysis.

Diátaxis framework and quality

Towards a Theory of Quality in Documentation - Diátaxis 📗 [Verified Community]
Daniele Procida’s exploration of functional quality vs. deep quality in documentation. Distinguishes objectively measurable standards from subjective excellence. Essential reading for understanding why metrics alone don’t guarantee quality.

Diátaxis - A Systematic Approach to Technical Documentation 📗 [Verified Community]
The overarching framework for documentation types (tutorials, how-to guides, reference, explanation) that this series uses as its organizational foundation.

Comprehension testing and usability

Cloze Procedure - Wikipedia 📘 [Official]
Background on the cloze test methodology originally developed by Wilson Taylor (1953), including administration protocols and scoring interpretation.

Information Foraging Theory - Wikipedia 📘 [Official]
Overview of Pirolli and Card’s information foraging theory (1999) that models how users search for information using scent-following behavior borrowed from ecological foraging models.

Plain Language Guidelines - Federal Plain Language 📘 [Official]
US government standards for clear, accessible writing. Includes guidance on readability, audience analysis, and comprehension testing for public-facing content.

System Usability Scale - Wikipedia 📘 [Official]
Overview of Brooke’s (1996) SUS questionnaire methodology, scoring, and interpretation benchmarks. SUS produces a single composite score from ten Likert-scale items.

Tools and implementation

textstat - Python Library 📗 [Verified Community]
Open-source Python library implementing all major readability formulas (Flesch, FK, Gunning Fog, Coleman-Liau, SMOG, Dale-Chall, ARI, and more). Ideal for CI/CD pipeline integration and batch analysis.

Vale - Prose Linter 📗 [Verified Community]
Open-source prose linter supporting configurable style rules, terminology enforcement, and integration with Microsoft, Google, and custom style guides. The primary tool for automated documentation quality enforcement.

Hemingway Editor 📒 [Community]
Visual writing tool that highlights complex sentences, passive voice, and adverb overuse. Provides grade-level readability scoring. Best for real-time writing feedback during drafting.

readable.com 📒 [Community]
Web-based readability analysis service supporting all major formulas plus proprietary engagement metrics. Useful for one-off analysis and non-technical teams.

Repository-specific documentation

Validation Criteria [Internal Reference]
This repository’s seven validation dimensions with scoring thresholds, including Flesch targets (50–70) and grade-level standards (9–10).

Article Writing Instructions [Internal Reference]
Comprehensive writing guidance including readability targets, sentence length standards (15–25 words), and validation workflows.

Readability Review Prompt [Internal Reference]
AI-assisted validation prompt for analyzing Flesch scores, grade level, and suggesting readability improvements.