Measuring Readability and Comprehension
Measuring Readability and Comprehension
Move from gut feelings to evidence-based quality assessment—measuring not just whether readers can decode your words, but whether they understand, retain, and act on your documentation
Table of Contents
- 🎯 Introduction
- 📊 Readability formulas compared
- ⚖️ Functional quality vs. deep quality
- 🧪 Comprehension testing methodologies
- 🔍 Information scent and foraging theory
- 🧠 Mental model alignment
- 📋 Documentation usability testing
- 📏 Quantitative benchmarks by content type
- 🛠️ Tools comparison
- 📌 Applying readability measurement to this repository
- ✅ Conclusion
- 📚 References
🎯 Introduction
Readability scores tell you whether your text is linguistically accessible. But readability isn’t comprehension. A sentence can score 65 on Flesch Reading Ease and still leave readers confused about what to do next. Measuring documentation quality requires a broader toolkit—one that spans surface-level readability, deep comprehension, information findability, and usability.
Article 01 surveys all seven readability formulas with practical targets and scoring guidance. This article goes deeper—providing mathematical foundations, comprehension testing methodologies, and quality measurement frameworks that readability scores alone can’t capture:
- Readability formulas in depth — Coleman-Liau, SMOG, Dale-Chall, and ARI with full mathematical treatment, strengths, weaknesses, and when each outperforms the others
- Functional quality vs. deep quality — The Diátaxis framework’s distinction between measurable standards and the subjective experience of excellent documentation
- Comprehension testing — Cloze tests, recall tests, think-aloud protocols, and task-based testing
- Information scent and foraging theory — Why users abandon documentation and how to keep them on the right path
- Mental model alignment — Ensuring your documentation’s conceptual structure matches how readers think
- Documentation usability testing — Task completion rates, time-on-task, and error rates as quality indicators
- Quantitative benchmarks — Target scores by content type (tutorials, reference, how-to guides, explanation)
- Tools comparison — textstat, Vale, Hemingway, readable.com, and other readability measurement tools
Why this matters: Readability and understandability are explicitly requested validation criteria in this repository (see 05-validation-and-quality-assurance.md). Without comprehensive measurement, “good enough” is just a guess.
Prerequisites: Familiarity with writing style principles (especially the readability formulas section) and validation and quality assurance is recommended.
📊 Readability formulas compared
Article 01 surveys all seven readability formulas with practical targets and score interpretation tables. This section provides deeper mathematical treatment of the four formulas beyond Flesch—Coleman-Liau, SMOG, Dale-Chall, and ARI—with full formulas, statistical validation context, and guidance on when each outperforms the others.
On deliberate overlap with Article 01: Both articles cover readability formulas, but with different purposes. Article 01 presents all seven formulas as a practical survey—what they are, what scores mean, and what targets to use. This article provides analytical depth—mathematical foundations, comparative strengths, and how formulas connect to comprehension testing and usability measurement. This intentional layering follows the series’ redundancy policy (see Article 08, acceptable redundancy).
Why multiple formulas matter
No single readability formula captures every dimension of text complexity. Each formula uses different linguistic features as proxies for difficulty:
| Proxy | Formulas that use it | Limitation |
|---|---|---|
| Syllable count | Flesch, FK Grade, Gunning Fog | Penalizes technical terms that are actually familiar to the audience |
| Word length (characters) | Coleman-Liau, ARI | Doesn’t distinguish between common long words and rare short ones |
| Sentence length | All formulas | Doesn’t account for clause complexity or nesting depth |
| Vocabulary familiarity | Dale-Chall | List-dependent; may not reflect domain-specific audiences |
| Polysyllabic word count | SMOG | Better for health/medical content; less tested for technical docs |
Using multiple formulas and comparing their results provides a more reliable assessment than relying on any single score.
Coleman-Liau Index
The Coleman-Liau Index estimates the US grade level required to understand a text. Unlike Flesch-based formulas, it uses character count instead of syllable count—making it easier to compute automatically and more reliable for machine scoring.
Formula:
\[CLI = 0.0588 \times L - 0.296 \times S - 15.8\]
Where:
- \(L\) = average number of letters per 100 words
- \(S\) = average number of sentences per 100 words
Strengths:
- Doesn’t require syllable counting (syllable detection is error-prone in automated tools)
- Designed explicitly for machine scoring
- Strong correlation with comprehension test results
Weaknesses:
- Character count penalizes languages with longer average words
- Doesn’t account for vocabulary familiarity
- Less intuitive to interpret than Flesch Reading Ease
Typical range for technical documentation: 10–14 (high school to early college)
SMOG (Simple Measure of Gobbledygook)
The SMOG grade estimates years of education needed for 100% comprehension of a text. It was developed by G. Harry McLaughlin in 1969 as a more accurate substitute for the Gunning Fog Index.
Formula:
\[SMOG = 1.0430 \times \sqrt{polysyllables \times \frac{30}{sentences}} + 3.1291\]
Where:
- polysyllables = words with 3+ syllables in a 30-sentence sample
- sentences = number of sentences in the sample
Strengths:
- Yields a 0.985 correlation with comprehension test results (the highest of any readability formula)
- Recommended for health communication materials by the American Medical Association
- Simple to calculate manually with the approximate formula: count polysyllabic words in 30 sentences, take the square root of the nearest perfect square, add 3
Weaknesses:
- Requires a minimum of 30 sentences for statistical validity
- Polysyllabic word counting still penalizes familiar technical terms
- Tends to give higher (harder) scores than Flesch-Kincaid for the same text
Typical range for technical documentation: 10–14
Dale-Chall readability formula
The Dale-Chall formula takes a fundamentally different approach: instead of measuring word length, it checks words against a list of 3,000 words that 80% of fourth-grade students could reliably understand. Any word not on this list counts as “difficult.”
Formula:
\[Raw = 0.1579 \times \left(\frac{difficult\ words}{total\ words} \times 100\right) + 0.0496 \times \left(\frac{total\ words}{sentences}\right)\]
If the percentage of difficult words exceeds 5%, add 3.6365 to get the adjusted score.
Score interpretation:
| Score | Reading level |
|---|---|
| 4.9 or lower | Easily understood by a 4th-grade student |
| 5.0–5.9 | 5th- or 6th-grade student |
| 6.0–6.9 | 7th- or 8th-grade student |
| 7.0–7.9 | 9th- or 10th-grade student |
| 8.0–8.9 | 11th- or 12th-grade student |
| 9.0–9.9 | College student |
Strengths:
- Directly measures vocabulary difficulty rather than using length as a proxy
- More sensitive to audience-appropriate vocabulary choices
- Updated word list (1995 revision) reflects modern English usage
Weaknesses:
- Word list doesn’t account for domain expertise (terms like “endpoint,” “middleware,” and “deployment” aren’t on the list but are basic vocabulary for developers)
- List-based approach requires maintenance as language evolves
- Less useful for highly technical content where the audience knows jargon
Typical range for technical documentation: 7.0–9.0
Automated Readability Index (ARI)
The Automated Readability Index uses characters per word and words per sentence to estimate the US grade level needed to understand a text. Like Coleman-Liau, it avoids syllable counting.
Formula:
\[ARI = 4.71 \times \frac{characters}{words} + 0.5 \times \frac{words}{sentences} - 21.43\]
Strengths:
- Very fast to compute (character and word counting only)
- Designed specifically for real-time monitoring of readability on typewriters and early computers
- No ambiguity in counting (unlike syllable-based formulas)
Weaknesses:
- Crude proxy—character count captures less linguistic information than syllable count
- Tends to overestimate difficulty for technical content with precise but familiar terms
- Less research validation than Flesch or SMOG
Typical range for technical documentation: 10–14
Comprehensive comparison
The following table compares all seven readability formulas covered across this article and Article 01:
| Formula | Input | Output | Best for | This repo’s target |
|---|---|---|---|---|
| Flesch Reading Ease | Syllables, sentences | 0–100 score (higher = easier) | General readability screening | 50–70 |
| Flesch-Kincaid Grade | Syllables, sentences | US grade level | Grade-level benchmarking | 9–10 |
| Gunning Fog | Complex words, sentences | Years of education | Academic/professional content | 8–12 |
| Coleman-Liau | Characters, sentences | US grade level | Automated pipelines | 10–14 |
| SMOG | Polysyllabic words, sentences | Years of education | Health/medical content, high accuracy | 10–14 |
| Dale-Chall | Unfamiliar words, sentences | Adjusted score → grade level | Vocabulary-sensitive assessment | 7.0–9.0 |
| Automated Readability Index (ARI) | Characters, words, sentences | US grade level | Real-time monitoring | 10–14 |
Practical recommendation: Use Flesch Reading Ease as your primary screening metric (it’s the most widely supported). Supplement with SMOG for accuracy validation and Dale-Chall when vocabulary difficulty is a concern. Run Coleman-Liau or ARI in automated CI pipelines where syllable counting adds unnecessary complexity.
⚖️ Functional quality vs. deep quality
The Diátaxis framework draws a critical distinction between two kinds of documentation quality that readability formulas alone can’t capture. Article 00 covers this distinction in full—definitions, characteristics, comparison table, and how Diátaxis serves each quality type. Here’s a brief recap as context for measurement.
On deliberate overlap with Article 00: Article 00 provides the full definition and theory of functional vs. deep quality—what each means, how they’re characterized, and how they relate. This article applies the distinction specifically to measurement strategy—what readability formulas can and can’t capture, and what a complete measurement approach requires. See Article 08 for the series redundancy policy.
Functional quality encompasses objectively measurable properties—accuracy, completeness, consistency, usefulness, and precision. These characteristics are independent of each other and can be assessed with metrics and checklists. Deep quality encompasses subjective, interdependent characteristics—flow, beauty, anticipation, and fitness for human needs. Deep quality can’t be reduced to scores; it requires human judgment.
The critical asymmetry: deep quality is conditional upon functional quality. Documentation won’t feel excellent if it’s inaccurate. But meeting every functional standard doesn’t guarantee it’ll feel good to use.
What this means for measurement
Readability formulas measure one aspect of functional quality. They’re necessary but insufficient. A complete measurement strategy must also:
- Measure all dimensions of functional quality — not just readability, but accuracy, completeness, consistency, and usefulness (the seven validation dimensions in Article 05 operationalize this)
- Create conditions for deep quality — through user testing, information architecture analysis, and flow assessment (see comprehension testing and documentation usability testing below)
- Recognize the limits of metrics — deep quality can’t be reduced to a dashboard, but it can be enquired into through qualitative methods
The Diátaxis framework helps by preventing disruptions to flow (for example, keeping explanation out of how-to guides) and by aligning documentation types with user needs. But applying Diátaxis doesn’t guarantee deep quality—it lays down conditions for its possibility.
🧪 Comprehension testing methodologies
Readability formulas predict whether text should be understandable based on linguistic features. Comprehension tests measure whether text is actually understood by real readers. They answer a fundamentally different question.
Cloze tests
A cloze test (from “closure” in Gestalt psychology) deletes every Nth word from a passage and asks readers to fill in the blanks. The percentage of correctly restored words indicates comprehension.
How to administer:
- Select a representative passage (250–350 words)
- Delete every 5th word (some researchers use every 7th)
- Replace deleted words with uniform-length blanks
- Ask test subjects to fill in the blanks
- Score: count exact word matches (synonyms don’t count in the standard method)
Score interpretation:
| Cloze score | Comprehension level | Implication |
|---|---|---|
| 60%+ | Independent level | Reader understands without assistance |
| 40–59% | Instructional level | Reader understands with some support |
| Below 40% | Frustration level | Reader can’t understand effectively |
Strengths:
- Well-researched and statistically validated (Taylor, 1953)
- Measures actual comprehension, not predicted readability
- Easy to create and administer
- Works across document types and audiences
Weaknesses:
- Requires actual human participants
- Results depend heavily on passage selection
- Exact-word scoring can undercount comprehension (a reader who writes “use” instead of “utilize” clearly understood the text)
- Doesn’t measure ability to apply knowledge
When to use: Validate that documentation meets audience reading level before publishing; compare comprehension across draft versions; assess whether technical vocabulary creates barriers.
Recall and recognition tests
Recall tests measure what readers remember after reading documentation without prompts. Recognition tests present options (like multiple-choice questions) and ask readers to identify correct information.
Free recall protocol:
- Ask readers to read a section of documentation
- Remove the documentation
- Ask readers to write down everything they remember
- Score: count the number of key concepts accurately recalled
Cued recall protocol:
- Ask readers to read documentation
- Provide questions about specific concepts (“What command starts the development server?”)
- Score: accuracy of responses
Recognition protocol:
- Ask readers to read documentation
- Present multiple-choice or true/false questions
- Score: percentage of correct answers
Recall vs. recognition comparison:
| Aspect | Recall | Recognition |
|---|---|---|
| Difficulty | Harder (retrieval from memory) | Easier (matching against options) |
| What it measures | Deep encoding of information | Familiarity with information |
| Use for docs | “Can users remember the steps?” | “Can users identify the right approach?” |
| Practical value | Higher—reflects real-world use | Lower—doesn’t mean users can act on knowledge |
Think-aloud protocols
In a think-aloud protocol, readers verbalize their thoughts while reading documentation. A researcher observes and records where confusion, satisfaction, or frustration occurs.
How to conduct:
- Select 3–5 representative readers from your target audience
- Ask them to read a section while speaking their thoughts aloud
- Record the session (audio or video)
- Code the transcript for comprehension markers:
- Understanding signals: “Oh, so this means…” / “That makes sense because…”
- Confusion signals: “Wait, what?” / “I don’t understand why…” / re-reading
- Inference signals: “I think this means…” (correct or incorrect inferences reveal gaps)
Strengths:
- Reveals where and why comprehension breaks down (not just that it did)
- Identifies assumptions readers bring to the documentation
- Catches problems that no formula can detect (misleading examples, ambiguous instructions, missing context)
Weaknesses:
- Time-intensive (1–2 hours per participant for analysis)
- Small sample sizes (typically 3–5 participants)
- The act of thinking aloud may alter reading behavior
- Requires skilled facilitation to avoid leading participants
When to use: Before publishing high-stakes documentation (onboarding guides, API quickstarts); when readability scores are acceptable but user feedback indicates confusion; when redesigning documentation structure.
Task-based comprehension testing
Task-based testing measures comprehension by asking readers to do something after reading documentation. This is the most authentic test because it mirrors real-world documentation use.
Protocol:
- Define specific tasks that documentation should enable (“Deploy an Azure Function using the CLI”)
- Provide relevant documentation
- Ask participants to complete the task
- Measure: task completion rate, time-on-task, errors, help requests
Metrics:
| Metric | What it measures | Target for good docs |
|---|---|---|
| Task completion rate | Can users succeed? | 80%+ first attempt |
| Time-on-task | How efficiently? | Within 1.5× estimated time |
| Error rate | How accurately? | <2 wrong actions per task |
| Help requests | Is documentation self-sufficient? | <1 per task |
Strengths:
- Directly measures documentation’s practical value
- Reveals gaps between what documentation says and what users need
- Results are concrete and actionable (“Users failed at step 5—that’s where we need to improve”)
Weaknesses:
- Most resource-intensive testing method
- Requires a working environment for task completion
- Hard to isolate documentation quality from tool/product usability
🔍 Information scent and foraging theory
Information foraging theory (Pirolli & Card, 1999) applies ecological foraging models to explain how people search for information. Just as animals follow scent trails to find food, users follow information scent—cues that signal whether a path will lead to useful content.
The foraging model
In the natural world, animals make continuous decisions: keep foraging in this patch, or move to a new one? The decision depends on the rate of return—when a patch becomes depleted, the rational strategy is to move on.
Information foraging applies the same logic to documentation users:
- Information patches = documentation pages, sections, search results
- Information scent = headings, link text, navigation labels, breadcrumbs, summaries
- Foraging decision = continue reading this page or navigate elsewhere
- Rate of return = amount of useful information gained per unit of time invested
Why users abandon documentation
Users leave documentation when information scent is weak:
Weak scent → abandonment:
- Vague headings (“Overview,” “Introduction,” “Getting Started” with no specifics)
- Navigation labels that don’t match user terminology
- Long pages without clear section boundaries
- Search results with misleading snippets
Strong scent → engagement:
- Specific headings that match user queries (“Configure OAuth 2.0 for Azure Functions”)
- Navigation labels using the user’s task language, not the product’s internal terminology
- Summary boxes that preview section content
- Progressive disclosure that rewards scanning
Measuring information scent
Method 1: First-click testing
- Present users with a documentation homepage or navigation
- Ask: “Where would you click to find [specific information]?”
- Measure: percentage of users who click the correct link first
A first-click success rate below 50% indicates weak information scent. Research shows that users who click correctly on the first try succeed at their overall task 87% of the time, compared to 46% for those who click incorrectly first.
Method 2: Navigation path analysis
- Track the sequence of pages users visit before finding target information
- Measure: number of pages visited, backtracking frequency, time to target
- Optimal: users reach information in 2–3 clicks with no backtracking
Method 3: Heading prediction
- Show users a heading and ask: “What content would you expect under this heading?”
- Compare predictions to actual content
- Misalignment indicates heading doesn’t carry accurate scent
Improving information scent in documentation
| Problem | Solution | Example |
|---|---|---|
| Generic headings | Use specific, task-oriented headings | “Getting Started” → “Deploy your first function in 5 minutes” |
| Ambiguous navigation | Match user vocabulary, not internal naming | “Resources” → “API reference” |
| Long pages without landmarks | Add summary boxes, anchor links, visual breaks | TL;DR boxes at section start |
| Search result snippets | Write informative descriptions for each page | Meta descriptions in YAML frontmatter |
🧠 Mental model alignment
A mental model is the internal representation a person holds about how something works. Documentation succeeds when its conceptual structure aligns with the reader’s mental model. It fails when it forces readers to build a new mental model just to navigate the docs.
What mental model alignment looks like
Aligned: > A developer expects “authentication” to involve tokens and API keys. The authentication section covers exactly that, organized by authentication method.
Misaligned: > The same developer looks for “authentication” but finds it under “Security” → “Identity” → “Credential Management.” The path doesn’t match how they think about the concept.
Measuring alignment
Card sorting:
- Write each documentation topic on a card
- Ask users to group cards into categories that make sense to them
- Compare user-generated categories to your documentation’s actual organization
- Overlap percentage indicates alignment
- Open card sort: Users create their own category labels (reveals how they think)
- Closed card sort: Users sort cards into predefined categories (tests your structure)
Tree testing:
- Present your documentation’s hierarchy as a text-only tree (no visual design)
- Ask users to find specific information by navigating the tree
- Measure: success rate, directness (no backtracking), time to completion
Tree testing removes visual design influence—if users can’t find information in the tree, the problem lies with the information architecture, not the page design.
Concept mapping:
- Ask users to draw how they think the documented system works
- Compare their concept maps to the system’s actual architecture
- Identify gaps between user understanding and reality
If documentation is effective, concept maps drawn after reading should more closely match reality than maps drawn before reading.
Common alignment failures
| Failure | Symptom | Fix |
|---|---|---|
| Org-chart structure | Docs mirror internal team structure, not user needs | Reorganize around user tasks and concepts |
| Expert blind spot | Docs assume knowledge the audience doesn’t have | User test with representative beginners |
| Feature-centric | Docs organized by features, not by user goals | Add task-based navigation alongside feature reference |
| Jargon mismatch | Docs use internal terms users don’t recognize | Conduct vocabulary alignment testing with users |
📋 Documentation usability testing
Usability testing for documentation applies the same principles as software usability testing: observe real users attempting real tasks, measure outcomes, fix what’s broken.
Planning a documentation usability test
Define objectives:
- Which documentation sections are you testing?
- What tasks should readers be able to complete?
- What success metrics matter most?
Recruit participants:
- 5 participants detect ~85% of usability issues (Nielsen & Landauer, 1993)
- Match participants to your actual audience (developers testing developer docs)
- Include a range of experience levels if your documentation serves multiple audiences
Design tasks:
- Base tasks on real user goals, not documentation structure
- Include both “find information” and “accomplish a task” scenarios
- Order tasks from simple to complex
Core metrics
| Metric | Definition | How to measure | Benchmark |
|---|---|---|---|
| Task success rate | Percentage of participants who complete the task | Binary: completed or not | 78%+ (Sauro & Lewis, 2016) |
| Time-on-task | Time from task start to successful completion | Stopwatch or screen recording | Within 2× expert time |
| Error rate | Number of wrong actions per task | Observer counts deviations from optimal path | <3 per task |
| Satisfaction (SUS) | System Usability Scale score (0–100) | Post-test questionnaire | 68+ (above average) |
| Findability | Time to first correct navigation choice | Screen recording analysis | <30 seconds for top-level nav |
The System Usability Scale (SUS) for documentation
The System Usability Scale (Brooke, 1996) is a 10-item questionnaire that produces a single usability score. Although designed for software, it adapts well to documentation:
Adapted SUS questions for documentation:
- I think I would use this documentation frequently
- I found the documentation unnecessarily complex
- I thought the documentation was easy to use
- I think I would need support from a person to use this documentation
- I found the various sections were well integrated
- I thought there was too much inconsistency in this documentation
- I imagine most people would learn to navigate this documentation quickly
- I found the documentation very cumbersome to navigate
- I felt confident finding information in this documentation
- I needed to learn a lot before I could get going with this documentation
Scoring: SUS scores range from 0 to 100. A score of 68 is average. Above 80 is good. Above 90 is exceptional.
Lightweight alternatives
Full usability testing isn’t always feasible. These lighter methods still provide actionable data:
Five-second test:
- Show a documentation page for 5 seconds
- Ask: “What is this page about?” and “What can you do from here?”
- Measures: first impression clarity, information scent strength
Highlighter test:
- Give readers a printed documentation page and two highlighters
- Green = “I understand this clearly”; Pink = “I’m confused by this”
- Collect and stack pages—confusion clusters become visually obvious
Heuristic evaluation: Apply Jakob Nielsen’s 10 usability heuristics to documentation:
- Visibility of system status → “Does the TOC show where the reader is?”
- Match between system and real world → “Does the documentation use the user’s language?”
- User control and freedom → “Can readers easily navigate back?”
- Consistency and standards → “Do similar sections follow the same structure?”
📏 Quantitative benchmarks by content type
Different Diátaxis documentation types serve different purposes, and their readability targets should differ accordingly. The following benchmarks combine readability research with the content type characteristics described in Article 00.
Benchmarks table
| Metric | Tutorial | How-to guide | Reference | Explanation |
|---|---|---|---|---|
| Flesch Reading Ease | 60–70 | 55–65 | 45–60 | 50–65 |
| FK Grade Level | 8–9 | 9–10 | 10–12 | 9–11 |
| Gunning Fog | 8–10 | 9–11 | 10–14 | 9–12 |
| Coleman-Liau | 9–11 | 10–12 | 11–14 | 10–13 |
| SMOG | 9–11 | 10–12 | 11–14 | 10–13 |
| Dale-Chall | 6.5–7.5 | 7.0–8.0 | 7.5–9.0 | 7.0–8.5 |
| ARI | 9–11 | 10–12 | 11–14 | 10–13 |
| Avg. sentence length | 14–20 words | 15–22 words | 12–20 words | 16–24 words |
| Cloze score | 55%+ | 50%+ | 45%+ | 50%+ |
| Task success rate | 85%+ | 80%+ | N/A (lookup, not task) | N/A (understanding, not task) |
Rationale by content type
Tutorials target the easiest readability because they serve newcomers who are learning. Shorter sentences, simpler vocabulary, and higher cloze scores reflect the need for maximum clarity. Task success rates should be high because tutorials control the environment.
How-to guides allow slightly more complexity because they assume prior knowledge. They’re task-oriented, so task success rate remains a key metric, but the tolerance for technical vocabulary increases.
Reference documentation can have the highest complexity because its audience actively seeks specific information. Dense, precise descriptions are acceptable—but sentence length should remain short because reference is consumed in fragments, not read linearly.
Explanation falls in the middle. It discusses concepts and builds understanding, requiring enough complexity to cover nuance without becoming impenetrable. Longer sentences are acceptable because readers engage in sustained reading, but vocabulary should remain accessible.
Using benchmarks effectively
Don’t: enforce benchmark targets rigidly. A reference page with a Flesch score of 43 isn’t automatically bad—it might be accurately describing complex API behavior.
Do: use benchmarks as investigation triggers. A tutorial with a Flesch score of 40 deserves a second look—is the vocabulary unnecessarily complex for newcomers? Can sentences be shortened without losing meaning?
Do: track trends over time. A series of articles that progressively drift toward lower readability scores may indicate creeping complexity.
🛠️ Tools comparison
Multiple tools can measure readability, enforce style rules, and support comprehension assessment. Here’s how they compare for technical documentation workflows.
Readability measurement tools
| Tool | Type | Formulas supported | Best for | Cost |
|---|---|---|---|---|
| textstat | Python library | Flesch, FK, Gunning Fog, Coleman-Liau, SMOG, Dale-Chall, ARI, Linsear Write | CI/CD pipeline integration, batch analysis | Free (open-source) |
| readable.com | Web service | All major formulas + proprietary metrics | One-off analysis, content marketing teams | Paid (free trial) |
| Hemingway Editor | Web/desktop app | Custom (grade level) | Real-time writing feedback, sentence simplification | Free (web), paid (desktop) |
| readability-cli | Node.js CLI | Flesch, FK, Coleman-Liau, ARI, SMOG, Dale-Chall | Command-line workflows, quick checks | Free (open-source) |
Prose linting tools
| Tool | Type | What it checks | Best for | Cost |
|---|---|---|---|---|
| Vale | CLI linter | Style, terminology, jargon, consistency (configurable rules) | Enforcing style guides at scale, CI integration | Free (open-source) |
| Vale + Microsoft style | Vale package | Microsoft Writing Style Guide compliance | Microsoft-aligned documentation | Free (open-source) |
| alex | CLI linter | Insensitive, inconsiderate language | Inclusive language enforcement | Free (open-source) |
| write-good | CLI linter | Passive voice, weasel words, adverbs | Quick writing quality checks | Free (open-source) |
Comprehensive comparison: textstat vs. Vale vs. Hemingway
| Dimension | textstat | Vale | Hemingway |
|---|---|---|---|
| Readability formulas | All 7+ (programmatic) | None built-in (different purpose) | Grade-level equivalent |
| Style enforcement | None | Extensive (custom rules) | Basic (adverbs, passive voice, complexity) |
| CI/CD integration | Excellent (Python library) | Excellent (CLI, GitHub Actions) | None (manual only) |
| Custom rules | Write Python code | YAML rule definitions | None |
| Terminology checking | None | Built-in (substitution, existence, occurrence rules) | None |
| Learning curve | Low (Python API) | Medium (rule configuration) | Very low (visual interface) |
| Batch processing | Yes (scripted) | Yes (glob patterns) | No |
Recommendation for this repository:
- Primary readability tool: textstat in Python scripts for automated validation
- Primary style tool: Vale with Microsoft style package for consistent terminology
- Quick checks during writing: Hemingway Editor for real-time sentence simplification
- CI integration: Vale + textstat in GitHub Actions for pre-merge validation
📌 Applying readability measurement to this repository
Current coverage
This repository currently measures readability through:
- Flesch Reading Ease targets (50–70) defined in validation criteria
- Flesch-Kincaid Grade Level targets (9–10) for general benchmarking
- Readability review prompt (readability-review.prompt.md) for AI-assisted review
- Sentence length guidelines (15–25 words) in article-writing.instructions.md
Opportunities for expanded measurement
Short-term (addable with minimal effort):
- Add Coleman-Liau and SMOG scores to the readability review prompt for cross-validation
- Include Dale-Chall analysis when reviewing vocabulary accessibility
- Create content-type benchmarks (the table in this article can serve as a starting reference)
Medium-term (requires tooling):
- Integrate textstat into a Python-based validation script
- Configure Vale with Microsoft style rules for terminology consistency
- Add readability scoring to CI pipeline for pull request validation
Long-term (requires user testing):
- Conduct cloze testing on representative articles in each Diátaxis type
- Run first-click testing on the repository’s navigation structure
- Perform tree testing on the table of contents hierarchy
- Track documentation usability metrics over time as part of quality dashboards
Practical implementation: textstat example
# Install: pip install textstat
import textstat
def analyze_readability(text: str) -> dict:
"""Calculate comprehensive readability metrics for documentation."""
return {
"flesch_reading_ease": textstat.flesch_reading_ease(text),
"flesch_kincaid_grade": textstat.flesch_kincaid_grade(text),
"gunning_fog": textstat.gunning_fog(text),
"coleman_liau": textstat.coleman_liau_index(text),
"smog": textstat.smog_index(text),
"dale_chall": textstat.dale_chall_readability_score(text),
"ari": textstat.automated_readability_index(text),
"avg_sentence_length": textstat.avg_sentence_length(text),
}
# Example usage:
# with open("article.md", "r") as f:
# text = f.read()
# scores = analyze_readability(text)
# print(f"Flesch RE: {scores['flesch_reading_ease']:.1f}")
# print(f"SMOG: {scores['smog']:.1f}")
# print(f"Dale-Chall: {scores['dale_chall']:.1f}")✅ Conclusion
Measuring documentation quality is a multi-dimensional challenge. Readability formulas provide a necessary but insufficient foundation—they measure linguistic surface features but can’t tell you whether readers understand, can act on, or enjoy your documentation.
Key takeaways
- No single formula suffices — Use multiple readability formulas (Flesch, SMOG, Dale-Chall, Coleman-Liau) to get a reliable picture; each measures different linguistic features
- Functional quality is necessary but not sufficient — The Diátaxis framework distinguishes between measurable functional quality (accuracy, completeness, consistency) and subjective deep quality (flow, anticipation, beauty)
- Comprehension testing reveals what formulas can’t — Cloze tests, recall tests, think-aloud protocols, and task-based testing measure actual understanding, not predicted readability
- Information scent drives navigation success — Users follow cues (headings, link text, navigation labels) like foraging animals follow scent; weak scent causes abandonment
- Mental models shape comprehension — Documentation succeeds when its structure matches how readers think about the topic; card sorting and tree testing reveal alignment
- Usability metrics ground quality in evidence — Task completion rates, time-on-task, error rates, and SUS scores provide objective evidence for documentation effectiveness
- Benchmarks should vary by content type — Tutorials need higher readability than reference; how-to guides need higher task success rates than explanations
Next steps
- Previous article: 08-consistency-standards-and-enforcement.md — Consistency enforcement that builds on measurable quality standards
- Related: 01-writing-style-and-voice-principles.md — Foundational readability formulas (Flesch, FK Grade, Gunning Fog) that this article extends
- Related: 05-validation-and-quality-assurance.md — The validation framework that these measurement approaches support
- Related: 00-foundations-of-technical-documentation.md — Diátaxis framework and quality characteristics referenced throughout
📚 References
Readability formulas and research
Flesch-Kincaid Readability Tests - Wikipedia 📘 [Official]
Technical explanation of Flesch Reading Ease and Flesch-Kincaid Grade Level formulas, with scoring interpretation and history.
Coleman-Liau Index - Wikipedia 📘 [Official]
Description of the character-based readability formula designed for machine scoring, including the original 1975 formula and worked example.
SMOG - Wikipedia 📘 [Official]
Overview of the Simple Measure of Gobbledygook formula, its 0.985 correlation with comprehension tests, and its recommendation for health communication materials.
Dale-Chall Readability Formula - Wikipedia 📘 [Official]
Description of the vocabulary-based readability formula using a 3,000-word familiar word list, with formula, scoring table, and history.
Automated Readability Index - Wikipedia 📘 [Official]
The character-and-word-count formula designed for real-time readability monitoring without syllable analysis.
Diátaxis framework and quality
Towards a Theory of Quality in Documentation - Diátaxis 📗 [Verified Community]
Daniele Procida’s exploration of functional quality vs. deep quality in documentation. Distinguishes objectively measurable standards from subjective excellence. Essential reading for understanding why metrics alone don’t guarantee quality.
Diátaxis - A Systematic Approach to Technical Documentation 📗 [Verified Community]
The overarching framework for documentation types (tutorials, how-to guides, reference, explanation) that this series uses as its organizational foundation.
Comprehension testing and usability
Cloze Procedure - Wikipedia 📘 [Official]
Background on the cloze test methodology originally developed by Wilson Taylor (1953), including administration protocols and scoring interpretation.
Information Foraging Theory - Wikipedia 📘 [Official]
Overview of Pirolli and Card’s information foraging theory (1999) that models how users search for information using scent-following behavior borrowed from ecological foraging models.
Plain Language Guidelines - Federal Plain Language 📘 [Official]
US government standards for clear, accessible writing. Includes guidance on readability, audience analysis, and comprehension testing for public-facing content.
System Usability Scale - Wikipedia 📘 [Official]
Overview of Brooke’s (1996) SUS questionnaire methodology, scoring, and interpretation benchmarks. SUS produces a single composite score from ten Likert-scale items.
Tools and implementation
textstat - Python Library 📗 [Verified Community]
Open-source Python library implementing all major readability formulas (Flesch, FK, Gunning Fog, Coleman-Liau, SMOG, Dale-Chall, ARI, and more). Ideal for CI/CD pipeline integration and batch analysis.
Vale - Prose Linter 📗 [Verified Community]
Open-source prose linter supporting configurable style rules, terminology enforcement, and integration with Microsoft, Google, and custom style guides. The primary tool for automated documentation quality enforcement.
Hemingway Editor 📒 [Community]
Visual writing tool that highlights complex sentences, passive voice, and adverb overuse. Provides grade-level readability scoring. Best for real-time writing feedback during drafting.
readable.com 📒 [Community]
Web-based readability analysis service supporting all major formulas plus proprietary engagement metrics. Useful for one-off analysis and non-technical teams.
Repository-specific documentation
Validation Criteria [Internal Reference]
This repository’s seven validation dimensions with scoring thresholds, including Flesch targets (50–70) and grade-level standards (9–10).
Article Writing Instructions [Internal Reference]
Comprehensive writing guidance including readability targets, sentence length standards (15–25 words), and validation workflows.
Readability Review Prompt [Internal Reference]
AI-assisted validation prompt for analyzing Flesch scores, grade level, and suggesting readability improvements.