testing and automating accessibility
Interpreting Lighthouse Accessibility Scores
A Lighthouse accessibility score is a single number between 0 and 100, and that compression is exactly why it gets misread. Teams celebrate a 100 as "we're accessible" and panic at an 88 without knowing whether the gap is one stray icon button or a systemic contrast failure. This guide explains precisely how the number is computed, why the weighting matters for triage, where pass/fail automated audits end and Lighthouse's own manual audit list begins, and what a paired manual checklist must cover so the score becomes a useful signal instead of a misleading trophy. It builds on the broader Accessibility Audits with Lighthouse.
Mapped WCAG 2.1/2.2 Success Criteria:
1.4.3 Contrast (Minimum)– Thecolor-contrastaudit is one of the most heavily weighted contributors to the score.4.1.2 Name, Role, Value– Accessible-name audits (button-name,link-name,image-alt) carry significant weight.2.4.3 Focus Order– Listed as a manual audit; it never affects the number.2.1.1 Keyboard– Also manual; keyboard operability is invisible to the score.
Prerequisites
Before interpreting a score, generate a JSON report so you can read the underlying data rather than the rounded headline. Run an accessibility-only audit and keep the raw output:
# Produce a JSON report you can inspect for weights and node counts
npx lighthouse https://staging.example.com \
--only-categories=accessibility \
--output=json --output-path=./lh-a11y.json \
--chrome-flags="--headless=new"
You should also be comfortable with the idea that Lighthouse's Accessibility category runs axe-core under the hood—so every score is ultimately an axe result wearing a Lighthouse hat.
How the Score Is Computed
The score is a weighted average of automated audits, where each audit is a binary pass (1) or fail (0). There is no partial credit within an audit. The arithmetic is:
score = Σ(weightᵢ × passᵢ) / Σ(weightᵢ) → scaled to 0–100
Every audit in the Accessibility category has a weight defined in categories.accessibility.auditRefs. Read it straight from your report:
# List each audit, its weight, and whether it passed — sorted by weight, highest first
npx -y jq -r '
.categories.accessibility.auditRefs as $refs
| $refs | sort_by(-.weight)[]
| "\(.weight)\t\(.id)\t\(.weight)"' lh-a11y.json | head -20
# Cross-reference each ref id with the audit result to see pass/fail
npx -y jq -r '
.categories.accessibility.auditRefs[] as $r
| (.audits[$r.id]) as $a
| select($a.score != null)
| "\($r.weight)\t\($a.score)\t\($r.id)\t\($a.title)"' lh-a11y.json | sort -rn
Three facts fall out of this formula, and each one shapes how you should react to a given number:
- Audits are all-or-nothing. One
<img>missingaltfails the entireimage-altaudit. A page with 99 perfectly labeled images and one bare one loses the fullimage-altweight—not one percent of it. - Weights are uneven, so the score is a triage signal, not an effort meter. Restoring one heavily weighted audit (
color-contrast) can lift the score more than fixing several light ones. That makes the weighting genuinely useful for prioritization. - The denominator is automated audits only. Manual audits and not-applicable audits are excluded entirely, which is the root reason 100 ≠ compliant.
Testing Hook: Sort your failing audits by weight before touching code. The highest-weight failure is both the biggest score gain and, usually, the broadest user-facing defect.
Pass/Fail Audits vs. Lighthouse's Manual Audits
Lighthouse splits its accessibility results into four buckets, and only one of them feeds the number:
- Passed audits — automated checks that succeeded. Counted in the denominator.
- Failed audits — automated checks that found a provable defect. These are what lower your score.
- Not applicable — audits with no relevant elements on the page (no tables, so
td-headers-attris N/A). Excluded from the score. - Additional items to manually check — the manual audits. These never affect the score at all.
That last bucket is the crux of interpretation. Lighthouse surfaces manual audits like keyboard operability (2.1.1 Keyboard), logical focus order (2.4.3 Focus Order), and meaningful interactive controls because it knows a static engine cannot verify them. A page can score 100 while being completely unusable by keyboard, because every manual concern sits outside the calculation. The standalone axe-core workflow draws the same line—automation proves the provable and stays silent on the rest.
Testing Hook: After reading the score, scroll to the manual audit list and treat every item as an open question, not a passed check. The score's silence on these is not approval.
Common High-Weight Failures
When a score drops, a handful of heavily weighted audits are usually responsible. Recognizing them lets you predict the fix before opening the node list:
color-contrast(1.4.3 Contrast (Minimum)) — the single most common score-killer. Foreground/background pairs below 4.5:1 (or 3:1 for large text). The report gives you the exact hex pair and failing ratio. A theme token regression can fail this across hundreds of nodes at once.button-name/link-name(4.1.2 Name, Role, Value) — icon-only buttons and links with no text and noaria-label. Common in toolbars, card actions, and pagination.image-alt(1.1.1 Non-text Content) — images with noalt. Decorative images needalt="", not a missing attribute.label(1.3.1,4.1.2) — form inputs with no associated<label>or accessible name.heading-order(1.3.1 Info and Relationships) — skipped heading levels (an<h2>followed directly by an<h4>).html-has-lang(3.1.1 Language of Page) — a missinglangattribute on<html>; cheap to fix, surprisingly common in SSR shells.
# Surface the highest-weight failing audits with their offending node counts
npx -y jq -r '
.categories.accessibility.auditRefs[] as $r
| (.audits[$r.id]) as $a
| select($a.score != null and $a.score < 1)
| "\($r.weight)\t\($a.id)\t\($a.details.items | length // 0) nodes"' lh-a11y.json | sort -rn
Testing Hook: A
color-contrastfailure with a high node count almost always traces to a single design token, not many separate bugs. Fix the token and re-audit before chasing individual nodes.
A Manual Checklist to Pair with the Score
The score covers the automated layer; this checklist covers the layer it cannot reach. Run it on every key flow regardless of how green the number is:
- Keyboard reach (
2.1.1) — Can you reach and operate every interactive control with Tab, Enter, Space, and arrow keys? No mouse. - No keyboard trap (
2.1.2) — Can focus always leave a component (modal, menu, embedded widget)? - Focus order (
2.4.3) — Does the tab sequence match the visual/reading order? - Visible focus (
2.4.7) — Is the focus indicator always visible and high-contrast? - Screen-reader meaning (
1.3.1,4.1.2) — Do controls announce a useful name and role, not just a technically present one? "Button" and "Click here" pass automation but fail users. - Status messages (
4.1.3) — Do dynamic updates (toasts, form errors, loading states) announce via live regions at the right moment? - Zoom and reflow (
1.4.10) — Does content reflow without horizontal scrolling at 400% zoom?
For driving the interaction-heavy items in automation rather than by hand, layer in Playwright-based end-to-end accessibility tests.
How to Verify
Verify your interpretation with both a tool reading and a manual pass:
Tool check. Confirm the score's composition matches your understanding. The score in categories.accessibility.score (0–1) should equal the weighted pass ratio you can reconstruct from auditRefs:
# Reconstruct the score from weights and pass/fail, then compare to the reported value
npx -y jq -r '
(.categories.accessibility.auditRefs
| map(select(.weight > 0))
| { num: (map(.weight * ((.audits // {})[.id].score // 0)) | add),
den: (map(.weight) | add) }) ' lh-a11y.json
# Reported value for comparison:
npx -y jq '.categories.accessibility.score' lh-a11y.json
If the reconstructed ratio and the reported score agree, you have read the weighting correctly.
Manual check. Take any page scoring 100 and run the manual checklist above with the keyboard only, then with NVDA (Windows) or VoiceOver (macOS). If you find an unreachable control or a meaningless announcement on a "perfect" page, you have proven the central point: the number measures the automated subset, and conformance lives beyond it.
Common a11y Mistakes
- Reading 100 as "compliant." It means automated audits passed. Manual audits—keyboard, focus order, screen-reader meaning—are excluded from the number by design.
- Treating the score as linear. A 7-point drop from one heavy audit is not "7% broken." Binary audits and uneven weights make the score a triage signal, not a defect percentage.
- Ignoring the manual audit list. It sits right under the score and contains the checks most likely to break real users.
- Chasing node counts instead of root causes. A 200-node contrast failure is usually one token; fix the source, not each node.
- Comparing scores across different DOM states. A pre-hydration audit and a post-hydration audit of an SSR app are not comparable; pin the state.
Conclusion
A Lighthouse accessibility score is a weighted, binary, automated-only signal. Read correctly—sorted by weight, with the manual audit list treated as open work and a keyboard/screen-reader pass run on every flow—it is a sharp triage tool. Read naively, it is a trophy that hides exactly the failures users feel most. Interpret the number, don't worship it, and pair every score with the manual checklist that the formula can never include.
Frequently Asked Questions
Why isn't a Lighthouse accessibility score of 100 the same as WCAG compliance? Because the score only averages automated audits. WCAG conformance also requires success criteria that no static engine can verify—keyboard operability, logical focus order, meaningful screen-reader output, and correctly timed status messages. Lighthouse lists these as manual audits and deliberately excludes them from the number. A 100 is a clean automated baseline; compliance requires the manual layer on top.
How are audits weighted in the Lighthouse accessibility score?
Each automated audit has a weight stored in categories.accessibility.auditRefs. The score is the sum of weight × pass (where pass is 1 or 0) divided by the sum of all weights. Heavily weighted audits like color-contrast, button-name, and image-alt move the score more than narrow checks, which makes the weighting useful for prioritizing fixes by impact.
Why does one missing alt attribute drop my score so much?
Audits are binary. The image-alt audit passes only if every relevant image has an accessible alternative; a single bare <img> fails the whole audit, costing its full weight rather than a proportional fraction. The fix is to add alt text (or alt="" for decorative images) to every offending node, then re-audit.
Which Lighthouse audits should I fix first to raise the score?
Sort your failing audits by weight and start at the top—usually color-contrast, then accessible-name audits like button-name and link-name. High-weight failures give the largest score gain and typically represent the broadest user-facing defects. Use jq to list auditRefs by weight against your failing audits to build the priority order.
Can I trust automated scores from tools other than Lighthouse? They share the same fundamental limit. Lighthouse, the standalone axe-core tooling, and similar engines can only verify what is statically provable. Any automated number—Lighthouse's included—must be paired with manual keyboard and screen-reader testing to claim accessibility.