testing and automating accessibility

Testing & Automating Accessibility

Q: Which WCAG criteria are hardest to automate?

The experiential and judgment-based ones. 2.1.1 Keyboard operability of composite widgets, 2.4.3 Focus Order matching visual order, 2.4.7 Focus Visible on custom-styled controls, and the meaningfulness half of 4.1.2 Name, Role, Value (valid ARIA is checkable; coherent ARIA is not). These are exactly where your manual testing budget should go.

Accessibility regressions are silent. A refactored <div> that used to be a <button>, a modal that stopped trapping focus, a contrast token nudged two shades lighter — none of these throw a runtime error, none fail a type check, and most slip past code review because reviewers read diffs, not accessibility trees. By the time a real user on NVDA or VoiceOver reports the breakage, the regression has shipped, compounded, and entangled itself with three more releases. The only sustainable defense is a layered, automated test strategy that fails the build before broken markup reaches production.

This guide is for frontend and UI engineers who already understand WCAG fundamentals and now need a workflow: which tool runs at which layer, what each layer can and cannot catch, and how to wire it all into CI so a violation blocks a merge instead of generating a ticket. Automation is non-negotiable at scale — but automation alone certifies nothing. Tools like axe-core reliably detect a specific, machine-decidable subset of failures: missing form labels, invalid ARIA attribute values, duplicate IDs, color-contrast ratios. They cannot tell you whether a screen reader announces your custom combobox coherently, whether Escape actually closes your dropdown, or whether the focus order matches the visual reading order. Those require a human with a keyboard and a screen reader. The discipline is knowing exactly where the machine stops and the human begins, and building both into the same pipeline.

The pyramid above is the mental model for the rest of this guide: cheap, fast checks run on the broad base and execute constantly; slower, higher-fidelity checks run less often nearer the apex; and the entire stack funnels into a single CI gate. Manual testing is not a layer in the pyramid — it is the band that wraps the whole thing, because no automated layer can replace a human listening to actual screen reader speech.

What You'll Learn

This guide maps to five focused areas. Each handles one layer of the strategy in depth; read this page for how they fit together, then drill into whichever layer you're wiring up.

Automated Accessibility Testing with axe-core — the shared rules engine that powers nearly every other tool in this stack, how its rule set maps to WCAG, and how to configure, scope, and triage its results.
Component Testing with jest-axe — running axe-core against rendered components in JSDOM so a broken label or invalid ARIA attribute fails a unit test the moment it's introduced.
End-to-End Accessibility Testing with Playwright — testing real keyboard flows, focus order, route transitions, and dynamic states in a live browser where JSDOM can't reach.
Accessibility Audits with Lighthouse — page-level scoring and budgets that act as a coarse regression tripwire across whole routes, ideal for CI score thresholds.
Gating Accessibility in CI/CD Pipelines — turning all of the above into blocking checks so accessibility violations fail the build instead of becoming backlog.

These five layers complement, not replace, the foundations in Core Accessibility Principles for Modern Frameworks and the implementation patterns in React & Next.js Accessibility Patterns. Testing tells you whether an interface is accessible; those pillars tell you how to build it so it passes.

What Automation Catches — and the Half It Can't

The single most dangerous misconception in accessibility tooling is treating a clean axe scan or a Lighthouse score of 100 as proof of compliance. It is not. Industry analyses of axe-core and similar engines consistently put automated coverage at roughly 30–50% of WCAG success criteria by number of criteria, and even within the criteria a tool partially checks, it only catches the machine-decidable failures. The remaining ~50–67% of issues require human judgment.

Automation reliably catches deterministic, structural failures:

Missing or empty accessible names on form controls and buttons (4.1.2 Name, Role, Value).
Invalid ARIA: roles applied to elements that disallow them, required attributes missing, attribute values that aren't valid tokens.
Color contrast ratios below the threshold (1.4.3 Contrast (Minimum)).
Duplicate id values, broken aria-labelledby/aria-describedby references, document-language gaps.
Images without alt, lists with invalid children, tables without headers.

Automation cannot decide the things that depend on meaning and interaction:

Whether alt="image" is useful — it's present, so it passes, but it tells a screen reader user nothing.
Whether your custom widget's role and state actually produce coherent screen reader speech (4.1.2 Name, Role, Value is only partly machine-checkable).
Whether every interaction is operable from the keyboard, including arrow-key navigation inside composite widgets and Escape-to-close (2.1.1 Keyboard).
Whether the focus order matches the visual reading order, or whether focus is ever lost after a route change.
Whether an aria-live announcement actually fires at the right moment with the right politeness.

So the rule is blunt: automation is a regression net, not a certification. It exists to catch the boring, repetitive, high-volume failures that humans miss in review — freeing your manual testing budget for the judgment calls no tool can make. Every layer in this guide is built on that premise.

How to verify the gap is covered: run your automated suite, then pick one critical user flow per release and walk it manually with a keyboard only (no mouse) and with NVDA or VoiceOver running. If the automated suite is green but the manual walk surfaces a problem, that's exactly the ~50–67% your machine can't see — and a signal to add a Playwright assertion or a manual test-case to your release checklist. For a structured approach to the manual side, see Screen Reader Compatibility Testing.

axe-core: The Shared Engine Under Everything

Before looking at layers, understand the engine. axe-core is the open-source rules engine that powers jest-axe, the official Playwright integration (@axe-core/playwright), the axe DevTools browser extension, and Lighthouse's accessibility audits. This is the most important architectural fact in the whole stack: you are running essentially the same rule set at every layer, just against different runtimes (JSDOM, a real Chromium page, a full Lighthouse audit). That consistency is a feature — a violation you fix to satisfy jest-axe stays fixed in Playwright and Lighthouse, because they share rule definitions.

axe-core groups rules by WCAG conformance level using tags (wcag2a, wcag2aa, wcag21a, wcag22aa, plus best-practice). You select which families run by tag, and you can disable individual rules where a known, documented exception applies.

// a11y/axe-config.ts — one shared config consumed by jest-axe AND Playwright
// so every layer enforces an identical rule set. Drift between layers is a
// common source of "passes in unit tests, fails in CI e2e" confusion.
import type { RunOptions } from 'axe-core';

export const axeRunOptions: RunOptions = {
  // Run only the conformance levels we actually commit to (WCAG 2.2 AA).
  runOnly: {
    type: 'tag',
    values: ['wcag2a', 'wcag2aa', 'wcag21a', 'wcag21aa', 'wcag22aa'],
  },
  rules: {
    // Contrast is meaningless in JSDOM (no layout/paint), so component tests
    // disable it and let Playwright/Lighthouse catch it in a real browser.
    'color-contrast': { enabled: true },
    // Example of a documented, deliberate exception — keep these rare and reviewed.
    // 'region': { enabled: false },
  },
};

How to verify: after defining a shared config, deliberately introduce one violation (e.g. remove a <label>) and confirm it is reported with the same rule id in both your component test output and your Playwright run. Matching rule ids across layers proves the engine is genuinely shared and your tags line up. Full configuration depth — selectors, exclusions, custom rules, and result triage — lives in Automated Accessibility Testing with axe-core.

The Component Layer: jest-axe

The base of the practical pyramid (above static linting) is component-level testing. jest-axe runs axe-core against the DOM your component renders inside JSDOM, then exposes a toHaveNoViolations() matcher. This is the cheapest place to catch a structural regression because it runs with your existing unit tests, in milliseconds, on every change — long before a browser is ever spun up.

// AccountMenu.test.tsx
import { render } from '@testing-library/react';
import { axe, toHaveNoViolations } from 'jest-axe';
import { AccountMenu } from './AccountMenu';

expect.extend(toHaveNoViolations);

test('AccountMenu has no axe violations in its default state', async () => {
  const { container } = render(<AccountMenu user={{ name: 'Ada' }} />);
  // Scans the rendered subtree for machine-decidable WCAG failures:
  // missing names, invalid ARIA, broken references, bad roles.
  const results = await axe(container);
  expect(results).toHaveNoViolations();
});

The critical limitation: JSDOM has no layout engine and no real focus model, so the component layer is blind to contrast, focus order, and anything that depends on actual rendering or interaction. It catches static structure superbly and behavior not at all. That's by design — keep these tests fast and let the e2e layer handle interaction. Render each meaningful state (open menu, error state, loading state), because each state is different markup with its own potential violations.

How to verify: assert against multiple states, not just the initial render, and confirm a deliberately broken state (e.g. an error message not associated via aria-describedby) fails the test. The mechanics, matcher options, and CI-friendly reporting are covered in Component Testing with jest-axe.

The E2E Layer: Playwright for Real Behavior

Static scans can confirm a button exists with an accessible name. Only a real browser can confirm that pressing Tab reaches it, that Enter activates it, that focus moves into the dialog it opens and returns afterward, and that a client-side route change doesn't strip focus to <body>. This is the e2e layer, and Playwright is the strongest tool for it because it drives a genuine Chromium/Firefox/WebKit page with a real focus model and real event dispatch.

@axe-core/playwright runs the same axe engine against the live page — now contrast and layout-dependent rules actually work — and Playwright's input APIs let you assert the behavioral criteria axe can't reach.

// e2e/dialog.spec.ts
import { test, expect } from '@playwright/test';
import AxeBuilder from '@axe-core/playwright';

test('settings dialog: scan is clean and keyboard behavior is correct', async ({ page }) => {
  await page.goto('/settings');

  // Same axe engine, but in a real browser — contrast and layout rules now apply.
  const scan = await new AxeBuilder({ page })
    .withTags(['wcag2a', 'wcag2aa', 'wcag22aa'])
    .analyze();
  expect(scan.violations).toEqual([]);

  // Behavior axe cannot check: keyboard operability (2.1.1) and focus management.
  await page.getByRole('button', { name: 'Open settings' }).focus();
  await page.keyboard.press('Enter');

  const dialog = page.getByRole('dialog', { name: 'Settings' });
  await expect(dialog).toBeVisible();
  // Focus must land inside the dialog, not on <body>.
  await expect(dialog.locator(':focus')).toBeVisible();

  // Escape must close it and focus must return to the trigger (2.4.3 Focus Order).
  await page.keyboard.press('Escape');
  await expect(dialog).toBeHidden();
  await expect(page.getByRole('button', { name: 'Open settings' })).toBeFocused();
});

This is the only automated layer that can verify 2.1.1 Keyboard and focus restoration in anything close to a realistic environment. It's slower and runs less often than component tests — typically on PRs and pre-merge — which is exactly why it sits higher on the pyramid. The full pattern set, including testing live-region announcements and route transitions, is in End-to-End Accessibility Testing with Playwright.

The Audit Layer: Lighthouse and Budgets

Lighthouse runs an axe-core-based accessibility audit across an entire rendered page and produces a 0–100 score. It overlaps heavily with the e2e scan, so don't treat it as a separate source of truth for individual violations — its value is a different one: a coarse, page-level budget you can threshold in CI. A route that suddenly drops from 100 to 86 is a loud, cheap signal that something structural regressed across that whole page, even if no single component test was watching that exact element.

// lighthouserc.js — Lighthouse CI asserts a per-route accessibility budget.
module.exports = {
  ci: {
    collect: {
      url: ['http://localhost:3000/', 'http://localhost:3000/settings'],
      numberOfRuns: 3, // average runs to reduce score flakiness
    },
    assert: {
      assertions: {
        // Block the build if the accessibility category drops below budget.
        'categories:accessibility': ['error', { minScore: 1 }],
      },
    },
  },
};

The trap is over-trusting the number: a Lighthouse score of 100 means "no failures the audit can detect," which is the same partial coverage as any axe-based tool — it is not a compliance certificate. Use it as a tripwire across many routes, and rely on jest-axe and Playwright for granular, element-level enforcement. Budget strategy, scoring nuances, and how to combine page-level audits with component-level scans are covered in Accessibility Audits with Lighthouse.

The CI Gate: Failing the Build on Violations

A test that runs locally but doesn't block a merge is documentation, not enforcement. The entire point of the pyramid is to funnel into a CI gate that makes accessibility violations fail the build — the same severity as a failing type check or a broken unit test. Without a blocking gate, every layer above degrades into advisory noise that teams learn to ignore.

# .github/workflows/a11y.yml
name: accessibility
on: [pull_request]

jobs:
  a11y:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: npm }
      - run: npm ci

      # Layer 1 — static analysis: cheapest, fails first.
      - run: npm run lint  # includes eslint-plugin-jsx-a11y

      # Layer 2 — component tests with jest-axe.
      - run: npm test -- --ci

      # Layer 3 — e2e with Playwright + axe.
      - run: npx playwright install --with-deps chromium
      - run: npm run test:e2e

      # Layer 4 — Lighthouse budget. A non-zero exit fails the whole job,
      # which blocks the merge when the branch is protected.
      - run: npm run build && npm run start &
      - run: npx wait-on http://localhost:3000
      - run: npx @lhci/cli autorun

Two practical disciplines make a gate survivable. First, start with a baseline. Turning on a hard gate against a legacy app with hundreds of existing violations will block every PR; instead, snapshot the current violations and fail only on new ones, then burn the baseline down over time. Second, fail loud and specific — surface the exact axe rule id, the failing selector, and a link to the rule's help page in the CI output, so an engineer can fix the violation without leaving the PR. The full gating playbook — baselines, severity thresholds, branch protection, and reporting — is in Gating Accessibility in CI/CD Pipelines.

Manual Verification That Must Remain

No combination of the four automated layers certifies accessibility, because the most consequential criteria are experiential. A human must periodically verify the things the machine structurally cannot:

Actual screen reader speech. Open the flow in NVDA (Windows/Firefox) and VoiceOver (macOS/Safari) and listen. Is the custom combobox announced as a combobox with its current value and option count? Does the live region read at the right moment, or get clobbered by a focus change? axe confirms the ARIA is valid; only your ears confirm it's coherent (4.1.2 Name, Role, Value).
Keyboard-only operation. Unplug the mouse. Tab through the whole flow. Every interactive element must be reachable, operable, and have an order that matches the visual layout (2.1.1 Keyboard, 2.4.3 Focus Order). Composite widgets need their arrow-key and Escape behaviors exercised by hand.
Visible focus. Confirm a clearly visible focus indicator on every focusable element as you tab, including custom-styled controls where a CSS reset may have stripped the default outline (2.4.7 Focus Visible). This is trivial to break with a global outline: none and nearly impossible for a DOM-structure scan to catch.

The most efficient model is a standing manual checklist run on critical flows each release, with the automated layers preventing the regressions the checklist already caught from ever recurring. When a manual pass finds something, the follow-up is mechanical: encode it as a Playwright assertion or a jest-axe case so the machine guards that specific failure forever after. Manual testing is expensive; spend it on discovery, then let automation handle the repetition.

Key Takeaways

Automation is a regression net, not a certificate. It catches ~30–50% of WCAG criteria — the deterministic, high-volume failures humans miss in review. The other ~50–67% need a keyboard and a screen reader.
One engine, four layers. axe-core underpins jest-axe, Playwright, and Lighthouse. Share a single rule config so the layers can't drift.
Match the tool to the question. Static linting and jest-axe catch structure fast and cheap; Playwright is the only automated layer that verifies real keyboard, focus, and route behavior; Lighthouse is a coarse per-page budget tripwire.
A gate that doesn't block is just documentation. Wire every layer into CI so violations fail the build, baseline legacy debt, and fail loud with rule ids and selectors.
Manual screen reader and keyboard testing wraps everything. Spend it on discovery, then encode each finding as an automated assertion so it never regresses.

Frequently Asked Questions

Does a clean automated scan prove my app is WCAG compliant? No. Automated engines like axe-core catch roughly 30–50% of WCAG success criteria, and within those they only flag machine-decidable failures. A score of 100 means "no failures the tool can detect," not "accessible." Issues like useful alt text, coherent screen reader announcements, and logical focus order require manual verification with a keyboard and a screen reader.

Do jest-axe, Playwright, and Lighthouse duplicate each other since they all use axe-core? They share the engine but test at different layers, so the overlap is intentional, not redundant. jest-axe runs in JSDOM (no contrast, no real focus) for fast per-component structural checks. Playwright runs the engine in a real browser and drives keyboard/focus behavior the engine can't assess. Lighthouse produces a coarse per-page score for budget thresholds. Use all three — each answers a question the others can't.

Where should I run accessibility tests: pre-commit, PR, or deploy? Tier them to match cost. Static linting and jest-axe are fast enough for pre-commit and every push. Playwright e2e and Lighthouse are slower, so run them on pull requests as the blocking merge gate. Reserve full manual screen reader passes for release candidates. The goal is fast feedback on cheap checks and thorough enforcement before merge.

How do I introduce a blocking CI gate to a legacy app without blocking every PR? Snapshot the existing violations as a baseline and configure the gate to fail only on new violations beyond that baseline. This stops new regressions immediately while letting you burn down the legacy debt incrementally. A hard gate against an uncleaned legacy codebase will block every merge and get disabled within a week.

Can Playwright fully replace manual screen reader testing? No. Playwright can verify keyboard operability, focus order, focus restoration, and that valid ARIA is present — a large and valuable share of behavioral coverage. But it cannot tell you what a screen reader actually says or whether that speech is coherent to a user. Real NVDA and VoiceOver listening remains irreplaceable for 4.1.2 Name, Role, Value in custom widgets.

Which WCAG criteria are hardest to automate? The experiential and judgment-based ones. 2.1.1 Keyboard operability of composite widgets, 2.4.3 Focus Order matching visual order, 2.4.7 Focus Visible on custom-styled controls, and the meaningfulness half of 4.1.2 Name, Role, Value (valid ARIA is checkable; coherent ARIA is not). These are exactly where your manual testing budget should go.

Home — the full accessibility-for-frameworks library.
Core Accessibility Principles for Modern Frameworks — the foundations these tests verify.
React & Next.js Accessibility Patterns — implementation patterns that pass these gates.
Automated Accessibility Testing with axe-core — the shared engine, configured in depth.
Component Testing with jest-axe — fast structural checks in JSDOM.
End-to-End Accessibility Testing with Playwright — real keyboard, focus, and route behavior.
Accessibility Audits with Lighthouse — page-level scores and budgets.
Gating Accessibility in CI/CD Pipelines — making violations fail the build.
Screen Reader Compatibility Testing — the manual layer that wraps all of the above.