testing and automating accessibility

Accessibility Regression Testing in GitHub Actions

A clean gate on a clean codebase is easy; the hard problem is a real app that already carries accessibility debt. You cannot fail every PR on hundreds of pre-existing violations, but you also cannot let new ones slip through. The answer is regression testing: snapshot the currently accepted violations into a baseline, then fail a pull request only when it introduces something new. This guide—part of Gating Accessibility in CI/CD Pipelines—shows how to baseline, diff, and run scheduled full audits in GitHub Actions, so a regression in 4.1.2 Name, Role, Value or 1.4.3 Contrast (Minimum) fails the build while legacy debt is tracked, not blocking.

WCAG Coverage Mapping

  • 4.1.2 Name, Role, Value (Level A)
  • 1.4.3 Contrast (Minimum) (Level AA)
  • 1.3.1 Info and Relationships (Level A)

Prerequisites

  • An axe-based suite (jest-axe or @axe-core/playwright) that can emit machine-readable JSON.
  • A GitHub Actions workflow running on pull_request, already a required check.
  • A writable location for the baseline file, committed to the repository.

Snapshotting a Baseline of Accepted Violations

The baseline is a committed record of every violation you currently tolerate, keyed so that an entry survives unrelated DOM churn. A raw axe report is too volatile to diff directly—node order and attribute values shift between runs—so reduce each violation to a stable fingerprint of rule id plus a normalized selector.

// scripts/fingerprint.js — stable identity for one violation node
const crypto = require('node:crypto');

function fingerprint(ruleId, node) {
  // Normalize the selector so trivial DOM reordering doesn't churn the baseline.
  const selector = node.target.join(' ').replace(/:nth-child\(\d+\)/g, ':nth-child(n)');
  return crypto.createHash('sha1').update(`${ruleId}::${selector}`).digest('hex').slice(0, 12);
}
module.exports = { fingerprint };
// scripts/write-baseline.js — generate the accepted-debt snapshot
const fs = require('node:fs');
const { fingerprint } = require('./fingerprint');
const report = require('../a11y-report.json');

const baseline = {};
for (const v of report.violations) {
  for (const node of v.nodes) {
    baseline[fingerprint(v.id, node)] = { rule: v.id, impact: v.impact, selector: node.target.join(' ') };
  }
}
fs.writeFileSync('a11y-baseline.json', JSON.stringify(baseline, null, 2) + '\n');

Commit a11y-baseline.json to the repository. It is the source of truth for "violations we have agreed to fix later," and every PR diffs against it. The narrower, selector-scoped form of this idea for a single rule is covered in Failing Pull Requests on axe Violations.

Gate Hook: Regenerate the baseline only on an intentional, reviewed commit—never automatically in CI. An auto-updating baseline launders new regressions into accepted debt, defeating the entire gate.


Diffing a New Run Against the Baseline

On each pull request the workflow runs axe, fingerprints the fresh violations, and subtracts the baseline. Anything left is new and fails the build; anything in the baseline that disappeared is a fixed item you can prune.

// scripts/diff-baseline.js — only NEW violations fail the gate
const fs = require('node:fs');
const { fingerprint } = require('./fingerprint');
const baseline = require('../a11y-baseline.json');
const report = require('../a11y-report.json');

const seen = new Set();
const fresh = [];
for (const v of report.violations) {
  for (const node of v.nodes) {
    const fp = fingerprint(v.id, node);
    seen.add(fp);
    if (!baseline[fp]) fresh.push({ rule: v.id, impact: v.impact, selector: node.target.join(' ') });
  }
}
const fixed = Object.keys(baseline).filter((fp) => !seen.has(fp));

console.log(`### Accessibility regression check\n`);
console.log(`New: **${fresh.length}** · Fixed (prunable): **${fixed.length}**\n`);
if (fresh.length) {
  console.log('| Rule | Impact | Selector |\n| --- | --- | --- |');
  for (const f of fresh) console.log(`| ${f.rule} | ${f.impact} | \`${f.selector}\` |`);
}
// Non-zero ONLY on new violations -> red required check on regressions.
process.exit(fresh.length ? 1 : 0);
# .github/workflows/a11y-regression.yml
name: a11y-regression
on:
  pull_request:
    branches: [main]
jobs:
  diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: npm }
      - run: npm ci
      - run: npm run test:a11y:json     # writes a11y-report.json
      - name: Diff against baseline
        run: node ./scripts/diff-baseline.js >> "$GITHUB_STEP_SUMMARY"

The diff makes the gate adoptable: a team with 300 known violations sees a green check until someone adds the 301st, at which point the job exits non-zero and branch protection blocks the merge.

Gate Hook: Report the count of fixed baseline entries too. Surfacing prunable items nudges engineers to shrink the baseline, turning the gate into a ratchet that only tightens.


Scheduled Full Audits Beyond PR Scope

Pull-request runs only test the routes a diff touches, so untouched pages drift. A nightly schedule trigger runs the full audit across every route, catches third-party and content regressions, and can open an issue when the baseline grows on main.

# .github/workflows/a11y-nightly.yml
name: a11y-nightly
on:
  schedule:
    - cron: '0 6 * * *'          # 06:00 UTC daily, full-site sweep
  workflow_dispatch: {}          # allow manual runs
jobs:
  full-audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: npm }
      - run: npm ci
      - run: npx playwright install --with-deps chromium
      - run: npm run test:e2e:a11y:full   # crawls all routes, writes report
      - name: Open issue on new debt
        if: failure()
        run: gh issue create --title "A11y regression on main" --body-file a11y-summary.md
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

The nightly job uses the same diff logic but against the full route map, so it surfaces regressions that no PR exercised—a CMS content change that broke contrast, or a dependency bump that stripped an accessible name. Page-level budgets that complement this sweep are detailed in Setting Lighthouse CI Accessibility Budgets.


Caching and Matrix for Speed

A full sweep is heavier than a PR diff, so cache the expensive pieces and parallelize across routes with a matrix.

jobs:
  full-audit:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false           # report every shard, not just the first failure
      matrix:
        shard: [1, 2, 3, 4]      # split the route map four ways
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: npm }
      - uses: actions/cache@v4
        with:
          path: ~/.cache/ms-playwright    # reuse browser binaries
          key: pw-${{ hashFiles('package-lock.json') }}
      - run: npm ci
      - run: npm run test:e2e:a11y -- --shard=${{ matrix.shard }}/4

fail-fast: false ensures every shard reports its own regressions instead of cancelling siblings on the first failure—critical for an audit whose job is to surface all drift.


Uploading Reports as Artifacts

Persist the full axe JSON, the diff summary, and any HTML report so a failure is investigable after the runner is gone. Upload on always() so the artifact survives a red run.

      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: a11y-report-shard-${{ matrix.shard }}
          path: |
            a11y-report.json
            a11y-summary.md
            playwright-report/
          retention-days: 30      # keep history to trend regressions

Thirty-day retention lets you trend the violation count over time and answer "when did this regress?" by diffing two archived baselines.


How to Verify

Confirm the regression gate behaves correctly in both directions—new violations must fail, baselined ones must not.

  1. Baseline passes (tool check): With an up-to-date baseline, open a no-op PR and confirm the diff job is green and the summary reports New: 0.
  2. New violation fails (tool check): Introduce a fresh 4.1.2 defect (remove an aria-label on an un-baselined element), push, and confirm the job exits non-zero and lists the new selector in the step summary.
  3. Baselined debt does not fail (manual check): Confirm a PR that merely touches code near an existing baselined violation stays green—proving the diff suppresses accepted debt.
  4. Fixed items surface (manual check): Repair one baselined violation and confirm the summary reports it under "Fixed (prunable)," prompting a baseline trim.
  5. Nightly sweep runs (tool check): Trigger the nightly workflow via workflow_dispatch and confirm it audits routes no PR touched and uploads the artifact.

Conclusion

Regression testing is what lets an accessibility gate ship on a real codebase without a flag day. Fingerprint and commit a baseline of accepted violations, diff every PR so only new defects fail, sweep the full site on a nightly cron, and archive reports to trend the debt. Done right, the baseline only ratchets down: new regressions are blocked at the PR, and each fix shrinks the accepted set until the gate guards a fully clean tree.


Frequently Asked Questions

How is a baseline different from disabling rules? A baseline suppresses specific known instances—a rule on a specific element—while still failing on any new instance of that same rule elsewhere. Disabling a rule blinds the gate to every occurrence, including future regressions. The baseline is a tracked debt list; a disabled rule is a permanent blind spot.

Should CI update the baseline automatically? No. Auto-updating launders new regressions straight into accepted debt and silently defeats the gate. Regenerate the baseline only on an intentional, reviewed commit, so growing the accepted-debt set is always a deliberate, visible decision.

Why run a nightly audit if every PR is already gated? PR runs only exercise the routes a change touches. Untouched pages drift through CMS content edits, third-party widget updates, and dependency bumps that strip accessible names or break contrast. A scheduled full sweep catches regressions no pull request ever exercised.

How do I keep the full audit fast? Cache the Playwright browser binaries, split the route map across a build matrix with fail-fast: false, and reuse the npm cache. Sharding turns a long serial crawl into parallel jobs while still reporting every shard's regressions.

What should I store as an artifact? The raw axe JSON, the Markdown diff summary, and the HTML report, uploaded with if: always() and a retention window. Keeping history lets you diff two archived baselines to pinpoint exactly when and where a regression landed.