Performance Budgets That Actually Work

Most performance budgets fail because they’re set once and forgotten. A team establishes targets, maybe adds a Lighthouse check to CI, and six months later the budget is either blown past or so generous it catches nothing. Here’s how to build budgets that actually prevent regressions.

Why budgets fail

The typical failure mode: a team sets a Lighthouse performance score target of 90. Scores fluctuate between 85 and 95 depending on network conditions, test environment load, and what else happens to be running on the CI machine. The check becomes noise. Engineers start ignoring it. The budget is effectively dead.

Score-based budgets fail because scores are synthetic aggregates. They compress multiple signals into one number, making it impossible to diagnose what changed or why. A score drop from 92 to 87 could mean your JavaScript bundle grew, or your largest image got bigger, or your server response time increased. The score doesn’t tell you which.

Budget by metric, not by score

Instead of targeting a Lighthouse score, budget individual metrics:

Largest Contentful Paint (LCP): under 2.5 seconds on a 4G connection
Interaction to Next Paint (INP): under 200ms (this replaced First Input Delay as a Core Web Vital in March 2024, and it’s a harder bar to clear because it measures all interactions throughout a session, not just the first one)
Cumulative Layout Shift (CLS): under 0.1
JavaScript bundle size: under 250KB compressed for the initial load
First-party JavaScript: under 150KB compressed

Each metric is independently measurable, independently actionable, and independently meaningful. When a budget breaks, you know exactly what to investigate. A bundle size regression points you to the PR that added the dependency. An LCP regression points you to image loading or server response time.

Automate enforcement in CI

A budget that requires manual checking is a budget that will be ignored. Integrate directly into your CI pipeline:

Bundle size checks on every PR using size-limit or bundlesize. These tools compare the build output against your budget and fail the PR if the budget is exceeded. The output shows exactly which bundles grew and by how much.

Configuration with size-limit is straightforward:

[
  {
    "path": "dist/main.*.js",
    "limit": "150 KB",
    "gzip": true
  },
  {
    "path": "dist/vendor.*.js",
    "limit": "100 KB",
    "gzip": true
  }
]

Lighthouse CI running against preview deployments with per-metric assertions. Instead of asserting a score, assert individual metrics:

{
  "assertions": {
    "largest-contentful-paint": ["error", {"maxNumericValue": 2500}],
    "interactive": ["error", {"maxNumericValue": 3500}],
    "cumulative-layout-shift": ["error", {"maxNumericValue": 0.1}]
  }
}

Real User Monitoring (RUM) alerts when field metrics cross thresholds. Synthetic tests (Lighthouse CI) catch regressions before they ship. RUM catches regressions that only appear with real users on real devices in real network conditions. You need both.

The PR-level checks are the most important. They catch regressions before they merge, when the cost of fixing them is lowest. A developer who just added a 200KB charting library sees the budget failure immediately and can evaluate whether the library is worth the cost or whether a lighter alternative exists.

The performance ratchet

The most effective technique I’ve found is the performance ratchet: after every improvement, tighten the budget to the new baseline plus a small margin.

Ship a bundle size reduction from 280KB to 220KB? Set the new budget at 235KB. This prevents the improvement from being eroded by future changes while leaving room for legitimate feature growth.

The ratchet works because it aligns the budget with reality. Traditional budgets are aspirational: you set a target and hope the team meets it. Ratcheted budgets are empirical: you measure where you are, lock in the improvement, and only allow growth that’s been consciously decided.

The discipline here is important. Without the ratchet, performance improvements are temporary. A team invests a sprint in reducing bundle size, ships the improvement, and over the next quarter the savings are gradually eaten by new features and new dependencies. The ratchet makes the improvement permanent.

Third-party script budgets

First-party JavaScript is the easy part. You control it, you can optimise it, and you can enforce budgets on it.

Third-party scripts (analytics, A/B testing tools, feature flags, behavioural analysis, advertising) are harder. They’re loaded asynchronously, they can change size without warning, and they often resist optimisation because you don’t control the source.

Budget third-party scripts separately:

Track the total weight of third-party JavaScript
Set a budget for third-party script count (each script is a potential performance and security risk)
Use requestIdleCallback or dynamic imports to defer non-critical third-party scripts
Review the third-party budget quarterly and remove scripts that aren’t providing measurable value

This is especially important when integrating tools like A/B testing platforms and behavioural analysis software. Each tool adds weight. The cumulative impact needs to be measured and budgeted, not ignored.

Make performance visible

Put your current metrics on a dashboard that the team sees regularly. Not buried in a CI log. On a screen, in a Slack channel, in the team’s routine discussions. Visibility creates accountability without process.

The dashboard should show:

Current Core Web Vitals from RUM data (the 75th percentile, which is what Google uses for ranking)
Bundle size trend over the last 30 days
A list of recent budget violations with links to the PRs that caused them

When engineers can see the performance trend, they start caring about it naturally. Performance becomes a shared responsibility rather than something that one engineer champions and everyone else ignores.

RUM vs. synthetic: you need both

Synthetic testing (Lighthouse CI in your pipeline) is fast, consistent, and catches regressions before they ship. But it runs in a controlled environment that doesn’t represent real users.

Real User Monitoring captures performance as actual users experience it. Different devices, different network conditions, different content states. RUM will surface problems that synthetic tests miss: slow interactions on low-end Android devices, layout shifts caused by dynamically loaded content, LCP regressions that only appear when the CDN cache is cold.

Use synthetic tests as your CI gate: fast feedback, consistent results, catches the obvious regressions. Use RUM as your source of truth: real performance, real users, real impact on business metrics.

The cultural shift

Performance budgets are ultimately a cultural tool, not a technical one. They communicate that performance is a feature, not an afterthought. The specific numbers matter less than the commitment to measuring, enforcing, and improving over time.

The best performance cultures I’ve seen treat budget violations like test failures: the build is red until someone fixes it. Not “we’ll address it next sprint.” Not “it’s only 5KB over.” Red means red. This sounds rigid, but it’s actually liberating. It takes performance debates off the table. The budget is the budget. If you need to exceed it, make the case, update the budget explicitly, and document why.