You ran Lighthouse. You got a 94. You shipped it. Three weeks later, someone pings you to say the page feels slow. You run Lighthouse again. Still a 94. Both things are true, and that’s exactly the problem.

Web performance is one of those areas where the tooling can make you feel great about a situation that your users are actively suffering through. The gap between “my score is fine” and “my site is fast” is wider than most people realise, and crossing it requires understanding not just what to measure, but where and when. Get that wrong, and you’re not measuring performance. You’re measuring your ability to game a benchmark.

A very loose Doom reference
Slaying Web Performance

This article isn’t going to tell you what Core Web Vitals are. There are plenty of resources for that. Instead, it’s going to walk you through how to build a measurement stack that actually reflects reality, and why the one you probably have right now doesn’t.

The False Confidence Problem

The dirty secret of web performance tooling is that the most accessible tools are also the least representative. Lighthouse is installed in every developer’s browser, runs in seconds, and produces a satisfying score out of 100. It is also running on your MacBook Pro, on a fast connection, with your local machine’s CPU, in a tab you just opened specifically to test it.

Your users are not doing that.

They’re on mid-range Android devices with throttled CPUs, on mobile networks that drop out in lifts, with seventeen other tabs open and three Chrome extensions running. The Lighthouse score you’re optimising for is a controlled lab environment. The experience you’re actually responsible for is anything but.

This isn’t a reason to stop using Lighthouse. It’s a great tool for catching obvious problems and getting directional feedback quickly. The mistake is treating it as ground truth. A 94 in a lab doesn’t mean your users are having a good time. It means you did well in the lab.

Why Regressions Keep Happening

Here’s something I’ve seen more times than I’d like: a page scores well in CI, deploys cleanly, and then quietly gets slower over the next few weeks as other changes land around it. Nobody notices until someone actually uses the thing and mentions it in passing.

The reason this keeps happening is usually one of a few things.

The common thread is timing. These regressions are findable. They just need to be found at the right moment, in the right place.

The Measurement Stack

The goal is to catch problems as close to the source as possible, which means having tools at multiple points in the lifecycle. Here’s how to think about it, ordered by how fast they give you feedback.

Local Lighthouse and DevTools

Useful for spotting obvious problems, profiling specific interactions, and getting a quick sanity check before opening a pull request. Not useful for representing real user experience, tracking trends, or catching anything subtle.

Treat Lighthouse as a development aid, not a performance verdict. Run it often, take the numbers loosely, and resist the urge to chase the score. The more interesting tool is sitting right next to it.

The Performance tab in Chrome DevTools doesn’t get nearly enough attention. While everyone is fixated on their Lighthouse score, the Performance tab gives you a full trace of exactly what the browser is doing during a page load or interaction: every network request, every layout and paint event, every long task that’s blocking the main thread. You can see precisely which JavaScript is responsible for a slow INP, where your LCP candidate is being delayed, and whether anything is forcing an unexpected reflow.

The flame chart takes a little getting used to, but once it clicks it’s one of the most powerful diagnostic tools available. A Lighthouse score tells you something is slow. The Performance tab tells you why.

Lighthouse CI and Pre-merge Checks

This is where Lighthouse becomes genuinely useful: not as a one-off check, but as an automated gate on every pull request. Most CI platforms have a way to run Lighthouse against a preview or staging URL as part of the pipeline. Wire it up with a performance budget that fails the build if key metrics drop beyond a threshold, and you’ve moved from “we’ll notice eventually” to “this can’t ship.”

The key thing is to run it against your staging environment rather than a local build, so you’re at least approximating production conditions. Even running Lighthouse against a preview deployment is a meaningful step up from a laptop benchmark.

Your budget.json defines the thresholds you care about. Set them to fail the build, not just warn. A budget you can override at will is not a budget, it’s a suggestion.

budget.json
[
  {
    "path": "/*",
    "timings": [
      { "metric": "largest-contentful-paint", "budget": 2500 },
      { "metric": "total-blocking-time", "budget": 300 }
    ]
  }
]

If you want to go further, the @lhci/server package gives you a self-hosted dashboard that stores historical results and lets you compare scores across builds. It’s not essential to get started, but it fills a meaningful gap: your CI currently tells you whether a build passed or failed, but it doesn’t tell you whether your scores have been slowly drifting in the wrong direction over the past month. The server does. Self-hosting does add some infrastructure overhead, so it’s worth weighing up whether it’s justified for your team.

Lighthouse isn’t the only tool worth running in CI, though. Integration tests are a natural complement, and with Playwright you can go beyond functional correctness and start asserting on performance behaviour directly. This is particularly valuable for framework-level features that Lighthouse has no visibility into. If you’re using Next.js and relying on ISR, for example, verifying that pages are actually being served from cache is something worth testing explicitly. A misconfigured revalidation period or a missing Cache-Control header won’t show up in a Lighthouse score, but it will silently tank your TTFB for every user who hits an uncached page.

performance.spec.ts
test('ISR page is served from cache', async ({ request }) => {
  const response = await request.get('/your-isr-page');
  const cacheHeader = response.headers()['x-nextjs-cache'];
  expect(cacheHeader).toBe('HIT');
});

The same principle applies to other performance-critical features: preload headers for critical assets, correct Vary headers for content negotiation, or ensuring that your CDN is actually caching what you think it is. These are things Lighthouse will never catch, but a targeted integration test will, and they belong in your CI pipeline alongside your budget checks.

Always a lighthouse

Letting Changes Soak in Staging

Before anything goes to production, let it sit in staging for a meaningful window: at least a few hours for routine changes, longer for anything touching the critical render path. This isn’t just about Lighthouse. It’s about giving your monitoring enough time to surface problems that a single test run would miss.

Intermittent SSR latency is a good example. An authenticated API call that’s usually fast might have a rough window during peak load, or degrade under certain cache conditions that only appear after a deployment has been running for a while. A CI check that runs once at deploy time won’t see it. Your RUM tool running against staging overnight will.

Staging soak time is cheap insurance. The main failure mode is teams that treat staging as a formality, something to pass through as quickly as possible on the way to production. The value is in the monitoring you have running while the change is there, not the act of deploying to it.

Real User Monitoring in Production

This is where the story changes. Everything above is lab data: controlled conditions, simulated users, consistent hardware. A RUM tool gives you field data: actual Core Web Vitals from actual users on actual devices and networks. Options range from purpose-built solutions like Grafana Faro and Datadog RUM to platform-native offerings like Vercel Analytics or Cloudflare Analytics if you’re already in that ecosystem. The underlying mechanism is similar across all of them. The web-vitals JavaScript library captures LCP, CLS, INP, FCP, and TTFB from real navigation events and pipes them somewhere you can query.

The difference in what you see can be humbling. Pages that look healthy in CI develop a long tail of slow LCP readings when you look at real user data by device category. SSR endpoints that pass all your benchmarks turn out to have occasional multi-second response times that never appear in a controlled test. That API call that’s usually fast? It’s not always fast.

The slicing matters too. A page that performs well on desktop in London might be a different story on mobile in a region with worse network infrastructure. Lab tools will never show you that. RUM will, and it’ll show you exactly which segment is suffering.

Auditing with AI

The tools above all measure what’s happening at runtime. A complementary approach is to go looking for problems in the codebase itself before they ever make it to a browser. Most AI coding tools like Claude or GitHub Copilot are capable of auditing your code for common Core Web Vitals mistakes if you give them the right context and a focused prompt.

The key is specificity. Asking an AI to “check for performance issues” will produce vague results. A reusable skill, something you can save and share with your team, changes that. Here’s a starting point:

/audit-core-web-vitals
Review the provided code for issues that could negatively impact Core Web Vitals. Focus on:

- Images missing `priority` or `fetchpriority` attributes above the fold
- Render-blocking scripts or stylesheets on the critical path
- Synchronous data fetching during SSR that could increase TTFB
- Large dependencies being imported on the critical path
- Components likely to cause layout shift (missing dimensions, dynamic content insertion)
- Any third-party scripts loaded without `async` or `defer`

For each issue found, explain which metric it affects (LCP, CLS, INP, TTFB) and suggest a fix.

It won’t catch everything, but it will catch the obvious things that CI tends to miss because they’re too subtle to trip a budget threshold on their own.

If you’re already using an MCP server to give your AI tools context about your design system or codebase, this kind of audit becomes significantly more accurate. An AI that understands your component library can spot a design system image component being used without the right props far more reliably than one working from first principles.

For runtime analysis, the Chrome DevTools MCP server takes things further. Rather than reading static code, your AI agent can connect to a live Chrome instance, record a performance trace, and inspect network requests directly. Instead of guessing what might be slow, you point it at a running page and ask what actually is.

Alerting: Making Regressions Someone’s Problem

Having the data is one thing. Acting on it quickly enough to matter is another. The difference is alerting, and getting alerting right takes a bit more thought than just turning on every notification your RUM tool will give you.

Conclusion

The score is not the point. The score is a proxy: a useful one when used in the right context, a misleading one when treated as the whole story. Build a stack that covers the full lifecycle, catch obvious regressions before they merge, let changes soak in staging with real monitoring running, and track what actually happens when real users hit production.

Do that, and “the page feels slow” stops being a mystery you investigate after the fact and starts being a signal you caught three days ago.