Patching CVE-2026-44578 — Next.js SSRF, Why It Mattered for Smart Island, and the Build Saga - Smart Island

The CVE

CVE-2026-44578 dropped 15 May 2026: a Server-Side Request Forgery (SSRF) in Next.js, CVSS 8.6 (High). The flaw is in the built-in Node.js server's handling of WebSocket upgrade requests — a crafted upgrade can cause Next.js to proxy arbitrary HTTP requests through itself to internal endpoints.

The interesting part of the attack pattern: it doesn't need an authenticated user, doesn't need a misconfigured app, doesn't need any custom code. Just an unpatched Next.js between 13.4.13 and 15.5.15, behind anything that lets HTTP/1.1 upgrades through.

Affected: Next.js 13.4.13 → 15.5.15, and 16.0.0 → 16.2.4 Fixed: 15.5.16 / 16.2.5 Not affected: Vercel-hosted (Vercel's proxy strips the relevant header before the app sees it)

This is the second case in recent memory of "self-hosting Next.js = bigger CVE surface than Vercel". Worth pricing in if you self-host: the platform vendor's edge often catches things the framework itself doesn't.

Why we cared specifically

Smart Island runs self-hosted on Azure: a single Linux VM, pm2 supervising the Next.js server, Nginx in front, Let's Encrypt SSL. We were on 15.5.12 — comfortably inside the vulnerable range.

What an attacker could have done with SSRF on this stack:

Reach the Azure Instance Metadata Service at 169.254.169.254. That's the endpoint that returns the VM's identity tokens — Azure-issued OAuth tokens for whichever managed identity the VM is configured with. If those tokens carry permission to read Storage or Key Vault, the attacker gets read access to whatever the VM has.
Reach the internal Azure SQL gateway. Our iom-smartisland.database.windows.net accepts connections from the VM's IP. SSRF turns the VM into the attacker for SQL reachability purposes.
Poke any other internal Azure service the NSG lets the VM reach.

None of those are guaranteed wins for the attacker — IMDS v2 has its own header-shaped defence; SQL still requires auth; many services rate-limit. But CVSS 8.6 is High specifically because the vulnerability doesn't need much else to weaponise.

So: drop everything, patch.

The patch

Trivial in principle:

cd packages/web
pnpm update next@^15.5.16
# resolves to 15.5.18 (latest 15.x)

Three lines in pnpm-lock.yaml change; commit; push; GitHub webhook fires; deploy.sh runs on the server; pm2 restarts the app. Should be a 30-second incident from CVE-aware to patched.

It was not a 30-second incident.

The build saga

This is the bit worth writing down honestly — because the gap between "the patch is trivial" and "the site is back up patched" was hours, and the reasons are instructive for anyone running a similar stack.

Hour 1 — the build wouldn't finish

First deploy attempt killed mid-build because pm2's autorestart loop on the old smartisland app was thrashing the box. Stopping smartisland for a deploy puts the process in pm2's "should be running, isn't, try again" state — and pm2 retried next start thousands of times against a half-cleared .next directory while the build was trying to use the same CPU. Restart count ticked into the hundreds of thousands before we noticed. Killing the rogue next start processes and pausing pm2's smartisland entry let the build breathe.

Hour 2 — Google Fonts timed out

Second attempt got further but failed during compile with:

app/layout.tsx
`next/font` error:
Failed to fetch `Inter` from Google Fonts.

next/font/google fetches the font CSS at build time and bundles the response. From this Azure region, fonts.googleapis.com was returning ETIMEDOUT for a stretch of the day. Eventually cleared on its own (transient Google outage or routing issue), but there's a TODO to switch the layout to next/font/local and bundle Inter directly so this never blocks a deploy again.

Hour 3 — Azure SQL ran out of DTUs during static page generation

This is the most interesting failure, and the one that took the longest to diagnose because the error message is misleading.

After compile cleared, the build moved to static page generation (the "Generating static pages (50/202)" phase). Failed almost immediately with:

prisma:error
Invalid `prisma.job.count()` invocation:
Can't reach database server at `iom-smartisland.database.windows.net:1433`.
Error code: P1001

Connectivity was fine. TCP to port 1433 succeeded. DNS resolved. prisma db pull from the same box worked perfectly in another terminal. The Prisma client could talk to Azure SQL all day.

What the error actually meant — and what Prisma's P1001 surfaces as "Can't reach database" — was connection pool exhaustion on the SQL Server side. Our Azure SQL tier is on Database Transaction Units (DTUs). Low-DTU tiers throttle aggressively: when N parallel connections try to land in the same brief window, the gateway accepts the first few and refuses the rest. Prisma's resulting "can't reach" error is correct in the literal sense — the gateway said no — but reads as "the database is down" which is a much louder, scarier statement than "the database limited you, briefly".

The trigger: Next.js's static page generator spawns multiple worker processes to render pages in parallel. Each worker imports the page module, runs its data-loader, and for pages with heavy Prisma queries the data-loader fires multiple parallel Prisma calls via Promise.all. For example, our getStats() function (used on /stats and /ai-observatory) fires 18 parallel prisma.job.count(...) and prisma.aggregate(...) queries on a single page-load.

Pages × workers × queries-per-page = enough simultaneous connections that the DTU tier said no. Build aborts. No BUILD_ID written. Site stays on the vulnerable version.

The fix

Four pages converted from ISR to fully-dynamic rendering:

Page	Heavy load
`/` (home)	`getDashboardData()` — 12 parallel queries
`/ai-observatory`	`getStats()` — 18 parallel queries
`/stats`	Same `getStats()`
`/observatory`	4 parallel counts + aggregates

A one-line directive export const dynamic = "force-dynamic" removes a page from the build-time pre-render path. The page is now rendered per-request; the DB queries happen at request time, not at deploy time. Next.js's response cache and the upstream CDN absorb the traffic just fine.

Trade-off: slightly slower first-byte on those four pages. None of them are hot-traffic enough for the trade-off to matter.

After the patch the build sailed through 198 of 198 remaining pages (down from 202; the 4 we converted dropped out of the static count). BUILD_ID appeared. [7/7] Restarting app... fired. pm2 brought the patched Next.js up.

Site back. CVE closed.

What I'd do differently

Three follow-ups going on the backlog from this incident, in order of importance:

1. Stop pre-rendering pages whose data is uncomputed precomputed JSON. The pages we patched all read live from Prisma at build time. They could just as easily read from precomputed JSON files (the way /education/courses and /education/adaptive-capacity already do). A scraper pipeline writes dashboard-stats.json once an hour; pages read it via fs.readFileSync; build is decoupled from Azure SQL entirely. Long-term, this is the right architecture — DB is for queries, JSON is for snapshots.

2. Decouple pm2 from deploy.sh's stop/start. Our deploy script does pm2 stop → clear build → re-build → pm2 restart. Between the stop and the restart, pm2's autorestart sees the app as "should be running, isn't" and retries. The fix is either pm2 delete (re-created in step 7) or setting min_uptime + max_restarts in the ecosystem config so a fast-failing app gives up after 10 attempts instead of looping forever.

3. Self-host Inter via next/font/local. Drop the Google Fonts build-time fetch entirely. Bundle the WOFF2 in the repo. Done.

The honest take

A 30-second patch took a multi-hour incident because every shortcut taken in the past — pm2 autorestart on, Google Fonts at build time, heavy Prisma queries during static generation — was fine until something else went wrong. Each one was a known compromise we'd accepted. The CVE rolled them all up at once.

If you self-host Next.js, today's a good day to look at your build pipeline and ask: which of my known compromises would all fire if I had to deploy in the next 15 minutes? That's the list to chip away at before the next CVE forces it.

Smart Island is patched and serving on Next.js 15.5.18 as of 16 May 2026. CVE-2026-44578 closed.

— Manning Aguirre's adaptive-capacity framework would call this an absorption gap: the supply tap (a deployment infrastructure that can land patches in seconds) exists, but the cohort it serves grew faster than the tap. Worth fixing.