After a core rollback, halt the rest — a safety design we arrived at the hard way

In WordPress maintenance automation, you inevitably run into points where you have to decide: keep going, or stop right here? One that took us a long time to get right was this: when a WordPress core update goes wrong and gets rolled back, should the remaining plugin updates continue, or stop?

We eventually switched to the “stop” design, but we started with “keep going” — and several traps surfaced only after running it in production. Here’s how the redesign happened.

Three cases to separate

The outcome of a core update, viewed through a rollback lens, falls into three patterns:

Case 1: Core rollback succeeded, site recovered — the site is healthy again after the RB
Case 2: Core rollback succeeded, but site did not recover — the RB ran, but the site is still broken
Case 3: Core rollback itself failed — the RB couldn’t even complete

Case 1 is clearly “keep going,” and Cases 2/3 are clearly “abnormal.” But what to do next isn’t as simple as that framing suggests.

The old design — disable the HTTP check and continue

The original design kept maintenance running through Cases 2 and 3:

_skip_http_check = True   # disable HTTP check after a core anomaly, keep going
# remaining plugin / theme / translation updates still run

The reasoning was: “Once core is broken, of course a plugin update will return 5xx — so disable the HTTP check, and we won’t mistakenly roll back unrelated plugins.”

In practice, this did reduce false-positive rollbacks. But as the tool ran in real environments, two problems emerged.

Two problems that surfaced

Problem (a): broken state plus piled updates = untraceable

If 20 plugins are updated while core is still broken, the log records “20 updates succeeded.” The site is still broken, but the log reads as healthy.

The next day, when the agency tries to trace “where did it break?” — there’s no way to tell whether core was the cause, one of the later updates was, or some combination of them. A safety mechanism intended to reduce noise was actually inflating investigation cost.

Problem (b): genuine plugin failures became invisible

Setting _skip_http_check = True disables the HTTP check uniformly — including for plugin-side bugs that have nothing to do with core (memory leaks, dependency conflicts, PHP version incompatibility).

What was supposed to be “skip the HTTP check while core is broken” was actually “make all anomalies in this window invisible.” That’s equivalent to intentionally disabling a safety device.

The new design — halt in Cases 2 and 3

Based on these problems, Cases 2 and 3 now stop all subsequent plugin / theme / translation updates entirely.

if _halt_remaining:  # set to True in Case 2 / 3
    # record step_rollbacks first, then early return
    return

The key is that this isn’t just “stop and walk away”:

step_rollbacks records are kept — the full record of what happened stays in the log
The outer visual_check / browser_automation / email notification still run — final HTTP confirmation and the alert to the agency are still guaranteed
Reports are still generated after the early return — the message “stopped after core rollback” appears in the client-facing report too

Case 1 (RB succeeded + recovery confirmed) continues as before. The site is healthy again, so the precondition for safely running the remaining plugin updates with HTTP checks is intact.

The trade-off we accepted

This change means “if core breaks today, all subsequent updates for the day stop.” A scheduled batch of 20 plugin updates gets deferred to the next maintenance run if a single core rollback happens. In the short term, that feels inconvenient.

But in real operation:

The agency receives a clear signal: “fix core before retrying”
The next maintenance job resumes from a healthy state
The log carries an unambiguous “halted after core rollback” trace

These three together — at the cost of skipping that day’s plugin updates — give operationally far more traceable behavior than the silently-continues-while-broken alternative.

Takeaway — safety devices need to be left on

Designs that “disable the check during abnormal conditions” can look clever but tend to make anomalies invisible. Stopping the moment something abnormal is detected, and handing the decision back to the agency, generally gives more predictable behavior across the workflow.

When a maintenance-automation design choice is hard to settle, a useful heuristic is: don’t try to fix it automatically — communicate it clearly to a human. Surprisingly often, that’s what saves the operation.