Detecting the running site from streaming logs — why log-order inference broke and how one marker fixed it

In a screen that runs maintenance across multiple sites in sequence, you want to show progress live: a blue border on “the site being processed now,” a green border on “completed sites.” Common UI.

The approach we reached for first was “watch the streaming logs the backend emits, and infer which site is being processed.” It turned out to be surprisingly tricky — three rounds of fixes before it stabilized. This post records that trial-and-error as a design log.

The first design — “site switched, so mark the previous one done”

Logs arrive per-site as lines like [Site name] Starting maintenance. The first implementation was simple:

// Initial implementation (later found to be the root of the bugs)
function _setRunningSiteId(siteId) {
    if (_runningSiteId && _runningSiteId !== siteId) {
        _markCompleted(_runningSiteId);  // switched to another site = previous is done
    }
    _runningSiteId = siteId;
}

The inference: “a new site name appeared in the log = the previous site finished.” Intuitively plausible.

Bug 1 — log “rewind” marks an unstarted site as done

Real-world testing surfaced this: “select two sites, run, and the second site that hasn’t even started yet turns green (done).”

The cause was that the streaming log sometimes interleaves multiple site names in a non-monotonic order.

t1: log shows [Site A]        → _runningSiteId = A
t2: log shows [Site B]        → mark A done + _runningSiteId = B
t3: log shows [Site A] again  → mark B done + _runningSiteId = A

At t3, B — which hasn’t started processing — gets falsely marked “done.” The “switched, so previous is done” rule was fragile against the log’s non-monotonic ordering.

Fix 1 — reject rewind with an execution-order index

On run, we save the “planned site ID array (in execution order)” and ignore any switch whose index moves backward.

let _runningSiteIdOrder = [];   // planned site_id array (execution order)
let _runningSiteIdIndex = -1;   // current index

function _setRunningSiteId(siteId) {
    const newIdx = _runningSiteIdOrder.indexOf(siteId);
    if (newIdx < _runningSiteIdIndex) {
        return;  // rewind = ignore (e.g. index=0 log reappears while index=1 is running)
    }
    _runningSiteIdIndex = newIdx;
    _runningSiteId = siteId;
}

We also moved completion marking off the switch event entirely — completion now gets applied to the whole execution-order array at the end of the run. We separated “inferring what’s running” from “confirming what’s done.”

Bug 2 — the real root cause was the init loop’s pre-emitted logs

Even with rewind defense, another failure recurred: “A is running, but B shows as running.“

Scrutinizing the logs revealed the real root cause. The backend (maintenance_agent.py) was emitting init logs for all target sites before any actual maintenance processing began:

[Site A]   Injecting connection info from SSH profile
[Site B]   Injecting connection info from SSH profile
[Site C]   Injecting connection info from SSH profile
[Site A]   Starting maintenance   ← actual processing begins here
[Site A]   Creating backup
...

The old detection logic picked candidates from “every line containing [...],” so it falsely detected the last site in the init loop (C, the last injection log) as “running.” That was the real story behind “A is running but B/C show as running.” Rewind defense helped the ordering problem, but was powerless against “we’re reading the wrong lines in the first place.”

Fix 2 — the marker approach (narrow to one line type)

The fix was almost anticlimactically simple. Detect only the “Starting maintenance” line, which appears exactly once at the start of each site’s real processing.

const _START_MARKER = /メンテナンスを開始|Starting maintenance/;

function _detectRunningSiteFromLog(lines) {
    for (let i = lines.length - 1; i >= 0; i--) {
        if (!_START_MARKER.test(lines[i])) continue;  // ignore non-marker lines
        const siteName = extractBracketName(lines[i]);
        // ... match against allSites, then _setRunningSiteId
        return;
    }
}

Init lines like [Site B] Injecting SSH profile are now completely ignored; _runningSiteId only moves the instant [Site A] Starting maintenance appears. The shift was from “infer state from the whole log” to “pick one clear marker that represents a state transition, and watch only that.“

Bug 3 — the completion reset was too aggressive

The last one is smaller. We had a “clear all completion marks (green borders) when a maintenance run starts” step, but a user pointed out: “shouldn’t sites maintained within the last 24 hours stay green?”

Right — we should have reset only the sites in the current run’s scope, but we were also wiping the green borders of out-of-scope sites (ones recently maintained on a different occasion). Limiting the reset to the execution scope fixed it. A “clear state” step always needs its clear-scope kept in mind, or it sweeps up unrelated state.

Lessons — inferring state from streaming logs

Three principles from three rounds of fixes:

Don’t infer state from the whole log — pick one clear marker. Loose inference like “a name appeared = it’s running” is fragile against non-monotonic log output (pre-emitted init logs, interleaved names). A one-to-one “this line means this state” marker removes the inference ambiguity
Separate “inferring what’s running” from “confirming what’s done.” Marking the previous site done at switch time means a mis-detected switch also mis-marks completion. Confirming completion in one batch at run-end keeps it out of the inference jitter
Always scope your state clears. “Reset everything” is easy, but it sweeps up unrelated state. Limit the clear to the execution scope

Behind the modest request “show progress live” lurked the difficulty of inferring state from an asynchronous data source. When you’re wiring up “watch logs, reflect state in the UI,” building with the marker approach from the start is the shortcut that avoids three rounds of rework.