When SSH commands hit a csh login shell — wrapping every command in /bin/sh -c across the codebase

One day a user reported an oddly asymmetric bug. In the “add new site” modal, picking an SSH profile and clicking “auto-detect WordPress install path” always failed with “no path found.” But clicking the WP-CLI path test button on the same SSH connection worked fine. Same credentials, same host — one succeeded, the other failed.

Tracing it down, the culprit was an old foe: csh / bash incompatibility on the server side. This post walks through the fix, sweeping the same bug across the rest of the codebase, and the static-analysis test we added to keep it from coming back.

The smoking gun — `find: 2: unknown primary or operator`

The server-side error log gave it away:

find: 2: unknown primary or operator

find itself is POSIX-standard, but it was dying with a mysterious 2 argument. That 2 is the leading number of 2>/dev/null — a redirect that was being passed as a literal argument to find because the shell never interpreted it as a redirect in the first place.

Note: 2>/dev/null is the standard way to silently discard stderr in Bourne shell (sh) and bash. csh (C shell) uses different syntax and doesn’t recognize it.

Sakura Internet defaults users to csh

We’ve documented this before in the four-host investigation of why WP-CLI doesn’t run: on Sakura Internet (Japanese host), the default user login shell is csh / tcsh, not bash.

This collides with how paramiko (Python’s SSH library) works: exec_command runs the command through the user’s login shell. Sending find ... 2>/dev/null to a Sakura host means csh tries to interpret it and chokes. That’s the real error.

The bash/sh idioms that fall over on csh include:

2>/dev/null (redirect)
[ -f path ] (test syntax)
for X in ...; do ... done (loop)
cmd1 && cmd2 (short-circuit)
\( ... \) (subshell)

These all blow up with “unknown primary or operator” or “Missing }” on csh.

“I fixed one site, so they’re all fixed” — but they weren’t

This wasn’t our first encounter with this issue. A few release rounds earlier, we’d noticed test_ssh_profile (the SSH-connection test API) was failing on csh hosts and patched it by wrapping the command in /bin/sh -c '...':

# Already-fixed test_ssh_profile
result = c.run('/bin/sh -c ' + shlex.quote('echo ok'),
               hide=True, warn=True)

Wrapping commands in /bin/sh -c "..." guarantees that POSIX sh interprets them, regardless of what login shell the user is on. Small trick, big effect.

The problem: that fix only went into the one endpoint, test_ssh_profile. We didn’t grep “is the same pattern elsewhere?” at the time, so every other SSH-command-sending API was still broken on csh hosts.

The full list of stragglers:

Location	Role
`/api/discover_server_paths`	Auto-detect WordPress install paths
WP-CLI auto-detect (on profile save)	Find the `wp` binary per server
`test_wpcli` endpoint	Standalone WP-CLI path test
Plugin-list fetch (multiple call sites)	`wp plugin list` for the dashboard etc.

Translation: all of these were silently broken on csh hosts. Because the WP-CLI test button alone had been patched, users saw the strange asymmetry: “test passes, auto-detect fails.”

The fix — funnel everything through a `_safe_run` helper

Adding /bin/sh -c wrapping at every site by hand is exactly the kind of thing you forget the next time. So we collapsed every SSH command call onto a single helper function:

def _safe_run(c, cmd, **kwargs):
    """Wrap cmd in /bin/sh -c so it doesn't depend on the login shell.
    Default path used by every SSH-related API to ensure POSIX sh
    interprets the command.
    """
    wrapped = '/bin/sh -c ' + shlex.quote(cmd)
    return c.run(wrapped, **kwargs)

Now there’s essentially no “raw c.run(cmd)” left in the codebase, and adding new SSH commands naturally routes through _safe_run.

Catch-the-relapse — static analysis as the second layer

We added one more layer: a test that makes it hard to reintroduce the bug.

# tests/test_csh_safe_run_wrapping.py (conceptual)
def test_no_raw_c_run_with_sh_syntax():
    """Statically verify that no raw c.run() with sh-shaped syntax
    remains in site_manager_web.py."""
    for call in find_c_run_calls('site_manager_web.py'):
        arg = call.argument_text
        if contains_sh_syntax(arg):
            assert arg.lstrip("'").lstrip('"').startswith('/bin/sh -c'), \
                f"Raw c.run() with sh syntax at line {call.lineno}"

The scan is scoped to the inside of c.run( argument expressions, so docstrings and comments mentioning 2>/dev/null don’t cause false positives.

With this in CI, a future PR that writes c.run('find ... 2>/dev/null') without going through _safe_run fails the build. Forgetting to fan out a fix manually is no longer fatal — the machine catches it.

Closing — sealing “copy-paste bugs” with two layers

Two principles from this round seem worth keeping:

Grep for the same pattern when you make the first fix. Bugs that fix-by-copy-paste across multiple call sites overwhelm human memory. Doing a cross-grep at the moment of first fix is the first line of defense.
A regression test for the forbidden pattern. A static-analysis test that catches “the wrong shape” gives you a second line of defense that runs even when the human forgets. CI catches what discipline doesn’t.

The honest reflection here: we assumed “we fixed test_ssh_profile, so csh is handled,” and that assumption sheltered five remaining bugs. Hope this is useful for anyone working through the same shell-portability problem.

The same bug class came back again later, this time in a new feature — a DB backup progress monitoring script that called c.run() directly, outside the scope of what _safe_run covered. The story of that regression, and how wrap_remote_sh() was designed to work across three server environments (Sakura/csh, Xserver, and heteml without openssl), is in the follow-up article.