One day a multi-site administrator reported a strange bug: “After running the app’s SSH connection test 2-3 times, my IP can’t reach SSH on that server for a long while.” The errors came back as Connection refused or Connection closed by .... The server wasn’t down, and SSH from a different IP worked fine. The source IP was being temporarily banned at the server.
Two external investigation reports gave the cause: server-side protection mechanisms (fail2ban or PerSourcePenalties in OpenSSH 25+) detect short-windowed authentication failure spikes and temporarily ban the source IP. But the user had only clicked the test button 2-3 times — why were failures “spiking”?
The answer turned out to be paramiko’s default behavior.
paramiko’s default — trying many keys per connection
paramiko.SSHClient.connect() defaults two options to True:
client.connect(
'host',
pkey=my_key,
# The following are True by default:
# look_for_keys=True, # also try ~/.ssh/id_* files
# allow_agent=True, # also try ssh-agent registered keys
)
When the explicitly passed pkey fails, paramiko falls back through ssh-agent registered keys → ~/.ssh/id_* files → password auth in order. Convenient for developers with a single key. Disastrous for a multi-site administrator:
- The SSH agent has multiple per-site keys registered
~/.ssh/holds severalid_rsa/id_ed25519files- A single connect call ends up trying 5-10 keys in sequence
- That blows past the server’s
MaxAuthTries(default 6) on a single connection
So what looked to the user like “one connection test” was being seen by the server as “a suspicious IP racking up 5-10 auth failures in a row.” Repeat that 2-3 times and the protection mechanism declares the IP “exceeded threshold” and bans it.
The fix — look_for_keys=False and allow_agent=False
paramiko exposes options to scope key trial. We set them explicitly in connect_kwargs:
connect_kwargs = {
'pkey': my_key,
'look_for_keys': False, # don't try ~/.ssh/id_*
'allow_agent': False, # don't try ssh-agent keys
}
client.connect('host', **connect_kwargs)
Now “the explicitly passed pkey and nothing else” gets tried — failure ends in one attempt per connection. The MaxAuthTries-overrun path is gone.
Backward compatibility is preserved. Existing pkey / key_filename users see no change, and password-auth users are unaffected.
Lesson from V12: fix the same bug in 10 places at once
This is where the principle from the csh / bash-syntax SSH command bug paid off: the moment you find this kind of bug, grep for the same pattern everywhere. A full sweep of Connection(...) calls found 10 sites with missing or empty connect_kwargs:
| Location | Role |
|---|---|
core/ssh_utils.py::get_ssh_connection |
Maintenance main path |
save_server_profile WP-CLI auto-detect |
Profile save |
test_ssh_profile |
Connection test |
discover_server_paths |
Path discovery |
test_wpcli |
WP-CLI test |
install_wpcli |
WP-CLI install |
diagnose_server |
Server diagnosis |
fetch_plugins |
Plugin list fetch |
fetch_pending_plugins_for_site |
Pending plugin fetch |
save_site WP-CLI auto-detect |
Site save |
All 10 got a comment explaining the IP-block prevention rationale. If we’d patched only one, the same bug would have come back through a different code path.
We also removed a warning UI — automatic retries made it worse
In an earlier round we’d actually tried a different approach: diagnose private-key permissions before the connection test → show a warning + “Fix and connect” button. Well intentioned, but in this exact scenario it backfired completely.
The user-experienced sequence was:
- Warning: “Your SSH key permissions are loose. Want to fix them?”
- User clicks “Fix and connect” → internal
chmod 600→ connection test auto-retries - Retry hits the same multi-key trial → fails
- “Authentication failed” → try another key → fail again
- Failures spike → IP ban triggers
“Warn the user, auto-fix, auto-retry” turns out to be a UX pattern that amplifies failure counts when the root cause lies elsewhere. The auto-retry hides multiple attempts behind a single user click.
Worse, paramiko doesn’t enforce OpenSSH’s StrictModes check, so for this app’s purposes the key works fine even with loose permissions. The “helpful” preventive warning was actually overreach in our specific context.
We removed the warning UI in the same round (-120 / +31 = net 89 lines deleted). The _diagnoseAndOfferFix() function shell is kept for backward-compatible signatures, but its body is now a no-op.
Regression defense — an AST test that forbids empty connect_kwargs
Same two-layer defense pattern as V12: a regression test that fails the build if the bug ever comes back. tests/test_ssh_connection_isolation.py ships with 6 tests:
# Sketch
import ast
def test_all_connect_kwargs_have_look_for_keys_false():
"""Verify that every connect_kwargs / ck initialization
contains 'look_for_keys': False and 'allow_agent': False."""
for file in [CORE_SSH_UTILS, SITE_MANAGER_WEB]:
for assign in find_connect_kwargs_assignments(file):
keys = extract_dict_keys(assign.value)
assert keys.get('look_for_keys') is False, \
f"{file}:{assign.lineno} missing look_for_keys=False"
assert keys.get('allow_agent') is False, \
f"{file}:{assign.lineno} missing allow_agent=False"
If someone later adds a new SSH API and leaves connect_kwargs = {}, the build fails. Reading assign.value directly from the AST keeps comment / docstring strings from causing false positives.
Closing — three principles
- Library defaults aren’t always “correct” for your environment. paramiko’s
look_for_keys=True/allow_agent=Trueis reasonable fallback behavior for single-key users, but dangerous in multi-key environments. Re-read library docs through the lens of your actual deployment environment before trusting defaults - Warning UI + auto-retry can amplify the problem. “Warn the user, auto-fix, auto-retry” is well-intentioned but when the root cause lies somewhere else, the auto-retry compounds the failure count. UX-layer “helpfulness” can become operational debt
- Find a same-shaped bug? Grep + AST regression test on the spot. Third entry in this pattern. The
/bin/sh -cwrap /_safe_runhelper (V12) and now theconnect_kwargsscoping share the same shape: a fix needs to apply across every SSH path in the codebase. Cross-grep + AST regression test as a standard step changes how confident you can be about “did we get them all?”
SSH-related code tends to be exactly where “library defaults break under our environment” and “the same pattern scatters across many call sites” collide. If you’re building a multi-site administrator tool on top of paramiko, defaulting look_for_keys and allow_agent to False from the start is the safer bet — that’s what this round taught us.