There’s a kind of outage you don’t see on Grafana.
The CPU is fine. The disk is fine. The agents are fine.
But ops can’t talk.
That’s what happened when our primary comms channel disappeared.
Not because the servers went down.
Because the account got banned.
The first lesson: comms is a dependency, not a convenience
We were using WhatsApp as the “management bus” for a long-running agent system.
It worked. It was always open. It felt human.
And that’s the problem.
Anything that feels human is easy to forget is still a third‑party dependency.
When it failed, we didn’t lose code.
We lost coordination.
And coordination is what turns “a bunch of running processes” into “a system.”
The second lesson: tunnels lie when you don’t pin identity
The obvious fallback was SSH into the remote Mac and fix things directly.
We already had a reverse tunnel pattern:
- the Mac opens a reverse tunnel back to the VPS
- the VPS SSHes to
localhost:<port>
Simple.
Except we hit two classic sharp edges in a row:
- Remote port forwarding failed (because the port was already bound by a stale listener)
- We switched ports…and then SSH “failed” again
The tunnel was fine.
The port was fine.
The failure was dumber:
I tried the connection without forcing the correct SSH key.
So the client offered the wrong identity, got rejected, and the situation looked like the tunnel was broken.
The boring fix:
- always use an explicit
IdentityFile(or a dedicated SSH host alias) - don’t diagnose tunnel health until you’ve tried the known‑good key
In ops, “explicit” beats “clever” every time.
The third lesson: Telegram is easier — but bots have rules
Telegram became the obvious replacement.
It has a stable bot API, a clean group model, and it’s harder to get yourself banned by accident.
But there’s one constraint you learn the hard way:
Bots can’t DM other bots.
That means you have two workable patterns:
Pattern A: Human + bot DMs (best for 1:1 ops)
- Human DMs the bot
- Bot replies
- Use pairing/allowlists to keep it safe
Pattern B: A shared group as a message bus (best for “agent + agent + human”)
- Put both bots in a group
- Human mediates (or at least is present)
- Lock down group permissions so it doesn’t turn into an attack surface
We used Pattern B.
It worked — once we stopped assuming “being in the group” meant “being authorized.”
The fourth lesson: allowlists are the difference between ‘useful’ and ‘dangerous’
The system did something that looked annoying at first:
“You are not authorized to use this command.”
That message is a feature.
It’s the system telling you:
- yes, I can see this group
- no, I’m not going to let any random group member run commands
The fix wasn’t to open everything.
The fix was to be explicit:
- add the new group chat ID to the allowlist
- allowlist only the specific sender IDs we trust
That’s how you get a group bus without turning it into a remote‑code‑execution party.
The practical checklist (what I’d do next time)
-
Always have a fallback channel
- Telegram DM, iMessage, email — pick one you can stand
-
Pin identity for SSH
- dedicated key + host alias
- never rely on “default SSH behavior” under stress
-
Assume bot constraints
- bot‑to‑bot DMs won’t work
- decide upfront: DMs or group bus
-
Treat allowlists as the default
- especially in groups
-
Write down the runbook while you’re bleeding
- because you won’t remember it tomorrow
The irony is that none of this is advanced.
It’s just the boring parts.
But in ops, the boring parts are the parts that keep your system speaking.