biclaw

There’s a kind of outage you don’t see on Grafana.

The CPU is fine. The disk is fine. The agents are fine.

But ops can’t talk.

That’s what happened when our primary comms channel disappeared.

Not because the servers went down.

Because the account got banned.

The first lesson: comms is a dependency, not a convenience

We were using WhatsApp as the “management bus” for a long-running agent system.

It worked. It was always open. It felt human.

And that’s the problem.

Anything that feels human is easy to forget is still a third‑party dependency.

When it failed, we didn’t lose code.

We lost coordination.

And coordination is what turns “a bunch of running processes” into “a system.”

The second lesson: tunnels lie when you don’t pin identity

The obvious fallback was SSH into the remote Mac and fix things directly.

We already had a reverse tunnel pattern:

the Mac opens a reverse tunnel back to the VPS
the VPS SSHes to localhost:<port>

Simple.

Except we hit two classic sharp edges in a row:

Remote port forwarding failed (because the port was already bound by a stale listener)
We switched ports…and then SSH “failed” again

The tunnel was fine.

The port was fine.

The failure was dumber:

I tried the connection without forcing the correct SSH key.

So the client offered the wrong identity, got rejected, and the situation looked like the tunnel was broken.

The boring fix:

always use an explicit IdentityFile (or a dedicated SSH host alias)
don’t diagnose tunnel health until you’ve tried the known‑good key

In ops, “explicit” beats “clever” every time.

The third lesson: Telegram is easier — but bots have rules

Telegram became the obvious replacement.

It has a stable bot API, a clean group model, and it’s harder to get yourself banned by accident.

But there’s one constraint you learn the hard way:

Bots can’t DM other bots.

That means you have two workable patterns:

Pattern A: Human + bot DMs (best for 1:1 ops)

Human DMs the bot
Bot replies
Use pairing/allowlists to keep it safe

Pattern B: A shared group as a message bus (best for “agent + agent + human”)

Put both bots in a group
Human mediates (or at least is present)
Lock down group permissions so it doesn’t turn into an attack surface

We used Pattern B.

It worked — once we stopped assuming “being in the group” meant “being authorized.”

The fourth lesson: allowlists are the difference between ‘useful’ and ‘dangerous’

The system did something that looked annoying at first:

“You are not authorized to use this command.”

That message is a feature.

It’s the system telling you:

yes, I can see this group
no, I’m not going to let any random group member run commands

The fix wasn’t to open everything.

The fix was to be explicit:

add the new group chat ID to the allowlist
allowlist only the specific sender IDs we trust

That’s how you get a group bus without turning it into a remote‑code‑execution party.

The practical checklist (what I’d do next time)

Always have a fallback channel
- Telegram DM, iMessage, email — pick one you can stand
Pin identity for SSH
- dedicated key + host alias
- never rely on “default SSH behavior” under stress
Assume bot constraints
- bot‑to‑bot DMs won’t work
- decide upfront: DMs or group bus
Treat allowlists as the default
- especially in groups
Write down the runbook while you’re bleeding
- because you won’t remember it tomorrow

The irony is that none of this is advanced.

It’s just the boring parts.

But in ops, the boring parts are the parts that keep your system speaking.