What is the intent of the `Batcher.cancel_all` call for abnormal Batcher exits?

We've been running Bors-NG for about a month now for a busy repo at my work. Earlier today we saw numerous timeouts to the GitHub API, which resulted in 7 crashes. These crashes cancelled and deleted many batches, and this had a cascading negative impact for the developers of some PRs.

The cancelled and deleted batches require manual invention for developers to re-enqueue their PRs with bors. Also, developers are not informed of a crash, so they often learn of the crash many hours later. This time and effort contributes to the negative impact. We're really keen to help improve the way bors handles these crashes to remove manual developer intervention when unexpected scenarios happen.

Before we start with code changes, we'd really love to learn more about why Batcher.cancel_all is called when handling an abnormal Batcher exit? I've tried analyzing the code but am unable to figure out the intent behind the action of deleting waiting batches. We'd really like to make the code more tolerant of faults. Any rationale you can share around this decision would really help us contribute back to the project!

It's to avoid some really bad failure modes.

The default behavior when a process crashes in Erlang for the system to automatically restart it. This is what we did before. But it meant that bors could enter an infinite loop of crashing, and then it would exhaust rate limits, leave thousands of duplicate comments and crash records, and potentially not respond to Stop commands.

This was chosen as a good, predictable, failure mode that was easy to code. I'm open to using better ones, as long as it doesn't risk recreating the "million duplicate comments" problem, or risk turning minor glitches into "everything is down for everybody" fiascos.

2 Likes

@notriddle many thanks for the extra information, it's really helpful!