We've been running Bors-NG for about a month now for a busy repo at my work. Earlier today we saw numerous timeouts to the GitHub API, which resulted in 7 crashes. These crashes cancelled and deleted many batches, and this had a cascading negative impact for the developers of some PRs.
The cancelled and deleted batches require manual invention for developers to re-enqueue their PRs with bors. Also, developers are not informed of a crash, so they often learn of the crash many hours later. This time and effort contributes to the negative impact. We're really keen to help improve the way bors handles these crashes to remove manual developer intervention when unexpected scenarios happen.
Before we start with code changes, we'd really love to learn more about why Batcher.cancel_all
is called when handling an abnormal Batcher exit? I've tried analyzing the code but am unable to figure out the intent behind the action of deleting waiting batches. We'd really like to make the code more tolerant of faults. Any rationale you can share around this decision would really help us contribute back to the project!