Summary: Make it possible to build multiple batches in parallel. When a new PR is r+'ed, start one new batch for master + this PR, and another batch for each already running batch. As batches complete, check if they can be merged without blocking batches that were started earlier. Once a batch is merged, check if any previously blocked batches can now be merged.
I allow this RFC document to be modified and redistributed under the terms of the Apache-2.0 license, or the CC-BY-NC-SA 3.0 license, at your option.
Motivation
At Tink we use Bors to build a large monorepo. As we've grown, we are experiencing growing pains when a lot of developers try to merge code at the same time. We often see batches of 10 PRs or more. A CI run takes around 20 minutes usually, but in a worst case this can still translate to 4+ hours of bisection if your urgent one-line change is on the wrong side of multiple bisections.
Initially I thought about trying to build both halves of the bisection at once - and taking whichever half succeeded first - but then I realised it might be better to stop the queue from getting so big in the first place - just start building staging + the new PR right away, even if something's already building.
Guide-level explanation
If bors.toml contains parallel = true
, Bors will not wait for the previous batch to complete when building a new batch. Instead, it will create one new batch for master + this PR, and another batch for each currently running batch.
When a batch finishes building, bors will check if it can be merged without blocking earlier batches. If not, it is put into a new CompletedPendingMerge state. If it can be, it is merged, and then Bors will check if any batches in the CompletedPendingMerge state are either now mergable as well, or need to be marked Invalid.
Some existing bors features will be incompatible with this feature:
- It will not be possible to set priority on PRs
- It will not be possible to use squash merges.
Reference-level explanation
When using parallel = true
, bors will append the batch number to the staging branch name when creating it, to permit multiple staging branches to co-exist and be built in parallel.
- Users of this feature will need to adjust their branch restrictions. Github and most CI tools support specifying them with a regexp or other wildcard expression.
A new status will be introduced, CompletedPendingMerge, to mark a batch that has finished building, but cannot yet be merged because doing so would block PRs that were r+'ed earlier.
Another new status will be introduced, Invalid, to indicate a batch that can no longer be merged because earlier batches are now on master. Filtering will be added to the UI to hide Invalid batches by default. This should keep the UI from becoming too cluttered with invalid batches.
When a PR is cancelled, all batches that include that PR will be cancelled. There should still be a valid batch for each combination of other PRs that were building at the same time.
Notifications of build statuses will need to be adjusted, so that a notification is sent when a PR is no longer being built anywhere.
Two new columns have been added to the batch table in my implementation - parent_commit
and parent_batch_id
. I'm not yet sure if these are strcitly necessary - I haven't removed the waiting state from my implementation, but with this we should be able to put the new batches directly into the Running state when we already have that information to hand. These may also be useful/necessary when sequencing merges of batches.
Drawbacks
It adds a lot of complexity to the code, and causes stuff to be built that we know we're not going to merge.
If the number of concurrent builds you can run is limited, this is probably going to perform a lot worse.
Rationale and alternatives
- This approach should make the worst-case for getting a valid PR onto master be whatever it takes to do a full CI build.
- Improving the bisection process would still provide us with some benefit. As discussed above, we could try building both halves of a bisection simultaneously. Or we could have a bot that tries to guess which PR caused the failure and boots it from the batch.
- A simpler approach would be to only create a batch for the new PR + the most recent staging batch. This could still lead to a lot of unnecessary queue time for later PRs in the event that an early PR is cancelled/fails.
Prior art
- Issue #141 proposes a slightly simpler approach, taking the latest staging and creating one new batch from it. This uses less resources but performs worse when batches fail.
- Some other bors-like systems implement a first-in-first-out appreach, such as mergequeue, but it is not clear if they support parallel builds.
Unresolved questions
- Are there any holes in the design - i.e. are there cases where this first-in-first-out approach will lose PRs? (I intend to write tests to show that this does not happen)
- What is the impact of flaky tests on this approach vs. bisection?
** Right now, a flaky test will cause a failure (+ possibly bisection) on 100% of batches!
** If flakes occur on branches that end up being marked invalid, that might be an improvement
** But that might be more than offset because we're doing so many more builds. - Is this functionality that bors is interested having merged into master? Or is this a sufficiently-different thing to what bors is that it should live in a fork.
- Does it make sense to bake this functionality into the existing
lib/worker/batcher.ex
gated behind feature checks, similar to the squash merge functionality? Or should it live in an entirely separate batcher implementation.
Future possibilities
I want to keep it out of scope for the initial implementation, but it is still possible to support priorities with this approach. Just start a new set of batches but with the higher priority PR at the start instead of the end, and sort the higher priority batches before the earlier ones when deciding merge order. It's not even necessary to mark the already-building PRs as invalid - if the higher priority PR ends up failing to build, the earlier PRs can be merged when they complete as normal.
It might make sense to pro-actively delete the branches when a batch is made invalid, to stop them from building and save CI costs.
See also
My bors fork is here: https://github.com/richardstephens/bors-ng
The first pass of my implementation of this is in the commit that is currently on master on that fork, here. I have left comments explaining the rationale for most of the changes.
Current status: So far, this creates the additional batches, but nothing further is implemented. No sequencing of merges is implemented yet. This means that if the build always takes the same amount of time, this will work, but if a later batch finishes before an earlier one, it will be blocked. I also have not handled notifications well, so this currently sends wrong notifications to a lot of PRs.
In the future, we could consider creating a batch for every currently open PR pre-emptively - before r+ is even issued - and if it passes, r+ can be implemented as
- push that commit to master
- invalidate all the other pending batches
- create a new set of batches
This functionality probably wouldn't make sense for us today, but I could see it making sense with some improvements to our pipeline - the way we detect changes in our monorepo is overly conservative currently.