Draft RFC: Support building batches in parallel

richardstephens · November 4, 2020, 1:02pm

Summary: Make it possible to build multiple batches in parallel. When a new PR is r+'ed, start one new batch for master + this PR, and another batch for each already running batch. As batches complete, check if they can be merged without blocking batches that were started earlier. Once a batch is merged, check if any previously blocked batches can now be merged.

I allow this RFC document to be modified and redistributed under the terms of the Apache-2.0 license, or the CC-BY-NC-SA 3.0 license, at your option.

Motivation

At Tink we use Bors to build a large monorepo. As we've grown, we are experiencing growing pains when a lot of developers try to merge code at the same time. We often see batches of 10 PRs or more. A CI run takes around 20 minutes usually, but in a worst case this can still translate to 4+ hours of bisection if your urgent one-line change is on the wrong side of multiple bisections.

Initially I thought about trying to build both halves of the bisection at once - and taking whichever half succeeded first - but then I realised it might be better to stop the queue from getting so big in the first place - just start building staging + the new PR right away, even if something's already building.

Guide-level explanation

If bors.toml contains parallel = true, Bors will not wait for the previous batch to complete when building a new batch. Instead, it will create one new batch for master + this PR, and another batch for each currently running batch.

When a batch finishes building, bors will check if it can be merged without blocking earlier batches. If not, it is put into a new CompletedPendingMerge state. If it can be, it is merged, and then Bors will check if any batches in the CompletedPendingMerge state are either now mergable as well, or need to be marked Invalid.

Some existing bors features will be incompatible with this feature:

It will not be possible to set priority on PRs
It will not be possible to use squash merges.

Reference-level explanation

When using parallel = true, bors will append the batch number to the staging branch name when creating it, to permit multiple staging branches to co-exist and be built in parallel.

Users of this feature will need to adjust their branch restrictions. Github and most CI tools support specifying them with a regexp or other wildcard expression.

A new status will be introduced, CompletedPendingMerge, to mark a batch that has finished building, but cannot yet be merged because doing so would block PRs that were r+'ed earlier.

Another new status will be introduced, Invalid, to indicate a batch that can no longer be merged because earlier batches are now on master. Filtering will be added to the UI to hide Invalid batches by default. This should keep the UI from becoming too cluttered with invalid batches.

When a PR is cancelled, all batches that include that PR will be cancelled. There should still be a valid batch for each combination of other PRs that were building at the same time.

Notifications of build statuses will need to be adjusted, so that a notification is sent when a PR is no longer being built anywhere.

Two new columns have been added to the batch table in my implementation - parent_commit and parent_batch_id. I'm not yet sure if these are strcitly necessary - I haven't removed the waiting state from my implementation, but with this we should be able to put the new batches directly into the Running state when we already have that information to hand. These may also be useful/necessary when sequencing merges of batches.

Drawbacks

It adds a lot of complexity to the code, and causes stuff to be built that we know we're not going to merge.
If the number of concurrent builds you can run is limited, this is probably going to perform a lot worse.

Rationale and alternatives

This approach should make the worst-case for getting a valid PR onto master be whatever it takes to do a full CI build.
Improving the bisection process would still provide us with some benefit. As discussed above, we could try building both halves of a bisection simultaneously. Or we could have a bot that tries to guess which PR caused the failure and boots it from the batch.
A simpler approach would be to only create a batch for the new PR + the most recent staging batch. This could still lead to a lot of unnecessary queue time for later PRs in the event that an early PR is cancelled/fails.

Prior art

Issue #141 proposes a slightly simpler approach, taking the latest staging and creating one new batch from it. This uses less resources but performs worse when batches fail.
Some other bors-like systems implement a first-in-first-out appreach, such as mergequeue, but it is not clear if they support parallel builds.

Unresolved questions

Are there any holes in the design - i.e. are there cases where this first-in-first-out approach will lose PRs? (I intend to write tests to show that this does not happen)
What is the impact of flaky tests on this approach vs. bisection?
** Right now, a flaky test will cause a failure (+ possibly bisection) on 100% of batches!
** If flakes occur on branches that end up being marked invalid, that might be an improvement
** But that might be more than offset because we're doing so many more builds.
Is this functionality that bors is interested having merged into master? Or is this a sufficiently-different thing to what bors is that it should live in a fork.
Does it make sense to bake this functionality into the existing lib/worker/batcher.ex gated behind feature checks, similar to the squash merge functionality? Or should it live in an entirely separate batcher implementation.

Future possibilities

I want to keep it out of scope for the initial implementation, but it is still possible to support priorities with this approach. Just start a new set of batches but with the higher priority PR at the start instead of the end, and sort the higher priority batches before the earlier ones when deciding merge order. It's not even necessary to mark the already-building PRs as invalid - if the higher priority PR ends up failing to build, the earlier PRs can be merged when they complete as normal.

It might make sense to pro-actively delete the branches when a batch is made invalid, to stop them from building and save CI costs.

Current status: So far, this creates the additional batches, but nothing further is implemented. No sequencing of merges is implemented yet. This means that if the build always takes the same amount of time, this will work, but if a later batch finishes before an earlier one, it will be blocked. I also have not handled notifications well, so this currently sends wrong notifications to a lot of PRs.

In the future, we could consider creating a batch for every currently open PR pre-emptively - before r+ is even issued - and if it passes, r+ can be implemented as

push that commit to master
invalidate all the other pending batches
create a new set of batches
This functionality probably wouldn't make sense for us today, but I could see it making sense with some improvements to our pipeline - the way we detect changes in our monorepo is overly conservative currently.

notriddle · November 4, 2020, 4:29pm

https://zuul-ci.org/ parallel building, and they’ve documented what they’re doing pretty well. I’ve always figured bors was basically Zulu-on-a-budget.

A few other questions, also:

Does this cause the number of running batches to experience quadratic growth? That is, every approved PR doubles the number of batches? Or am I misunderstanding it?
Could you please make sure it works with priorities and squashing? I don’t like un-composable features.

richardstephens · November 4, 2020, 9:03pm

Effectively yes, correct. I don't have anything close to the statistical chops to model this properly, but my hope is that this won't be too much of a problem in practise. We usually only see the big batches when something in a smaller batch fails or is cancelled and PRs continue to pile in while the bisection happens. Having said that, if 7 engineers decide to r+ within the same 20 minute time-span, we would end up with 127 batches building simultaneously. This seems like a lot, but we already build every commit on every feature branch, so if those 7 branches each had 7 commits pushed to them, that's less than a 3x increase in total CI minutes spent.

But then, another doubling would see 255 simultaneous batches, which is probably beyond a number that's justifiable.

So then we probably need a configurable limit on the number of batches to build simultaneously, at which point we would have to do one or some combination of

Revert to only starting one batch for each r+, stop creating 'speculative' batches
Put r+'es into queue and create a full set of speculative batches as we consume them.
revert to bors' normal batching and bisection strategy.

Doing 3 would mean we would need to keep track of whether a batch was an 'overflow' batch. Doing 1 would mean either failing PRs that hadn't been individually tested to show they were at fault, or keeping track of state to work out what to rebuild. I imagine we'll try 2 first, because it seems like the simplest to implement.

Actually, the first cut of this will probably end up just discarding r+'s and issuing a 'too busy - try again later' if too much stuff is building currently. Again, I don't know where to begin to model this - it might turn out that we can just set the limit at 6 and this never happens, at least not at our current scale.

Yup. But I want to get the basics working first so we can see if the idea has merit or if it's totally off base.

notriddle · November 5, 2020, 12:14am

@rvl did some modeling awhile back.

I don’t have a problem with parallel functionality itself, but I do want rate limiting before we start making functionality available on the hosted instance, which always runs the latest version of bors. Creating hundreds of branches at once sounds like a good way to get cut off by GitHub’s abuse prevention.

If your fork works well for you, then you can come back with experience reports. “We used it for months and it was never a problem” is a good way to respond to theoretical concerns like I have here.

So, if I were you, I’d start planning for a long term fork.

JensRantil · November 5, 2020, 2:34pm

bors will append the batch number to the staging branch name when creating it

To avoid the quadratic behaviour, could we name the staging branch after the commit that it contains?

I just also have a hunch that it should be possible to do some smartness to avoid the quadratic behaviour. I think cost of issuing quadratic number of builds might be a bit too high for us. Eight scheduled merges would yield 1+2+3+4+5+6+7+8=36 builds, right? I think that's a lot compared to 8.

richardstephens · November 5, 2020, 3:52pm

I hadn't thought about that - thank you. We'll definitely need some sort of rate limiting before running this on a real repo, even if it's just "too busy try again later".

Yup, I was anticipating that.

I'm not sure what improvement we would gain by naming the staging branch based on a commit hash - I picked batch number because it's guaranteed to be unique to the batch.

The quadratic approach I am working towards I picked because, assuming unlimited CI resources, it is the only way I can think of to guarantee that bad PRs won't hold up good ones. There are plenty of other compromises we could do, but all of them involve some compromise of building a newly r+'ed PR alongside PRs that haven't finished merging, and therefore would also require retry logic and extra delays.

Bors Forum