Bug report: Travis should not look at the `codecov` status if the Travis status check is still pending

DilumAluthge · July 31, 2019, 6:11am

Codecov has a status called codecov/project which succeeds if the entire project's test coverage on the staging branch is greater than the entire project's test coverage on the master branch, and fails otherwise.

I have a project (https://github.com/bcbi/ModelSanitizer.jl) that has 100% coverage on the master branch. Therefore, the codecov/project status will fail on the staging branch if the coverage on the staging branch falls below 100%. This is good - I want to enforce the policy that our coverage must never fall below 100% on the master branch.

Therefore, I have required the codecov/project status in bors.toml, i.e. my bors.toml file includes the following lines:

status = [
    "Travis CI - Branch",
    "codecov/project",
]

My test suite is divided into multiple Travis jobs. Each Travis job submits a coverage report to Codecov, and Codecov merges all of the reports together.

The coverage on each of the Travis jobs is less than 100%, but after all of the reports are merged together, the coverage is 100%.

Unfortunately, the Travis jobs do not all complete at the same time. Therefore, the coverage reports are not all submitted to CodeCov at the same time. An example timeline may look like this:

Bors pushes to the staging branch, and the Travis jobs start. For the sake of this example, suppose that there are two Travis jobs.
- Current state of GitHub status checks:
  - Travis CI - Branch: pending
  - codecov/project: this GitHub status does not yet exist
Five minutes later, Travis job #1 finishes and submits a coverage report to Codecov with 80% coverage. Since this is the only report, Codecov has nothing else to merge. Codecov creates the codecov/project GitHub status check and marks it as failing because the coverage of 80% is less than 100%.
- Current state of GitHub status checks:
  - Travis CI - Branch: pending
  - codecov/project: failure (coverage decreased from 100% to 80%)
Ten minutes later, Travis job #2 finishes and submits a coverage report to CodeCov with 60% coverage. Codecov now has two reports total, so it merges the two reports and finds that the total coverage is now 100%. Codecov updates the existing codecov/project GitHub status check and changes it from failing to passing because the coverage is 100%.
- Current state of GitHub status checks:
  - Travis CI - Branch: success
  - codecov/project: success (coverage 100% unchanged from previous)

Unfortunately, at step 2 above, Bors sees that the codecov/project status check has failed, and thus Bors fails the build.

However, the state of the codecov/project status at the end of step 2 is not accurate, because we will not know the true project coverage until all of the Travis builds have finished.

This is a bug because at the end of step 3, all of the status checks have succeeded, which means that Bors should have merged. However, because Bors looked at the codecov/project status at the end of step 2, Bors incorrectly failed the build.

Workaround

There is a workaround, but it's not great. You remove "codecov/project" from the [status] section of bors.toml, and you add codecov/project as a required status check in the GitHub protected branches setting for the master branch:

With this workaround, if the total coverage on the staging branch is less than 100%, GitHub won't allow bors to fast-forward the master branch. This causes Bors to crash with the following error:

Crash 	batch 	

{{:badmatch,
  {:error, :push, 422,
   "{\"message\":\"Required status check \\\"codecov/project\\\" is failing.\",\"documentation_url\":\"https://help.github.com/articles/about-protected-branches\"}"}},
 [
   {BorsNG.Worker.Batcher, :complete_batch, 3,
    [file: 'lib/worker/batcher.ex', line: 390]},
   {BorsNG.Worker.Batcher, :maybe_complete_batch, 1,
    [file: 'lib/worker/batcher.ex', line: 376]},
   {BorsNG.Worker.Batcher, :poll_, 1,
    [file: 'lib/worker/batcher.ex', line: 202]},
   {BorsNG.Worker.Batcher, :handle_info, 2,
    [file: 'lib/worker/batcher.ex', line: 184]},
   {:gen_server, :try_dispatch, 4,
    [file: 'gen_server.erl', line: 637]},
   {:gen_server, :handle_msg, 6,
    [file: 'gen_server.erl', line: 711]},
   {:proc_lib, :init_p_do_apply, 3,
    [file: 'proc_lib.erl', line: 249]}
 ]}

This is good because Bors was not allowed to fast-foward master to a commit with less than 100% code coverage!

This is bad because instead of posting a comment on the pull request indicating that the push failed, Bors crashes. Therefore, I receive no notification that the build failed.

DilumAluthge · August 1, 2019, 3:47am

I think that there are at least two different ways to fix this bug:

[HARD] Modify Bors so that it ignores "failed" CodeCov statuses if the Travis status is still pending. This will probably be pretty hard, and I'm not sure if we should make those changes to Bors.
[EASIER] If Bors encounters a "422 - Required status check is failing" when it tries to fast-forward master, it posts a comment on all of the pull requests in the batch saying "The Bors build succeeded but Bors encountered an error when trying to push to master."

I think option 2 is easier and less disruptive to the Bors codebase.

DilumAluthge · August 4, 2019, 12:57am

Bump.

@notriddle @umamaistempo Could you take a look at this and let me know what you think?

notriddle · August 4, 2019, 4:01am

I'm worried that neither option would actually be enough.

When CodeCov receives a coverage dump from Travis, while they may or may not update their own database synchronously (I don't know; I've never checked), they almost certainly has a queue between their API endpoint and sending completion reports to the GitHub API.

But the problem with this is that it means Travis might report a finished GitHub Check before CodeCov is done updating its own GitHub Check. In other words, if the only thing we wait on is Travis, we have a race condition. We have to go with option three: bors has to wait for CodeCov to go green, with a timeout in case it never does.

DilumAluthge · August 4, 2019, 5:33am

Would we make this a special case for CodeCov?

Or would we make it a new feature? i.e. we add a new option status_force_wait (needs a better name obviously). Statuses listed in status will follow the current behavior. Statuses listed in status_force_wait would be the ones for which Bors waits until they succeed (or Bors times out)

notriddle · August 4, 2019, 6:10am

New feature. Special-casing should be reserved for extreme circumstances, and should preferably be done in a way that has little effect on the codebase.

DilumAluthge · August 5, 2019, 7:16am

If it’s going to be a new feature, we’ll need an RFC. I’ll write it up.

What do you suggest we call the new configuration option in bors.toml? I was initially thinking status_force_wait, but that’s a pretty long name. It would be good to come up with something a little shorter.

DilumAluthge · August 5, 2019, 7:17am

Another option would be status_wait_green. Still pretty long, but maybe it’s more descriptive?

notriddle · August 5, 2019, 3:56pm

Probably this one, or something similar like status_wait_success.

DilumAluthge · August 14, 2019, 2:24am

@notriddle I am working on the RFC now. What is the default value of timeout_sec? (I am writing a little bit of explanation on how Bors will time out if the statuses in status_wait_success never turn green, and it would help to mention the default value of timeout_sec.)

notriddle · August 14, 2019, 4:41am

It's one hour.