Hello,
I’m seeing miscellaneous crashes that tend to put the queue in the bad state. One of the first calls I’ve seen create issues is
{:timeout,
{GenServer, :call,
[
BorsNG.GitHub,
{:post_commit_status, {{:installation, 22}, 32},
{"sha", :error,
"Build failed",
"https://<bors-urrl>"}},
5000
]}}
When I examine the PR, I see the failed status was correctly posted to the PR. I have bors connected to a Github Enterprise instance, and I verified in the GHE logs that the response returned with in a few ms.
Another call I’ve seen create issues
{:timeout,
{GenServer, :call,
[
BorsNG.GitHub,
{:get_commit_status, {{:installation, 22}, 32},
{"sha"}},
5000
]}}
For this request, I’m able to see the entries in the GHE logs show times responded without issue.
PRs that were queued up at the time are not picked rescheduled, and they need to be bors r-
before they can be bors r+
again…
Around this time these crashes happen I notice webhooks (from GHE to bors) were taking > 10s to return. From what I can tell this beghins to happen when GHE is under high load (usually sending a lot of webhooks). The load on the instance running bors is never fully utilized (at least according to monitoring metrics).
I have a couple questions
- Is there a known case that could generate the crashes above
- Is there a known case that could generate long webhook response times
I’d be happy to provide more information to debug if possible.