Miscellaneous Crashes put queue into bad state

tr-argo · January 29, 2019, 2:59am

Hello,

I’m seeing miscellaneous crashes that tend to put the queue in the bad state. One of the first calls I’ve seen create issues is

{:timeout,
 {GenServer, :call,
  [
    BorsNG.GitHub,
    {:post_commit_status, {{:installation, 22}, 32},
     {"sha", :error,
      "Build failed",
      "https://<bors-urrl>"}},
    5000
  ]}}

When I examine the PR, I see the failed status was correctly posted to the PR. I have bors connected to a Github Enterprise instance, and I verified in the GHE logs that the response returned with in a few ms.

Another call I’ve seen create issues

{:timeout,
 {GenServer, :call,
  [
    BorsNG.GitHub,
    {:get_commit_status, {{:installation, 22}, 32},
     {"sha"}},
    5000
  ]}}

For this request, I’m able to see the entries in the GHE logs show times responded without issue.

PRs that were queued up at the time are not picked rescheduled, and they need to be bors r- before they can be bors r+ again…

Around this time these crashes happen I notice webhooks (from GHE to bors) were taking > 10s to return. From what I can tell this beghins to happen when GHE is under high load (usually sending a lot of webhooks). The load on the instance running bors is never fully utilized (at least according to monitoring metrics).

I have a couple questions

Is there a known case that could generate the crashes above
Is there a known case that could generate long webhook response times

I’d be happy to provide more information to debug if possible.

wchen9911 · January 29, 2019, 7:16am

More details of the above issue.

Now we believe the timeout is not a GenServer.call() timeout, instead, it is a timeout from httpoison/hackney. After increased the default GenServer.call() timeout to 10s, we now see the nested timeout exception from httpoison

    ** (EXIT) an exception was raised:
    ** (HTTPoison.Error) :timeout
        (httpoison) lib/httpoison.ex:66: HTTPoison.request!/5
        (bors) lib/github/github/server.ex:118: BorsNG.GitHub.Server.do_handle_call/3
        (bors) lib/github/github/server.ex:584: BorsNG.GitHub.Server.use_token!/3
        (stdlib) gen_server.erl:661: :gen_server.try_handle_call/4
        (stdlib) gen_server.erl:690: :gen_server.handle_msg/6
        (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

I believe the default connection timeout for httpoison is 8000ms, so it didn’t show at first.

Not familiar with the hackney client, my understanding is the connection was not able to be established in 8000ms. What could be the cause?

FWIW, we have seen some intermittent DNS issue where some hosts can’t resolve our github enterprise hostname, could that shown as connection timeout for hackney?

Thank you!

notriddle · January 29, 2019, 3:51pm

From what I’ve tested, that would’ve produced an :nxdomain error. Similarly, if it was a connection timeout, it would have produced a :connect_timeout error, if it had run out of its connection pool, it would also have produced a :connect_timeout error, and if it had already started downloading data, it would have been a :recv_timeout.

The :timeout error is supposed to appear when HTTPoison/Hackney makes a successful connection to the server, sends its request, and doesn’t get back a response within the timeout window.

I should probably switch to a different HTTP library, though. This is the second time I’ve run into an apparent bug in HTTPoison. Time 1 and now I think this is your issue.

wchen9911 · January 30, 2019, 1:37am

Thanks for the explanation, Michael.

Anyway we can enable more logging for the lib?

notriddle · January 30, 2019, 1:40am

I know the Hackney metrics API exists.

But I’ve already created a pull request switching bors-ng off of Hackney and onto httpc. I’ve done some basic testing on it, but would you mind trying it under load to see if it behaves better?

wchen9911 · January 30, 2019, 2:52am

I’d love to.

So we checked the log on our github enterprise side, its haproxy log showed everything was processed in 145 ms.

`Jan 29 05:24:58 github-[redacted]-primary haproxy[1209]: 10.225.5.149:57710 [29/Jan/2019:05:24:58.681] https_protocol~ web_unicorns/localhost 0/0/0/145/145 201 2538 - - ---- 364/363/3/4/0 0/0 {[redacted]||hackney/1.15.0} “POST /api/v3/repositories/32/statuses/a7be9956e080e8168aa7292cf20c6efc7b15781b HTTP/1.1”

So it seems the culprit is somewhere in the http client? maybe some messed up reused connection from the pool?

Anywyay, I’d love to test the new lib. how can we get it? a branch?

wchen9911 · January 30, 2019, 3:13am

btw, we also see long response time from GHE webhooks. Do you think it might be the http client or maybe the erlang VM? The box we provisioned is x5.xlarge, we don’t see the cpu, network, or anything is stressed. Is there something we can tune?

notriddle · January 30, 2019, 3:15am

Yeah, it’s in a branch. https://github.com/bors-ng/bors-ng/tree/tesla-http

No, I’m pretty sure it’s just Hackney’s pool manager being fragile when kept near exhaustion. The long webhook response times come from the webhook making HTTP requests.

wchen9911 · January 30, 2019, 3:55am

The long webhook response times come from the webhook making HTTP requests.

I am talking about the webhook of bors called from GHE. Are you suggesting GHE is slow? Or you are confirming Hackney’s pool manager is the culprit in that case? FWIW, those webhook calls (to bors) are pretty fast in general, but slow some time, and anecdotally around the same time we see the timeout exception on bors side.

notriddle · January 30, 2019, 4:20am

Wait, which one of these are you talking about:

Are you saying that it’s taking a long time for GitHub to make its webhook requests?

If you’re asking me to help with the your GH Enterprise being slow, then you are barking up the wrong tree. Ask GitHub Support for help with that.
Are you saying that GitHub is making its webhook requests quickly, but that it is taking a long time for bors to respond?

If this is the problem, then, yeah, the HTTP Client pool is the culprit. Bors has to make HTTP requests sometimes when it’s handling a webhook, so the webhook response is slow if HTTP is slow.

wchen9911 · February 9, 2019, 8:19pm

I was saying the latter.

BTW, we did some DNS tuning, and so far it hasn’t happened again. We have not tried the tesla-http branch yet. We will see what happens, and will try to update here if there’s new development.

notriddle · February 9, 2019, 11:42pm

I tested it some more myself and merged the tesla-http branch into master. So you can just upgrade to try it out.