Debugging possible "concurrent batcher" executions

I'm debugging an issue with our Bors deployment, and the logs indicate that there were two "threads of execution" (not sure what the proper Elixir word is here) of Batcher.start_waiting_merged_batch for the same Batch concurrently. The project is configured to do squash merging, and we can see multiple instances of the same PR in our into_branch.

I'm still new to Elixir, so any help or guidance you can provide will be really helpful.

During the debugging process, we have ruled out the possibility of multiple Bors executables running at the same time.

Some questions that I'm trying to answer are:

  1. How does Bors-NG/Elixir/Erlang prevent concurrent executions of Batcher.start_waiting_merged_batch?
  2. How does Bors-NG/Elixir/Erlang prevent concurrent executions of Batcher.poll_? (may be the same answer as above)

Thanks in advance,
Adam

After doing more research into the issue, we believe that it is possible that multiple :poll messages were delivered to a project's Batcher server in a short amount of time. From our understanding, if a server received multiple :poll messages it would process them concurrently.

Looking deeper at the code that handles :poll messages in Batcher, there can be a long time between when a) "a waiting batch chosen to be started" in Batcher:221, and b) "a started batch is marked as running" in Batcher:339. During our incident, our logs indicate this time was at least 2 minutes. This analysis indicates a possible race condition in the code if a 2 :poll messages were received <2 mins apart.

Looking over the code that sends :poll messages, we've found:

  1. inside Batcher.handle_info here i.e. polling every 30 minutes
  2. after significant batch processing events in Batcher (like when a merge conflict in detected, a batch is successful, a batch fails, etc etc)

Thinking about all of this, we have a couple more questions:

  1. Is our analysis of the situation correct? (i.e. can one Batcher server process multiple :poll messages concurrently), and
  2. If this is true, what is the best way to go about fixing this?

Thank you again!
Adam

Moving this discussion over to a GitHub Issue.