I'm debugging an issue with our Bors deployment, and the logs indicate that there were two "threads of execution" (not sure what the proper Elixir word is here) of Batcher.start_waiting_merged_batch
for the same Batch concurrently. The project is configured to do squash merging, and we can see multiple instances of the same PR in our into_branch
.
I'm still new to Elixir, so any help or guidance you can provide will be really helpful.
During the debugging process, we have ruled out the possibility of multiple Bors executables running at the same time.
Some questions that I'm trying to answer are:
- How does Bors-NG/Elixir/Erlang prevent concurrent executions of
Batcher.start_waiting_merged_batch
?
- How does Bors-NG/Elixir/Erlang prevent concurrent executions of
Batcher.poll_
? (may be the same answer as above)
Thanks in advance,
Adam
After doing more research into the issue, we believe that it is possible that multiple :poll
messages were delivered to a project's Batcher
server in a short amount of time. From our understanding, if a server received multiple :poll
messages it would process them concurrently.
Looking deeper at the code that handles :poll
messages in Batcher, there can be a long time between when a) "a waiting batch chosen to be started" in Batcher:221, and b) "a started batch is marked as running" in Batcher:339. During our incident, our logs indicate this time was at least 2 minutes. This analysis indicates a possible race condition in the code if a 2 :poll
messages were received <2 mins apart.
Looking over the code that sends :poll
messages, we've found:
- inside
Batcher.handle_info
here i.e. polling every 30 minutes
- after significant batch processing events in
Batcher
(like when a merge conflict in detected, a batch is successful, a batch fails, etc etc)
Thinking about all of this, we have a couple more questions:
- Is our analysis of the situation correct? (i.e. can one Batcher server process multiple
:poll
messages concurrently), and
- If this is true, what is the best way to go about fixing this?
Thank you again!
Adam
Moving this discussion over to a GitHub Issue.