Bug 958362
Summary: | database deadlock in beakerd | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Dan Callaghan <dcallagh> | ||||
Component: | scheduler | Assignee: | beaker-dev-list | ||||
Status: | CLOSED EOL | QA Contact: | |||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 0.12 | CC: | mastyk, mboswell, qwan, tools-bugs | ||||
Target Milestone: | --- | Keywords: | Triaged | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-04-21 08:47:41 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Dan Callaghan
2013-05-01 07:04:34 UTC
Trawling through this (and assuming I am reading it correctly), the main suspicious(!) piece of code I found is the call to self.resource.system.suspicious_abort() in the update code, which then may trigger a call to self.mark_broken(). The "session.flush()" in the second traceback is the first flush after that call. If this is accurate, we would see a report of a suspicious abort in the logs just before the deadlocks occur. We have some more recent occurrences of this and there's no sign of any suspicious abort messages in the preceding log entries. Something else is going on :P Created attachment 748048 [details]
latest deadlock
I've seen a couple of outputs now from SHOW ENGINE INNODB STATUS now, but they have all been the same as the above attachment. So my theory is that there is a session S1 that updates system A, then tries to update system B and waits. At the same time there is another session S2 (I'm assuming this is process_new_recipes) that gets an S lock on system B (via adding a row in system_recipe_map), then tries to get an S lock on system A and waits. This is what gives us the deadlock. Still unsure of the code path that creates S1. OK ncoghlan just enlightened me that it is most likely in the schedule_queued_recipes(). As of 0.12, we now wrap the whole block in one big session, so all we need to do is in S1 schedule two systems that are also being inserted into system_recipe_map as part of S2. The timing could easily be such that we end up with a deadlock. The reasons for the above traceback (in update_dirty_jobs) are for the exact same reason, updating multiple system rows in one transaction. Reproducer posted as http://gerrit.beaker-project.org/#/c/2023/ This appears to be a consequence of the way we currently do scheduling, which requires a cached mapping of recipes to possible systems to attain reasonable performance (the system_recipe_map table). Eliminating the need for that cache is one of the goals of the event driven scheduling [1] redesign. We can live with this misbehaviour until that is in place. [1] http://beaker-project.org/dev/proposals/event-driven-scheduler.html Closing this issue. We are not planning to address this problem in the Beaker development lifecycle. Instead of that, we are planning to continue our effort in building Beaker.NEXT. If you have any questions, feel free to reach out to me. Best regards, Martin Styk |