Bug 952587
Summary: | Scheduling deadlock for multihost tests with pending status updates | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Nick Coghlan <ncoghlan> |
Component: | scheduler | Assignee: | Raymond Mancy <rmancy> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | xjia <xjia> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 0.12 | CC: | asaha, dcallagh, ebaak, junichi.nomura, kueda, llim, mcsontos, qwan, rglasz, rmancy, tatsu-ab1, tnishimura, xjia, xtian |
Target Milestone: | 0.13.x | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | Scheduler | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | 889065 | Environment: | |
Last Closed: | 2013-07-11 02:44:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 889065 | ||
Bug Blocks: |
Description
Nick Coghlan
2013-04-16 09:07:25 UTC
Nick has offered some suggestions of dealing with the systems on the scheduler side, so we don't have to make expensive database calls. This may be a viable solution. So, in reviewing the patch, we (mostly dcallagh) realised that the core problem here is the fact that "update_dirty_jobs()" runs in a different thread from the main scheduling loop. This means that after calling queue_processed_recipes() in the main loop, some of the queued recipes *may not* yet be marked as queued - they're be marked as processed+dirty instead. We have a longer term plan to make this whole arrangement more sensible (see http://beaker-project.org/dev/proposals/event-driven-scheduler.html), but we want a simpler near term fix. Dan's suggestion (and I now agree) is that we drop the separate "update_dirty_jobs_loop" and instead change main_recipes_loop to look something like this: @log_traceback(log) def main_recipes_loop(*args, **kwargs): while running: updated = update_dirty_jobs() abort_dead_recipes() updated = update_dirty_jobs() or updated if _virt_enabled(): virt = provision_virt_recipes() else: virt = False updated = update_dirty_jobs() or updated queued = schedule_queued_recipes() updated = update_dirty_jobs() or updated scheduled = provision_scheduled_recipesets() updated = update_dirty_jobs() or updated if not virt and not queued and not scheduled and not updated: event.wait() log.debug("main recipes thread exiting") It's a clumsy fix, but it will eliminate this issue and should also eliminate the database deadlock that can occur when both threads try to update the system_recipe_map table. (In reply to Nick Coghlan from comment #3) > So, in reviewing the patch, we (mostly dcallagh) realised that the core > problem here is the fact that "update_dirty_jobs()" runs in a different > thread from the main scheduling loop. This means that after calling > queue_processed_recipes() in the main loop, some of the queued recipes *may > not* yet be marked as queued - they're be marked as processed+dirty instead. > This has 'always' (well, since 0.12) been the case, and was by design. The need for this BZ is exactly because of this. > We have a longer term plan to make this whole arrangement more sensible (see > http://beaker-project.org/dev/proposals/event-driven-scheduler.html), but we > want a simpler near term fix. > > Dan's suggestion (and I now agree) is that we drop the separate > "update_dirty_jobs_loop" and instead change main_recipes_loop to look > something like this: > > @log_traceback(log) > def main_recipes_loop(*args, **kwargs): > while running: > updated = update_dirty_jobs() > abort_dead_recipes() > updated = update_dirty_jobs() or updated > if _virt_enabled(): > virt = provision_virt_recipes() > else: > virt = False > updated = update_dirty_jobs() or updated > queued = schedule_queued_recipes() > updated = update_dirty_jobs() or updated > scheduled = provision_scheduled_recipesets() > updated = update_dirty_jobs() or updated > if not virt and not queued and not scheduled and not updated: > event.wait() > log.debug("main recipes thread exiting") > > > It's a clumsy fix, but it will eliminate this issue and should also > eliminate the database deadlock that can occur when both threads try to > update the system_recipe_map table. I think you meant to say "when both threads try to lock the same system record". Did you mean to include process_new_recipes() in this loop? That will need to be in there to fix the deadlocking issue as well. I guess the obvious problem with this is that serialising it means longer processing times. This is better than the deadlocks though. FWIW I don't think this is a proper verification. However, the way that the scheduler has been changed the original conditions cannot be reproduced. So I think that it's fine for this to stat 'VERIFIED' Beaker 0.13.2 has been released. (http://beaker-project.org/docs/whats-new/release-0.13.html#beaker-0-13-2). |