Bug 1558776
Summary: | [Regression] beakerd tries to assign a system to a recipe which is already allocated | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Roman Joost <rjoost> |
Component: | scheduler | Assignee: | Dan Callaghan <dcallagh> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Roman Joost <rjoost> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 25 | CC: | achatter, dcallagh, mtyson, rjoost |
Target Milestone: | 25.1 | Keywords: | Patch, Regression |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-11 05:19:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Roman Joost
2018-03-21 01:10:11 UTC
Presumably caused by bug 1519589. That bug introduced a new field system.scheduler_status with three possible states: * reserved = system is reserved, scheduler should ignore it * pending = system is not reserved, scheduler should check if a queued recipe can use it * idle = system is not reserved, scheduler has already checked that no queued recipe can use it In this particular case, we see that the system in question became idle, and then got assigned to a queued recipe here: Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Checking for queued recipes which are runnable on apm-mustang-ev3-17.rhts.eng.bos.redhat.com Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Assigning recipe 21235 to system apm-mustang-ev3-17.rhts.eng.bos.redhat.com Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.scheduler DEBUG Reserving system apm-mustang-ev3-17.rhts.eng.bos.redhat.com for recipe 21235 Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'User', action=u'Reserved', old=u'', new=u'rjoost', user=rjoost Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.inventory DEBUG Created reservation for system apm-mustang-ev3-17.rhts.eng.bos.redhat.com with type u'recipe', service u'Scheduler', user rjoost Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.tasklibrary DEBUG Generating task library snapshot Mar 21 00:06:26 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Created watchdog for recipe id: 21235 and system: apm-mustang-ev3-17.rhts.eng.bos.redhat.com Mar 21 00:06:26 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd INFO recipe ID 21235 moved from Queued to Scheduled Then it was provisioned in the next loop: Mar 21 00:06:30 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Provisioning recipe 21235 in RS:18480 Mar 21 00:06:31 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'Distro Tree', action=u'Provision', old=u'', new=u'RHEL-7.3-20161019.0 Server aarch64', user=rjoost *But* then the scheduler tried assigning the system to a queued recipe again, even though it was already reserved: Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Checking for queued recipes which are runnable on apm-mustang-ev3-17.rhts.eng.bos.redhat.com Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Assigning recipe 21243 to system apm-mustang-ev3-17.rhts.eng.bos.redhat.com Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.scheduler DEBUG Reserving system apm-mustang-ev3-17.rhts.eng.bos.redhat.com for recipe 21243 Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd ERROR Error in schedule_pending_system(58) [...] Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: StaleSystemUserException: User rjoost has already reserved system apm-mustang-ev3-17.rhts.eng.bos.redhat.com It seems as if the schedule_pending_systems() loop keeps iterating over the same system id many times, even after the first iteration has already assigned a recipe to the system. Ah, no, the loop doesn't keep iterating the same system many times. The whole *scheduler* itself keeps going through the entire main loop repeatedly. Each time schedule_pending_system(58) fails, the system is left in pending... which means the scheduler immediately tries again to schedule it. The actual problem is that something in update_dirty_jobs() seems to cause *all* systems in the entire database to be marked as Pending. I added a little debug message at the top of schedule_pending_systems(). This is what I observed: Mar 21 01:55:10 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Updating dirty job 14055 Mar 21 01:55:14 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.scheduler DEBUG Releasing system apm-mustang-ev3-17.rhts.eng.bos.redhat.com for recipe 21243 Mar 21 01:55:14 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'User', action=u'Returned', old=u'rjoost', new=u'', user=rjoost Mar 21 01:55:17 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Going to schedule pending systems [1L, 2L, 3L, 5L, 9L, 10L, 12L, 14L, 15L, 18L, 19L, 20L, 24L, 25L, 30L, 40L, 41L, 58L, 60L, 61L, 67L, 68L, 69L, 94L, 95L, 101L, 102L, 103L] On the first run through, all those systems go back to idle because there was nothing queued for them, except 58, which is assigned another queued recipe: Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Assigning recipe 21248 to system apm-mustang-ev3-17.rhts.eng.bos.redhat.com Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.scheduler DEBUG Reserving system apm-mustang-ev3-17.rhts.eng.bos.redhat.com for recipe 21248 Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'User', action=u'Reserved', old=u'', new=u'rjoost', user=rjoost Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.inventory DEBUG Created reservation for system apm-mustang-ev3-17.rhts.eng.bos.redhat.com with type u'recipe', service u'Scheduler', user rjoost Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.tasklibrary DEBUG Generating task library snapshot Mar 21 01:55:20 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Created watchdog for recipe id: 21248 and system: apm-mustang-ev3-17.rhts.eng.bos.redhat.com Mar 21 01:55:20 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd INFO recipe ID 21248 moved from Queued to Scheduled But as soon as the recipe gets provisioned and update_dirty_jobs() runs on it, *all* those systems suddenly go back to Pending: [...] Mar 21 01:55:21 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Updating dirty job 14055 Mar 21 01:55:25 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Provisioning recipe 21248 in RS:18493 Mar 21 01:55:27 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'Distro Tree', action=u'Provision', old=u'', new=u'Fedora-25 Cloud aarch64', user=rjoost Mar 21 01:55:27 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Updating dirty job 14055 Mar 21 01:55:31 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Going to schedule pending systems [1L, 2L, 3L, 5L, 9L, 10L, 12L, 14L, 15L, 18L, 19L, 20L, 24L, 25L, 30L, 40L, 41L, 58L, 60L, 61L, 67L, 68L, 69L, 94L, 95L, 101L, 102L, 103L] Okay, I found it. We were updating scheduler_status = Pending in the wrong place. https://gerrit.beaker-project.org/6037 However, note that the only side-effect of this bug, as far as I can see, is that the scheduler would go nuts in a busy loop failing to schedule an already-scheduled system. Once the recipe finished, the system would be scheduled as normal. So this bug does not actually prevent the scheduler from working normally *but* it does cause beakerd to spew a very large amount of log messages which will cause beakerd and rsyslog to consume a lot of CPU. Scheduled two self-tests in which the bug originated from. Both self tests ran through without scheduling problems. https://beaker-devel.app.eng.bos.redhat.com/jobs/14112 https://beaker-devel.app.eng.bos.redhat.com/jobs/14120 Beaker 25.1 has been released: https://beaker-project.org/docs/whats-new/release-25.html#beaker-25-1 |