Bug 1558776

Summary: [Regression] beakerd tries to assign a system to a recipe which is already allocated
Product: [Retired] Beaker Reporter: Roman Joost <rjoost>
Component: schedulerAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact: Roman Joost <rjoost>
Severity: high Docs Contact:
Priority: urgent    
Version: 25CC: achatter, dcallagh, mtyson, rjoost
Target Milestone: 25.1Keywords: Patch, Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-11 05:19:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Joost 2018-03-21 01:10:11 UTC
Description of problem:

During self-tests we've encountered a regression in beakerd which results in many tracebacks (see log).

Version-Release number of selected component (if applicable):
25.0

How reproducible:
not sure at this point

Actual results:
beakerd is continously trying to assign a system which is already in use by a different recipe.

Expected results:
system is not picked and assigned to an already used system.

Additional info:
scheduler_status is pending

Comment 1 Dan Callaghan 2018-03-21 01:21:43 UTC
Presumably caused by bug 1519589.

That bug introduced a new field system.scheduler_status with three possible states:

* reserved = system is reserved, scheduler should ignore it
* pending = system is not reserved, scheduler should check if a queued recipe can use it
* idle = system is not reserved, scheduler has already checked that no queued recipe can use it

In this particular case, we see that the system in question became idle, and then got assigned to a queued recipe here:

Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Checking for queued recipes which are runnable on apm-mustang-ev3-17.rhts.eng.bos.redhat.com
Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Assigning recipe 21235 to system apm-mustang-ev3-17.rhts.eng.bos.redhat.com
Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.scheduler DEBUG Reserving system apm-mustang-ev3-17.rhts.eng.bos.redhat.com for recipe 21235
Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'User', action=u'Reserved', old=u'', new=u'rjoost', user=rjoost
Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.inventory DEBUG Created reservation for system apm-mustang-ev3-17.rhts.eng.bos.redhat.com with type u'recipe', service u'Scheduler', user rjoost
Mar 21 00:06:23 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.tasklibrary DEBUG Generating task library snapshot
Mar 21 00:06:26 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Created watchdog for recipe id: 21235 and system: apm-mustang-ev3-17.rhts.eng.bos.redhat.com
Mar 21 00:06:26 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd INFO recipe ID 21235 moved from Queued to Scheduled

Then it was provisioned in the next loop:

Mar 21 00:06:30 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Provisioning recipe 21235 in RS:18480
Mar 21 00:06:31 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'Distro Tree', action=u'Provision', old=u'', new=u'RHEL-7.3-20161019.0 Server aarch64', user=rjoost

*But* then the scheduler tried assigning the system to a queued recipe again, even though it was already reserved:

Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Checking for queued recipes which are runnable on apm-mustang-ev3-17.rhts.eng.bos.redhat.com
Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd DEBUG Assigning recipe 21243 to system apm-mustang-ev3-17.rhts.eng.bos.redhat.com
Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.model.scheduler DEBUG Reserving system apm-mustang-ev3-17.rhts.eng.bos.redhat.com for recipe 21243
Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]: bkr.server.tools.beakerd ERROR Error in schedule_pending_system(58)
[...]
Mar 21 00:06:36 beaker-devel.app.eng.bos.redhat.com beakerd[29562]:  StaleSystemUserException: User rjoost has already reserved system apm-mustang-ev3-17.rhts.eng.bos.redhat.com

Comment 2 Dan Callaghan 2018-03-21 01:44:58 UTC
It seems as if the schedule_pending_systems() loop keeps iterating over the same system id many times, even after the first iteration has already assigned a recipe to the system.

Comment 3 Dan Callaghan 2018-03-21 02:01:35 UTC
Ah, no, the loop doesn't keep iterating the same system many times. The whole *scheduler* itself keeps going through the entire main loop repeatedly. Each time schedule_pending_system(58) fails, the system is left in pending... which means the scheduler immediately tries again to schedule it.

The actual problem is that something in update_dirty_jobs() seems to cause *all* systems in the entire database to be marked as Pending.

I added a little debug message at the top of schedule_pending_systems(). This is what I observed:

Mar 21 01:55:10 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Updating dirty job 14055
Mar 21 01:55:14 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.scheduler DEBUG Releasing system apm-mustang-ev3-17.rhts.eng.bos.redhat.com for recipe 21243
Mar 21 01:55:14 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'User', action=u'Returned', old=u'rjoost', new=u'', user=rjoost
Mar 21 01:55:17 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Going to schedule pending systems [1L, 2L, 3L, 5L, 9L, 10L, 12L, 14L, 15L, 18L, 19L, 20L, 24L, 25L, 30L, 40L, 41L, 58L, 60L, 61L, 67L, 68L, 69L, 94L, 95L, 101L, 102L, 103L]

On the first run through, all those systems go back to idle because there was nothing queued for them, except 58, which is assigned another queued recipe:

Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Assigning recipe 21248 to system apm-mustang-ev3-17.rhts.eng.bos.redhat.com
Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.scheduler DEBUG Reserving system apm-mustang-ev3-17.rhts.eng.bos.redhat.com for recipe 21248
Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'User', action=u'Reserved', old=u'', new=u'rjoost', user=rjoost
Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.inventory DEBUG Created reservation for system apm-mustang-ev3-17.rhts.eng.bos.redhat.com with type u'recipe', service u'Scheduler', user rjoost
Mar 21 01:55:18 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.tasklibrary DEBUG Generating task library snapshot
Mar 21 01:55:20 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Created watchdog for recipe id: 21248 and system: apm-mustang-ev3-17.rhts.eng.bos.redhat.com
Mar 21 01:55:20 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd INFO recipe ID 21248 moved from Queued to Scheduled

But as soon as the recipe gets provisioned and update_dirty_jobs() runs on it, *all* those systems suddenly go back to Pending:

[...]
Mar 21 01:55:21 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Updating dirty job 14055
Mar 21 01:55:25 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Provisioning recipe 21248 in RS:18493
Mar 21 01:55:27 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=58L, service=u'Scheduler', field=u'Distro Tree', action=u'Provision', old=u'', new=u'Fedora-25 Cloud aarch64', user=rjoost
Mar 21 01:55:27 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Updating dirty job 14055
Mar 21 01:55:31 beaker-devel.app.eng.bos.redhat.com beakerd[597]: bkr.server.tools.beakerd DEBUG Going to schedule pending systems [1L, 2L, 3L, 5L, 9L, 10L, 12L, 14L, 15L, 18L, 19L, 20L, 24L, 25L, 30L, 40L, 41L, 58L, 60L, 61L, 67L, 68L, 69L, 94L, 95L, 101L, 102L, 103L]

Comment 4 Dan Callaghan 2018-03-21 03:17:03 UTC
Okay, I found it. We were updating scheduler_status = Pending in the wrong place.

https://gerrit.beaker-project.org/6037

However, note that the only side-effect of this bug, as far as I can see, is that the scheduler would go nuts in a busy loop failing to schedule an already-scheduled system. Once the recipe finished, the system would be scheduled as normal.

So this bug does not actually prevent the scheduler from working normally *but* it does cause beakerd to spew a very large amount of log messages which will cause beakerd and rsyslog to consume a lot of CPU.

Comment 6 Roman Joost 2018-03-29 04:10:57 UTC
Scheduled two self-tests in which the bug originated from. Both self tests ran through without scheduling problems.

https://beaker-devel.app.eng.bos.redhat.com/jobs/14112
https://beaker-devel.app.eng.bos.redhat.com/jobs/14120

Comment 7 Roman Joost 2018-04-11 05:19:59 UTC
Beaker 25.1 has been released:

https://beaker-project.org/docs/whats-new/release-25.html#beaker-25-1