Description of problem: When summitting several jobs to use one system, many jobs are in queued. If the the system will be removed or disassociated from lab controller. The jobs that are in queued or waiting will be in queued or waiting for ever until the system comes back. Version-Release number of selected component (if applicable): Beaker 24.0.git.243.389e953 How reproducible: 100% Steps to Reproduce: 1. Submit 3 jobs on same system 2. Cancel the installing job and set none lab controller of system the jobs which have already in Queue still won't be aborted. Actual results: The jobs that are in queued or waiting will be in queued or waiting for ever. Expected results: I am not sure the expected results, may be it's better of giving some useful message and aborting the job. Additional info:
In this particular case your recipe is not stuck in the queue, it actually started and reserved the system -- even though the system is not associated to a lab controller and can never be provisioned. Beaker is supposed to prevent this, and it normally does. You can't set the lab controller to "(none)" while a system is reserved, and the scheduler won't reserve a system with no lab controller. However you happened to hit a race window between these checks, because you set the lab controller to "(none)" at the same instant the scheduler reserved the system. The activity log shows: huiwang Scheduler 2017-01-17 16:58:26 +10:00 Distro Tree Provision Fedora 24 Server x86_64 Scheduler 2017-01-17 16:58:24 +10:00 Power clear_netboot Aborted: System disassociated from lab controller Scheduler 2017-01-17 16:58:24 +10:00 Power off Aborted: System disassociated from lab controller huiwang HTTP 2017-01-17 16:58:24 +10:00 Lab Controller Changed lab-devel-02.rhts.eng.bos.redhat.com huiwang Scheduler 2017-01-17 16:58:24 +10:00 User Reserved huiwang Note the identical timestamps. So that recipe is stuck in Waiting because it is waiting for you to manually reboot the system, effectively the same problem as bug 903930. (A system with no lab controller is effectively a system without power control.) If you clone another recipe, it will abort straight away with the expected message "Recipe ID 18060 does not match any systems" which is correct, because the scheduler sees that the system is not associated to any lab controller and therefore cannot run recipes.
The race window might be possible to fix, although difficult. It is similar to other races we have faced, for example with reserving a system, which we solved by setting system.user_id using a conditional UPDATE and checking the rowcount. It ensures that any two transactions trying to update that row are serialised and then one fails if it used an inconsistent view. But this race is not over a single column's value, but rather between system.lab_controller_id being set NULL at the same time as system.user_id being set to a non-NULL value. To avoid the race we would need to SELECT... FOR UPDATE in the HTTP handler which is updating system.lab_controller_id (that's reasonable enough) but we would also need to ensure the scheduler SELECT... FOR UPDATE's when picking a system, but that means locking the entire system table because of the way the scheduler works. Which would just lead to nightmarish lock contention. So I think the resolution here is CANTFIX for the race condition, and DUPLICATE of 903930 for the recipe stuck Waiting.
(In reply to Dan Callaghan from comment #2) > The race window might be possible to fix, although difficult. It is similar > to other races we have faced, for example with reserving a system, which we > solved by setting system.user_id using a conditional UPDATE and checking the > rowcount. It ensures that any two transactions trying to update that row are > serialised and then one fails if it used an inconsistent view. > > But this race is not over a single column's value, but rather between > system.lab_controller_id being set NULL at the same time as system.user_id > being set to a non-NULL value. > > To avoid the race we would need to SELECT... FOR UPDATE in the HTTP handler > which is updating system.lab_controller_id (that's reasonable enough) but we > would also need to ensure the scheduler SELECT... FOR UPDATE's when picking > a system, but that means locking the entire system table because of the way > the scheduler works. Which would just lead to nightmarish lock contention. > > So I think the resolution here is CANTFIX for the race condition, and > DUPLICATE of 903930 for the recipe stuck Waiting. It sounds reasonable. So I agree with your resolution.
Since I can't resolve the bug as DUPLICATE and CANTFIX at the same time, I pick the race condition and resolve it as CANTFIX with a reference to Bug 903930