If a system has queued power commands and you set the lab controller to "none", the power commands will never be handled and stay Queued forever. If beaker-provision was in the process of handling one at the time, it will fail because the server will suddenly start rejecting the calls because it thinks the lab controller is not allowed to alter commands for that system: Aug 1 23:13:50 lab-02 beaker-provision[15819]: bkr.labcontroller.provision ERROR Command handler <Greenlet at 0x19444b0: <bound method CommandQueuePoller.handle of <bkr.labcontroller.provision.CommandQueuePoller object at 0x18a6ed0>>({'quiescent_period': 5, 'power': {'passwd': None, , [<Greenlet at 0x1944eb0>, <Greenlet at 0x1944370>,)> had unhandled exception: <Fault 1: "<type 'exceptions.ValueError'>:lab.example.com cannot update command for mrg9.example.com in wrong lab"> If a system is set to Removed, any queued power commands will still be executed even though the owner probably doesn't want that.
Our production Beaker instance has a lot of very old queued power commands hanging around for systems not associated with any lab controller. So it would be good to write a data migration which cleans those up as well.
https://gerrit.beaker-project.org/#/c/5274/2
Checked this fix. The result is not the expected result. The result is FAILED. Expected result: the recipe should be abort if the system has queued power commands when user sets system has none associated lab controller or removes the system. Actual result: the recipe or job still on Queue or waiting status. Steps: 1. Submit 3 jobs on dev-kvm-guest-01.rhts.eng.bos.redhat.com 2. Cancel the installing job and set none lab controller of system 3. There will be 1 job is in queued status and 1 job is in waiting status, and the two jobs are hanging.
Added an additional fix https://gerrit.beaker-project.org/#/c/5564/
(In reply to Jon Orris from comment #6) > Added an additional fix > > https://gerrit.beaker-project.org/#/c/5564/ Resubmitted as https://gerrit.beaker-project.org/#/c/5567/
1. Submit 3 jobs on calxeda-soc-01-n04.rhts.eng.bos.redhat.com 2. Cancel the installing job and set none lab controller of system 3. I found the jobs which have already in Queue still won't be aborted.
(In reply to Hui Wang from comment #9) > 1. Submit 3 jobs on calxeda-soc-01-n04.rhts.eng.bos.redhat.com > 2. Cancel the installing job and set none lab controller of system > > 3. I found the jobs which have already in Queue still won't be aborted. Can you post a pointer to the automated test you are using to reproduce this? Even better, can it be integrated into Beaker's test suite?
(In reply to Jon Orris from comment #10) > (In reply to Hui Wang from comment #9) > > 1. Submit 3 jobs on calxeda-soc-01-n04.rhts.eng.bos.redhat.com > > 2. Cancel the installing job and set none lab controller of system > > > > 3. I found the jobs which have already in Queue still won't be aborted. > > Can you post a pointer to the automated test you are using to reproduce this? > > Even better, can it be integrated into Beaker's test suite? Sure, let me try to integrate the test in comment #9 to test suite.
(In reply to Jon Orris from comment #10) > (In reply to Hui Wang from comment #9) > > 1. Submit 3 jobs on calxeda-soc-01-n04.rhts.eng.bos.redhat.com > > 2. Cancel the installing job and set none lab controller of system > > > > 3. I found the jobs which have already in Queue still won't be aborted. > > Can you post a pointer to the automated test you are using to reproduce this? > > Even better, can it be integrated into Beaker's test suite? After further analysis/test, I am sure your code can abort the queued power commands. I can Verified this bug. But the queued jobs that is waiting for the system are still in Queue. These jobs will be hang. May be this is another issue. How about file a new bug for this?
Verified this issue and filed a new bug 1414212 for my concern.
So I was a little confused about exactly what scenario we are talking about in comments 9-13 so let me summarize... There are two related sets of expected behaviour for this bug. 1. Submit a job for a particular system 2. Wait for the system to be provisioned and some power commands to be enqueued 3. Set the system to Removed Expected result: queued power commands should be Aborted with a message "System marked as removed", and the recipe will therefore also be aborted with a result such as "Command 104146 aborted: System marked as removed". This works now, verified on commit 389e953. At this point, if there were other recipes also enqueued for the now-Removed system they will be Aborted as well, with a message such as "R:1029 does not match any systems, aborting." That is handled by the dead recipes collector and is not changed by anything in this bug. It also still works as expected. The other related behaviour *would be*: 1. Submit a job for a particular system 2. Wait for the system to be provisioned and some power commands to be enqueued 3. Set the system's lab controller to "(none)" However that is not actually possible, because Beaker already prevents changing the lab controller while a recipe is running. It rejects the request with an error like this: CONFLICT: Unable to change lab controller while system is in use (return the system first) So that scenario is not actually possible. There is a separate pair of scenarios when provisioning a system manually (if the condition is set to Manual, or you are the owner). 1. On the Provision tab of the system page, provision the system 2. Set the system to Removed Expected result: queued commands are aborted. 1. On the Provision tab of the system page, provision thes ystem 2. Set the system's lab controller to "(none)" Expected result: queued commands are aborted. Both of these scenarios also work, verified on commit 389e953. The issues Sophia is describing above (and on bug 1414212) are actually about the behaviour of the dead recipe collector, which is supposed to handle the case where a recipe is queued waiting for a system which can never take it (for example because it's been removed). So that's a totally separate issue which we can follow up on that bz.
(In reply to Dan Callaghan from comment #14) > So I was a little confused about exactly what scenario we are talking about > in comments 9-13 so let me summarize... > > There are two related sets of expected behaviour for this bug. > > 1. Submit a job for a particular system > 2. Wait for the system to be provisioned and some power commands to be > enqueued > 3. Set the system to Removed > Expected result: queued power commands should be Aborted with a message > "System marked as removed", and the recipe will therefore also be aborted > with a result such as "Command 104146 aborted: System marked as removed". > > This works now, verified on commit 389e953. > > At this point, if there were other recipes also enqueued for the now-Removed > system they will be Aborted as well, with a message such as "R:1029 does not > match any systems, aborting." That is handled by the dead recipes collector > and is not changed by anything in this bug. It also still works as expected. > > The other related behaviour *would be*: > > 1. Submit a job for a particular system > 2. Wait for the system to be provisioned and some power commands to be > enqueued > 3. Set the system's lab controller to "(none)" > > However that is not actually possible, because Beaker already prevents > changing the lab controller while a recipe is running. It rejects the > request with an error like this: > > CONFLICT: Unable to change lab controller while system is in use (return the > system first) > > So that scenario is not actually possible. > > There is a separate pair of scenarios when provisioning a system manually > (if the condition is set to Manual, or you are the owner). > > 1. On the Provision tab of the system page, provision the system > 2. Set the system to Removed > Expected result: queued commands are aborted. > > 1. On the Provision tab of the system page, provision thes ystem > 2. Set the system's lab controller to "(none)" > Expected result: queued commands are aborted. > > Both of these scenarios also work, verified on commit 389e953. > > The issues Sophia is describing above (and on bug 1414212) are actually > about the behaviour of the dead recipe collector, which is supposed to > handle the case where a recipe is queued waiting for a system which can > never take it (for example because it's been removed). So that's a totally > separate issue which we can follow up on that bz. The summary is very very clear and that is what I wanted to express.
Beaker 24.0 has been released.