1362371 – abort queued power commands when a system is deassociated from its lab controller and/or set to Removed

Bug 1362371 - abort queued power commands when a system is deassociated from its lab controller and/or set to Removed

Summary: abort queued power commands when a system is deassociated from its lab contro...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	general
Sub Component:
Version:	23
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	24.0
Assignee:	Jon Orris
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-02 05:04 UTC by Dan Callaghan
Modified:	2017-02-21 18:50 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-02-21 18:50:29 UTC
Embargoed:
Flags:	huiwang: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1362369	0	unspecified	CLOSED	if configure_netboot command fails, subsequent off and on commands for the installation should be skipped	2021-02-22 00:41:40 UTC

Internal Links: 1362369

Description Dan Callaghan 2016-08-02 05:04:03 UTC

If a system has queued power commands and you set the lab controller to "none", the power commands will never be handled and stay Queued forever.

If beaker-provision was in the process of handling one at the time, it will fail because the server will suddenly start rejecting the calls because it thinks the lab controller is not allowed to alter commands for that system:

Aug  1 23:13:50 lab-02 beaker-provision[15819]: bkr.labcontroller.provision ERROR Command handler <Greenlet at 0x19444b0: <bound method CommandQueuePoller.handle of <bkr.labcontroller.provision.CommandQueuePoller object at 0x18a6ed0>>({'quiescent_period': 5, 'power': {'passwd': None, , [<Greenlet at 0x1944eb0>, <Greenlet at 0x1944370>,)> had unhandled exception: <Fault 1: "<type 'exceptions.ValueError'>:lab.example.com cannot update command for mrg9.example.com in wrong lab">

If a system is set to Removed, any queued power commands will still be executed even though the owner probably doesn't want that.

Comment 1 Dan Callaghan 2016-08-02 07:36:17 UTC

Our production Beaker instance has a lot of very old queued power commands hanging around for systems not associated with any lab controller. So it would be good to write a data migration which cleans those up as well.

Comment 3 Jon Orris 2016-09-28 21:00:48 UTC

https://gerrit.beaker-project.org/#/c/5274/2

Comment 5 Hui Wang 2016-12-29 06:34:33 UTC

Checked this fix.
The result is not the expected result.
The result is FAILED.

Expected result: the recipe should be abort if the system has queued power commands when user sets system has none associated lab controller or removes the system.

Actual result: the recipe or job still on Queue or waiting status.

Steps:

1. Submit 3 jobs on dev-kvm-guest-01.rhts.eng.bos.redhat.com
2. Cancel the installing job and set none lab controller of system 

3. There will be 1 job is in queued status and 1 job is in waiting status, and the two jobs are hanging.

Comment 6 Jon Orris 2017-01-03 23:46:33 UTC

Added an additional fix

https://gerrit.beaker-project.org/#/c/5564/

Comment 7 Jon Orris 2017-01-04 17:32:09 UTC

(In reply to Jon Orris from comment #6)
> Added an additional fix
> 
> https://gerrit.beaker-project.org/#/c/5564/

Resubmitted as 
https://gerrit.beaker-project.org/#/c/5567/

Comment 9 Hui Wang 2017-01-13 07:42:43 UTC

1. Submit 3 jobs on calxeda-soc-01-n04.rhts.eng.bos.redhat.com
2. Cancel the installing job and set none lab controller of system 

3. I found the jobs which have already in Queue still won't be aborted.

Comment 10 Jon Orris 2017-01-16 18:48:51 UTC

(In reply to Hui Wang from comment #9)
> 1. Submit 3 jobs on calxeda-soc-01-n04.rhts.eng.bos.redhat.com
> 2. Cancel the installing job and set none lab controller of system 
> 
> 3. I found the jobs which have already in Queue still won't be aborted.

Can you post a pointer to the automated test you are using to reproduce this?

Even better, can it be integrated into Beaker's test suite?

Comment 11 Hui Wang 2017-01-17 02:12:14 UTC

(In reply to Jon Orris from comment #10)
> (In reply to Hui Wang from comment #9)
> > 1. Submit 3 jobs on calxeda-soc-01-n04.rhts.eng.bos.redhat.com
> > 2. Cancel the installing job and set none lab controller of system 
> > 
> > 3. I found the jobs which have already in Queue still won't be aborted.
> 
> Can you post a pointer to the automated test you are using to reproduce this?
> 
> Even better, can it be integrated into Beaker's test suite?

Sure, let me try to integrate the test in comment #9 to test suite.

Comment 12 Hui Wang 2017-01-17 07:12:57 UTC

(In reply to Jon Orris from comment #10)
> (In reply to Hui Wang from comment #9)
> > 1. Submit 3 jobs on calxeda-soc-01-n04.rhts.eng.bos.redhat.com
> > 2. Cancel the installing job and set none lab controller of system 
> > 
> > 3. I found the jobs which have already in Queue still won't be aborted.
> 
> Can you post a pointer to the automated test you are using to reproduce this?
> 
> Even better, can it be integrated into Beaker's test suite?

After further analysis/test, I am sure your code can abort the queued power commands. I can Verified this bug. 
But the queued jobs that is waiting for the system are still in Queue. These jobs will be hang. May be this is another issue. How about file a new bug for this?

Comment 13 Hui Wang 2017-01-18 03:29:14 UTC

Verified this issue and filed a new bug 1414212 for my concern.

Comment 14 Dan Callaghan 2017-01-18 05:37:07 UTC

So I was a little confused about exactly what scenario we are talking about in comments 9-13 so let me summarize...

There are two related sets of expected behaviour for this bug.

1. Submit a job for a particular system
2. Wait for the system to be provisioned and some power commands to be enqueued
3. Set the system to Removed
Expected result: queued power commands should be Aborted with a message "System marked as removed", and the recipe will therefore also be aborted with a result such as "Command 104146 aborted: System marked as removed".

This works now, verified on commit 389e953.

At this point, if there were other recipes also enqueued for the now-Removed system they will be Aborted as well, with a message such as "R:1029 does not match any systems, aborting." That is handled by the dead recipes collector and is not changed by anything in this bug. It also still works as expected.

The other related behaviour *would be*:

1. Submit a job for a particular system
2. Wait for the system to be provisioned and some power commands to be enqueued
3. Set the system's lab controller to "(none)"

However that is not actually possible, because Beaker already prevents changing the lab controller while a recipe is running. It rejects the request with an error like this:

CONFLICT: Unable to change lab controller while system is in use (return the system first)

So that scenario is not actually possible.

There is a separate pair of scenarios when provisioning a system manually (if the condition is set to Manual, or you are the owner).

1. On the Provision tab of the system page, provision the system
2. Set the system to Removed
Expected result: queued commands are aborted.

1. On the Provision tab of the system page, provision thes ystem
2. Set the system's lab controller to "(none)"
Expected result: queued commands are aborted.

Both of these scenarios also work, verified on commit 389e953.

The issues Sophia is describing above (and on bug 1414212) are actually about the behaviour of the dead recipe collector, which is supposed to handle the case where a recipe is queued waiting for a system which can never take it (for example because it's been removed). So that's a totally separate issue which we can follow up on that bz.

Comment 15 Hui Wang 2017-01-18 05:49:34 UTC

(In reply to Dan Callaghan from comment #14)
> So I was a little confused about exactly what scenario we are talking about
> in comments 9-13 so let me summarize...
> 
> There are two related sets of expected behaviour for this bug.
> 
> 1. Submit a job for a particular system
> 2. Wait for the system to be provisioned and some power commands to be
> enqueued
> 3. Set the system to Removed
> Expected result: queued power commands should be Aborted with a message
> "System marked as removed", and the recipe will therefore also be aborted
> with a result such as "Command 104146 aborted: System marked as removed".
> 
> This works now, verified on commit 389e953.
> 
> At this point, if there were other recipes also enqueued for the now-Removed
> system they will be Aborted as well, with a message such as "R:1029 does not
> match any systems, aborting." That is handled by the dead recipes collector
> and is not changed by anything in this bug. It also still works as expected.
> 
> The other related behaviour *would be*:
> 
> 1. Submit a job for a particular system
> 2. Wait for the system to be provisioned and some power commands to be
> enqueued
> 3. Set the system's lab controller to "(none)"
> 
> However that is not actually possible, because Beaker already prevents
> changing the lab controller while a recipe is running. It rejects the
> request with an error like this:
> 
> CONFLICT: Unable to change lab controller while system is in use (return the
> system first)
> 
> So that scenario is not actually possible.
> 
> There is a separate pair of scenarios when provisioning a system manually
> (if the condition is set to Manual, or you are the owner).
> 
> 1. On the Provision tab of the system page, provision the system
> 2. Set the system to Removed
> Expected result: queued commands are aborted.
> 
> 1. On the Provision tab of the system page, provision thes ystem
> 2. Set the system's lab controller to "(none)"
> Expected result: queued commands are aborted.
> 
> Both of these scenarios also work, verified on commit 389e953.
> 
> The issues Sophia is describing above (and on bug 1414212) are actually
> about the behaviour of the dead recipe collector, which is supposed to
> handle the case where a recipe is queued waiting for a system which can
> never take it (for example because it's been removed). So that's a totally
> separate issue which we can follow up on that bz.

The summary is very very clear and that is what I wanted to express.

Comment 16 Dan Callaghan 2017-02-21 18:50:29 UTC

Beaker 24.0 has been released.

Note You need to log in before you can comment on or make changes to this bug.