Bug 1414212

Summary: Jobs are in queued/waiting will be hang if the system will be removed or disassociated from lab controller
Product: [Retired] Beaker Reporter: Hui Wang <huiwang>
Component: generalAssignee: beaker-dev-list
Status: CLOSED CANTFIX QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: developCC: dcallagh, mjia, rjoost
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-19 00:10:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hui Wang 2017-01-18 03:27:29 UTC
Description of problem:
When summitting several jobs to use one system, many jobs are in queued. If the the system will be removed or disassociated from lab controller. The jobs that are in queued or waiting will be in queued or waiting for ever until the system comes back. 

Version-Release number of selected component (if applicable):
Beaker 24.0.git.243.389e953 

How reproducible:
100%

Steps to Reproduce:
1. Submit 3 jobs on same system
2. Cancel the installing job and set none lab controller of system 
 the jobs which have already in Queue still won't be aborted.


Actual results:
The jobs that are in queued or waiting will be in queued or waiting for ever.

Expected results:
I am not sure the expected results, may be it's better of giving some useful message and aborting the job. 

Additional info:

Comment 1 Dan Callaghan 2017-01-18 05:38:24 UTC
In this particular case your recipe is not stuck in the queue, it actually started and reserved the system -- even though the system is not associated to a lab controller and can never be provisioned.

Beaker is supposed to prevent this, and it normally does. You can't set the lab controller to "(none)" while a system is reserved, and the scheduler won't reserve a system with no lab controller.

However you happened to hit a race window between these checks, because you set the lab controller to "(none)" at the same instant the scheduler reserved the system. The activity log shows:

huiwang	Scheduler	2017-01-17 16:58:26 +10:00	Distro Tree	Provision		Fedora 24 Server x86_64
	Scheduler	2017-01-17 16:58:24 +10:00	Power	clear_netboot		Aborted: System disassociated from lab controller
	Scheduler	2017-01-17 16:58:24 +10:00	Power	off		Aborted: System disassociated from lab controller
huiwang	HTTP	2017-01-17 16:58:24 +10:00	Lab Controller	Changed	lab-devel-02.rhts.eng.bos.redhat.com	
huiwang	Scheduler	2017-01-17 16:58:24 +10:00	User	Reserved		huiwang

Note the identical timestamps.

So that recipe is stuck in Waiting because it is waiting for you to manually reboot the system, effectively the same problem as bug 903930. (A system with no lab controller is effectively a system without power control.)

If you clone another recipe, it will abort straight away with the expected message "Recipe ID 18060 does not match any systems" which is correct, because the scheduler sees that the system is not associated to any lab controller and therefore cannot run recipes.

Comment 2 Dan Callaghan 2017-01-18 05:47:19 UTC
The race window might be possible to fix, although difficult. It is similar to other races we have faced, for example with reserving a system, which we solved by setting system.user_id using a conditional UPDATE and checking the rowcount. It ensures that any two transactions trying to update that row are serialised and then one fails if it used an inconsistent view.

But this race is not over a single column's value, but rather between system.lab_controller_id being set NULL at the same time as system.user_id being set to a non-NULL value.

To avoid the race we would need to SELECT... FOR UPDATE in the HTTP handler which is updating system.lab_controller_id (that's reasonable enough) but we would also need to ensure the scheduler SELECT... FOR UPDATE's when picking a system, but that means locking the entire system table because of the way the scheduler works. Which would just lead to nightmarish lock contention.

So I think the resolution here is CANTFIX for the race condition, and DUPLICATE of 903930 for the recipe stuck Waiting.

Comment 3 Hui Wang 2017-01-18 06:08:04 UTC
(In reply to Dan Callaghan from comment #2)
> The race window might be possible to fix, although difficult. It is similar
> to other races we have faced, for example with reserving a system, which we
> solved by setting system.user_id using a conditional UPDATE and checking the
> rowcount. It ensures that any two transactions trying to update that row are
> serialised and then one fails if it used an inconsistent view.
> 
> But this race is not over a single column's value, but rather between
> system.lab_controller_id being set NULL at the same time as system.user_id
> being set to a non-NULL value.
> 
> To avoid the race we would need to SELECT... FOR UPDATE in the HTTP handler
> which is updating system.lab_controller_id (that's reasonable enough) but we
> would also need to ensure the scheduler SELECT... FOR UPDATE's when picking
> a system, but that means locking the entire system table because of the way
> the scheduler works. Which would just lead to nightmarish lock contention.
> 
> So I think the resolution here is CANTFIX for the race condition, and
> DUPLICATE of 903930 for the recipe stuck Waiting.

It sounds reasonable. So I agree with your resolution.

Comment 4 Roman Joost 2017-01-19 00:10:37 UTC
Since I can't resolve the bug as DUPLICATE and CANTFIX at the same time, I pick the race condition and resolve it as CANTFIX with a reference to Bug 903930