Bug 1414212
Summary: | Jobs are in queued/waiting will be hang if the system will be removed or disassociated from lab controller | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Hui Wang <huiwang> |
Component: | general | Assignee: | beaker-dev-list |
Status: | CLOSED CANTFIX | QA Contact: | tools-bugs <tools-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | develop | CC: | dcallagh, mjia, rjoost |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-01-19 00:10:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Hui Wang
2017-01-18 03:27:29 UTC
In this particular case your recipe is not stuck in the queue, it actually started and reserved the system -- even though the system is not associated to a lab controller and can never be provisioned. Beaker is supposed to prevent this, and it normally does. You can't set the lab controller to "(none)" while a system is reserved, and the scheduler won't reserve a system with no lab controller. However you happened to hit a race window between these checks, because you set the lab controller to "(none)" at the same instant the scheduler reserved the system. The activity log shows: huiwang Scheduler 2017-01-17 16:58:26 +10:00 Distro Tree Provision Fedora 24 Server x86_64 Scheduler 2017-01-17 16:58:24 +10:00 Power clear_netboot Aborted: System disassociated from lab controller Scheduler 2017-01-17 16:58:24 +10:00 Power off Aborted: System disassociated from lab controller huiwang HTTP 2017-01-17 16:58:24 +10:00 Lab Controller Changed lab-devel-02.rhts.eng.bos.redhat.com huiwang Scheduler 2017-01-17 16:58:24 +10:00 User Reserved huiwang Note the identical timestamps. So that recipe is stuck in Waiting because it is waiting for you to manually reboot the system, effectively the same problem as bug 903930. (A system with no lab controller is effectively a system without power control.) If you clone another recipe, it will abort straight away with the expected message "Recipe ID 18060 does not match any systems" which is correct, because the scheduler sees that the system is not associated to any lab controller and therefore cannot run recipes. The race window might be possible to fix, although difficult. It is similar to other races we have faced, for example with reserving a system, which we solved by setting system.user_id using a conditional UPDATE and checking the rowcount. It ensures that any two transactions trying to update that row are serialised and then one fails if it used an inconsistent view. But this race is not over a single column's value, but rather between system.lab_controller_id being set NULL at the same time as system.user_id being set to a non-NULL value. To avoid the race we would need to SELECT... FOR UPDATE in the HTTP handler which is updating system.lab_controller_id (that's reasonable enough) but we would also need to ensure the scheduler SELECT... FOR UPDATE's when picking a system, but that means locking the entire system table because of the way the scheduler works. Which would just lead to nightmarish lock contention. So I think the resolution here is CANTFIX for the race condition, and DUPLICATE of 903930 for the recipe stuck Waiting. (In reply to Dan Callaghan from comment #2) > The race window might be possible to fix, although difficult. It is similar > to other races we have faced, for example with reserving a system, which we > solved by setting system.user_id using a conditional UPDATE and checking the > rowcount. It ensures that any two transactions trying to update that row are > serialised and then one fails if it used an inconsistent view. > > But this race is not over a single column's value, but rather between > system.lab_controller_id being set NULL at the same time as system.user_id > being set to a non-NULL value. > > To avoid the race we would need to SELECT... FOR UPDATE in the HTTP handler > which is updating system.lab_controller_id (that's reasonable enough) but we > would also need to ensure the scheduler SELECT... FOR UPDATE's when picking > a system, but that means locking the entire system table because of the way > the scheduler works. Which would just lead to nightmarish lock contention. > > So I think the resolution here is CANTFIX for the race condition, and > DUPLICATE of 903930 for the recipe stuck Waiting. It sounds reasonable. So I agree with your resolution. Since I can't resolve the bug as DUPLICATE and CANTFIX at the same time, I pick the race condition and resolve it as CANTFIX with a reference to Bug 903930 |