Bug 1362369

Summary: if configure_netboot command fails, subsequent off and on commands for the installation should be skipped
Product: [Retired] Beaker Reporter: Dan Callaghan <dcallagh>
Component: schedulerAssignee: Jon Orris <jorris>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 23CC: dcallagh, dowang, jorris, mjia, rjoost
Target Milestone: 24.0Keywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-21 18:49:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Callaghan 2016-08-02 04:58:59 UTC
Description of problem:
When Beaker provisions a system it enqueues four power commands which are run in sequence: clear_logs, configure_netboot, off, and on. If any of those fail, the recipe is Aborted.

However currently, if configure_netboot fails, for example with this error:

    No usable URL found for distro tree 60544 in lab

then the subsequent off and on commands relating to that (now aborted) recipe will still be run, for no real reason. Beaker will have enqueued another off command on top since the recipe is aborted.

Moreover, since Beaker 23, the subsequent on command for the aborted recipe can actually crash in beaker-provision like this:

Aug  1 05:22:11 lab-02 beaker-provision[15819]: bkr.labcontroller.provision ERROR Command handler <Greenlet at 0x1944910: <bound method CommandQueuePoller.handle of <bkr.labcontroller.provision.CommandQueuePoller object at 0x18a6ed0>>({'quiescent_period': 5, 'power': {'passwd': <repr , [<Greenlet at 0x1944a50>, <Greenlet at 0x1944370>,)> had unhandled exception: <Fault 1: "<class 'bkr.common.bexceptions.BX'>:No watchdog exists for recipe 2917987">

causing the command to be left in Running state (until it's cleaned up by the stale command clearing process when beaker-provision is next restarted). That's because the on command is trying to extend the watchdog for the recipe (bug 1348018) but it's already aborted.

Version-Release number of selected component (if applicable):
23.0

How reproducible:
with some difficulty

Steps to Reproduce:
1. Hack a distro tree to have some invalid URL (on the Lab Controllers tab, delete the existing http:// URL and use http://example.invalid/ or similar) -- this will cause the configure_netboot command to fail
(NOTE: if beaker-pxemenu is configured in the environment, it will defeat this hackery, because beaker-provision will use the local cached images on disk instead of fetching from the invalid URL. As a workaround, rm -rf /var/lib/tftpboot/distrotrees/).
2. Schedule a recipe using this hacked distro using reserve workflow, put method=http into kickstart metadata so that it tries to use the invalid http:// URL
3. Wait for Beaker to provision a system for the recipe

Actual results:
The configure_netboot command fails and recipe is aborted.
Then beaker-provision powers the system off, on, and off again. The on command will be left Running due to:
Aug  2 14:56:36 lab beaker-provision[28676]: bkr.labcontroller.provision ERROR Command handler <Greenlet at 0x101d370: <bound method CommandQueuePoller.handle of <bkr.labcontroller.provision.CommandQueuePoller object at 0x7fb230aa68d0>>({'quiescent_period': 5, 'power': {'passwd': None, , [<Greenlet at 0x101d910>, <Greenlet at 0x101d550>,)> had unhandled exception: <Fault 1: "<class 'bkr.common.bexceptions.BX'>:No watchdog exists for recipe 1014">

Expected results:
The following off and on commands for the recipe installation should be Aborted (or some other status, causing them to be skipped).

Also the command handler shouldn't hit an unhandled exception -- extending the watchdog time should be skipped if the recipe is already finished.

Comment 1 Jon Orris 2016-09-28 21:01:10 UTC
https://gerrit.beaker-project.org/#/c/5274/

Comment 4 Dan Callaghan 2017-02-21 18:49:39 UTC
Beaker 24.0 has been released.