Bug 1362369 - if configure_netboot command fails, subsequent off and on commands for the installation should be skipped
Summary: if configure_netboot command fails, subsequent off and on commands for the in...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: scheduler
Version: 23
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: 24.0
Assignee: Jon Orris
QA Contact: tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-02 04:58 UTC by Dan Callaghan
Modified: 2017-02-21 18:49 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-21 18:49:39 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1362371 0 unspecified CLOSED abort queued power commands when a system is deassociated from its lab controller and/or set to Removed 2021-02-22 00:41:40 UTC

Internal Links: 1362371

Description Dan Callaghan 2016-08-02 04:58:59 UTC
Description of problem:
When Beaker provisions a system it enqueues four power commands which are run in sequence: clear_logs, configure_netboot, off, and on. If any of those fail, the recipe is Aborted.

However currently, if configure_netboot fails, for example with this error:

    No usable URL found for distro tree 60544 in lab

then the subsequent off and on commands relating to that (now aborted) recipe will still be run, for no real reason. Beaker will have enqueued another off command on top since the recipe is aborted.

Moreover, since Beaker 23, the subsequent on command for the aborted recipe can actually crash in beaker-provision like this:

Aug  1 05:22:11 lab-02 beaker-provision[15819]: bkr.labcontroller.provision ERROR Command handler <Greenlet at 0x1944910: <bound method CommandQueuePoller.handle of <bkr.labcontroller.provision.CommandQueuePoller object at 0x18a6ed0>>({'quiescent_period': 5, 'power': {'passwd': <repr , [<Greenlet at 0x1944a50>, <Greenlet at 0x1944370>,)> had unhandled exception: <Fault 1: "<class 'bkr.common.bexceptions.BX'>:No watchdog exists for recipe 2917987">

causing the command to be left in Running state (until it's cleaned up by the stale command clearing process when beaker-provision is next restarted). That's because the on command is trying to extend the watchdog for the recipe (bug 1348018) but it's already aborted.

Version-Release number of selected component (if applicable):
23.0

How reproducible:
with some difficulty

Steps to Reproduce:
1. Hack a distro tree to have some invalid URL (on the Lab Controllers tab, delete the existing http:// URL and use http://example.invalid/ or similar) -- this will cause the configure_netboot command to fail
(NOTE: if beaker-pxemenu is configured in the environment, it will defeat this hackery, because beaker-provision will use the local cached images on disk instead of fetching from the invalid URL. As a workaround, rm -rf /var/lib/tftpboot/distrotrees/).
2. Schedule a recipe using this hacked distro using reserve workflow, put method=http into kickstart metadata so that it tries to use the invalid http:// URL
3. Wait for Beaker to provision a system for the recipe

Actual results:
The configure_netboot command fails and recipe is aborted.
Then beaker-provision powers the system off, on, and off again. The on command will be left Running due to:
Aug  2 14:56:36 lab beaker-provision[28676]: bkr.labcontroller.provision ERROR Command handler <Greenlet at 0x101d370: <bound method CommandQueuePoller.handle of <bkr.labcontroller.provision.CommandQueuePoller object at 0x7fb230aa68d0>>({'quiescent_period': 5, 'power': {'passwd': None, , [<Greenlet at 0x101d910>, <Greenlet at 0x101d550>,)> had unhandled exception: <Fault 1: "<class 'bkr.common.bexceptions.BX'>:No watchdog exists for recipe 1014">

Expected results:
The following off and on commands for the recipe installation should be Aborted (or some other status, causing them to be skipped).

Also the command handler shouldn't hit an unhandled exception -- extending the watchdog time should be skipped if the recipe is already finished.

Comment 1 Jon Orris 2016-09-28 21:01:10 UTC
https://gerrit.beaker-project.org/#/c/5274/

Comment 4 Dan Callaghan 2017-02-21 18:49:39 UTC
Beaker 24.0 has been released.


Note You need to log in before you can comment on or make changes to this bug.