Bug 1820698

Summary: After a failed introspection, and inspector goes on a second try, it might fail to set_boot_device if it's stuck in POST
Product: Red Hat OpenStack Reporter: David Vallee Delisle <dvd>
Component: openstack-ironic-inspectorAssignee: Julia Kreger <jkreger>
Status: CLOSED ERRATA QA Contact: mlammon
Severity: low Docs Contact:
Priority: low    
Version: 16.0 (Train)CC: achernet, bfournie, cswanson, dhill, dtantsur, eduen, hbrock, jkreger, jparoly, jslagle, mburns, pweeks, rpittau, slinaber, stendulker
Target Milestone: betaKeywords: Reopened, Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-ironic-inspector-10.6.2-0.20220118051837.8f97076.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-21 12:09:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Vallee Delisle 2020-04-03 15:45:03 UTC
Description of problem:
While troubleshooting this bz1776929 the node was remaining on the ipxe in POST.

After the 20 minutes introspection timeout, ironic tried to set_boot_device on the node, but failed with UnableToModifyDuringSystemPOST returned. This message was quite hidden though, I had to add some custom debug to get it and opened bz1820689 to address this.

I'm wondering if inspector shouldn't shutdown the node before sending BootSourceOverrideTarget ? Should this be under ironic-inspector or redfish?

Version-Release number of selected component (if applicable):
master

How reproducible:
All the time

Steps to Reproduce:
1. Launch introspection
2. Fail to load ipxe image and remain in ipxe shell
3. wait 20 minutes

Actual results:
On second try, inspector fails with this traceback [1]

Expected results:
This shouldn't prevent inspector from doing a second try.

Additional info:

[1]
~~~
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/ilo/management.py", line 279, in set_boot_device
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     ilo_object.set_one_time_boot(boot_device)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/proliantutils/ilo/client.py", line 459, in set_one_time_boot
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return self._call_method('set_one_time_boot', value)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/proliantutils/ilo/client.py", line 341, in _call_method
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return method(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/proliantutils/redfish/redfish.py", line 610, in set_one_time_boot
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     raise exception.IloError(msg)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server proliantutils.exception.IloError: [iLO xxx] The Redfish controller failed to set one time boot device NETWORK. Error: HTTP PATCH https://xxx/redfish/v1/Systems/1 returned code 400. iLO.0.10.ExtendedInfo: See @Message.ExtendedInfo for more information.
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic_lib/metrics.py", line 60, in wrapped
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return func(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 3034, in set_boot_device
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     persistent=persistent)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic_lib/metrics.py", line 60, in wrapped
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/ilo/management.py", line 286, in set_boot_device
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     error=ilo_exception)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server ironic.common.exception.IloOperationError: Setting pxe as boot device failed, error: [iLO xxx] The Redfish controller failed to set one time boot device NETWORK. Error: HTTP PATCH https://xxx/redfish/v1/Systems/1 returned code 400. iLO.0.10.ExtendedInfo: See @Message.ExtendedInfo for more information.
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server
~~~

Comment 1 Bob Fournier 2020-04-03 16:33:48 UTC
As Ilya noted - this is up to proliantutils or iLO driver to make that decision...? Redfish does not require any specific power state when changing boot options

Comment 2 Julia Kreger 2020-06-09 15:53:37 UTC
I think the managed introspection functionality that merged during the last upstream development cycle (Ussuri), should effectively solve this as the item managing the power and boot mode settings is then just ironic with-in a workflow, at least as long as [inspector]require_managed_boot is set to True.

The only way to realistically prevent this is for inspector to force the power state off in advance of trying to run, or the driver trying to assert power state off before changing the boot device. I guess the machine was already powered on when inspection was triggered?

Depending on the code path, it looks like the call goes to inspector, inspector then attempts to ask ironic to set the network device to boot, and then reboot the node. I guess my disconnect is why is the node on even before this step?

Comment 3 Julia Kreger 2020-09-03 22:49:12 UTC
Patch uploaded upstream to address this. The actual process in this case is being driven by ironic-inspector. The previous focus on proliantutils was not correct as it is legitimately failing, just not with much clarity, although patches have been proposed upstream to improve that.

Comment 4 pweeks 2021-08-18 16:34:58 UTC
low priority, no progress in the last year
closing wontfix
If this needs to be reconsidered, please re-open

Comment 5 Julia Kreger 2022-03-23 13:53:52 UTC
I noticed the noted patch will be in OSP17, linking appropriately and moving to modified state.

Comment 13 errata-xmlrpc 2022-09-21 12:09:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543