Bug 1820698 - After a failed introspection, and inspector goes on a second try, it might fail to set_boot_device if it's stuck in POST
Summary: After a failed introspection, and inspector goes on a second try, it might fa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic-inspector
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: beta
: 17.0
Assignee: Julia Kreger
QA Contact: mlammon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-03 15:45 UTC by David Vallee Delisle
Modified: 2023-09-07 22:40 UTC (History)
15 users (show)

Fixed In Version: openstack-ironic-inspector-10.6.2-0.20220118051837.8f97076.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 12:09:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack Storyboard 2008107 0 None None None 2020-09-03 22:49:12 UTC
OpenStack gerrit 749845 0 None MERGED Power off before inspection 2021-02-01 14:53:06 UTC
Red Hat Issue Tracker OSP-7197 0 None None None 2022-03-22 15:39:12 UTC
Red Hat Knowledge Base (Solution) 4962461 0 None None None 2020-05-01 15:09:14 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:10:24 UTC

Description David Vallee Delisle 2020-04-03 15:45:03 UTC
Description of problem:
While troubleshooting this bz1776929 the node was remaining on the ipxe in POST.

After the 20 minutes introspection timeout, ironic tried to set_boot_device on the node, but failed with UnableToModifyDuringSystemPOST returned. This message was quite hidden though, I had to add some custom debug to get it and opened bz1820689 to address this.

I'm wondering if inspector shouldn't shutdown the node before sending BootSourceOverrideTarget ? Should this be under ironic-inspector or redfish?

Version-Release number of selected component (if applicable):
master

How reproducible:
All the time

Steps to Reproduce:
1. Launch introspection
2. Fail to load ipxe image and remain in ipxe shell
3. wait 20 minutes

Actual results:
On second try, inspector fails with this traceback [1]

Expected results:
This shouldn't prevent inspector from doing a second try.

Additional info:

[1]
~~~
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/ilo/management.py", line 279, in set_boot_device
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     ilo_object.set_one_time_boot(boot_device)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/proliantutils/ilo/client.py", line 459, in set_one_time_boot
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return self._call_method('set_one_time_boot', value)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/proliantutils/ilo/client.py", line 341, in _call_method
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return method(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/proliantutils/redfish/redfish.py", line 610, in set_one_time_boot
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     raise exception.IloError(msg)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server proliantutils.exception.IloError: [iLO xxx] The Redfish controller failed to set one time boot device NETWORK. Error: HTTP PATCH https://xxx/redfish/v1/Systems/1 returned code 400. iLO.0.10.ExtendedInfo: See @Message.ExtendedInfo for more information.
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic_lib/metrics.py", line 60, in wrapped
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return func(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 3034, in set_boot_device
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     persistent=persistent)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic_lib/metrics.py", line 60, in wrapped
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/ilo/management.py", line 286, in set_boot_device
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server     error=ilo_exception)
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server ironic.common.exception.IloOperationError: Setting pxe as boot device failed, error: [iLO xxx] The Redfish controller failed to set one time boot device NETWORK. Error: HTTP PATCH https://xxx/redfish/v1/Systems/1 returned code 400. iLO.0.10.ExtendedInfo: See @Message.ExtendedInfo for more information.
ironic/ironic-conductor.log:2020-04-01 17:38:06.494 7 ERROR oslo_messaging.rpc.server
~~~

Comment 1 Bob Fournier 2020-04-03 16:33:48 UTC
As Ilya noted - this is up to proliantutils or iLO driver to make that decision...? Redfish does not require any specific power state when changing boot options

Comment 2 Julia Kreger 2020-06-09 15:53:37 UTC
I think the managed introspection functionality that merged during the last upstream development cycle (Ussuri), should effectively solve this as the item managing the power and boot mode settings is then just ironic with-in a workflow, at least as long as [inspector]require_managed_boot is set to True.

The only way to realistically prevent this is for inspector to force the power state off in advance of trying to run, or the driver trying to assert power state off before changing the boot device. I guess the machine was already powered on when inspection was triggered?

Depending on the code path, it looks like the call goes to inspector, inspector then attempts to ask ironic to set the network device to boot, and then reboot the node. I guess my disconnect is why is the node on even before this step?

Comment 3 Julia Kreger 2020-09-03 22:49:12 UTC
Patch uploaded upstream to address this. The actual process in this case is being driven by ironic-inspector. The previous focus on proliantutils was not correct as it is legitimately failing, just not with much clarity, although patches have been proposed upstream to improve that.

Comment 4 pweeks 2021-08-18 16:34:58 UTC
low priority, no progress in the last year
closing wontfix
If this needs to be reconsidered, please re-open

Comment 5 Julia Kreger 2022-03-23 13:53:52 UTC
I noticed the noted patch will be in OSP17, linking appropriately and moving to modified state.

Comment 13 errata-xmlrpc 2022-09-21 12:09:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.