Bug 2189245

Summary: In containerized RHEL 16.2 deployment , problem with headers during node introspection using Redfish
Product: Red Hat OpenStack Reporter: Jim Bagwell <james.bagwell>
Component: python-sushyAssignee: Julia Kreger <jkreger>
Status: CLOSED CURRENTRELEASE QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: achernet, janders, jkreger, rhos-maint, sbaker
Target Milestone: ---Keywords: Triaged
Target Release: ---Flags: jkreger: needinfo? (janders)
sbaker: needinfo? (kamil.gustab)
lsvaty: needinfo? (achernet)
lsvaty: needinfo? (rhos-maint)
jkreger: needinfo-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-24 20:01:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jim Bagwell 2023-04-24 14:05:33 UTC
Description of problem:


Version-Release number of selected component (if applicable):

Host: RHEL 8.6 OS
Deploying RHEL Openstack 16.2
Python-sushy: python3-sushy-2.0.6-2.20220913045147.484e642.el8ost.noarch




How reproducible:

100% 
Steps to Reproduce:

1.import nodes with redfish driver
2.try to introspect , fails

Actual results: ( IPS have been obfuscated ) 

2023-04-24 10:29:35.018 7 ERROR ironic_inspector.utils [-] [node: cc96cd2f-9357-4694-a352-8914e6d120a0 state starting] Failed to set boot device to PXE: Redfish exception occurred. Error: Redfish set boot device failed for node cc96cd2f-9357-4694-a352-8914e6d120a0. Error: HTTP PATCH https://XXXXXXXX/redfish/v1/Systems/Self returned code 428. Ami.1.0.0.PreconditionHeaderMissing: The request did not provide the required precondition, such as an If-Match or If-None-Match header. (HTTP 500): ironicclient.common.apiclient.exceptions.InternalServerError: Redfish exception occurred. Error: Redfish set boot device failed for node cc96cd2f-9357-4694-a352-8914e6d120a0. Error: HTTP PATCH https://XXXXXXXX/redfish/v1/Systems/Self returned code 428. Ami.1.0.0.PreconditionHeaderMissing: The request did not provide the required precondition, such as an If-Match or If-None-Match header. (HTTP 500)
2023-04-24 10:29:35.019 7 ERROR ironic_inspector.node_cache [-] [node: cc96cd2f-9357-4694-a352-8914e6d120a0 state starting] Processing the error event because of an exception <class 'ironic_inspector.utils.Error'>: Failed to set boot device to PXE: Redfish exception occurred. Error: Redfish set boot device failed for node cc96cd2f-9357-4694-a352-8914e6d120a0. Error: HTTP PATCH https://XXXXXXXXX/redfish/v1/Systems/Self returned code 428. Ami.1.0.0.PreconditionHeaderMissing: The request did not provide the required precondition, such as an If-Match or If-None-Match header. (HTTP 500) raised by ironic_inspector.introspect._do_introspect: ironic_inspector.utils.Error: Failed to set boot device to PXE: Redfish exception occurred. Error: Redfish set boot device failed for node cc96cd2f-9357-4694-a352-8914e6d120a0. Error: HTTP PATCH https://XXXXXXXXXX/redfish/v1/Systems/Self returned code 428. Ami.1.0.0.PreconditionHeaderMissing: The request did not provide the required precondition, such as an If-Match or If-None-Match header. (HTTP 500)


Expected results:

Introspection passed.


Additional info:
Is this the correct version of sushy for the RHEL 16.2 containers? This seems old, and possibly a dev-branch version?

Comment 1 Kamil Gustab 2023-04-24 14:11:56 UTC
Hi, can you please advice?

Comment 2 Julia Kreger 2023-04-24 20:05:17 UTC
Greetings!

To answer the versioning question, that version is anticipated given the library versions are matched with the service versions of a specific release. In this case, these versions are from the Train release of OpenStack.

Specifically, it appears to be an issue with ETag support, and there have been a number of changes to etags as well as the sushy library over the years. Realistically, we need more information to assist, but if your feeling brave and want to try a newer python3-sushy version, that would be a good data point for us.

In order to help, we will need to know the model of the hardware, as well as the firmware version of the BMC. Additionally, the ironic-conductor log *should* have this error in it, and may have additional debugging detail. If we can get logs from the ironic-conductor service with those errors, it will be easier for us to trace down what is going on, and if existing ETag fixes will remedy this, or not.


We do have a later version of python3-sushy via the OpenShift Container Platform repositories, python3-sushy-4.1.6-0.20221213125412.e06f1c3.el8.noarch.rpm, which might work as a test point. However, that is not something we have tested nor could we support it. But that version does have the ETag fixes which come to mind, which would help us identify a path forward.

Please let us know.

Thanks,

-Julia and the HardProv team.

Comment 4 Kamil Gustab 2023-04-25 16:48:58 UTC
Hi Julia,

So after changing sushy RPM to the one you mentioned (python3-sushy-4.1.6-0.20221213125412.e06f1c3.el8.noarch.rpm) in ironic_conductor container it successfully changed boot order to PXE, but seems to go into the loop, and then timeouts.

HW model: Nokia Airframe OR18
BMC version: 3.61
Ironic Conductor logs: https://pastes.io/raw/0znwocemja
Ironic Inspector logs: https://pastes.io/raw/hvpomudl1v

One more thing is that we have other nodes in the ironic, which have IPMI as a driver.

Comment 5 Julia Kreger 2023-04-25 18:43:03 UTC
Greetings,

Interesting, bear with me please, I want to confirm a few details:

1) The node we're talking about is cc96cd2f-9357-4694-a352-8914e6d120a0 ?
2) which appears to have a mac address registered as 00:30:64:1d:3e:b2. Can this please be confirmed by evaluating the port assigned on `openstack baremetal node ports` output?


The mac address appears to be what we get from Ironic for the ethernet port.

However, then I see the following in the ironic-inspector log:

['system', 'kernel', 'cmdline', 'ipa-inspection-callback-url=http://172.31.0.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,numa-topology,logs systemd.journald.forward_to_console=yes BOOTIF=50:6b:4b:44:14:36 ipa-debug=1 ipa-inspection-dhcp-all-interfaces=1 ipa-collect-lldp=0 initrd=agent.ramdisk'], ['system', 'ipmi', 'channel', '1'], ['ipmi', 'lan', 'set-in-progress', 'Set Complete'], ['ipmi', 'lan', 'auth-type-enable', 'Callback : MD5'], ['ipmi', 'lan', 'ip-address-source', 'Static Address'], ['ipmi', 'lan', 'ip-address', 'x.x.x.150'], ['ipmi', 'lan', 'subnet-mask', '255.255.255.192'], ['ipmi', 'lan', 'mac-address', '00:30:64:1d:3e:b2']

Specifically we see the machine boot for introspection with a mac address of 50:6b:4b:44:14:36 (See BOOTIF), and I suspect the MAC address registered is wrong based upon the BMC (IPMI) controller listing the original mac address also based upon the additional BMC details logged.

What ends up happening is the overall introspection fails and returns an error, in the inspector log:

2023-04-25 16:37:22.971 7 ERROR ironic_inspector.utils [-] [node: MAC 50:6b:4b:44:14:36 BMC x.x.x.150] The following failures happened during running pre-processing hooks:
Look up error: Could not find a node for attributes {'bmc_address': ['x.x.x.150'], 'mac': ['50:6b:4b:44:14:36', '50:6b:4b:4b:44:3b', '00:30:64:1d:3e:b4', '50:6b:4b:44:14:37', '50:6b:4b:4b:44:3a']}
2023-04-25 16:37:22.971 7 DEBUG ironic_inspector.main [req-3f656a69-973b-45ed-bde8-924da3464f24 - - - - -] Returning error to client: The following failures happened during running pre-processing hooks:
Look up error: Could not find a node for attributes {'bmc_address': ['x.x.x.150'], 'mac': ['50:6b:4b:44:14:36', '50:6b:4b:4b:44:3b', '00:30:64:1d:3e:b4', '50:6b:4b:44:14:37', '50:6b:4b:4b:44:3a']} error_response /usr/lib/python3.6/site-packages/ironic_inspector/main.py:122

Which is caused by the data submitted in the introspection not aligning. In this specific case, I suspect the 00:30:64:1d:3d:b4 is an in-band virutal ethernet interface to the BMC, where as the interface the port was created with might be a physical BMC interface. Where things are going side ways,  we extract the BMC IP address, and the ethernet interface MAC addresses transmitted in the introspection data payload, and attempt to consult our internal cache utilizing that data. The BMC IP might not be resolvable in this case, depending on the exact details, and configuration, but the MAC address is what is expected to be used to resolve the host's identity. In this case, the list submitted does not contain the registered MAC address with ironic. So Introspection fails.

I suspect if the port is corrected, things will work as expected. i.e. add one of the physical ethernet macs to Ironic, and remove what appears to be the BMC mac address from the port in ironic.

Going back to redfish issues in general, Interestingly, we don't actually see signs of the boot order being temporarily overridden for introspection:

2023-04-25 16:26:39.164 7 DEBUG sushy.resources.base [req-c4c9156e-30fc-49a5-941b-f944864e726a - - - - -] Received representation of System /redfish/v1/Systems/Self: {'_actions': {'reset': {'allowed_values': None, 'operation_apply_time_support': {'_maintenance_window_resource': {'resource_uri': '/redfish/v1/Systems/Self'}, 'maintenance_window_duration_in_seconds': 600, 'maintenance_window_start_time': None, 'mapped_supported_values': [<ApplyTime.IMMEDIATE: 'Immediate'>, <ApplyTime.AT_MAINTENANCE_WINDOW_START: 'AtMaintenanceWindowStart'>], 'supported_values': ['Immediate', 'AtMaintenanceWindowStart']}, 'target_uri': '/redfish/v1/Systems/Self/Actions/ComputerSystem.Reset'}}, '_oem_vendors': None, '_settings': None, 'asset_tag': 'Free form asset tag', 'bios_version': '4.21.10', 'boot': {'allowed_values': ['None', 'Pxe', 'Floppy', 'Cd', 'Usb', 'Hdd', 'BiosSetup', 'Utilities', 'Diags', 'UefiShell', 'UefiTarget', 'SDCard', 'UefiHttp', 'RemoteDrive', 'UefiBootNext'], 'enabled': <BootSourceOverrideEnabled.DISABLED: 'Disabled'>, 'mode': <BootSourceOverrideMode.LEGACY: 'Legacy'>, 'target': <BootSource.NONE: 'None'>}, 'description': 'System Self', 'hostname': None, 'identity': 'Self', 'indicator_led': None, 'links': {'oem_vendors': None}, 'maintenance_window': None, 'manufacturer': 'Nokia Solutions and Networks', 'memory_summary': {'health': 'Critical', 'size_gib': None}, 'name': 'System', 'part_number': ' ', 'power_state': <PowerState.ON: 'On'>, 'serial_number': 'CHATST0618000874', 'sku': '', 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': <Health.OK: 'OK'>, 'state': <State.ENABLED: 'Enabled'>}, 'system_type': <SystemType.PHYSICAL: 'Physical'>, 'uuid': 'FFFF6661-FFFF-FFFF-6166-0030641D3EB4'} refresh /usr/lib/python3.6/site-packages/sushy/resources/base.py:656

Specifically I noticed the BootSourceOverrideTarget, represented by 'target' above, field value never gets changed, or at least, a different value seems not to be reported. It could just be the behavior of that BMC, but it still seems odd. If you correct the mac address issue, and retry, that would be helpful for us, but ultimately I suspect we're going to need to look at backporting some portion of ETag fixes to ensure the interactions with the BMC behave as expected.


If you can retry on OSP17.0, that would be a helpful data point as well as it has most of the fixes for ETags. When 17.1 releases soon, it has the final known etag fix for "weak etags" which the RPM I suggested to you also contains.

Comment 6 Kamil Gustab 2023-04-26 15:55:59 UTC
Hi,

So first answering your questions:

1) Yes, that was the node
2) Yes, that was the MAC address assigned to it. This MAC address is the MAC of the BMC (what you already know)

"Specifically I noticed the BootSourceOverrideTarget, represented by 'target' above, field value never gets changed, or at least, a different value seems not to be reported."
Isn't it this request changing it?

2023-04-25 16:28:55.606 7 DEBUG sushy.connector [req-d8000c64-4038-40cd-80ae-be5239637869 f9d309eb199f47b88d1af446dd24ba94 e52242df0b404d49bc1e94cfa47a3ccb - default default] HTTP request: PATCH https://x.x.x.150/redfish/v1/Systems/Self; headers: {'If-Match': '"1682432842"', 'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Pxe', 'BootSourceOverrideEnabled': 'Once'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:125
2023-04-25 16:29:11.480 7 DEBUG sushy.connector [req-d8000c64-4038-40cd-80ae-be5239637869 f9d309eb199f47b88d1af446dd24ba94 e52242df0b404d49bc1e94cfa47a3ccb - default default] HTTP response for PATCH https://x.x.x.150/redfish/v1/Systems/Self: status code: 204 _op /usr/lib/python3.6/site-packages/sushy/connector.py:229


As you suggested I changed data for MAC when importing node to Ironic to the from the BMC's MAC to the MAC that should be booted from PXE - and now it went through introspection, I'll let you know if deployment went through when it's done.

My question here is - how could we get that desired MAC address for Ironic?

We were only giving it this BMC MAC address, where could we take MAC of this NIC for PXE boot since it's not available at /redfish/Systems/Self/EthernetIntefaces?

Comment 7 Julia Kreger 2023-04-26 16:50:12 UTC
Okay, finally found it representing in a way that we would normally expect, unfortunately I just didn't find it yesterday.

2023-04-25 16:29:24.994 7 DEBUG sushy.resources.base [req-fa8d32b3-5698-4a57-b5a6-c480b674afad f9d309eb199f47b88d1af446dd24ba94 e52242df0b404d49bc1e94cfa47a3ccb - default default] Received representation of System /redfish/v1/Systems/Self: {'_actions': {'reset': {'allowed_values': None, 'operation_apply_time_support': {'_maintenance_window_resource': {'resource_uri': '/redfish/v1/Systems/Self'}, 'maintenance_window_duration_in_seconds': 600, 'maintenance_window_start_time': None, 'mapped_supported_values': [<ApplyTime.IMMEDIATE: 'Immediate'>, <ApplyTime.AT_MAINTENANCE_WINDOW_START: 'AtMaintenanceWindowStart'>], 'supported_values': ['Immediate', 'AtMaintenanceWindowStart']}, 'target_uri': '/redfish/v1/Systems/Self/Actions/ComputerSystem.Reset'}}, '_oem_vendors': None, '_settings': None, 'asset_tag': 'Free form asset tag', 'bios_version': '4.21.10', 'boot': {'allowed_values': ['None', 'Pxe', 'Floppy', 'Cd', 'Usb', 'Hdd', 'BiosSetup', 'Utilities', 'Diags', 'UefiShell', 'UefiTarget', 'SDCard', 'UefiHttp', 'RemoteDrive', 'UefiBootNext'], 'enabled': <BootSourceOverrideEnabled.ONCE: 'Once'>, 'mode': <BootSourceOverrideMode.LEGACY: 'Legacy'>, 'target': <BootSource.PXE: 'Pxe'>}, 'description': 'System Self', 'hostname': None, 'identity': 'Self', 'indicator_led': None, 'links': {'oem_vendors': None}, 'maintenance_window': None, 'manufacturer': 'Nokia Solutions and Networks', 'memory_summary': {'health': 'Critical', 'size_gib': None}, 'name': 'System', 'part_number': ' ', 'power_state': <PowerState.ON: 'On'>, 'serial_number': 'CHATST0618000874', 'sku': '', 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': <Health.OK: 'OK'>, 'state': <State.ENABLED: 'Enabled'>}, 'system_type': <SystemType.PHYSICAL: 'Physical'>, 'uuid': 'FFFF6661-FFFF-FFFF-6166-0030641D3EB4'} refresh /usr/lib/python3.6/site-packages/sushy/resources/base.py:656

"'target': <BootSource.PXE: 'Pxe'"

Sorry for confusion, I just didn't seem to find it when I was looking yesterday.

As for address lookups/host identification:

To be able to perform lookup and association with only the BMC Mac address is not a feature we have at present. One concern of doing so is we do see a number of operators who explicit disable the inband communication from the host OS to the BMC for security reasons, which forces an Ethernet interface MAC address. It doesn't necessarily *have* to be the interface for PXE boot to be registered as a port, as your doing introspection and all of the addresses are handled.

The other issue is the ports in ironic represent possible IP networking interfaces, not BMC MAC addresses, and use of the BMC mac can actually create problems if that interface is chosen for network connectivity as the host OS cannot use it.

It *might* be, if you've been doing BMC mac addresses all along, that the IPMI bmc address resolution might have hinted for things to work, where as the feature to try the same behavior from the redfish_address field contents didn't land until Victoria upstream, which means OSP17.x may work as intended, yet the BMC mac address should not be used as a port registered in Ironic.

I guess your question also highlights that even if you were to directly ask ironic to perform introspection via `openstack baremetal node inspect <uuid>` while using the redfish hardware type and inband inspection, the ethernet interface mac address discovery to aid in introspection would come from the redfish EthernetInterfaces endpoint, which means that wouldn't help. Unfortunately I suspect this is a case where some advance correlation or data collection may be needed until your using OSP17, if your going to use Redfish. That being said, I *think* the DMTF did finally reach consensus on enumeration of MAC addresses in the BMC, but naturally it requires the BMC be able to communicate inside of the hardware to the ethernet controller which not every vendor supports. A number of vendors which still support legacy boot mode, do get entirely different sets of data from their BMCs, so if you switch to UEFI boot mode (which is better, since you don't prohibit device IO to the first 4GB of RAM), you may get more data from the BMC.

Comment 8 Steve Baker 2023-05-08 19:44:21 UTC
We'd still recommend using the correct MAC address, and you may need to discover this manually. We're just checking to confirm that you have a way forward at this point.

Comment 9 Steve Baker 2023-05-22 19:45:12 UTC
Just setting a NEEDINFO to draw attention to the above suggestion.

Comment 12 Julia Kreger 2023-06-21 16:27:04 UTC
Lowering severity since this issue appears to be a mix of Etag issues combined with an initial incorrect configuration.

We believe the underlying etag issues are resolved in OSP17.1. Please retry with OSP17.1 beta, and let us know. Otherwise I believe we will close this issue out as fixed in 17.1.

Thanks,

-Julia