Bug 2103710 - BMH image provisioning error during ZTP with Lenovo server - "The ETag supplied did not match the ETag required to change this resource."
Summary: BMH image provisioning error during ZTP with Lenovo server - "The ETag suppli...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.13.0
Assignee: Jacob Anders
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-04 15:19 UTC by yliu1
Modified: 2022-12-02 00:20 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-02 00:20:42 UTC
Target Upstream Version:
Embargoed:
yliu1: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 850547 0 None MERGED Do not send compression as acceptable encoding 2022-09-21 14:54:57 UTC

Description yliu1 2022-07-04 15:19:05 UTC
Description of problem:

During ZTP deployment, the bmh image provisioning failed with following error. 

  Normal  ProvisioningStarted     65s    metal3-baremetal-controller  Image provisioning started for https://assisted-image-service-multicluster-engine.apps.kni-qe-17.lab.eng.rdu2.redhat.com/images/685cefb6-52b7-4aaa-96fa-a458f30991e2?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI2ODVjZWZiNi01MmI3LTRhYWEtOTZmYS1hNDU4ZjMwOTkxZTIifQ.-vZQpCAmItXqrhWvjlWpT0EmgjYY4Gr7pRreMkBFCyElEplsQRlIQeVAaIarEcM7p0J1-m1Qf2cX2sRlOpAOmg&arch=x86_64&type=minimal-iso&version=4.9
  Normal  ProvisioningError       45s    metal3-baremetal-controller  Image provisioning failed: Deploy step deploy.deploy failed with HTTPError: HTTP PATCH https://10.16.230.34/redfish/v1/Managers/1/VirtualMedia/EXT1 returned code 412. Base.1.8.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'MessageSeverity': 'Critical', 'MessageArgs': [], 'MessageId': 'Base.1.8.PreconditionFailed', 'Resolution': 'Try the operation again using the appropriate ETag.', '@odata.type': '#Message.v1_1_0.Message', 'Message': 'The ETag supplied did not match the ETag required to change this resource.'}].


Version-Release number of selected component (if applicable):
4.10.20

How reproducible:
100% in my env

Steps to Reproduce:
1. Start spoke cluster deployment via ZTP gitops. siteconfig used is here: 
http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/242339a7e61248cfc76eea6fe8b77d6a943d6aa8/siteconfig/se450.yaml

2. Wait for iso to be created in infraenv

3. wait for provisioned state in bmh

Actual results:
3. bmh image provisioning started and failed with following error: 
  Normal  ProvisioningStarted     65s    metal3-baremetal-controller  Image provisioning started for https://assisted-image-service-multicluster-engine.apps.kni-qe-17.lab.eng.rdu2.redhat.com/images/685cefb6-52b7-4aaa-96fa-a458f30991e2?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI2ODVjZWZiNi01MmI3LTRhYWEtOTZmYS1hNDU4ZjMwOTkxZTIifQ.-vZQpCAmItXqrhWvjlWpT0EmgjYY4Gr7pRreMkBFCyElEplsQRlIQeVAaIarEcM7p0J1-m1Qf2cX2sRlOpAOmg&arch=x86_64&type=minimal-iso&version=4.9
  Normal  ProvisioningError       45s    metal3-baremetal-controller  Image provisioning failed: Deploy step deploy.deploy failed with HTTPError: HTTP PATCH https://10.16.230.34/redfish/v1/Managers/1/VirtualMedia/EXT1 returned code 412. Base.1.8.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'MessageSeverity': 'Critical', 'MessageArgs': [], 'MessageId': 'Base.1.8.PreconditionFailed', 'Resolution': 'Try the operation again using the appropriate ETag.', '@odata.type': '#Message.v1_1_0.Message', 'Message': 'The ETag supplied did not match the ETag required to change this resource.'}].


Expected results:
bmh is in provisioned state.

Additional info:
Manually attaching an iso image via BMC GUI works fine for this server.

Comment 3 Iury Gregory Melo Ferreira 2022-07-07 13:21:46 UTC
Moving to Jacob since was looking at the issue when Rhys mentioned on slack (he is already looking into alternatives)

Comment 4 Jacob Anders 2022-07-11 12:00:09 UTC
Thank you for the report, Yang.

Can I please ask you to confirm the exact Lenovo model in question?

We've had at least one similar report so I'm trying to determine if these issues are linked to the same server model.

Thank you,
Jacob

Comment 6 yliu1 2022-07-12 16:47:28 UTC
@janders model is this: ThinkEdge SE450.
And I can give you access to my server if needed.

Comment 7 mmorency 2022-07-12 17:08:53 UTC
running into the same issue on the same lenovo model. it was working fine for Hubcluster OCP 4.9 , ACM 2.4.x and ztp 4.10 but now failing for OCP 4.10.16, ACM 2.5.1. BMH is showing 412 error code. provisioning failed.

Comment 8 Jacob Anders 2022-07-20 13:17:02 UTC
Yang, Rhys, can I please confirm the BIOS/firmware versions on your machines?

We're having a discussion about this with the upstream Ironic community, this would be a very useful reference to compare results between different operators.

Thank you!

Comment 10 Jacob Anders 2022-07-20 23:44:13 UTC
Thank you Yang. So it's 1.10 BMC firmware revision.

Rhys: I dug out a similar screenshot from our slack thread and it looks like you're on 1.10 as well - please correct me if I am wrong.

Comment 11 Jacob Anders 2022-07-21 00:29:54 UTC
Added an upstream patch proposed by Julia which may help resolve this issue.

Comment 12 Rhys Oxenham 2022-07-21 09:42:35 UTC
(In reply to Jacob Anders from comment #10)
> Thank you Yang. So it's 1.10 BMC firmware revision.
> 
> Rhys: I dug out a similar screenshot from our slack thread and it looks like
> you're on 1.10 as well - please correct me if I am wrong.

Firmware as below...

BMC (Primary) 	Active 	        1.10 	USX304W 	2022-04-20
BMC (Backup) 	Inactive 	1.10 	USX304W 	2022-04-20
UEFI 	        Active 	        1.01 	CME104I 	2022-03-17

Comment 13 Rhys Oxenham 2022-07-21 11:20:49 UTC
I tested the above linked patch (https://review.opendev.org/c/openstack/sushy/+/850547) and can seemingly now make requests against Redfish without eTag issues, but I'm having a problem proceeding further. I'm not sure if it's a side effect from the patch though.

Image provisioning failed: Deploy step deploy.deploy failed: Redfish exception occurred. Error: Redfish set boot device failed for node a7573fbd-15c5-4cb6-b8c9-567b378b8b6f. Error: HTTP PATCH https://192.168.1.1/redfish/v1/Systems/1/Pending returned code 400. Base.1.8.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'@odata.type': '#Message.v1_1_0.Message', 'Resolution': 'Remove the property from the request body and resubmit the request if the operation failed.', 'MessageArgs': ['BootSourceOverrideTarget'], 'MessageSeverity': 'Warning', 'MessageId': 'Base.1.8.PropertyNotWritable', 'Message': 'The property BootSourceOverrideTarget is a read only property and cannot be assigned a value.'}].

It seems to be trying to navigate to /Pending, is that to be expected?

Comment 14 Julia Kreger 2022-07-22 15:10:46 UTC
Just chiming in since I haven't had a chance to cycle back as of yet. No, it is not expected AFAIK. We're clearly seeing varying/differing behavior here, which I think is going to require some careful re-reading of spec documents before we'll make progress.

FWIW, there is a community/operator and customer who has observed the exact same behavior with their brand new lenovo server when they also grabbed the patch to give it a try. I *believe*, based upon comments they are making and questions they are asking, that they might have found themselves getting further along with some additional changes/logic, but I've not gotten details nor have they posted a patch yet.

Comment 15 mmorency 2022-08-01 20:04:18 UTC
no patch found yet? we are waiting for some workaround to get these servers certified on OCP 4.9.x.

Comment 16 Julia Kreger 2022-08-03 16:12:33 UTC
So, the navigation to /Pending doesn't seem to be in the DMTF specification. I *suspect* it is a result of the implementor's decisions around error handling. BootSourceOverrideTarget is also explicitly listed as a read-write field, so I'm suspecting that machine is in the territory of some sort of bug.  Or it could be in some sort of unrelated state, which is temporarily treating the field as read-only. Fixing etag behavior likely wouldn't change that behavior.

In regards to the upstream Etag patch, we've requested reviews from representitives from multiple hardware vendors to validate that they do not foresee any issues.

Comment 17 Jacob Anders 2022-08-05 08:09:11 UTC
(In reply to mmorency from comment #15)
> no patch found yet? we are waiting for some workaround to get these servers
> certified on OCP 4.9.x.

Unfortunately there is no patch or workaround at the moment. As discussed on Slack I'm intending to have a closer look at this in few weeks time (will be off next week).

Comment 18 Rhys Oxenham 2022-08-05 09:27:24 UTC
Also confirming that Julia's further patch set - https://review.opendev.org/c/openstack/sushy/+/850547 - also resolves the eTag issue on the ThinkEdge SE450, but we're still left with the strange "/Pending" redirect for virtual media attachment.

Comment 19 Dmitry Tantsur 2022-08-08 14:33:09 UTC
The /Pending redirect is actually coming from the System itself:

  "@Redfish.Settings": {
    "SupportedApplyTimes": [
      "OnReset"
    ],
    "Messages": [],
    "@odata.type": "#Settings.v1_3_0.Settings",
    "SettingsObject": {
      "@odata.id": "/redfish/v1/Systems/1/Pending"
    },
    "Time": null
  },

Why it does not accept the update is beyond my understanding.

Comment 20 Jacob Anders 2022-09-06 23:48:23 UTC
Early patch is up for review:

https://review.opendev.org/c/openstack/sushy/+/856123

I hope to test it on SE450 later this week (Marc kindly provided me with access to one).

Will keep you posted.

Comment 21 Jacob Anders 2022-09-08 05:37:53 UTC
I've done some specific testing on Lenovo SE450 (big thanks to Marc and team for providing me access).

I again ran into the previously reported problem with the /Pending/ URL (as described in https://bugzilla.redhat.com/show_bug.cgi?id=2103710#c19)

I investigated the problem in detail and here's where sushy/ironic are hitting the problem:

https://opendev.org/openstack/sushy/src/branch/master/sushy/resources/system/system.py#L286

If a BMC provides a SettingsURL, we attempt to use it. 

For some reason, the SE450 provides us a non-functional Settings URL yet if we ignore it and issue a PATCH request against the System, that works.

This looks like a BMC firmware bug that should be addressed in firmware. I will raise this specifically in a discussion with the Vendor (it's been mentioned alongside eTag issues but I think we need a dedicated discussion). I will also investigate possible workarounds.

Comment 22 Jacob Anders 2022-09-09 12:22:27 UTC
Added a gerrit link to a sushy patch with a possible workaround for consideration

Comment 24 Emre Özkan 2022-09-15 15:59:33 UTC
Customer hardware: Lenovo / ThinkSystem SR650 V2

Comment 25 Rui Gouveia 2022-09-15 17:24:04 UTC
{
  'ProductName': 'ThinkSystem SR650 V2',
  'SystemId': '7Z73CTO1WW',
  'FirmwareVersion': 'AFBT20N'
}

'firmware_version': 'AFBT20N // 1.80 // 2022-04-19'

highlights from Ironic logs:

~~~
2022-09-15 11:10:38.416 1 DEBUG sushy.connector [req-e2a2a204-d345-4630-9800-cc8123146889 - - - - -] HTTP request: PATCH https://<server_ip>/redfish/v1/Managers/1/VirtualMedia/EXT1; headers: {'If-Match': 'W/"47ec134fd3872967abd","47ec134fd3872967abd"', 'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Image': 'http://10.1.54.8:80/redfish/boot-1bfdc2e3-a486-4d0c-80a0-d043c2ff7b14.iso', 'Inserted': True, 'WriteProtected': True}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:111

2022-09-15 11:10:38.579 1 DEBUG sushy.exceptions [req-e2a2a204-d345-4630-9800-cc8123146889 - - - - -] HTTP response for PATCH https://<server_ip>/redfish/v1/Managers/1/VirtualMedia/EXT1: status code: 412, error: Base.1.11.GeneralError: A general error has occurred. See ExtendedInfo for more information., extended: [{'@odata.type': '#Message.v1_1_2.Message', 'Resolution': 'Try the operation again using the appropriate ETag.', 'Message': 'The ETag supplied did not match the ETag required to change this resource.', 'MessageArgs': [], 'MessageSeverity': 'Critical', 'MessageId': 'Base.1.11.PreconditionFailed'}] __init__ /usr/lib/python3.6/site-packages/sushy/exceptions.py:122

2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector [req-e2a2a204-d345-4630-9800-cc8123146889 - - - - -] Unable to start managed inspection for node 1bfdc2e3-a486-4d0c-80a0-d043c2ff7b14: HTTP PATCH https://10.1.82.23/redfish/v1/Managers/1/VirtualMedia/EXT1 returned code 412. Base.1.11.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'@odata.type': '#Message.v1_1_2.Message', 'Resolution': 'Try the operation again using the appropriate ETag.', 'Message': 'The ETag supplied did not match the ETag required to change this resource.', 'MessageArgs': [], 'MessageSeverity': 'Critical', 'MessageId': 'Base.1.11.PreconditionFailed'}]: sushy.exceptions.HTTPError: HTTP PATCH https://<server_ip>/redfish/v1/Managers/1/VirtualMedia/EXT1 returned code 412. Base.1.11.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'@odata.type': '#Message.v1_1_2.Message', 'Resolution': 'Try the operation again using the appropriate ETag.', 'Message': 'The ETag supplied did not match the ETag required to change this resource.', 'MessageArgs': [], 'MessageSeverity': 'Critical', 'MessageId': 'Base.1.11.PreconditionFailed'}]

2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector Traceback (most recent call last):
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/inspector.py", line 210, in _start_managed_inspection
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     task.driver.boot.prepare_ramdisk(task, ramdisk_params=params)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/boot.py", line 558, in prepare_ramdisk
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     _insert_vmedia(task, managers, iso_ref, sushy.VIRTUAL_MEDIA_CD)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/tenacity/__init__.py", line 329, in wrapped_f
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     return self.call(f, *args, **kw)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/tenacity/__init__.py", line 409, in call
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     do = self.iter(retry_state=retry_state)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/tenacity/__init__.py", line 356, in iter
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     return fut.result()
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     return self.__get_result()
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     raise self._exception
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/tenacity/__init__.py", line 412, in call
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     result = fn(*args, **kwargs)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/boot.py", line 200, in _insert_vmedia
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     write_protected=True)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/sushy/resources/manager/virtual_media.py", line 167, in insert_media
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     self._conn.patch(target_uri, data=payload, headers=headers)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/sushy/connector.py", line 247, in patch
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     **extra_session_req_kwargs)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/sushy/connector.py", line 127, in _op
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     exceptions.raise_for_response(method, url, response)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.6/site-packages/sushy/exceptions.py", line 172, in raise_for_response
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector     raise HTTPError(method, url, response)
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector sushy.exceptions.HTTPError: HTTP PATCH https://<server_ip>/redfish/v1/Managers/1/VirtualMedia/EXT1 returned code 412. Base.1.11.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'@odata.type': '#Message.v1_1_2.Message', 'Resolution': 'Try the operation again using the appropriate ETag.', 'Message': 'The ETag supplied did not match the ETag required to change this resource.', 'MessageArgs': [], 'MessageSeverity': 'Critical', 'MessageId': 'Base.1.11.PreconditionFailed'}]
2022-09-15 11:10:38.580 1 ERROR ironic.drivers.modules.inspector

Comment 27 Jacob Anders 2022-09-27 02:16:07 UTC
Hi Pavitra and Emre,

There's been some confusion around this BZ due to the fact we're trying to cover two separate issues in it:

1) HTTP 412 error as per the initial description ( https://bugzilla.redhat.com/show_bug.cgi?id=2103710#c0 )

2) Base.1.8.PropertyNotWritable error against /Pending/ URL ( https://bugzilla.redhat.com/show_bug.cgi?id=2103710#c13 )

In hindsight, I should have requested a second BZ to be opened against 2) but I was too focused on troubleshooting and we just kept going with the second issue in the same BZ. I will split it out later this week for the benefit of more clarity while tracking these issues going forward. At this point it is not clear whether issue 2) is specific to Lenovo SE450 only or whether it may impact other Lenovo models and possibly non-Lenovo machines. So far we have only observed the issue 2) on the SE450 model.

Issue 1) should be resolved by this upstream patch: https://review.opendev.org/c/openstack/sushy/+/850547/ which should be included in OpenShift 4.10.34 z-stream.

Would your customer be in position to upgrade to the latest 4.10 z-stream, try again and report back whether the fix to issue 1) has been successful? This would greatly assist in our troubleshooting and issue tracking.

Setting status of this BZ to MODIFIED to reflect the real state of the fix - it is already available in the 4.10 codebase.

Thank you,
Jacob

Comment 28 Jacob Anders 2022-10-05 23:29:20 UTC
NOTE: a new bug focusing on the /Pending/ URL and read only BootSourceOverrideEnabled: https://issues.redhat.com/browse/OCPBUGS-2052. I will continue working on this second issue in the new bug, at this stage development work on this bug is complete (verification of the fix is required).

Comment 29 Jacob Anders 2022-10-13 00:23:56 UTC
Pavira, Emre

Just following up on my comment: https://bugzilla.redhat.com/show_bug.cgi?id=2103710#c27. Would you be able to have your customers verify that the issue is fixed in the latest 4.10.z release?

Thank you,
Jacob

Comment 31 Vinu K 2022-11-14 20:15:31 UTC
Hello Team,

Any update on this?

Thanks,
Vinu K

Comment 33 Jacob Anders 2022-12-02 00:20:42 UTC
The two issues affecting Lenovo SE450 that this bug touches on have been fixed in 4.13 / master and are awaiting verification on the specific hardware to be completed:

https://issues.redhat.com/browse/OCPBUGS-3480

https://issues.redhat.com/browse/OCPBUGS-4225

Once this is done, we will commence backporting to 4.12, 4.11.z and 4.10.z.

Given these have been fixed in master (and backports will be tracked in JIRA) -  and I haven't heard further on my questions in https://bugzilla.redhat.com/show_bug.cgi?id=2103710#c32 I will proceed to close this bug.

I would normally close this as duplicate of https://issues.redhat.com/browse/OCPBUGS-3480 but given I can't mark duplicates across Bugzilla and JIRA, closing with CANTFIX instead.


Note You need to log in before you can comment on or make changes to this bug.