Bug 2278022

Summary: metalsmith list returns Request requires an ID but none was found
Product: Red Hat OpenStack Reporter: Kenny Tordeurs <ktordeur>
Component: python-metalsmithAssignee: Julia Kreger <jkreger>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 17.1 (Wallaby)CC: jbadiapa, jelle.hoylaerts.ext, jkreger, madgupta, mariel, sbaker
Target Milestone: z4Keywords: Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: python-metalsmith-1.4.4-17.1.20240522060758.5e7461e.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-11-21 09:40:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kenny Tordeurs 2024-04-30 18:56:33 UTC
Description of problem:
After upgrade of OSP 16.2 to OSP 17.1 metalsmith list is failing.

(undercloud) [stack@dir001 ~]$ metalsmith list
[2024-04-26 10:35:27,696] Request requires an ID but none was found


Attempted solution from below article without success
https://access.redhat.com/solutions/7048029

This happens after step “2.8. Running the director upgrade” from the upgrade with OSP 16.2 to OSP 17.1

Version-Release number of selected component (if applicable):
OSP 16.2 > OSP 17.1

How reproducible:
Same as in https://bugzilla.redhat.com/show_bug.cgi?id=2233575 and already used the updated documentation but still the issue happens.

Comment 9 Julia Kreger 2024-05-08 14:48:17 UTC
The tl;dr is an allocation record is missing when comparing nodes in the "openstack baremetal node list" and the "openstack baremetal allocation list" output, which causes metalsmith to error.

The record can be created manually using:

openstack baremetal allocation create --uuid <node instance_uuid> --name <node name> --resource-class baremetal --node <node uuid>

It is going to take us a little time to understand *why* the record was missing in this specific case and what the correct path to take to prevent this from being the case moving forward, since in part the logs have mostly been rotated since the time of the upgrade.

Comment 10 Kenny Tordeurs 2024-05-08 15:45:49 UTC
(In reply to Julia Kreger from comment #9)
> The tl;dr is an allocation record is missing when comparing nodes in the
> "openstack baremetal node list" and the "openstack baremetal allocation
> list" output, which causes metalsmith to error.
> 
> The record can be created manually using:
> 
> openstack baremetal allocation create --uuid <node instance_uuid> --name
> <node name> --resource-class baremetal --node <node uuid>
> 
> It is going to take us a little time to understand *why* the record was
> missing in this specific case and what the correct path to take to prevent
> this from being the case moving forward, since in part the logs have mostly
> been rotated since the time of the upgrade.

If you can tell us what exactly we need to grab we can get the logs for you, we will start the upgrade freshly next week. (in lab environment)

Comment 11 Julia Kreger 2024-05-08 16:19:16 UTC
> 
> If you can tell us what exactly we need to grab we can get the logs for you,
> we will start the upgrade freshly next week. (in lab environment)

In what you provided, they already rotated off except the ansible.log from where the steps were run *and* the ironic-api log.

But I see what happened after looking at the code, comparing to state detailed, and consulting the documentation. Looking at:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.1/html-single/framework_for_upgrades_16.2_to_17.1/index#running-the-overcloud-upgrade-preparation_overcloud-adoption

The Prerequisites in the documentation details that all nodes should be in ACTIVE state and that no node should be in maintenance state. It does this by indicating you should unset the maintenance on the node. The issue is, if you have a persistent issue, that issue needs to be resolved *before* proceeding, because the node will just be moved back to maintenance state due to the failure. Essentially what happens under the hood is when you run the steps defined by the documentation, because the node is in maintenance state, the allocation record creation is skipped because the node is not in a valid state to have work performed on it, because it is in maintenance state.

Basically I think the "fix", at a minimum, is a documentation update, and possibly updating metalsmith or the underlying library to try and address the case should the nodes state and maintenance state not be checked prior to attempting the overcloud adoption.

Comment 19 Kenny Tordeurs 2024-06-06 08:44:18 UTC
@Julia As requested we opened a support case and we gathered more data.

The issue here seems to be that ironic doesn't trust the self signed certs on the bmc.
Due to this it places the host in maintenance mode and stops communicating.
As a side effect, when the upgrade to metalsmith (osp17.1) happens the allocation doesn't get created automatically and you end up with the Request requires an ID but none was found error.
Until you manually create the allocation and solve the self signed cert issue.

We fully understand the risk of running self signed certs, but that seems to be how the majority of customers work.
As these bmc networks are 100% shielded from the outside world there is no use case to create official certificates on them for most customers.

Just to show the difference selfsigned vs corporate signed certs we have attached the appropriate logs in a private note.

And we can confirm that if you only use official signed certs you don't hit the issue with metalsmith/ironic.

Comment 21 Julia Kreger 2024-06-06 12:44:17 UTC
(In reply to Kenny Tordeurs from comment #19)

Greetings Kenny!

You may want to explore options such as redfish_verify_ca - https://docs.openstack.org/ironic/wallaby/admin/drivers/redfish.html.

That is an option a customer may take, but they need to understand they are telling the software it is okay not to verify the certificate on the BMC in that case. That being said, it is a decision the customer must take.

Comment 22 Kenny Tordeurs 2024-06-10 17:47:18 UTC
(In reply to Julia Kreger from comment #21)
Hello Julia,

We will test the option you provided, thanks for that !

Only 1 caveat, what if the redfish driver isn't the only one used, but also the ilo and idrac in other environments.
Is there an equivalent for them?

Can this only be done on the bare metal node create level or can this be enabled via director on the entire undercloud at once?

Thank you

Comment 23 Julia Kreger 2024-06-10 18:13:45 UTC
(In reply to Kenny Tordeurs from comment #22)
> (In reply to Julia Kreger from comment #21)
> Hello Julia,
> 
> We will test the option you provided, thanks for that !
> 
> Only 1 caveat, what if the redfish driver isn't the only one used, but also
> the ilo and idrac in other environments.
> Is there an equivalent for them?
> 
> Can this only be done on the bare metal node create level or can this be
> enabled via director on the entire undercloud at once?
> 
> Thank you

Greetings, sorry, I didn't quite realize they were using the ilo hardware type.

I guess that means there are even more variables at play, but "ilo_verify_ca" is the ilo version of the setting. Dell, at a glance, doesn't appear to have an equivalent setting.

Realistically, I highly recommend moving over to the redfish driver and not using the vendor flavored drivers at this point. All vendors have been moving in the direction of the pure redfish driver, and there is an idrac flavor which could also be used, but the stock driver should work for most general activities as well. In this specific case ilo driver is set, but the driver is ultimately also trying to use redfish, which also suggests the hardware is a newer machine as well. There is also the possibility that the machine might be on the outer boundry of what HPE supports/tests for the ilo driver as well.

Comment 24 Julia Kreger 2024-06-10 18:23:38 UTC
Oh, and as for the question of at node create level, depending on how it might not be possible. It is one of those things which if your using Ironic as designed, it is entirely possible, but with TripleO on top of Ironic masking the user experience to be "more simple", it is only suitable if your able to directly indicate "driver_info" field values. It is also not possible to generally assert in default across all drivers, only ilo has a "verify_ca" option in it's configuration section, and honestly doing so is really not advisable from a security standpoint.

Comment 36 errata-xmlrpc 2024-11-21 09:40:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHOSP 17.1.4 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:9974