1814123 – ironic fails to undeploy a failed baremetal node even if the node has maintenance enabled

Bug 1814123 - ironic fails to undeploy a failed baremetal node even if the node has maintenance enabled

Summary: ironic fails to undeploy a failed baremetal node even if the node has mainten...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ironic
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	RHOS Maint
QA Contact:	Alistair Tonner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-17 05:38 UTC by Takashi Kajinami
Modified:	2023-09-07 22:29 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-17 12:09:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-28384	0	None	None	None	2023-09-07 22:29:46 UTC

Description Takashi Kajinami 2020-03-17 05:38:13 UTC

Description of problem:

When we undeploy baremetal node, ironic always tries to power off node via power driver,
and node will get error provisioning status if it fails to power off.

However, this error also happens even if the node has maintenance mode enabled,
which would mean that the node is not functional now.

IMO, ironic should skip powering off the node if the node has maintenance enabled,
so that we can avoid the expected failure caused by failed node.


How reproducible:
Always

Steps to Reproduce:
1. Deploy overcloud
2. Disalbe IPMI interface about one overcloud node
3. Undeploy the baremetal node by 'openstack baremetal node undeploy <id>'

Actual results:
Undeploy fails and node goes into error provisioning status

Expected results:
Undeploy completes without error


Additional info:

Comment 2 Dmitry Tantsur 2020-03-17 12:09:23 UTC

Hi,

Generally, we try to avoid provisioning actions on nodes that experience management problems. A lot of operations inside ironic rely on being able to talk to the node (e.g. with cleaning enabled, you would enter cleaning after tear down, and that would also fail). It is expected that you recover from maintenance first, then proceed with any complex actions.

If the node is not recoverable, you can just delete it completely with `openstack baremetal node delete`.

Comment 3 Takashi Kajinami 2020-03-17 12:35:33 UTC

Hi Dmitry,


Thanks for your clarifying.

> It is expected that you recover from maintenance first, then proceed with any complex actions.
I understood.


May I ask one more question regarding this behaviour ?

The actual failure was observed when manipulating overcloud to remove overcloud nodes.
In this case, "openstack overcloud undeploy" was executed directly to get rid of a failed node before removing the node from nova,
however IIUC when the baremetal instance is deleted in nova (which can be happen when updating heat stack with new blacklisting for example),
nova will request the same unprovision to ironic node then get error if the node has its IPMI not working.

My concern is that current installation doc does not mention that we expect ipmi interface devices working when removing nodes from overcloud,
and we get failure in nova instance deletion (which ends up in stack update failure in TripleO) if the ipmi of the node to be removed is not working.

Do you think my above concern is valid ?
If the concern is valid one, I'll submit a bug against installation doc to have some notice about this expected error.

Thank you,
Takashi

Comment 4 Dmitry Tantsur 2020-03-17 13:03:50 UTC

> My concern is that current installation doc does not mention that we expect ipmi interface devices working when removing nodes from overcloud

It's a good point. We need to make clear that a different procedure must be followed for nodes without management access. I think we have this procedure somewhere, maybe we should direct the customers the correct way?

Feel free to reopen and target to documentation.

Note You need to log in before you can comment on or make changes to this bug.