Bug 1827537

Summary: Can't remove/replace baremetal node when its IPMI interface is not available
Product: Red Hat OpenStack Reporter: Takashi Kajinami <tkajinam>
Component: documentationAssignee: Irina <igallagh>
Status: CLOSED CURRENTRELEASE QA Contact: Paras Babbar <pbabbar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 16.2 (Train)CC: amcleod, astillma, bdobreli, igallagh, jkreger, knoha, pbabbar, sbaker
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-07 14:01:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2023628    

Description Takashi Kajinami 2020-04-24 06:45:49 UTC
Description of problem:

This issue was originally discussed in bz 1814123 .

According to the actual failure happening, IPMI interface of the broken baremetal node becomes unavailable.
While ironic stops polling power status for that node with its IPMI interface down, it still requires to
access IPMI interface to power off the node during deploy process.

This causes failure when we remove or replace that node, because deleting nova instance fails
during stack update.
To avoid the error, we should remove baremetal node by 
 $ openstack baremetal node delete <baremetal node id>
so that nova will skip undeploying the node.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. deploy overcloud
2. disable IPMI interface of one overcloud nodes
3. Remove or Replace that node according to our product documentation[1]

[1] Compute: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/director_installation_and_usage/scaling-overcloud-nodes#removing-compute-nodes
    Controller: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/director_installation_and_usage/replacing-controller-nodes

Actual results:
stack becomes UPDATE_FAILED status, because of error while deleting the nova instance

Expected results:
stack becomes UPDATE_COMPLETE status without any failures

Additional info:

Comment 10 Steve Baker 2022-03-14 20:23:56 UTC
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/director_installation_and_usage/index#removing-compute-nodes
16.3.8.i
We've decided this is correct, and we really don't want to recommend "openstack baremetal node delete" in general. If overcloud node delete fails in maintenance mode there could be any number of root causes so no general advice would apply.

However, 16.3.8.i should recommend to wait for 2 minutes after setting maintenance mode.

[2]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/director_installation_and_usage/index#replacing-a-controller-node
17.4.4, (no change required, 2 minutes will elapse just reading the docs)