Description of problem: This issue was originally discussed in bz 1814123 . According to the actual failure happening, IPMI interface of the broken baremetal node becomes unavailable. While ironic stops polling power status for that node with its IPMI interface down, it still requires to access IPMI interface to power off the node during deploy process. This causes failure when we remove or replace that node, because deleting nova instance fails during stack update. To avoid the error, we should remove baremetal node by $ openstack baremetal node delete <baremetal node id> so that nova will skip undeploying the node. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. deploy overcloud 2. disable IPMI interface of one overcloud nodes 3. Remove or Replace that node according to our product documentation[1] [1] Compute: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/director_installation_and_usage/scaling-overcloud-nodes#removing-compute-nodes Controller: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/director_installation_and_usage/replacing-controller-nodes Actual results: stack becomes UPDATE_FAILED status, because of error while deleting the nova instance Expected results: stack becomes UPDATE_COMPLETE status without any failures Additional info:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/director_installation_and_usage/index#removing-compute-nodes 16.3.8.i We've decided this is correct, and we really don't want to recommend "openstack baremetal node delete" in general. If overcloud node delete fails in maintenance mode there could be any number of root causes so no general advice would apply. However, 16.3.8.i should recommend to wait for 2 minutes after setting maintenance mode. [2]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/director_installation_and_usage/index#replacing-a-controller-node 17.4.4, (no change required, 2 minutes will elapse just reading the docs)
Updated content available on the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/director_installation_and_usage/assembly_scaling-overcloud-nodes#proc_removing-or-replacing-a-compute-node_scaling-overcloud-nodes https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/director_installation_and_usage/assembly_scaling-overcloud-nodes#proc_removing-or-replacing-a-compute-node_scaling-overcloud-nodes