Description of problem: A user reported that after a failed attempt to scale up their overcloud they attempted to use openstack overcloud node delete to clean up the failed compute nodes. When trying to remove one such node, Heat removed _all_ of their compute nodes, including the one that had previously been deployed successfully. Version-Release number of selected component (if applicable): How reproducible: Unsure Steps to Reproduce: 1. Attempt to run openstack overcloud node delete on an overcloud instance that failed to deploy completely. 2. 3. Actual results: All compute nodes deleted Expected results: Just specified compute node deleted. Additional info: Delete command: openstack overcloud node delete --stack overcloud --templates /home/stack/templates -e /home/stack/network-environment.yaml -e /home/stack/templates/environments/puppet-ceph-external.yaml --debug -e /home/stack/overcloud-dev.yaml <nova_uuid> Wondering if this could be a doc bug, in that deleting a failed node via Heat in this way is not intended to work. Maybe in this case a simple nova delete would have been the way to go. I don't know enough about the implementation of the node delete command to say for sure though.
The problem was most probably caused by having OC stack in inconsistent state when deleting a node. After failed scale up number of nodes in heat stack (ComputeCount) doesn't reflect real number of nodes (ComputeCount wasn't updated because scale up failed). Then when doing node deletion, the inconsistent ComputeCount value is used. A solution is to make sure that OC is in a consistent state before deleting a node (e.g. re-run "openstack overcloud deploy"). Although I'm afraid that in some situations it's not possible to get stack into consistent state, so alternative solution might be allow user specify desired number of nodes when deleting a node.
Zane pointed out (thanks!) that this recent upstream patch should solve the inconsistency of stack.parameters if update operation fails: https://review.openstack.org/#/c/215618/ IOW it means that backporting should be sufficient solution - I'm testing this locally now.
based on comment 5, switching this bug to the heat component
This issue is solved by the Zane's backport patch for BZ 1258967 (https://code.engineering.redhat.com/gerrit/#/c/56834/). Thanks to this patch heat returns stack params from last update operation. How to test: 1) deploy overcloud 2) scale up compute nodes beyond available nodes 3) when scale up operation fails, try delete instances in ERROR state 4) w/o this patch some additional instances would be deleted
Setting component back to director, making TestOnly. Already depends on the Heat bug 1258967.
*** Bug 1261129 has been marked as a duplicate of this bug. ***
It turns out that under certain circumstances BZ 1258967 is not sufficient fix for this issue: - if a user tried to delete a node and this operation failed before, then using ComputeCount parameter for computing new node count is insufficient. An upstream patch which counts new node count from actual nodes in ResourceGroup is here: https://review.openstack.org/226682
I raised a separate BZ, bug 1266102, for the issue in comment #11.
Verified with: openstack-heat-2015.1.1-4.el7ost.noarch Thanks jprovazn for reproduction help : +--------------------------------------+------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------------------------+--------+------------+-------------+-----------------------+ | d05f98fc-585b-4c6c-9221-7faf0ed66af1 | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.168.0.14 | | 15b0aa8a-c858-4317-932a-eaab124f871f | overcloud-compute-1 | ERROR | - | NOSTATE | | | 5c4b6e52-f3ab-475e-946d-db44ef16d896 | overcloud-compute-2 | ACTIVE | - | Running | ctlplane=192.168.0.15 | | db68c6d5-6ac6-49e4-8b37-59c36800446c | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.0.13 | +--------------------------------------+------------------------+--------+------------+-------------+-----------------------+ openstack overcloud node delete --templates --stack overcloud 15b0aa8a-c858-4317-932a-eaab124f871f [stack@undercloud ~]$ nova list +--------------------------------------+------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------------------------+--------+------------+-------------+-----------------------+ | d05f98fc-585b-4c6c-9221-7faf0ed66af1 | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.168.0.14 | | 5c4b6e52-f3ab-475e-946d-db44ef16d896 | overcloud-compute-2 | ACTIVE | - | Running | ctlplane=192.168.0.15 | | db68c6d5-6ac6-49e4-8b37-59c36800446c | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.0.13 | +--------------------------------------+------------------------+--------+------------+-------------+-----------------------+ [stack@undercloud ~]$ heat stack-list +--------------------------------------+------------+-----------------+----------------------+ | id | stack_name | stack_status | creation_time | +--------------------------------------+------------+-----------------+----------------------+ | 8dbb7631-3b07-4fd8-874a-7a2502b7b018 | overcloud | UPDATE_COMPLETE | 2015-09-22T04:46:27Z | +--------------------------------------+------------+-----------------+----------------------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:1862