Bug 1255910 - overcloud node delete of one compute node removed all of them
Summary: overcloud node delete of one compute node removed all of them
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: y1
: 7.0 (Kilo)
Assignee: Jan Provaznik
QA Contact: Omri Hochman
URL:
Whiteboard:
: 1261129 (view as bug list)
Depends On: 1258967
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-08-21 20:44 UTC by Ben Nemec
Modified: 2023-02-22 23:02 UTC (History)
9 users (show)

Fixed In Version: openstack-heat-2015.1.1-1.el7ost
Doc Type: Bug Fix
Doc Text:
When deleting a node in the Overcloud, the Heat stack's ComputeCount parameter calculated the number of nodes. However, Heat did not update parameters if a scale up operation failed. This meant the number of nodes that Heat returned in parameters did not reflect the real number of nodes. This caused problems with the number of nodes deleted on a failed stack. This fix ensures Heat updates the parameters even if a scale operation failed previously. Now the director deletes the requested nodes when running "overcloud node delete" on a stack where scale up operation failed before.
Clone Of:
Environment:
Last Closed: 2015-10-08 12:17:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:1862 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux OpenStack Platform 7 director update 2015-10-08 16:05:50 UTC

Description Ben Nemec 2015-08-21 20:44:45 UTC
Description of problem: A user reported that after a failed attempt to scale up their overcloud they attempted to use openstack overcloud node delete to clean up the failed compute nodes.  When trying to remove one such node, Heat removed _all_ of their compute nodes, including the one that had previously been deployed successfully.


Version-Release number of selected component (if applicable):


How reproducible: Unsure


Steps to Reproduce:
1. Attempt to run openstack overcloud node delete on an overcloud instance that failed to deploy completely.
2.
3.

Actual results: All compute nodes deleted


Expected results: Just specified compute node deleted.


Additional info: 
Delete command:
openstack overcloud node delete --stack overcloud --templates /home/stack/templates -e /home/stack/network-environment.yaml -e /home/stack/templates/environments/puppet-ceph-external.yaml --debug -e /home/stack/overcloud-dev.yaml <nova_uuid>

Wondering if this could be a doc bug, in that deleting a failed node via Heat in this way is not intended to work.  Maybe in this case a simple nova delete would have been the way to go.  I don't know enough about the implementation of the node delete command to say for sure though.

Comment 4 Jan Provaznik 2015-08-28 11:50:08 UTC
The problem was most probably caused by having OC stack in inconsistent state  when deleting a node. After failed scale up number of nodes in heat stack (ComputeCount) doesn't reflect real number of nodes (ComputeCount wasn't updated because scale up failed). Then when doing node deletion, the inconsistent ComputeCount value is used.

A solution is to make sure that OC is in a consistent state before deleting a node (e.g. re-run "openstack overcloud deploy"). Although I'm afraid that in some situations it's not possible to get stack into consistent state, so alternative solution might be allow user specify desired number of nodes when deleting a node.

Comment 5 Jan Provaznik 2015-08-28 14:09:14 UTC
Zane pointed out (thanks!) that this recent upstream patch should solve the inconsistency of stack.parameters if update operation fails:
https://review.openstack.org/#/c/215618/

IOW it means that backporting should be sufficient solution - I'm testing this locally now.

Comment 6 Mike Burns 2015-08-28 14:13:26 UTC
based on comment 5, switching this bug to the heat component

Comment 7 Jan Provaznik 2015-09-02 07:09:15 UTC
This issue is solved by the Zane's backport patch for BZ 1258967 (https://code.engineering.redhat.com/gerrit/#/c/56834/). Thanks to this patch heat returns stack params from last update operation.

How to test:
1) deploy overcloud
2) scale up compute nodes beyond available nodes
3) when scale up operation fails, try delete instances in ERROR state
4) w/o this patch some additional instances would be deleted

Comment 8 Zane Bitter 2015-09-03 16:47:51 UTC
Setting component back to director, making TestOnly. Already depends on the Heat bug 1258967.

Comment 10 Zane Bitter 2015-09-22 17:19:58 UTC
*** Bug 1261129 has been marked as a duplicate of this bug. ***

Comment 11 Jan Provaznik 2015-09-23 10:31:38 UTC
It turns out that under certain circumstances BZ 1258967 is not sufficient fix for this issue:
- if a user tried to delete a node and this operation failed before, then using ComputeCount parameter for computing new node count is insufficient.

An upstream patch which counts new node count from actual nodes in ResourceGroup is here:
https://review.openstack.org/226682

Comment 12 Zane Bitter 2015-09-24 13:18:24 UTC
I raised a separate BZ, bug 1266102, for the issue in comment #11.

Comment 13 Omri Hochman 2015-09-25 18:08:10 UTC
Verified with: openstack-heat-2015.1.1-4.el7ost.noarch

Thanks jprovazn for reproduction help : 


+--------------------------------------+------------------------+--------+------------+-------------+-----------------------+
| ID                                   | Name                   | Status | Task State | Power State | Networks              |
+--------------------------------------+------------------------+--------+------------+-------------+-----------------------+
| d05f98fc-585b-4c6c-9221-7faf0ed66af1 | overcloud-compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.0.14 |
| 15b0aa8a-c858-4317-932a-eaab124f871f | overcloud-compute-1    | ERROR  | -          | NOSTATE     |                       |
| 5c4b6e52-f3ab-475e-946d-db44ef16d896 | overcloud-compute-2    | ACTIVE | -          | Running     | ctlplane=192.168.0.15 |
| db68c6d5-6ac6-49e4-8b37-59c36800446c | overcloud-controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.0.13 |
+--------------------------------------+------------------------+--------+------------+-------------+-----------------------+

openstack overcloud node delete --templates --stack overcloud 15b0aa8a-c858-4317-932a-eaab124f871f


[stack@undercloud ~]$ nova list
+--------------------------------------+------------------------+--------+------------+-------------+-----------------------+
| ID                                   | Name                   | Status | Task State | Power State | Networks              |
+--------------------------------------+------------------------+--------+------------+-------------+-----------------------+
| d05f98fc-585b-4c6c-9221-7faf0ed66af1 | overcloud-compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.0.14 |
| 5c4b6e52-f3ab-475e-946d-db44ef16d896 | overcloud-compute-2    | ACTIVE | -          | Running     | ctlplane=192.168.0.15 |
| db68c6d5-6ac6-49e4-8b37-59c36800446c | overcloud-controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.0.13 |
+--------------------------------------+------------------------+--------+------------+-------------+-----------------------+

[stack@undercloud ~]$ heat stack-list
+--------------------------------------+------------+-----------------+----------------------+
| id                                   | stack_name | stack_status    | creation_time        |
+--------------------------------------+------------+-----------------+----------------------+
| 8dbb7631-3b07-4fd8-874a-7a2502b7b018 | overcloud  | UPDATE_COMPLETE | 2015-09-22T04:46:27Z |
+--------------------------------------+------------+-----------------+----------------------+

Comment 15 errata-xmlrpc 2015-10-08 12:17:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:1862


Note You need to log in before you can comment on or make changes to this bug.