Bug 1228324

Summary: When deleting the stack, a bare metal node goes to ERROR state and is not deleted
Product: [Community] RDO Reporter: Udi Kalifon <ukalifon>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED EOL QA Contact: Amit Ugol <augol>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: trunkCC: jpeeler, mburns, srevivo, ukalifon, zbitter
Target Milestone: ---   
Target Release: Kilo   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-19 15:35:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
heat logs none

Description Udi Kalifon 2015-06-04 15:28:48 UTC
Description of problem:
I make a deployment on bare metals (1 controller and 1 compute). When I delete the stack with "heat stack-delete overcloud" I can see that the stack was deleted, but if I check with "nova list" I can see that one of the nodes always takes too long to go down. Eventually it goes to ERROR state and is not deleted (so it is also not available for future deployments). 

Here is the output of the "heat stack-list" and "nova list" commands:

[stack@puma01 ~]$ heat stack-delete overcloud
+-------------+------------+--------------------+----------------------+
| id          | stack_name | stack_status       | creation_time        | 
+-------------+------------+--------------------+----------------------+
| 702d88f8... | overcloud  | DELETE_IN_PROGRESS | 2015-06-04T06:23:53Z | 
+-------------+------------+--------------------+----------------------+
[stack@puma01 ~]$ heat stack-list
+----+------------+--------------+---------------+
| id | stack_name | stack_status | creation_time |
+----+------------+--------------+---------------+
+----+------------+--------------+---------------+
[stack@puma01 ~]$ nova list
+--------------+------------------------+--------+------------+-------------+
| ID           | Name                   | Status | Task State | Power State |
+--------------+------------------------+--------+------------+-------------+
| 9e993693-... | ov-...-NovaCompute-... | ACTIVE | deleting   | Running     |
+--------------+------------------------+--------+------------+-------------+
[stack@puma01 ~]$ nova list
+--------------+------------------------+--------+------------+-------------+
| ID           | Name                   | Status | Task State | Power State |
+--------------+------------------------+--------+------------+-------------+
| 9e993693-... | ov-...-NovaCompute-... | ACTIVE | deleting   | Running     |
+--------------+------------------------+--------+------------+-------------+
[stack@puma01 ~]$ nova list
+--------------+------------------------+--------+------------+-------------+
| ID           | Name                   | Status | Task State | Power State |
+--------------+------------------------+--------+------------+-------------+
| 9e993693-... | ov-...-NovaCompute-... | ERROR  | -          | Running     |
+--------------+------------------------+--------+------------+-------------+



Version-Release number of selected component (if applicable):
openstack-tripleo-0.0.6-dev1717.el7.centos.noarch
openstack-heat-api-2015.1.1-dev11.el7.centos.noarch
openstack-heat-engine-2015.1.1-dev11.el7.centos.noarch


How reproducible:
~100%


Steps to Reproduce:
1. Deploy on bare metals
2. Delete the stack with heat stack-delete
3. Make sure all nodes are deleted by calling "nova list"


Actual results:
* One of the nodes fails to be deleted
* The output from heat stack-list seems to show that the stack was deleted, and it's misleading. Heat should wait until all nodes are *really* deleted successfully, or else show DELETE_FAILED.

Comment 1 Zane Bitter 2015-06-17 18:10:25 UTC
Can you attach some logs from heat-engine?

Comment 2 Udi Kalifon 2015-06-21 08:09:33 UTC
Created attachment 1041364 [details]
heat logs

I just recreated the problem. After heat stack-delete you can see some nodes in error:

$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| ID                                   | Name                    | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| d8142069-54e6-48ce-a6af-ccb4b58e9a1f | overcloud-cephstorage-0 | ERROR  | -          | Running     |          |
| aa04f32d-cb59-4c1f-bb19-842363e7c4d9 | overcloud-compute-0     | ERROR  | -          | Running     |          |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+


Please look towards the end of the logs (attached) to see if there are hint to the problem. Thanks.

Comment 3 Zane Bitter 2015-06-30 14:58:51 UTC
The log shows a couple of the deeply nested stacks failing. It's not clear why this wouldn't cause the parent stack to also fail, but it seems like a Heat bug if it's not.

2015-06-21 10:48:01.002 32209 INFO heat.engine.stack [-] Stack DELETE FAILED (overcloud-Ceph-Storage-s6fmly7ijwmc-0-7uinrcozacgx): Resource DELETE failed: Error: Server overcloud-cephstorage-0 delete failed: (None) Unknown
...
2015-06-21 10:48:04.108 32210 INFO heat.engine.stack [-] Stack DELETE FAILED (overcloud-Compute-etcgblcvn5gr-0-dubmckcnux5l): Resource DELETE failed: Error: Server overcloud-compute-0 delete failed: (500) Error destroying the instance on node 1479665f-d94d-4297-8378-fb9f16032353. Provision state still 'deleting'.

Comment 4 Zane Bitter 2015-07-16 18:33:22 UTC
I've looked more closely at the attached log, and also at the code, but I can't find any clue as to why the failure of the nested stacks is not bubbling up to their parent stacks.

Comment 5 Zane Bitter 2015-07-20 15:20:17 UTC
Bug 1244485 looks like it could easily be related.

Comment 6 Chandan Kumar 2016-05-19 15:35:10 UTC
This bug is against a Version which has reached End of Life.
If it's still present in supported release (http://releases.openstack.org), please update Version and reopen.