Description of problem: I make a deployment on bare metals (1 controller and 1 compute). When I delete the stack with "heat stack-delete overcloud" I can see that the stack was deleted, but if I check with "nova list" I can see that one of the nodes always takes too long to go down. Eventually it goes to ERROR state and is not deleted (so it is also not available for future deployments). Here is the output of the "heat stack-list" and "nova list" commands: [stack@puma01 ~]$ heat stack-delete overcloud +-------------+------------+--------------------+----------------------+ | id | stack_name | stack_status | creation_time | +-------------+------------+--------------------+----------------------+ | 702d88f8... | overcloud | DELETE_IN_PROGRESS | 2015-06-04T06:23:53Z | +-------------+------------+--------------------+----------------------+ [stack@puma01 ~]$ heat stack-list +----+------------+--------------+---------------+ | id | stack_name | stack_status | creation_time | +----+------------+--------------+---------------+ +----+------------+--------------+---------------+ [stack@puma01 ~]$ nova list +--------------+------------------------+--------+------------+-------------+ | ID | Name | Status | Task State | Power State | +--------------+------------------------+--------+------------+-------------+ | 9e993693-... | ov-...-NovaCompute-... | ACTIVE | deleting | Running | +--------------+------------------------+--------+------------+-------------+ [stack@puma01 ~]$ nova list +--------------+------------------------+--------+------------+-------------+ | ID | Name | Status | Task State | Power State | +--------------+------------------------+--------+------------+-------------+ | 9e993693-... | ov-...-NovaCompute-... | ACTIVE | deleting | Running | +--------------+------------------------+--------+------------+-------------+ [stack@puma01 ~]$ nova list +--------------+------------------------+--------+------------+-------------+ | ID | Name | Status | Task State | Power State | +--------------+------------------------+--------+------------+-------------+ | 9e993693-... | ov-...-NovaCompute-... | ERROR | - | Running | +--------------+------------------------+--------+------------+-------------+ Version-Release number of selected component (if applicable): openstack-tripleo-0.0.6-dev1717.el7.centos.noarch openstack-heat-api-2015.1.1-dev11.el7.centos.noarch openstack-heat-engine-2015.1.1-dev11.el7.centos.noarch How reproducible: ~100% Steps to Reproduce: 1. Deploy on bare metals 2. Delete the stack with heat stack-delete 3. Make sure all nodes are deleted by calling "nova list" Actual results: * One of the nodes fails to be deleted * The output from heat stack-list seems to show that the stack was deleted, and it's misleading. Heat should wait until all nodes are *really* deleted successfully, or else show DELETE_FAILED.
Can you attach some logs from heat-engine?
Created attachment 1041364 [details] heat logs I just recreated the problem. After heat stack-delete you can see some nodes in error: $ nova list +--------------------------------------+-------------------------+--------+------------+-------------+----------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+----------+ | d8142069-54e6-48ce-a6af-ccb4b58e9a1f | overcloud-cephstorage-0 | ERROR | - | Running | | | aa04f32d-cb59-4c1f-bb19-842363e7c4d9 | overcloud-compute-0 | ERROR | - | Running | | +--------------------------------------+-------------------------+--------+------------+-------------+----------+ Please look towards the end of the logs (attached) to see if there are hint to the problem. Thanks.
The log shows a couple of the deeply nested stacks failing. It's not clear why this wouldn't cause the parent stack to also fail, but it seems like a Heat bug if it's not. 2015-06-21 10:48:01.002 32209 INFO heat.engine.stack [-] Stack DELETE FAILED (overcloud-Ceph-Storage-s6fmly7ijwmc-0-7uinrcozacgx): Resource DELETE failed: Error: Server overcloud-cephstorage-0 delete failed: (None) Unknown ... 2015-06-21 10:48:04.108 32210 INFO heat.engine.stack [-] Stack DELETE FAILED (overcloud-Compute-etcgblcvn5gr-0-dubmckcnux5l): Resource DELETE failed: Error: Server overcloud-compute-0 delete failed: (500) Error destroying the instance on node 1479665f-d94d-4297-8378-fb9f16032353. Provision state still 'deleting'.
I've looked more closely at the attached log, and also at the code, but I can't find any clue as to why the failure of the nested stacks is not bubbling up to their parent stacks.
Bug 1244485 looks like it could easily be related.
This bug is against a Version which has reached End of Life. If it's still present in supported release (http://releases.openstack.org), please update Version and reopen.