1228324 – When deleting the stack, a bare metal node goes to ERROR state and is not deleted

RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/

Bug 1228324 - When deleting the stack, a bare metal node goes to ERROR state and is not deleted

Summary: When deleting the stack, a bare metal node goes to ERROR state and is not del...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	RDO
Classification:	Community
Component:	openstack-heat
Sub Component:
Version:	trunk
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	Kilo
Assignee:	Zane Bitter
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-06-04 15:28 UTC by Udi Kalifon
Modified:	2016-05-19 15:35 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-05-19 15:35:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
heat logs (7.96 MB, application/x-gzip) 2015-06-21 08:09 UTC, Udi Kalifon	no flags	Details
View All

Description Udi Kalifon 2015-06-04 15:28:48 UTC

Description of problem:
I make a deployment on bare metals (1 controller and 1 compute). When I delete the stack with "heat stack-delete overcloud" I can see that the stack was deleted, but if I check with "nova list" I can see that one of the nodes always takes too long to go down. Eventually it goes to ERROR state and is not deleted (so it is also not available for future deployments). 

Here is the output of the "heat stack-list" and "nova list" commands:

[stack@puma01 ~]$ heat stack-delete overcloud
+-------------+------------+--------------------+----------------------+
| id          | stack_name | stack_status       | creation_time        | 
+-------------+------------+--------------------+----------------------+
| 702d88f8... | overcloud  | DELETE_IN_PROGRESS | 2015-06-04T06:23:53Z | 
+-------------+------------+--------------------+----------------------+
[stack@puma01 ~]$ heat stack-list
+----+------------+--------------+---------------+
| id | stack_name | stack_status | creation_time |
+----+------------+--------------+---------------+
+----+------------+--------------+---------------+
[stack@puma01 ~]$ nova list
+--------------+------------------------+--------+------------+-------------+
| ID           | Name                   | Status | Task State | Power State |
+--------------+------------------------+--------+------------+-------------+
| 9e993693-... | ov-...-NovaCompute-... | ACTIVE | deleting   | Running     |
+--------------+------------------------+--------+------------+-------------+
[stack@puma01 ~]$ nova list
+--------------+------------------------+--------+------------+-------------+
| ID           | Name                   | Status | Task State | Power State |
+--------------+------------------------+--------+------------+-------------+
| 9e993693-... | ov-...-NovaCompute-... | ACTIVE | deleting   | Running     |
+--------------+------------------------+--------+------------+-------------+
[stack@puma01 ~]$ nova list
+--------------+------------------------+--------+------------+-------------+
| ID           | Name                   | Status | Task State | Power State |
+--------------+------------------------+--------+------------+-------------+
| 9e993693-... | ov-...-NovaCompute-... | ERROR  | -          | Running     |
+--------------+------------------------+--------+------------+-------------+



Version-Release number of selected component (if applicable):
openstack-tripleo-0.0.6-dev1717.el7.centos.noarch
openstack-heat-api-2015.1.1-dev11.el7.centos.noarch
openstack-heat-engine-2015.1.1-dev11.el7.centos.noarch


How reproducible:
~100%


Steps to Reproduce:
1. Deploy on bare metals
2. Delete the stack with heat stack-delete
3. Make sure all nodes are deleted by calling "nova list"


Actual results:
* One of the nodes fails to be deleted
* The output from heat stack-list seems to show that the stack was deleted, and it's misleading. Heat should wait until all nodes are *really* deleted successfully, or else show DELETE_FAILED.

Comment 1 Zane Bitter 2015-06-17 18:10:25 UTC

Can you attach some logs from heat-engine?

Comment 2 Udi Kalifon 2015-06-21 08:09:33 UTC

Created attachment 1041364 [details]
heat logs

I just recreated the problem. After heat stack-delete you can see some nodes in error:

$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| ID                                   | Name                    | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| d8142069-54e6-48ce-a6af-ccb4b58e9a1f | overcloud-cephstorage-0 | ERROR  | -          | Running     |          |
| aa04f32d-cb59-4c1f-bb19-842363e7c4d9 | overcloud-compute-0     | ERROR  | -          | Running     |          |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+


Please look towards the end of the logs (attached) to see if there are hint to the problem. Thanks.

Comment 3 Zane Bitter 2015-06-30 14:58:51 UTC

The log shows a couple of the deeply nested stacks failing. It's not clear why this wouldn't cause the parent stack to also fail, but it seems like a Heat bug if it's not.

2015-06-21 10:48:01.002 32209 INFO heat.engine.stack [-] Stack DELETE FAILED (overcloud-Ceph-Storage-s6fmly7ijwmc-0-7uinrcozacgx): Resource DELETE failed: Error: Server overcloud-cephstorage-0 delete failed: (None) Unknown
...
2015-06-21 10:48:04.108 32210 INFO heat.engine.stack [-] Stack DELETE FAILED (overcloud-Compute-etcgblcvn5gr-0-dubmckcnux5l): Resource DELETE failed: Error: Server overcloud-compute-0 delete failed: (500) Error destroying the instance on node 1479665f-d94d-4297-8378-fb9f16032353. Provision state still 'deleting'.

Comment 4 Zane Bitter 2015-07-16 18:33:22 UTC

I've looked more closely at the attached log, and also at the code, but I can't find any clue as to why the failure of the nested stacks is not bubbling up to their parent stacks.

Comment 5 Zane Bitter 2015-07-20 15:20:17 UTC

Bug 1244485 looks like it could easily be related.

Comment 6 Chandan Kumar 2016-05-19 15:35:10 UTC

This bug is against a Version which has reached End of Life.
If it's still present in supported release (http://releases.openstack.org), please update Version and reopen.

Note You need to log in before you can comment on or make changes to this bug.