1021989 – heat sometimes keeps listenings stacks with status DELETE_COMPLETE

RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/

Bug 1021989 - heat sometimes keeps listenings stacks with status DELETE_COMPLETE

Summary: heat sometimes keeps listenings stacks with status DELETE_COMPLETE

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RDO
Classification:	Community
Component:	openstack-heat
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Steven Hardy
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-22 12:57 UTC by Sandro Mathys
Modified:	2016-04-26 18:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-03-30 23:00:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1247200	0	None	None	None	Never

Description Sandro Mathys 2013-10-22 12:57:35 UTC

Description of problem:
According to Steve Hardy, Heat should not list stacks with status DELETE_COMPLETE anymore - but sometimes does. So far it seems to only happen if the stack is deleted while in CREATE_IN_PROGRESS state with a BrokerWaitHandle being stuck in progress due to a instance not being able to access the metadata API.

The issue was found because the F18 images used by these example templates don't set up their eth0 correctly, and therefore obviously can't access the metadata:
https://github.com/openstack/heat-templates/tree/master/openshift-origin/F18
http://fedorapeople.org/groups/heat/prebuilt-jeos-images/

Version-Release number of selected component (if applicable):
openstack-heat-engine-2013.2-1.0.el6.noarch

How reproducible:
So far always if the above case was hit, but never otherwise.

Steps to Reproduce:
1. Create a Stack with those images and templates:
- http://fedorapeople.org/groups/heat/prebuilt-jeos-images/F18-x86_64-openshift-origin-broker-cfntools.qcow2
- http://fedorapeople.org/groups/heat/prebuilt-jeos-images/F18-x86_64-openshift-origin-node-cfntools.qcow2
- https://github.com/openstack/heat-templates/blob/master/openshift-origin/F18/OpenShiftAutoScaling.yaml
2. Wait 10-20 minutes, just to be sure
3. Delete the stack

Actual results:
Stack will be deleted sucessfully and put into DELETE_COMPLETE state but it will still be listed with `heat stack-list` or in Horizon.

Expected results:
Stack not listed anymore.

Additional info:
I created and deleted the stacks through the Horizon interface but could verify with the heat CLI that they are still listed.

Comment 1 Steven Hardy 2013-10-22 13:04:17 UTC

I'll take this since I was discussing it with the reporter, and see if I can reproduce based on the info above.

Comment 2 Steven Hardy 2013-10-31 10:32:29 UTC

Ok, so I still can't reproduce this, but I have investigated another similar report, and discovered that for some reason, it's possible for the stack DB entry to end up with action==DELETE, status==COMPLETE, and deleted_at==NULL

This should never happen, and it seems related to a probable error-path bug in heat - it seems that in both the reported case here, and the one I've just looked at, things happened during the delete which caused errors (the most recent report had the RabbitMQ broker errors requiring a restart, and a failure to delete a keystone user).  But instead of ending up in DELETE_FAILED state, we somehow ended up in DELETE_COMPLETE with a NULL deleted_at column.

Sandro Mathys: Can you reproduce and confirm if the deleted_at column is null when you encounter this failure?

Steps to get this info:
mysql -u root -p
> use heat;
> select id,action,status,deleted_at from stack where id = '<your stack ID>'

Comment 3 Sandro Mathys 2013-10-31 12:37:53 UTC

Steven, it's just as you said (filtering deleted_at==NULL seemed easier):

mysql>  select id,action,status,deleted_at from stack where deleted_at is NULL;
+--------------------------------------+--------+----------+------------+
| id                                   | action | status   | deleted_at |
+--------------------------------------+--------+----------+------------+
| 094fddb6-e8c7-4c0e-aff7-34c9660fd3f2 | DELETE | COMPLETE | NULL       |
| 2c474e57-7edf-409d-b9bd-1eec1923a9a8 | CREATE | FAILED   | NULL       |
| c9e6075d-5fb6-4860-b736-245b05059d0a | DELETE | COMPLETE | NULL       |
| e1d59858-778e-4f67-8e13-df75ba921f80 | DELETE | COMPLETE | NULL       |
+--------------------------------------+--------+----------+------------+
4 rows in set (0.00 sec)

[root@ctrl-stg ~(ks_admin)]# heat stack-list
+--------------------------------------+------------------------------+-----------------+----------------------+
| id                                   | stack_name                   | stack_status    | creation_time        |
+--------------------------------------+------------------------------+-----------------+----------------------+
| 094fddb6-e8c7-4c0e-aff7-34c9660fd3f2 | OpenShift_Origin             | DELETE_COMPLETE | 2013-10-17T13:52:18Z |
| c9e6075d-5fb6-4860-b736-245b05059d0a | OpenShift_Origin_2           | DELETE_COMPLETE | 2013-10-22T07:34:55Z |
| e1d59858-778e-4f67-8e13-df75ba921f80 | OpenShift_Origin_AutoScaling | DELETE_COMPLETE | 2013-10-22T08:25:57Z |
| 2c474e57-7edf-409d-b9bd-1eec1923a9a8 | OSO_TEST_1                   | CREATE_FAILED   | 2013-10-24T18:33:58Z |
+--------------------------------------+------------------------------+-----------------+----------------------+

FWIW, there were other stacks and they are visible in the database with action=DELETE, status=COMPLETE and deleted_at=<timestamp>, i.e. as expected - so not all is going wrong :)

Comment 4 Steven Hardy 2013-11-01 17:12:37 UTC

Ok, I've just managed to repoduce this, raised this upstream bug:

https://bugs.launchpad.net/heat/+bug/1247200

Comment 5 Sandro Mathys 2013-11-01 17:16:40 UTC

Great, thanks for the effort. Much appreciated.

Once it's fixed, could you try backporting the patch to Havana for 2013.2.1 or 2013.2.2?

Comment 6 Steven Hardy 2013-11-26 16:51:28 UTC

Can I confirm, are you running with deferred_auth_method=trusts?

If, as I suspect, you're not, then my reproducer for bug 1247200 is a different route to getting stuck DELETE_COMPLETE than you are experiencing.

If that is the case, can you please reproduce and attach the backtrace from /var/log/heat/engine.log when the delete happens (as we must be exiting the delete after setting the status but before marking the row deleted in the DB, which I'd expect to result in an engine backtrace)

I'm about to post a fix upstream for the failure I reproduced in 1247200, but I want to ensure we fix the whole problem not just part of it, thanks!

Comment 7 Sandro Mathys 2013-11-27 15:55:48 UTC

I did not define deferred_auth_method which I think means I'm using =password.

I just tried to reproduce but this time I got "Delete Failed" with the below backtrace FWIW. I was then able to delete it completely as expected so not this issue. Have to run right now but will try to reproduce again tomorrow or so.


2013-11-27 16:51:53.361 2403 ERROR heat.engine.resource [-] Delete WaitConditionHandle "BrokerWaitHandle"
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource Traceback (most recent call last):
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource   File "/usr/lib/python2.6/site-packages/heat/engine/resource.py", line 527, in delete
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource     handle_data = self.handle_delete()
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource   File "/usr/lib/python2.6/site-packages/heat/engine/signal_responder.py", line 61, in handle_delete
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource     self.keystone().delete_stack_user(self.resource_id)
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource   File "/usr/lib/python2.6/site-packages/heat/common/heat_keystoneclient.py", line 281, in delete_stack_user
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource     raise exception.Error(reason)
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource Error: Timed out trying to delete user
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource 
2013-11-27 16:51:53.415 2403 DEBUG heat.engine.scheduler [-] Task destroy from None cancelled cancel /usr/lib/python2.6/site-packages/heat/engine/scheduler.py:187
2013-11-27 16:51:53.447 2403 DEBUG heat.engine.scheduler [-] Task destroy from None cancelled cancel /usr/lib/python2.6/site-packages/heat/engine/scheduler.py:187

Comment 8 Sandro Mathys 2013-11-29 13:15:58 UTC

Steven, I'm afraid I can't reproduce this anymore and I don't know why. I don't think anything changed in this particular setup and I was able to reproduce it a number of times before.

Comment 9 Steven Hardy 2013-11-29 16:27:03 UTC

(In reply to Sandro Mathys from comment #8)
> Steven, I'm afraid I can't reproduce this anymore and I don't know why. I
> don't think anything changed in this particular setup and I was able to
> reproduce it a number of times before.

Can you confirm, have you updated Heat since the last reproduce, and if so between which versions?

FYI, the upstream fix for the scenario I was able to reproduce has now been merged:

https://review.openstack.org/#/c/58582/

Comment 10 Sandro Mathys 2013-12-02 11:04:54 UTC

That are all updates that have been applied since reporting this:
Nov 08 11:36:49 Updated: device-mapper-libs-1.02.77-9.el6_4.2.x86_64
Nov 08 11:36:49 Updated: device-mapper-1.02.77-9.el6_4.2.x86_64
Nov 08 11:36:49 Updated: device-mapper-event-libs-1.02.77-9.el6_4.2.x86_64
Nov 08 11:36:49 Updated: nagios-plugins-1.4.16-10.el6.x86_64
Nov 08 11:36:50 Updated: postgresql-libs-8.4.18-1.el6_4.x86_64
Nov 08 11:36:50 Updated: python-six-1.4.1-1.el6.noarch
Nov 08 11:36:50 Updated: libgcrypt-1.4.5-11.el6_4.x86_64
Nov 08 11:36:51 Updated: 1:python-keystoneclient-0.4.1-3.el6.noarch
Nov 08 11:36:53 Updated: postgresql-8.4.18-1.el6_4.x86_64
Nov 08 11:36:53 Updated: device-mapper-event-1.02.77-9.el6_4.2.x86_64
Nov 08 11:36:53 Updated: lvm2-libs-2.02.98-9.el6_4.2.x86_64
Nov 08 11:36:55 Updated: kernel-firmware-2.6.32-358.123.2.openstack.el6.noarch
Nov 08 11:37:00 Installed: kernel-2.6.32-358.123.2.openstack.el6.x86_64
Nov 08 11:37:01 Updated: lvm2-2.02.98-9.el6_4.2.x86_64
Nov 08 11:37:02 Updated: postgresql-devel-8.4.18-1.el6_4.x86_64
Nov 08 11:37:03 Updated: python-heatclient-0.2.5-1.el6.noarch
Nov 08 11:37:03 Updated: gnupg2-2.0.14-6.el6_4.x86_64
Nov 08 11:37:04 Updated: libgcrypt-devel-1.4.5-11.el6_4.x86_64
Nov 08 11:37:04 Updated: python-cinderclient-1.0.7-1.el6.noarch
Nov 08 11:37:04 Updated: nagios-plugins-ping-1.4.16-10.el6.x86_64
Nov 08 11:37:04 Updated: nagios-plugins-nrpe-2.14-5.el6.x86_64
Nov 08 11:37:06 Updated: tzdata-2013g-1.el6.noarch
Nov 08 11:37:06 Updated: nrpe-2.14-5.el6.x86_64
Nov 08 11:37:06 Updated: setup-2.8.14-20.el6_4.1.noarch
Nov 08 11:37:06 Updated: 2:qemu-img-0.12.1.2-2.355.el6_4.9.x86_64
Nov 08 11:37:07 Updated: python-neutronclient-2.3.1-2.el6.noarch
Nov 08 11:37:07 Updated: rsync-3.0.6-9.el6_4.1.x86_64
Nov 08 11:37:07 Updated: rdo-release-havana-7.noarch
Nov 08 11:37:09 Updated: kernel-headers-2.6.32-358.123.2.openstack.el6.x86_64

heatclient oder keystoneclient are the only ones that I think could matter. Not sure what version of heatclient I had installed before (it's not in yum.log) but I'm 99.9% sure the version that was the most current in the repo before this one. keystoneclient:
Oct 21 09:09:30 Updated: 1:python-keystoneclient-0.4.1-1.el6.noarch

Comment 13 Steven Hardy 2014-02-03 10:12:23 UTC

stable/havana backport of the scenario I was able to reproduce (and some related tests) proposed:

https://review.openstack.org/70679
https://review.openstack.org/70680
https://review.openstack.org/70681

Note You need to log in before you can comment on or make changes to this bug.