Bug 1021989

Summary:	heat sometimes keeps listenings stacks with status DELETE_COMPLETE
Product:	[Community] RDO	Reporter:	Sandro Mathys <sandro>
Component:	openstack-heat	Assignee:	Steven Hardy <shardy>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Amit Ugol <augol>
Severity:	medium	Docs Contact:
Priority:	high
Version:	unspecified	CC:	jpeeler, sandro, shardy, yeylon
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-03-30 23:00:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sandro Mathys 2013-10-22 12:57:35 UTC

Description of problem:
According to Steve Hardy, Heat should not list stacks with status DELETE_COMPLETE anymore - but sometimes does. So far it seems to only happen if the stack is deleted while in CREATE_IN_PROGRESS state with a BrokerWaitHandle being stuck in progress due to a instance not being able to access the metadata API.

The issue was found because the F18 images used by these example templates don't set up their eth0 correctly, and therefore obviously can't access the metadata:
https://github.com/openstack/heat-templates/tree/master/openshift-origin/F18
http://fedorapeople.org/groups/heat/prebuilt-jeos-images/

Version-Release number of selected component (if applicable):
openstack-heat-engine-2013.2-1.0.el6.noarch

How reproducible:
So far always if the above case was hit, but never otherwise.

Steps to Reproduce:
1. Create a Stack with those images and templates:
- http://fedorapeople.org/groups/heat/prebuilt-jeos-images/F18-x86_64-openshift-origin-broker-cfntools.qcow2
- http://fedorapeople.org/groups/heat/prebuilt-jeos-images/F18-x86_64-openshift-origin-node-cfntools.qcow2
- https://github.com/openstack/heat-templates/blob/master/openshift-origin/F18/OpenShiftAutoScaling.yaml
2. Wait 10-20 minutes, just to be sure
3. Delete the stack

Actual results:
Stack will be deleted sucessfully and put into DELETE_COMPLETE state but it will still be listed with `heat stack-list` or in Horizon.

Expected results:
Stack not listed anymore.

Additional info:
I created and deleted the stacks through the Horizon interface but could verify with the heat CLI that they are still listed.

Comment 1 Steven Hardy 2013-10-22 13:04:17 UTC

I'll take this since I was discussing it with the reporter, and see if I can reproduce based on the info above.

Comment 2 Steven Hardy 2013-10-31 10:32:29 UTC

Ok, so I still can't reproduce this, but I have investigated another similar report, and discovered that for some reason, it's possible for the stack DB entry to end up with action==DELETE, status==COMPLETE, and deleted_at==NULL

This should never happen, and it seems related to a probable error-path bug in heat - it seems that in both the reported case here, and the one I've just looked at, things happened during the delete which caused errors (the most recent report had the RabbitMQ broker errors requiring a restart, and a failure to delete a keystone user).  But instead of ending up in DELETE_FAILED state, we somehow ended up in DELETE_COMPLETE with a NULL deleted_at column.

Sandro Mathys: Can you reproduce and confirm if the deleted_at column is null when you encounter this failure?

Steps to get this info:
mysql -u root -p
> use heat;
> select id,action,status,deleted_at from stack where id = '<your stack ID>'

Comment 3 Sandro Mathys 2013-10-31 12:37:53 UTC

Steven, it's just as you said (filtering deleted_at==NULL seemed easier):

mysql>  select id,action,status,deleted_at from stack where deleted_at is NULL;
+--------------------------------------+--------+----------+------------+
| id                                   | action | status   | deleted_at |
+--------------------------------------+--------+----------+------------+
| 094fddb6-e8c7-4c0e-aff7-34c9660fd3f2 | DELETE | COMPLETE | NULL       |
| 2c474e57-7edf-409d-b9bd-1eec1923a9a8 | CREATE | FAILED   | NULL       |
| c9e6075d-5fb6-4860-b736-245b05059d0a | DELETE | COMPLETE | NULL       |
| e1d59858-778e-4f67-8e13-df75ba921f80 | DELETE | COMPLETE | NULL       |
+--------------------------------------+--------+----------+------------+
4 rows in set (0.00 sec)

[root@ctrl-stg ~(ks_admin)]# heat stack-list
+--------------------------------------+------------------------------+-----------------+----------------------+
| id                                   | stack_name                   | stack_status    | creation_time        |
+--------------------------------------+------------------------------+-----------------+----------------------+
| 094fddb6-e8c7-4c0e-aff7-34c9660fd3f2 | OpenShift_Origin             | DELETE_COMPLETE | 2013-10-17T13:52:18Z |
| c9e6075d-5fb6-4860-b736-245b05059d0a | OpenShift_Origin_2           | DELETE_COMPLETE | 2013-10-22T07:34:55Z |
| e1d59858-778e-4f67-8e13-df75ba921f80 | OpenShift_Origin_AutoScaling | DELETE_COMPLETE | 2013-10-22T08:25:57Z |
| 2c474e57-7edf-409d-b9bd-1eec1923a9a8 | OSO_TEST_1                   | CREATE_FAILED   | 2013-10-24T18:33:58Z |
+--------------------------------------+------------------------------+-----------------+----------------------+

FWIW, there were other stacks and they are visible in the database with action=DELETE, status=COMPLETE and deleted_at=<timestamp>, i.e. as expected - so not all is going wrong :)

Comment 4 Steven Hardy 2013-11-01 17:12:37 UTC

Ok, I've just managed to repoduce this, raised this upstream bug:

https://bugs.launchpad.net/heat/+bug/1247200

Comment 5 Sandro Mathys 2013-11-01 17:16:40 UTC

Great, thanks for the effort. Much appreciated.

Once it's fixed, could you try backporting the patch to Havana for 2013.2.1 or 2013.2.2?

Comment 6 Steven Hardy 2013-11-26 16:51:28 UTC

Can I confirm, are you running with deferred_auth_method=trusts?

If, as I suspect, you're not, then my reproducer for bug 1247200 is a different route to getting stuck DELETE_COMPLETE than you are experiencing.

If that is the case, can you please reproduce and attach the backtrace from /var/log/heat/engine.log when the delete happens (as we must be exiting the delete after setting the status but before marking the row deleted in the DB, which I'd expect to result in an engine backtrace)

I'm about to post a fix upstream for the failure I reproduced in 1247200, but I want to ensure we fix the whole problem not just part of it, thanks!

Comment 7 Sandro Mathys 2013-11-27 15:55:48 UTC

I did not define deferred_auth_method which I think means I'm using =password.

I just tried to reproduce but this time I got "Delete Failed" with the below backtrace FWIW. I was then able to delete it completely as expected so not this issue. Have to run right now but will try to reproduce again tomorrow or so.


2013-11-27 16:51:53.361 2403 ERROR heat.engine.resource [-] Delete WaitConditionHandle "BrokerWaitHandle"
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource Traceback (most recent call last):
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource   File "/usr/lib/python2.6/site-packages/heat/engine/resource.py", line 527, in delete
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource     handle_data = self.handle_delete()
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource   File "/usr/lib/python2.6/site-packages/heat/engine/signal_responder.py", line 61, in handle_delete
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource     self.keystone().delete_stack_user(self.resource_id)
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource   File "/usr/lib/python2.6/site-packages/heat/common/heat_keystoneclient.py", line 281, in delete_stack_user
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource     raise exception.Error(reason)
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource Error: Timed out trying to delete user
2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource 
2013-11-27 16:51:53.415 2403 DEBUG heat.engine.scheduler [-] Task destroy from None cancelled cancel /usr/lib/python2.6/site-packages/heat/engine/scheduler.py:187
2013-11-27 16:51:53.447 2403 DEBUG heat.engine.scheduler [-] Task destroy from None cancelled cancel /usr/lib/python2.6/site-packages/heat/engine/scheduler.py:187

Comment 8 Sandro Mathys 2013-11-29 13:15:58 UTC

Steven, I'm afraid I can't reproduce this anymore and I don't know why. I don't think anything changed in this particular setup and I was able to reproduce it a number of times before.

Comment 9 Steven Hardy 2013-11-29 16:27:03 UTC

(In reply to Sandro Mathys from comment #8)
> Steven, I'm afraid I can't reproduce this anymore and I don't know why. I
> don't think anything changed in this particular setup and I was able to
> reproduce it a number of times before.

Can you confirm, have you updated Heat since the last reproduce, and if so between which versions?

FYI, the upstream fix for the scenario I was able to reproduce has now been merged:

https://review.openstack.org/#/c/58582/

Comment 10 Sandro Mathys 2013-12-02 11:04:54 UTC

That are all updates that have been applied since reporting this:
Nov 08 11:36:49 Updated: device-mapper-libs-1.02.77-9.el6_4.2.x86_64
Nov 08 11:36:49 Updated: device-mapper-1.02.77-9.el6_4.2.x86_64
Nov 08 11:36:49 Updated: device-mapper-event-libs-1.02.77-9.el6_4.2.x86_64
Nov 08 11:36:49 Updated: nagios-plugins-1.4.16-10.el6.x86_64
Nov 08 11:36:50 Updated: postgresql-libs-8.4.18-1.el6_4.x86_64
Nov 08 11:36:50 Updated: python-six-1.4.1-1.el6.noarch
Nov 08 11:36:50 Updated: libgcrypt-1.4.5-11.el6_4.x86_64
Nov 08 11:36:51 Updated: 1:python-keystoneclient-0.4.1-3.el6.noarch
Nov 08 11:36:53 Updated: postgresql-8.4.18-1.el6_4.x86_64
Nov 08 11:36:53 Updated: device-mapper-event-1.02.77-9.el6_4.2.x86_64
Nov 08 11:36:53 Updated: lvm2-libs-2.02.98-9.el6_4.2.x86_64
Nov 08 11:36:55 Updated: kernel-firmware-2.6.32-358.123.2.openstack.el6.noarch
Nov 08 11:37:00 Installed: kernel-2.6.32-358.123.2.openstack.el6.x86_64
Nov 08 11:37:01 Updated: lvm2-2.02.98-9.el6_4.2.x86_64
Nov 08 11:37:02 Updated: postgresql-devel-8.4.18-1.el6_4.x86_64
Nov 08 11:37:03 Updated: python-heatclient-0.2.5-1.el6.noarch
Nov 08 11:37:03 Updated: gnupg2-2.0.14-6.el6_4.x86_64
Nov 08 11:37:04 Updated: libgcrypt-devel-1.4.5-11.el6_4.x86_64
Nov 08 11:37:04 Updated: python-cinderclient-1.0.7-1.el6.noarch
Nov 08 11:37:04 Updated: nagios-plugins-ping-1.4.16-10.el6.x86_64
Nov 08 11:37:04 Updated: nagios-plugins-nrpe-2.14-5.el6.x86_64
Nov 08 11:37:06 Updated: tzdata-2013g-1.el6.noarch
Nov 08 11:37:06 Updated: nrpe-2.14-5.el6.x86_64
Nov 08 11:37:06 Updated: setup-2.8.14-20.el6_4.1.noarch
Nov 08 11:37:06 Updated: 2:qemu-img-0.12.1.2-2.355.el6_4.9.x86_64
Nov 08 11:37:07 Updated: python-neutronclient-2.3.1-2.el6.noarch
Nov 08 11:37:07 Updated: rsync-3.0.6-9.el6_4.1.x86_64
Nov 08 11:37:07 Updated: rdo-release-havana-7.noarch
Nov 08 11:37:09 Updated: kernel-headers-2.6.32-358.123.2.openstack.el6.x86_64

heatclient oder keystoneclient are the only ones that I think could matter. Not sure what version of heatclient I had installed before (it's not in yum.log) but I'm 99.9% sure the version that was the most current in the repo before this one. keystoneclient:
Oct 21 09:09:30 Updated: 1:python-keystoneclient-0.4.1-1.el6.noarch

Comment 13 Steven Hardy 2014-02-03 10:12:23 UTC

stable/havana backport of the scenario I was able to reproduce (and some related tests) proposed:

https://review.openstack.org/70679
https://review.openstack.org/70680
https://review.openstack.org/70681