Description of problem: According to Steve Hardy, Heat should not list stacks with status DELETE_COMPLETE anymore - but sometimes does. So far it seems to only happen if the stack is deleted while in CREATE_IN_PROGRESS state with a BrokerWaitHandle being stuck in progress due to a instance not being able to access the metadata API. The issue was found because the F18 images used by these example templates don't set up their eth0 correctly, and therefore obviously can't access the metadata: https://github.com/openstack/heat-templates/tree/master/openshift-origin/F18 http://fedorapeople.org/groups/heat/prebuilt-jeos-images/ Version-Release number of selected component (if applicable): openstack-heat-engine-2013.2-1.0.el6.noarch How reproducible: So far always if the above case was hit, but never otherwise. Steps to Reproduce: 1. Create a Stack with those images and templates: - http://fedorapeople.org/groups/heat/prebuilt-jeos-images/F18-x86_64-openshift-origin-broker-cfntools.qcow2 - http://fedorapeople.org/groups/heat/prebuilt-jeos-images/F18-x86_64-openshift-origin-node-cfntools.qcow2 - https://github.com/openstack/heat-templates/blob/master/openshift-origin/F18/OpenShiftAutoScaling.yaml 2. Wait 10-20 minutes, just to be sure 3. Delete the stack Actual results: Stack will be deleted sucessfully and put into DELETE_COMPLETE state but it will still be listed with `heat stack-list` or in Horizon. Expected results: Stack not listed anymore. Additional info: I created and deleted the stacks through the Horizon interface but could verify with the heat CLI that they are still listed.
I'll take this since I was discussing it with the reporter, and see if I can reproduce based on the info above.
Ok, so I still can't reproduce this, but I have investigated another similar report, and discovered that for some reason, it's possible for the stack DB entry to end up with action==DELETE, status==COMPLETE, and deleted_at==NULL This should never happen, and it seems related to a probable error-path bug in heat - it seems that in both the reported case here, and the one I've just looked at, things happened during the delete which caused errors (the most recent report had the RabbitMQ broker errors requiring a restart, and a failure to delete a keystone user). But instead of ending up in DELETE_FAILED state, we somehow ended up in DELETE_COMPLETE with a NULL deleted_at column. Sandro Mathys: Can you reproduce and confirm if the deleted_at column is null when you encounter this failure? Steps to get this info: mysql -u root -p > use heat; > select id,action,status,deleted_at from stack where id = '<your stack ID>'
Steven, it's just as you said (filtering deleted_at==NULL seemed easier): mysql> select id,action,status,deleted_at from stack where deleted_at is NULL; +--------------------------------------+--------+----------+------------+ | id | action | status | deleted_at | +--------------------------------------+--------+----------+------------+ | 094fddb6-e8c7-4c0e-aff7-34c9660fd3f2 | DELETE | COMPLETE | NULL | | 2c474e57-7edf-409d-b9bd-1eec1923a9a8 | CREATE | FAILED | NULL | | c9e6075d-5fb6-4860-b736-245b05059d0a | DELETE | COMPLETE | NULL | | e1d59858-778e-4f67-8e13-df75ba921f80 | DELETE | COMPLETE | NULL | +--------------------------------------+--------+----------+------------+ 4 rows in set (0.00 sec) [root@ctrl-stg ~(ks_admin)]# heat stack-list +--------------------------------------+------------------------------+-----------------+----------------------+ | id | stack_name | stack_status | creation_time | +--------------------------------------+------------------------------+-----------------+----------------------+ | 094fddb6-e8c7-4c0e-aff7-34c9660fd3f2 | OpenShift_Origin | DELETE_COMPLETE | 2013-10-17T13:52:18Z | | c9e6075d-5fb6-4860-b736-245b05059d0a | OpenShift_Origin_2 | DELETE_COMPLETE | 2013-10-22T07:34:55Z | | e1d59858-778e-4f67-8e13-df75ba921f80 | OpenShift_Origin_AutoScaling | DELETE_COMPLETE | 2013-10-22T08:25:57Z | | 2c474e57-7edf-409d-b9bd-1eec1923a9a8 | OSO_TEST_1 | CREATE_FAILED | 2013-10-24T18:33:58Z | +--------------------------------------+------------------------------+-----------------+----------------------+ FWIW, there were other stacks and they are visible in the database with action=DELETE, status=COMPLETE and deleted_at=<timestamp>, i.e. as expected - so not all is going wrong :)
Ok, I've just managed to repoduce this, raised this upstream bug: https://bugs.launchpad.net/heat/+bug/1247200
Great, thanks for the effort. Much appreciated. Once it's fixed, could you try backporting the patch to Havana for 2013.2.1 or 2013.2.2?
Can I confirm, are you running with deferred_auth_method=trusts? If, as I suspect, you're not, then my reproducer for bug 1247200 is a different route to getting stuck DELETE_COMPLETE than you are experiencing. If that is the case, can you please reproduce and attach the backtrace from /var/log/heat/engine.log when the delete happens (as we must be exiting the delete after setting the status but before marking the row deleted in the DB, which I'd expect to result in an engine backtrace) I'm about to post a fix upstream for the failure I reproduced in 1247200, but I want to ensure we fix the whole problem not just part of it, thanks!
I did not define deferred_auth_method which I think means I'm using =password. I just tried to reproduce but this time I got "Delete Failed" with the below backtrace FWIW. I was then able to delete it completely as expected so not this issue. Have to run right now but will try to reproduce again tomorrow or so. 2013-11-27 16:51:53.361 2403 ERROR heat.engine.resource [-] Delete WaitConditionHandle "BrokerWaitHandle" 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource Traceback (most recent call last): 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource File "/usr/lib/python2.6/site-packages/heat/engine/resource.py", line 527, in delete 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource handle_data = self.handle_delete() 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource File "/usr/lib/python2.6/site-packages/heat/engine/signal_responder.py", line 61, in handle_delete 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource self.keystone().delete_stack_user(self.resource_id) 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource File "/usr/lib/python2.6/site-packages/heat/common/heat_keystoneclient.py", line 281, in delete_stack_user 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource raise exception.Error(reason) 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource Error: Timed out trying to delete user 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource 2013-11-27 16:51:53.415 2403 DEBUG heat.engine.scheduler [-] Task destroy from None cancelled cancel /usr/lib/python2.6/site-packages/heat/engine/scheduler.py:187 2013-11-27 16:51:53.447 2403 DEBUG heat.engine.scheduler [-] Task destroy from None cancelled cancel /usr/lib/python2.6/site-packages/heat/engine/scheduler.py:187
Steven, I'm afraid I can't reproduce this anymore and I don't know why. I don't think anything changed in this particular setup and I was able to reproduce it a number of times before.
(In reply to Sandro Mathys from comment #8) > Steven, I'm afraid I can't reproduce this anymore and I don't know why. I > don't think anything changed in this particular setup and I was able to > reproduce it a number of times before. Can you confirm, have you updated Heat since the last reproduce, and if so between which versions? FYI, the upstream fix for the scenario I was able to reproduce has now been merged: https://review.openstack.org/#/c/58582/
That are all updates that have been applied since reporting this: Nov 08 11:36:49 Updated: device-mapper-libs-1.02.77-9.el6_4.2.x86_64 Nov 08 11:36:49 Updated: device-mapper-1.02.77-9.el6_4.2.x86_64 Nov 08 11:36:49 Updated: device-mapper-event-libs-1.02.77-9.el6_4.2.x86_64 Nov 08 11:36:49 Updated: nagios-plugins-1.4.16-10.el6.x86_64 Nov 08 11:36:50 Updated: postgresql-libs-8.4.18-1.el6_4.x86_64 Nov 08 11:36:50 Updated: python-six-1.4.1-1.el6.noarch Nov 08 11:36:50 Updated: libgcrypt-1.4.5-11.el6_4.x86_64 Nov 08 11:36:51 Updated: 1:python-keystoneclient-0.4.1-3.el6.noarch Nov 08 11:36:53 Updated: postgresql-8.4.18-1.el6_4.x86_64 Nov 08 11:36:53 Updated: device-mapper-event-1.02.77-9.el6_4.2.x86_64 Nov 08 11:36:53 Updated: lvm2-libs-2.02.98-9.el6_4.2.x86_64 Nov 08 11:36:55 Updated: kernel-firmware-2.6.32-358.123.2.openstack.el6.noarch Nov 08 11:37:00 Installed: kernel-2.6.32-358.123.2.openstack.el6.x86_64 Nov 08 11:37:01 Updated: lvm2-2.02.98-9.el6_4.2.x86_64 Nov 08 11:37:02 Updated: postgresql-devel-8.4.18-1.el6_4.x86_64 Nov 08 11:37:03 Updated: python-heatclient-0.2.5-1.el6.noarch Nov 08 11:37:03 Updated: gnupg2-2.0.14-6.el6_4.x86_64 Nov 08 11:37:04 Updated: libgcrypt-devel-1.4.5-11.el6_4.x86_64 Nov 08 11:37:04 Updated: python-cinderclient-1.0.7-1.el6.noarch Nov 08 11:37:04 Updated: nagios-plugins-ping-1.4.16-10.el6.x86_64 Nov 08 11:37:04 Updated: nagios-plugins-nrpe-2.14-5.el6.x86_64 Nov 08 11:37:06 Updated: tzdata-2013g-1.el6.noarch Nov 08 11:37:06 Updated: nrpe-2.14-5.el6.x86_64 Nov 08 11:37:06 Updated: setup-2.8.14-20.el6_4.1.noarch Nov 08 11:37:06 Updated: 2:qemu-img-0.12.1.2-2.355.el6_4.9.x86_64 Nov 08 11:37:07 Updated: python-neutronclient-2.3.1-2.el6.noarch Nov 08 11:37:07 Updated: rsync-3.0.6-9.el6_4.1.x86_64 Nov 08 11:37:07 Updated: rdo-release-havana-7.noarch Nov 08 11:37:09 Updated: kernel-headers-2.6.32-358.123.2.openstack.el6.x86_64 heatclient oder keystoneclient are the only ones that I think could matter. Not sure what version of heatclient I had installed before (it's not in yum.log) but I'm 99.9% sure the version that was the most current in the repo before this one. keystoneclient: Oct 21 09:09:30 Updated: 1:python-keystoneclient-0.4.1-1.el6.noarch
stable/havana backport of the scenario I was able to reproduce (and some related tests) proposed: https://review.openstack.org/70679 https://review.openstack.org/70680 https://review.openstack.org/70681