Bug 1021989
Summary: | heat sometimes keeps listenings stacks with status DELETE_COMPLETE | ||
---|---|---|---|
Product: | [Community] RDO | Reporter: | Sandro Mathys <sandro> |
Component: | openstack-heat | Assignee: | Steven Hardy <shardy> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Amit Ugol <augol> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | unspecified | CC: | jpeeler, sandro, shardy, yeylon |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-03-30 23:00:18 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sandro Mathys
2013-10-22 12:57:35 UTC
I'll take this since I was discussing it with the reporter, and see if I can reproduce based on the info above. Ok, so I still can't reproduce this, but I have investigated another similar report, and discovered that for some reason, it's possible for the stack DB entry to end up with action==DELETE, status==COMPLETE, and deleted_at==NULL
This should never happen, and it seems related to a probable error-path bug in heat - it seems that in both the reported case here, and the one I've just looked at, things happened during the delete which caused errors (the most recent report had the RabbitMQ broker errors requiring a restart, and a failure to delete a keystone user). But instead of ending up in DELETE_FAILED state, we somehow ended up in DELETE_COMPLETE with a NULL deleted_at column.
Sandro Mathys: Can you reproduce and confirm if the deleted_at column is null when you encounter this failure?
Steps to get this info:
mysql -u root -p
> use heat;
> select id,action,status,deleted_at from stack where id = '<your stack ID>'
Steven, it's just as you said (filtering deleted_at==NULL seemed easier): mysql> select id,action,status,deleted_at from stack where deleted_at is NULL; +--------------------------------------+--------+----------+------------+ | id | action | status | deleted_at | +--------------------------------------+--------+----------+------------+ | 094fddb6-e8c7-4c0e-aff7-34c9660fd3f2 | DELETE | COMPLETE | NULL | | 2c474e57-7edf-409d-b9bd-1eec1923a9a8 | CREATE | FAILED | NULL | | c9e6075d-5fb6-4860-b736-245b05059d0a | DELETE | COMPLETE | NULL | | e1d59858-778e-4f67-8e13-df75ba921f80 | DELETE | COMPLETE | NULL | +--------------------------------------+--------+----------+------------+ 4 rows in set (0.00 sec) [root@ctrl-stg ~(ks_admin)]# heat stack-list +--------------------------------------+------------------------------+-----------------+----------------------+ | id | stack_name | stack_status | creation_time | +--------------------------------------+------------------------------+-----------------+----------------------+ | 094fddb6-e8c7-4c0e-aff7-34c9660fd3f2 | OpenShift_Origin | DELETE_COMPLETE | 2013-10-17T13:52:18Z | | c9e6075d-5fb6-4860-b736-245b05059d0a | OpenShift_Origin_2 | DELETE_COMPLETE | 2013-10-22T07:34:55Z | | e1d59858-778e-4f67-8e13-df75ba921f80 | OpenShift_Origin_AutoScaling | DELETE_COMPLETE | 2013-10-22T08:25:57Z | | 2c474e57-7edf-409d-b9bd-1eec1923a9a8 | OSO_TEST_1 | CREATE_FAILED | 2013-10-24T18:33:58Z | +--------------------------------------+------------------------------+-----------------+----------------------+ FWIW, there were other stacks and they are visible in the database with action=DELETE, status=COMPLETE and deleted_at=<timestamp>, i.e. as expected - so not all is going wrong :) Ok, I've just managed to repoduce this, raised this upstream bug: https://bugs.launchpad.net/heat/+bug/1247200 Great, thanks for the effort. Much appreciated. Once it's fixed, could you try backporting the patch to Havana for 2013.2.1 or 2013.2.2? Can I confirm, are you running with deferred_auth_method=trusts? If, as I suspect, you're not, then my reproducer for bug 1247200 is a different route to getting stuck DELETE_COMPLETE than you are experiencing. If that is the case, can you please reproduce and attach the backtrace from /var/log/heat/engine.log when the delete happens (as we must be exiting the delete after setting the status but before marking the row deleted in the DB, which I'd expect to result in an engine backtrace) I'm about to post a fix upstream for the failure I reproduced in 1247200, but I want to ensure we fix the whole problem not just part of it, thanks! I did not define deferred_auth_method which I think means I'm using =password. I just tried to reproduce but this time I got "Delete Failed" with the below backtrace FWIW. I was then able to delete it completely as expected so not this issue. Have to run right now but will try to reproduce again tomorrow or so. 2013-11-27 16:51:53.361 2403 ERROR heat.engine.resource [-] Delete WaitConditionHandle "BrokerWaitHandle" 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource Traceback (most recent call last): 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource File "/usr/lib/python2.6/site-packages/heat/engine/resource.py", line 527, in delete 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource handle_data = self.handle_delete() 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource File "/usr/lib/python2.6/site-packages/heat/engine/signal_responder.py", line 61, in handle_delete 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource self.keystone().delete_stack_user(self.resource_id) 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource File "/usr/lib/python2.6/site-packages/heat/common/heat_keystoneclient.py", line 281, in delete_stack_user 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource raise exception.Error(reason) 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource Error: Timed out trying to delete user 2013-11-27 16:51:53.361 2403 TRACE heat.engine.resource 2013-11-27 16:51:53.415 2403 DEBUG heat.engine.scheduler [-] Task destroy from None cancelled cancel /usr/lib/python2.6/site-packages/heat/engine/scheduler.py:187 2013-11-27 16:51:53.447 2403 DEBUG heat.engine.scheduler [-] Task destroy from None cancelled cancel /usr/lib/python2.6/site-packages/heat/engine/scheduler.py:187 Steven, I'm afraid I can't reproduce this anymore and I don't know why. I don't think anything changed in this particular setup and I was able to reproduce it a number of times before. (In reply to Sandro Mathys from comment #8) > Steven, I'm afraid I can't reproduce this anymore and I don't know why. I > don't think anything changed in this particular setup and I was able to > reproduce it a number of times before. Can you confirm, have you updated Heat since the last reproduce, and if so between which versions? FYI, the upstream fix for the scenario I was able to reproduce has now been merged: https://review.openstack.org/#/c/58582/ That are all updates that have been applied since reporting this: Nov 08 11:36:49 Updated: device-mapper-libs-1.02.77-9.el6_4.2.x86_64 Nov 08 11:36:49 Updated: device-mapper-1.02.77-9.el6_4.2.x86_64 Nov 08 11:36:49 Updated: device-mapper-event-libs-1.02.77-9.el6_4.2.x86_64 Nov 08 11:36:49 Updated: nagios-plugins-1.4.16-10.el6.x86_64 Nov 08 11:36:50 Updated: postgresql-libs-8.4.18-1.el6_4.x86_64 Nov 08 11:36:50 Updated: python-six-1.4.1-1.el6.noarch Nov 08 11:36:50 Updated: libgcrypt-1.4.5-11.el6_4.x86_64 Nov 08 11:36:51 Updated: 1:python-keystoneclient-0.4.1-3.el6.noarch Nov 08 11:36:53 Updated: postgresql-8.4.18-1.el6_4.x86_64 Nov 08 11:36:53 Updated: device-mapper-event-1.02.77-9.el6_4.2.x86_64 Nov 08 11:36:53 Updated: lvm2-libs-2.02.98-9.el6_4.2.x86_64 Nov 08 11:36:55 Updated: kernel-firmware-2.6.32-358.123.2.openstack.el6.noarch Nov 08 11:37:00 Installed: kernel-2.6.32-358.123.2.openstack.el6.x86_64 Nov 08 11:37:01 Updated: lvm2-2.02.98-9.el6_4.2.x86_64 Nov 08 11:37:02 Updated: postgresql-devel-8.4.18-1.el6_4.x86_64 Nov 08 11:37:03 Updated: python-heatclient-0.2.5-1.el6.noarch Nov 08 11:37:03 Updated: gnupg2-2.0.14-6.el6_4.x86_64 Nov 08 11:37:04 Updated: libgcrypt-devel-1.4.5-11.el6_4.x86_64 Nov 08 11:37:04 Updated: python-cinderclient-1.0.7-1.el6.noarch Nov 08 11:37:04 Updated: nagios-plugins-ping-1.4.16-10.el6.x86_64 Nov 08 11:37:04 Updated: nagios-plugins-nrpe-2.14-5.el6.x86_64 Nov 08 11:37:06 Updated: tzdata-2013g-1.el6.noarch Nov 08 11:37:06 Updated: nrpe-2.14-5.el6.x86_64 Nov 08 11:37:06 Updated: setup-2.8.14-20.el6_4.1.noarch Nov 08 11:37:06 Updated: 2:qemu-img-0.12.1.2-2.355.el6_4.9.x86_64 Nov 08 11:37:07 Updated: python-neutronclient-2.3.1-2.el6.noarch Nov 08 11:37:07 Updated: rsync-3.0.6-9.el6_4.1.x86_64 Nov 08 11:37:07 Updated: rdo-release-havana-7.noarch Nov 08 11:37:09 Updated: kernel-headers-2.6.32-358.123.2.openstack.el6.x86_64 heatclient oder keystoneclient are the only ones that I think could matter. Not sure what version of heatclient I had installed before (it's not in yum.log) but I'm 99.9% sure the version that was the most current in the repo before this one. keystoneclient: Oct 21 09:09:30 Updated: 1:python-keystoneclient-0.4.1-1.el6.noarch stable/havana backport of the scenario I was able to reproduce (and some related tests) proposed: https://review.openstack.org/70679 https://review.openstack.org/70680 https://review.openstack.org/70681 |