Bug 1261077 - heat stack delete fails
Summary: heat stack delete fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: async
: 6.0 (Juno)
Assignee: Zane Bitter
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-09-08 14:17 UTC by Jeremy
Modified: 2023-02-22 23:02 UTC (History)
10 users (show)

Fixed In Version: openstack-heat-2014.2.3-9.el7ost
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-14 13:53:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
heat stack-show <STACK_NAME> heat resource-list <STACK_NAME> (6.07 KB, text/plain)
2015-09-08 14:17 UTC, Jeremy
no flags Details
heat engine log (1.03 MB, text/plain)
2015-09-08 14:37 UTC, Jeremy
no flags Details
mysql> select * from stack; mysql> select * from resource; (3.16 MB, text/plain)
2015-09-08 15:30 UTC, Jeremy
no flags Details
journalctl -u openstack-heat-engine (4.38 MB, text/x-vhdl)
2015-09-08 15:47 UTC, Jeremy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1446575 0 None None None Never
OpenStack gerrit 222724 0 None MERGED Fix TypeError exception during cancel stack update 2020-06-18 04:44:44 UTC
OpenStack gerrit 224915 0 None MERGED Make sure to delete stack when SNAPSHOT_IN_PROGRESS 2020-06-18 04:44:43 UTC
OpenStack gerrit 230130 0 None MERGED Save updated-in-place resources to backup stack 2020-06-18 04:44:43 UTC
Red Hat Product Errata RHBA-2016:0037 0 normal SHIPPED_LIVE openstack-heat bug fix advisory 2016-01-14 18:52:47 UTC

Description Jeremy 2015-09-08 14:17:16 UTC
Created attachment 1071379 [details]
heat stack-show <STACK_NAME> heat resource-list <STACK_NAME>

Description of problem: Heat stack delete fails


Version-Release number of selected component (if applicable):


How reproducible:unknown


Steps to Reproduce:
1.Attempt to delete existing heat stack
2.note failure
3.

Actual results:
Fails to delete

Expected results:
Heat stack deleted


Additional info:
Will attach more files and logs.  Stack name is omt_test1

Comment 3 Jeremy 2015-09-08 14:37:22 UTC
Created attachment 1071385 [details]
heat engine log

Comment 4 Zane Bitter 2015-09-08 14:40:52 UTC
It goes to DELETE_FAILED, or is stuck IN_PROGRESS forever? I'm assuming the latter from the attached ouput showing no resources in progress. This is usually due to a failure that occurs before the delete starts (see https://bugs.launchpad.net/heat/+bug/1492433), and these kinds of errors evade logging as well (see https://bugs.launchpad.net/heat/+bug/1492427).

So we really need:
* To know what actually happened? Nothing? Did the stack move to DELETE_IN_PROGRESS and stay there? Did it go to DELETE_FAILED?
* heat-engine logs
* journal output from systemd (`journalctl -u openstack-heat-engine`)

Comment 5 Jeremy 2015-09-08 15:30:50 UTC
Created attachment 1071393 [details]
mysql> select * from stack;  mysql> select * from resource;

Comment 6 Jeremy 2015-09-08 15:47:34 UTC
Created attachment 1071399 [details]
journalctl -u openstack-heat-engine

Comment 7 Zane Bitter 2015-09-09 20:02:04 UTC
The first thing that happens is that there's an exception occurring when we attempt to cancel the in-progress update:

2015-08-18 16:28:10.178 12019 INFO heat.engine.service [req-74ed2877-55da-4b8b-a78f-409625ca59fa None] Starting cancel of updating stack omt_test1
2015-08-18 16:28:10.193 12019 ERROR oslo.messaging.rpc.dispatcher [req-74ed2877-55da-4b8b-a78f-409625ca59fa ] Exception during message handling: call() takes exactly 3 arguments (5 given)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     incoming.message))
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     result = getattr(endpoint, method)(ctxt, **new_args)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 69, in wrapped
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     return func(self, ctx, *args, **kwargs)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 750, in stack_cancel_update
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     stack_identity, rpc_api.THREAD_CANCEL)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 850, in _remote_call
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     self.cctxt.call(cnxt, call, *args, **kwargs)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher TypeError: call() takes exactly 3 arguments (5 given)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher 

It looks like the problem here is that RPCClient.call() is being called with positional args when it requires keyword args - this was fixed upstream in Liberty by https://bugs.launchpad.net/heat/+bug/1448384 and backported to Kilo, but not Juno.

Subsequently there are several attempts to delete the stack which result in nothing happening:

2015-08-18 20:08:21.003 12019 INFO heat.engine.service [req-e85c3e54-b79f-4346-aeca-9d949feecf05 None] Deleting stack omt_test1
2015-08-18 20:09:14.869 12019 INFO heat.engine.service [req-ede60741-ddb8-4d38-9a45-64264cacd2ba None] Deleting stack omt_test1
2015-08-19 12:04:58.706 12019 INFO heat.engine.service [req-eba0fc20-4ae6-4ffb-a135-20fd02443c02 None] Deleting stack omt_test1

But at this point the trail goes cold - while the journal does contain a number of exception tracebacks, none are from Aug 18-19.

Comment 9 Zane Bitter 2015-09-11 22:08:26 UTC
Interestingly, the DB dump shows the stack in the state SUSPEND_FAILED.

The backup stack is in the DELETE_COMPLETE state, so it seems we can probably rule out a problem there as the cause (which is unfortunate, because that was the next thing to investigate).

Comment 10 Zane Bitter 2015-09-17 22:47:31 UTC
What I _can_ see in the journal:

At Aug 20 08:55:43 we see a similar problem to the one described, caused by an exception while trying to delete a stack snapshot. Matching up the logs suggests that the stack involved was named "dlr". The same error occurs again at the following times:

Sep 03 13:27:33
Sep 03 13:39:04

This bug was fixed upstream in Kilo (https://bugs.launchpad.net/heat/+bug/1437588), and could easily be backported to Juno.

There is a second class of error, that again likely manifests in the same way as the one described. This occurs at the following times:
Aug 24 16:29:19
Aug 24 16:29:43
Aug 24 17:01:44
Aug 24 17:18:53
Sep 04 12:54:26
Sep 08 08:09:39

This one is a dependency problem involving a Neutron Pool resource:
InvalidTemplateReference: The specified reference "monitor" (in pool.Properties.monitors[0]) is incorrect.

It's likely that the cause of this is https://bugs.launchpad.net/heat/+bug/1446575 - this happens when a resource fails during an update, and then you do a second update in which the same resource fails again.

The logs provided don't cover any of those times, so there is no way for me to know which stacks are involved. As I mentioned in comment #3, the only mention of any attempt to delete the stack in question, omt_test1, was at the following times:
Aug 18 20:08
Aug 18 20:09
Aug 19 12:04
Aug 19 14:58

From the log, we can see that this stack fails to create a Pool resource at Aug 18 14:40. The error returned from Neutron is "No eligible backend for pool". The same happens again at Aug 18 15:16. It seems like a fairly good bet that this has the same cause as the second class of error above, even though the evidence is mysteriously absent - we see a single resource failing twice in a row, which is the known trigger, and it's the same type of resource that we see problems with later.

The fix for this is in Kilo, and has been tagged for a potential Juno backport upstream but not yet backported by the looks of it.

Comment 21 Amit Ugol 2016-01-04 13:17:38 UTC
since I don't have the original template I had to do with one of my own.
All LB resources were deleted successfully and I have opened new bugs for the resources who were not.

Comment 23 errata-xmlrpc 2016-01-14 13:53:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0037.html


Note You need to log in before you can comment on or make changes to this bug.