Bug 1261077

Summary: heat stack delete fails
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: medium Docs Contact:
Priority: high    
Version: 6.0 (Juno)CC: chenders, dmaley, jmelvin, mburns, ohochman, rhel-osp-director-maint, sbaker, shardy, yeylon, zbitter
Target Milestone: asyncKeywords: Triaged, Unconfirmed, ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-heat-2014.2.3-9.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-14 13:53:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
heat stack-show <STACK_NAME> heat resource-list <STACK_NAME>
none
heat engine log
none
mysql> select * from stack; mysql> select * from resource;
none
journalctl -u openstack-heat-engine none

Description Jeremy 2015-09-08 14:17:16 UTC
Created attachment 1071379 [details]
heat stack-show <STACK_NAME> heat resource-list <STACK_NAME>

Description of problem: Heat stack delete fails


Version-Release number of selected component (if applicable):


How reproducible:unknown


Steps to Reproduce:
1.Attempt to delete existing heat stack
2.note failure
3.

Actual results:
Fails to delete

Expected results:
Heat stack deleted


Additional info:
Will attach more files and logs.  Stack name is omt_test1

Comment 3 Jeremy 2015-09-08 14:37:22 UTC
Created attachment 1071385 [details]
heat engine log

Comment 4 Zane Bitter 2015-09-08 14:40:52 UTC
It goes to DELETE_FAILED, or is stuck IN_PROGRESS forever? I'm assuming the latter from the attached ouput showing no resources in progress. This is usually due to a failure that occurs before the delete starts (see https://bugs.launchpad.net/heat/+bug/1492433), and these kinds of errors evade logging as well (see https://bugs.launchpad.net/heat/+bug/1492427).

So we really need:
* To know what actually happened? Nothing? Did the stack move to DELETE_IN_PROGRESS and stay there? Did it go to DELETE_FAILED?
* heat-engine logs
* journal output from systemd (`journalctl -u openstack-heat-engine`)

Comment 5 Jeremy 2015-09-08 15:30:50 UTC
Created attachment 1071393 [details]
mysql> select * from stack;  mysql> select * from resource;

Comment 6 Jeremy 2015-09-08 15:47:34 UTC
Created attachment 1071399 [details]
journalctl -u openstack-heat-engine

Comment 7 Zane Bitter 2015-09-09 20:02:04 UTC
The first thing that happens is that there's an exception occurring when we attempt to cancel the in-progress update:

2015-08-18 16:28:10.178 12019 INFO heat.engine.service [req-74ed2877-55da-4b8b-a78f-409625ca59fa None] Starting cancel of updating stack omt_test1
2015-08-18 16:28:10.193 12019 ERROR oslo.messaging.rpc.dispatcher [req-74ed2877-55da-4b8b-a78f-409625ca59fa ] Exception during message handling: call() takes exactly 3 arguments (5 given)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     incoming.message))
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     result = getattr(endpoint, method)(ctxt, **new_args)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 69, in wrapped
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     return func(self, ctx, *args, **kwargs)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 750, in stack_cancel_update
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     stack_identity, rpc_api.THREAD_CANCEL)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 850, in _remote_call
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     self.cctxt.call(cnxt, call, *args, **kwargs)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher TypeError: call() takes exactly 3 arguments (5 given)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher 

It looks like the problem here is that RPCClient.call() is being called with positional args when it requires keyword args - this was fixed upstream in Liberty by https://bugs.launchpad.net/heat/+bug/1448384 and backported to Kilo, but not Juno.

Subsequently there are several attempts to delete the stack which result in nothing happening:

2015-08-18 20:08:21.003 12019 INFO heat.engine.service [req-e85c3e54-b79f-4346-aeca-9d949feecf05 None] Deleting stack omt_test1
2015-08-18 20:09:14.869 12019 INFO heat.engine.service [req-ede60741-ddb8-4d38-9a45-64264cacd2ba None] Deleting stack omt_test1
2015-08-19 12:04:58.706 12019 INFO heat.engine.service [req-eba0fc20-4ae6-4ffb-a135-20fd02443c02 None] Deleting stack omt_test1

But at this point the trail goes cold - while the journal does contain a number of exception tracebacks, none are from Aug 18-19.

Comment 9 Zane Bitter 2015-09-11 22:08:26 UTC
Interestingly, the DB dump shows the stack in the state SUSPEND_FAILED.

The backup stack is in the DELETE_COMPLETE state, so it seems we can probably rule out a problem there as the cause (which is unfortunate, because that was the next thing to investigate).

Comment 10 Zane Bitter 2015-09-17 22:47:31 UTC
What I _can_ see in the journal:

At Aug 20 08:55:43 we see a similar problem to the one described, caused by an exception while trying to delete a stack snapshot. Matching up the logs suggests that the stack involved was named "dlr". The same error occurs again at the following times:

Sep 03 13:27:33
Sep 03 13:39:04

This bug was fixed upstream in Kilo (https://bugs.launchpad.net/heat/+bug/1437588), and could easily be backported to Juno.

There is a second class of error, that again likely manifests in the same way as the one described. This occurs at the following times:
Aug 24 16:29:19
Aug 24 16:29:43
Aug 24 17:01:44
Aug 24 17:18:53
Sep 04 12:54:26
Sep 08 08:09:39

This one is a dependency problem involving a Neutron Pool resource:
InvalidTemplateReference: The specified reference "monitor" (in pool.Properties.monitors[0]) is incorrect.

It's likely that the cause of this is https://bugs.launchpad.net/heat/+bug/1446575 - this happens when a resource fails during an update, and then you do a second update in which the same resource fails again.

The logs provided don't cover any of those times, so there is no way for me to know which stacks are involved. As I mentioned in comment #3, the only mention of any attempt to delete the stack in question, omt_test1, was at the following times:
Aug 18 20:08
Aug 18 20:09
Aug 19 12:04
Aug 19 14:58

From the log, we can see that this stack fails to create a Pool resource at Aug 18 14:40. The error returned from Neutron is "No eligible backend for pool". The same happens again at Aug 18 15:16. It seems like a fairly good bet that this has the same cause as the second class of error above, even though the evidence is mysteriously absent - we see a single resource failing twice in a row, which is the known trigger, and it's the same type of resource that we see problems with later.

The fix for this is in Kilo, and has been tagged for a potential Juno backport upstream but not yet backported by the looks of it.

Comment 21 Amit Ugol 2016-01-04 13:17:38 UTC
since I don't have the original template I had to do with one of my own.
All LB resources were deleted successfully and I have opened new bugs for the resources who were not.

Comment 23 errata-xmlrpc 2016-01-14 13:53:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0037.html