Created attachment 1071379 [details] heat stack-show <STACK_NAME> heat resource-list <STACK_NAME> Description of problem: Heat stack delete fails Version-Release number of selected component (if applicable): How reproducible:unknown Steps to Reproduce: 1.Attempt to delete existing heat stack 2.note failure 3. Actual results: Fails to delete Expected results: Heat stack deleted Additional info: Will attach more files and logs. Stack name is omt_test1
Created attachment 1071385 [details] heat engine log
It goes to DELETE_FAILED, or is stuck IN_PROGRESS forever? I'm assuming the latter from the attached ouput showing no resources in progress. This is usually due to a failure that occurs before the delete starts (see https://bugs.launchpad.net/heat/+bug/1492433), and these kinds of errors evade logging as well (see https://bugs.launchpad.net/heat/+bug/1492427). So we really need: * To know what actually happened? Nothing? Did the stack move to DELETE_IN_PROGRESS and stay there? Did it go to DELETE_FAILED? * heat-engine logs * journal output from systemd (`journalctl -u openstack-heat-engine`)
Created attachment 1071393 [details] mysql> select * from stack; mysql> select * from resource;
Created attachment 1071399 [details] journalctl -u openstack-heat-engine
The first thing that happens is that there's an exception occurring when we attempt to cancel the in-progress update: 2015-08-18 16:28:10.178 12019 INFO heat.engine.service [req-74ed2877-55da-4b8b-a78f-409625ca59fa None] Starting cancel of updating stack omt_test1 2015-08-18 16:28:10.193 12019 ERROR oslo.messaging.rpc.dispatcher [req-74ed2877-55da-4b8b-a78f-409625ca59fa ] Exception during message handling: call() takes exactly 3 arguments (5 given) 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last): 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher incoming.message)) 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args) 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args) 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 69, in wrapped 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher return func(self, ctx, *args, **kwargs) 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 750, in stack_cancel_update 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher stack_identity, rpc_api.THREAD_CANCEL) 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 850, in _remote_call 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher self.cctxt.call(cnxt, call, *args, **kwargs) 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher TypeError: call() takes exactly 3 arguments (5 given) 2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher It looks like the problem here is that RPCClient.call() is being called with positional args when it requires keyword args - this was fixed upstream in Liberty by https://bugs.launchpad.net/heat/+bug/1448384 and backported to Kilo, but not Juno. Subsequently there are several attempts to delete the stack which result in nothing happening: 2015-08-18 20:08:21.003 12019 INFO heat.engine.service [req-e85c3e54-b79f-4346-aeca-9d949feecf05 None] Deleting stack omt_test1 2015-08-18 20:09:14.869 12019 INFO heat.engine.service [req-ede60741-ddb8-4d38-9a45-64264cacd2ba None] Deleting stack omt_test1 2015-08-19 12:04:58.706 12019 INFO heat.engine.service [req-eba0fc20-4ae6-4ffb-a135-20fd02443c02 None] Deleting stack omt_test1 But at this point the trail goes cold - while the journal does contain a number of exception tracebacks, none are from Aug 18-19.
Interestingly, the DB dump shows the stack in the state SUSPEND_FAILED. The backup stack is in the DELETE_COMPLETE state, so it seems we can probably rule out a problem there as the cause (which is unfortunate, because that was the next thing to investigate).
What I _can_ see in the journal: At Aug 20 08:55:43 we see a similar problem to the one described, caused by an exception while trying to delete a stack snapshot. Matching up the logs suggests that the stack involved was named "dlr". The same error occurs again at the following times: Sep 03 13:27:33 Sep 03 13:39:04 This bug was fixed upstream in Kilo (https://bugs.launchpad.net/heat/+bug/1437588), and could easily be backported to Juno. There is a second class of error, that again likely manifests in the same way as the one described. This occurs at the following times: Aug 24 16:29:19 Aug 24 16:29:43 Aug 24 17:01:44 Aug 24 17:18:53 Sep 04 12:54:26 Sep 08 08:09:39 This one is a dependency problem involving a Neutron Pool resource: InvalidTemplateReference: The specified reference "monitor" (in pool.Properties.monitors[0]) is incorrect. It's likely that the cause of this is https://bugs.launchpad.net/heat/+bug/1446575 - this happens when a resource fails during an update, and then you do a second update in which the same resource fails again. The logs provided don't cover any of those times, so there is no way for me to know which stacks are involved. As I mentioned in comment #3, the only mention of any attempt to delete the stack in question, omt_test1, was at the following times: Aug 18 20:08 Aug 18 20:09 Aug 19 12:04 Aug 19 14:58 From the log, we can see that this stack fails to create a Pool resource at Aug 18 14:40. The error returned from Neutron is "No eligible backend for pool". The same happens again at Aug 18 15:16. It seems like a fairly good bet that this has the same cause as the second class of error above, even though the evidence is mysteriously absent - we see a single resource failing twice in a row, which is the known trigger, and it's the same type of resource that we see problems with later. The fix for this is in Kilo, and has been tagged for a potential Juno backport upstream but not yet backported by the looks of it.
since I don't have the original template I had to do with one of my own. All LB resources were deleted successfully and I have opened new bugs for the resources who were not.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0037.html