1261077 – heat stack delete fails

Bug 1261077 - heat stack delete fails

Summary: heat stack delete fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	async
Target Release:	6.0 (Juno)
Assignee:	Zane Bitter
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-08 14:17 UTC by Jeremy
Modified:	2023-02-22 23:02 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-heat-2014.2.3-9.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-01-14 13:53:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
heat stack-show <STACK_NAME> heat resource-list <STACK_NAME> (6.07 KB, text/plain) 2015-09-08 14:17 UTC, Jeremy	no flags	Details
heat engine log (1.03 MB, text/plain) 2015-09-08 14:37 UTC, Jeremy	no flags	Details
*mysql> select from stack; mysql> select * from resource;** (3.16 MB, text/plain) 2015-09-08 15:30 UTC, Jeremy	no flags	Details
journalctl -u openstack-heat-engine (4.38 MB, text/x-vhdl) 2015-09-08 15:47 UTC, Jeremy	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1446575	None	None	None	Never
OpenStack gerrit	222724	None	MERGED	Fix TypeError exception during cancel stack update	2020-06-18 04:44:44 UTC
OpenStack gerrit	224915	None	MERGED	Make sure to delete stack when SNAPSHOT_IN_PROGRESS	2020-06-18 04:44:43 UTC
OpenStack gerrit	230130	None	MERGED	Save updated-in-place resources to backup stack	2020-06-18 04:44:43 UTC
Red Hat Product Errata	RHBA-2016:0037	normal	SHIPPED_LIVE	openstack-heat bug fix advisory	2016-01-14 18:52:47 UTC

Description Jeremy 2015-09-08 14:17:16 UTC

Created attachment 1071379 [details]
heat stack-show <STACK_NAME> heat resource-list <STACK_NAME>

Description of problem: Heat stack delete fails


Version-Release number of selected component (if applicable):


How reproducible:unknown


Steps to Reproduce:
1.Attempt to delete existing heat stack
2.note failure
3.

Actual results:
Fails to delete

Expected results:
Heat stack deleted


Additional info:
Will attach more files and logs.  Stack name is omt_test1

Comment 3 Jeremy 2015-09-08 14:37:22 UTC

Created attachment 1071385 [details]
heat engine log

Comment 4 Zane Bitter 2015-09-08 14:40:52 UTC

It goes to DELETE_FAILED, or is stuck IN_PROGRESS forever? I'm assuming the latter from the attached ouput showing no resources in progress. This is usually due to a failure that occurs before the delete starts (see https://bugs.launchpad.net/heat/+bug/1492433), and these kinds of errors evade logging as well (see https://bugs.launchpad.net/heat/+bug/1492427).

So we really need:
* To know what actually happened? Nothing? Did the stack move to DELETE_IN_PROGRESS and stay there? Did it go to DELETE_FAILED?
* heat-engine logs
* journal output from systemd (`journalctl -u openstack-heat-engine`)

Comment 5 Jeremy 2015-09-08 15:30:50 UTC

Created attachment 1071393 [details]
mysql> select * from stack;  mysql> select * from resource;

Comment 6 Jeremy 2015-09-08 15:47:34 UTC

Created attachment 1071399 [details]
journalctl -u openstack-heat-engine

Comment 7 Zane Bitter 2015-09-09 20:02:04 UTC

The first thing that happens is that there's an exception occurring when we attempt to cancel the in-progress update:

2015-08-18 16:28:10.178 12019 INFO heat.engine.service [req-74ed2877-55da-4b8b-a78f-409625ca59fa None] Starting cancel of updating stack omt_test1
2015-08-18 16:28:10.193 12019 ERROR oslo.messaging.rpc.dispatcher [req-74ed2877-55da-4b8b-a78f-409625ca59fa ] Exception during message handling: call() takes exactly 3 arguments (5 given)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     incoming.message))
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     result = getattr(endpoint, method)(ctxt, **new_args)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 69, in wrapped
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     return func(self, ctx, *args, **kwargs)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 750, in stack_cancel_update
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     stack_identity, rpc_api.THREAD_CANCEL)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 850, in _remote_call
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher     self.cctxt.call(cnxt, call, *args, **kwargs)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher TypeError: call() takes exactly 3 arguments (5 given)
2015-08-18 16:28:10.193 12019 TRACE oslo.messaging.rpc.dispatcher 

It looks like the problem here is that RPCClient.call() is being called with positional args when it requires keyword args - this was fixed upstream in Liberty by https://bugs.launchpad.net/heat/+bug/1448384 and backported to Kilo, but not Juno.

Subsequently there are several attempts to delete the stack which result in nothing happening:

2015-08-18 20:08:21.003 12019 INFO heat.engine.service [req-e85c3e54-b79f-4346-aeca-9d949feecf05 None] Deleting stack omt_test1
2015-08-18 20:09:14.869 12019 INFO heat.engine.service [req-ede60741-ddb8-4d38-9a45-64264cacd2ba None] Deleting stack omt_test1
2015-08-19 12:04:58.706 12019 INFO heat.engine.service [req-eba0fc20-4ae6-4ffb-a135-20fd02443c02 None] Deleting stack omt_test1

But at this point the trail goes cold - while the journal does contain a number of exception tracebacks, none are from Aug 18-19.

Comment 9 Zane Bitter 2015-09-11 22:08:26 UTC

Interestingly, the DB dump shows the stack in the state SUSPEND_FAILED.

The backup stack is in the DELETE_COMPLETE state, so it seems we can probably rule out a problem there as the cause (which is unfortunate, because that was the next thing to investigate).

Comment 10 Zane Bitter 2015-09-17 22:47:31 UTC

What I _can_ see in the journal:

At Aug 20 08:55:43 we see a similar problem to the one described, caused by an exception while trying to delete a stack snapshot. Matching up the logs suggests that the stack involved was named "dlr". The same error occurs again at the following times:

Sep 03 13:27:33
Sep 03 13:39:04

This bug was fixed upstream in Kilo (https://bugs.launchpad.net/heat/+bug/1437588), and could easily be backported to Juno.

There is a second class of error, that again likely manifests in the same way as the one described. This occurs at the following times:
Aug 24 16:29:19
Aug 24 16:29:43
Aug 24 17:01:44
Aug 24 17:18:53
Sep 04 12:54:26
Sep 08 08:09:39

This one is a dependency problem involving a Neutron Pool resource:
InvalidTemplateReference: The specified reference "monitor" (in pool.Properties.monitors[0]) is incorrect.

It's likely that the cause of this is https://bugs.launchpad.net/heat/+bug/1446575 - this happens when a resource fails during an update, and then you do a second update in which the same resource fails again.

The logs provided don't cover any of those times, so there is no way for me to know which stacks are involved. As I mentioned in comment #3, the only mention of any attempt to delete the stack in question, omt_test1, was at the following times:
Aug 18 20:08
Aug 18 20:09
Aug 19 12:04
Aug 19 14:58

From the log, we can see that this stack fails to create a Pool resource at Aug 18 14:40. The error returned from Neutron is "No eligible backend for pool". The same happens again at Aug 18 15:16. It seems like a fairly good bet that this has the same cause as the second class of error above, even though the evidence is mysteriously absent - we see a single resource failing twice in a row, which is the known trigger, and it's the same type of resource that we see problems with later.

The fix for this is in Kilo, and has been tagged for a potential Juno backport upstream but not yet backported by the looks of it.

Comment 21 Amit Ugol 2016-01-04 13:17:38 UTC

since I don't have the original template I had to do with one of my own.
All LB resources were deleted successfully and I have opened new bugs for the resources who were not.

Comment 23 errata-xmlrpc 2016-01-14 13:53:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0037.html

Note You need to log in before you can comment on or make changes to this bug.