1267364 – Stack updates from a dead heat-engine remain IN_PROGRESS

Bug 1267364 - Stack updates from a dead heat-engine remain IN_PROGRESS

Summary: Stack updates from a dead heat-engine remain IN_PROGRESS

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	async
Target Release:	7.0 (Kilo)
Assignee:	Zane Bitter
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-29 19:24 UTC by Zane Bitter
Modified:	2023-02-22 23:02 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-heat-2015.1.1-7.el7ost
Doc Type:	Bug Fix
Doc Text:	During startup, Heat incorrectly ignored nested stacks when searching for stacks with interrupted operations (for example, ones form a previous heat-engine process exiting). In addition, while those stacks that were not ignored were correctly set to FAILED, their resources remained IN_PROGRESS. Because the resources remained IN_PROGRESS, it was not possible to recover the stacks when heat-engine was restarted. With this update, nested stacks are now included when searching for interrupted operations, and IN_PROGRESS resources as well as stacks are moved to the FAILED state. Consequently, they can be recovered as expected upon restart of heat-engine.
Clone Of:
Environment:
Last Closed:	2015-11-18 16:40:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1487288	None	None	None	Never
Launchpad	1501828	None	None	None	Never
OpenStack gerrit	231769	None	None	None	Never
OpenStack gerrit	231802	None	None	None	Never
Red Hat Product Errata	RHBA-2015:2076	normal	SHIPPED_LIVE	openstack-heat bug fix advisory	2015-11-18 21:40:02 UTC

Description Zane Bitter 2015-09-29 19:24:34 UTC

There is a known bug in the upstream Kilo version of Heat https://bugs.launchpad.net/heat/+bug/1446252 that means that when the update of a nested stack resource is cancelled (e.g. because another resource in the same stack as the nested stack resource fails), the nested stack update is not stopped. It continues to run until it either succeeds, fails or times out.

This is particularly problematic for TripleO, because TripleO combines very long timeouts (4 hours) with breakpoints that prevent the nested stack from either succeeding or failing on its own.

Unfortunately, the bug cannot be fixed directly in Kilo, because the fix requires a change to the RPC API. (See bug 1253773.)

A good workaround for this problem, for TripleO specifically, should be to restart heat-engine. (Since the undercloud only supports one overcloud, other users should not be affected.) This will stop any updates that are in-progress, but unfortunately does not record their new status.

At startup, there is a reset_stack_status task (in heat/engine/service.py) which is supposed to be started and which is supposed to reset the status of any stack that is IN_PROGRESS but not actually being acted on by any live engine (i.e. the stack lock is owned by an engine that no longer exists) to FAILED. It doesn't appear that this is actually happening. (It's also not clear that this is sufficient, since resources in the stack may remain in the IN_PROGRESS state.)

Currently the only known workaround to recover after a heat-engine restart is to connect to the database directly and issue the following SQL commands:

UPDATE stack SET status="FAILED" WHERE status="IN_PROGRESS" AND action="UPDATE";
UPDATE resource SET status="FAILED" WHERE status="IN_PROGRESS" AND action="UPDATE";

Comment 1 Jan Provaznik 2015-09-30 13:31:33 UTC

It seems that reset_stack_status method ignores nested stacks (thanks Zane), after replacing:
stacks =  stack_object.Stack.get_all(cnxt, filters=filters, tenant_safe=False) or []
with:
stacks = stack_object.Stack.get_all(cnxt, filters=filters, tenant_safe=False, show_nested=True) or []

All stacks are set to FAILED state after engine restart. Unfortunately this is not sufficient because resources remain in IN_PROGRESS state. It would be probably best to set them into FAILED state when stack is FAILED too.

I have hit the same error too recently, I tend to think that this bug was exposed by some other bug fix because from what I was able to run package update on failed stacks before without needing to even restart heat engine (IOW stack didn't remain in IN_PROGRESS state).

Comment 2 Zane Bitter 2015-10-01 19:57:37 UTC

It turns out the part about resetting the resource states is already fixed upstream in Liberty.

Comment 9 errata-xmlrpc 2015-11-18 16:40:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:2076

Note You need to log in before you can comment on or make changes to this bug.