Bug 1267364

Summary:	Stack updates from a dead heat-engine remain IN_PROGRESS
Product:	Red Hat OpenStack	Reporter:	Zane Bitter <zbitter>
Component:	openstack-heat	Assignee:	Zane Bitter <zbitter>
Status:	CLOSED ERRATA	QA Contact:	Amit Ugol <augol>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	7.0 (Kilo)	CC:	ddomingo, gbarros, jprovazn, mburns, rhel-osp-director-maint, sbaker, shardy, yeylon, zbitter
Target Milestone:	async	Keywords:	Triaged, ZStream
Target Release:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-heat-2015.1.1-7.el7ost	Doc Type:	Bug Fix
Doc Text:	During startup, Heat incorrectly ignored nested stacks when searching for stacks with interrupted operations (for example, ones form a previous heat-engine process exiting). In addition, while those stacks that were not ignored were correctly set to FAILED, their resources remained IN_PROGRESS. Because the resources remained IN_PROGRESS, it was not possible to recover the stacks when heat-engine was restarted. With this update, nested stacks are now included when searching for interrupted operations, and IN_PROGRESS resources as well as stacks are moved to the FAILED state. Consequently, they can be recovered as expected upon restart of heat-engine.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-18 16:40:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Zane Bitter 2015-09-29 19:24:34 UTC

There is a known bug in the upstream Kilo version of Heat https://bugs.launchpad.net/heat/+bug/1446252 that means that when the update of a nested stack resource is cancelled (e.g. because another resource in the same stack as the nested stack resource fails), the nested stack update is not stopped. It continues to run until it either succeeds, fails or times out.

This is particularly problematic for TripleO, because TripleO combines very long timeouts (4 hours) with breakpoints that prevent the nested stack from either succeeding or failing on its own.

Unfortunately, the bug cannot be fixed directly in Kilo, because the fix requires a change to the RPC API. (See bug 1253773.)

A good workaround for this problem, for TripleO specifically, should be to restart heat-engine. (Since the undercloud only supports one overcloud, other users should not be affected.) This will stop any updates that are in-progress, but unfortunately does not record their new status.

At startup, there is a reset_stack_status task (in heat/engine/service.py) which is supposed to be started and which is supposed to reset the status of any stack that is IN_PROGRESS but not actually being acted on by any live engine (i.e. the stack lock is owned by an engine that no longer exists) to FAILED. It doesn't appear that this is actually happening. (It's also not clear that this is sufficient, since resources in the stack may remain in the IN_PROGRESS state.)

Currently the only known workaround to recover after a heat-engine restart is to connect to the database directly and issue the following SQL commands:

UPDATE stack SET status="FAILED" WHERE status="IN_PROGRESS" AND action="UPDATE";
UPDATE resource SET status="FAILED" WHERE status="IN_PROGRESS" AND action="UPDATE";

Comment 1 Jan Provaznik 2015-09-30 13:31:33 UTC

It seems that reset_stack_status method ignores nested stacks (thanks Zane), after replacing:
stacks =  stack_object.Stack.get_all(cnxt, filters=filters, tenant_safe=False) or []
with:
stacks = stack_object.Stack.get_all(cnxt, filters=filters, tenant_safe=False, show_nested=True) or []

All stacks are set to FAILED state after engine restart. Unfortunately this is not sufficient because resources remain in IN_PROGRESS state. It would be probably best to set them into FAILED state when stack is FAILED too.

I have hit the same error too recently, I tend to think that this bug was exposed by some other bug fix because from what I was able to run package update on failed stacks before without needing to even restart heat engine (IOW stack didn't remain in IN_PROGRESS state).

Comment 2 Zane Bitter 2015-10-01 19:57:37 UTC

It turns out the part about resetting the resource states is already fixed upstream in Liberty.

Comment 9 errata-xmlrpc 2015-11-18 16:40:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:2076