There is a known bug in the upstream Kilo version of Heat https://bugs.launchpad.net/heat/+bug/1446252 that means that when the update of a nested stack resource is cancelled (e.g. because another resource in the same stack as the nested stack resource fails), the nested stack update is not stopped. It continues to run until it either succeeds, fails or times out.
This is particularly problematic for TripleO, because TripleO combines very long timeouts (4 hours) with breakpoints that prevent the nested stack from either succeeding or failing on its own.
Unfortunately, the bug cannot be fixed directly in Kilo, because the fix requires a change to the RPC API. (See bug 1253773.)
A good workaround for this problem, for TripleO specifically, should be to restart heat-engine. (Since the undercloud only supports one overcloud, other users should not be affected.) This will stop any updates that are in-progress, but unfortunately does not record their new status.
At startup, there is a reset_stack_status task (in heat/engine/service.py) which is supposed to be started and which is supposed to reset the status of any stack that is IN_PROGRESS but not actually being acted on by any live engine (i.e. the stack lock is owned by an engine that no longer exists) to FAILED. It doesn't appear that this is actually happening. (It's also not clear that this is sufficient, since resources in the stack may remain in the IN_PROGRESS state.)
Currently the only known workaround to recover after a heat-engine restart is to connect to the database directly and issue the following SQL commands:
UPDATE stack SET status="FAILED" WHERE status="IN_PROGRESS" AND action="UPDATE";
UPDATE resource SET status="FAILED" WHERE status="IN_PROGRESS" AND action="UPDATE";
It seems that reset_stack_status method ignores nested stacks (thanks Zane), after replacing:
stacks = stack_object.Stack.get_all(cnxt, filters=filters, tenant_safe=False) or 
stacks = stack_object.Stack.get_all(cnxt, filters=filters, tenant_safe=False, show_nested=True) or 
All stacks are set to FAILED state after engine restart. Unfortunately this is not sufficient because resources remain in IN_PROGRESS state. It would be probably best to set them into FAILED state when stack is FAILED too.
I have hit the same error too recently, I tend to think that this bug was exposed by some other bug fix because from what I was able to run package update on failed stacks before without needing to even restart heat engine (IOW stack didn't remain in IN_PROGRESS state).
It turns out the part about resetting the resource states is already fixed upstream in Liberty.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.