So the remaining issue here was this: The end result was an update failed and the recourse list showed this: ]$ heat resource-list -n 3 overcloud | grep -v COMPLETE +---------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+----------------------+---------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | parent_resource | +---------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+----------------------+---------------------------------------------+ | 0 | f4b63ac6-5461-4bad-a352-637e6ab816ed | OS::TripleO::Controller | UPDATE_IN_PROGRESS | 2015-12-09T09:59:23Z | Controller | | Compute | 03c29db3-2ada-446e-9636-c41d1aa60fc9 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2015-12-09T09:59:24Z | | | 0 | ee4b419e-512e-481e-8f84-1f114511a7b7 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2015-12-09T09:59:26Z | Compute | | Controller | 42661f8f-6f60-4a8e-bc7a-56a7c4f08d5d | OS::Heat::ResourceGroup | UPDATE_FAILED | 2015-12-09T10:31:05Z | | +---------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+----------------------+---------------------------------------------+ ...and it looks to me like this was caused by failing to cancel an operation on a child stack (which is a known problem for Kilo, fixed in Liberty: see https://bugs.launchpad.net/heat/+bug/1446252 for more details). It doesn't appear to be caused by an uncaught exception - the child stacks should eventually time out, they're unlikely to be permanently stuck in that state. Therefore, I'm going to close this bug. We can reopen if we see specific evidence that there are uncaught exceptions happening again.
*** Bug 1293117 has been marked as a duplicate of this bug. ***