openstack-heat: No way to stop the running heat deployment. Environment: openstack-heat-engine-5.0.1-4.el7ost.noarch openstack-heat-templates-0-0.8.20150605git.el7ost.noarch openstack-heat-api-cloudwatch-5.0.1-4.el7ost.noarch openstack-heat-common-5.0.1-4.el7ost.noarch openstack-heat-api-5.0.1-4.el7ost.noarch openstack-heat-api-cfn-5.0.1-4.el7ost.noarch Currently, if there's a problem with the started heat deployment (like missing or bad argument), the user has no way to stot it. This becomes more urgent with the upgrade from 7.x to 8.0. In OSP7 we could restart the heat-engine, but this approach isn't applicable in OSP8.
The approach is applicable as a last resort in OSP8, in the case where a deployment is going to hang until it times out. Ideally in the future we'll have a cancel-update command that does not always roll back (since TripleO can't deal with rollbacks), but AFAIK for now cancel-update always rolls back so it isn't an option. Restarting heat-engine shouldn't be the first port of call any more, however, because if a resource FAILs then everything else all of its siblings will be stopped automatically within 4 minutes (this might take a while to trickle down the nested stacks). In OSP7, nested stacks that were siblings of the failed resource were not stopped, and thus another update could not begin until they either completed or timed out. So in most cases there should be no need to restart heat-engine, and in fact it's undesirable because that is (and always has been) fragile. What's of more immediate concern is that it appears that if someone restarts heat-engine in mid-update anyway, its possible for some resources to get stuck in the IN_PROGRESS state and even further restarts fail to dislodge them from it. From Sasha's setup: [stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLETE +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | Controller | 39a69d8c-c3fd-4493-ba97-224b10576494 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2016-03-23T21:47:05 | overcloud | | Compute | 3d1fa595-3273-4c8f-b8df-8aed969b6594 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2016-03-23T21:47:08 | overcloud | | 0 | 883f1d3c-1c34-4cb7-8b2a-4630c29c56ff | OS::TripleO::Controller | UPDATE_IN_PROGRESS | 2016-03-23T21:47:09 | overcloud-Controller-oe63xwdjvve3 | | 1 | 15a39eed-01d5-4990-8146-e96ad2350862 | OS::TripleO::Compute | UPDATE_FAILED | 2016-03-23T21:47:11 | overcloud-Compute-5chmvfdk4kcu | | 1 | 4c35bed2-4406-4956-bd8f-f336fce341b7 | OS::TripleO::Controller | UPDATE_FAILED | 2016-03-23T21:47:11 | overcloud-Controller-oe63xwdjvve3 | | 0 | df8d1927-4c89-42da-be0a-e0e0dc3bb629 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-03-23T21:47:13 | overcloud-Compute-5chmvfdk4kcu | | 2 | 1eab5295-90e1-416b-8661-4623fe7515fd | OS::TripleO::Controller | UPDATE_IN_PROGRESS | 2016-03-23T21:47:14 | overcloud-Controller-oe63xwdjvve3 | | ControllerDeployment | da92af35-e5dc-419c-b7a4-65972311cc08 | OS::TripleO::SoftwareDeployment | UPDATE_FAILED | 2016-03-23T21:49:05 | overcloud-Controller-oe63xwdjvve3-0-lszx5qxppmpk | | NovaComputeDeployment | 7f1380c0-e3a9-45fc-9d15-58148bb011f5 | OS::TripleO::SoftwareDeployment | UPDATE_FAILED | 2016-03-23T21:49:09 | overcloud-Compute-5chmvfdk4kcu-1-4w6pbutbbhzn | | NovaComputeDeployment | 5829233a-7c98-4f22-9b28-6d2121d2a1d3 | OS::TripleO::SoftwareDeployment | UPDATE_FAILED | 2016-03-23T21:49:24 | overcloud-Compute-5chmvfdk4kcu-0-3xy6vq23zuus | | ControllerDeployment | f4b4c0e5-59b7-46fd-a8bf-079f935f669c | OS::TripleO::SoftwareDeployment | UPDATE_FAILED | 2016-03-23T21:49:31 | overcloud-Controller-oe63xwdjvve3-1-4udvp2s5plai | | ControllerDeployment | ed4d5f72-26ff-4f29-81e0-8367b56e7349 | OS::TripleO::SoftwareDeployment | UPDATE_FAILED | 2016-03-23T21:49:43 | overcloud-Controller-oe63xwdjvve3-2-y3vqu7jua3u7 | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ Since the startup code in the engine should reset any IN_PROGRESS stacks, this may be due to the stack itself being in the FAILED state but resources within it being IN_PROGRESS. We need a way of resetting these that is still safe for convergence. Or, alternatively, to make sure that resources are always set to FAILED before stacks. As far as I can tell https://bugs.launchpad.net/heat/+bug/1560688 is *NOT* related, because the resources shown as affected are all nested stacks that are never replaced (they're only ever updated in-place).
I've found a first occurrence of the problem during the reset itself, and linked the bug and the patch.
If you see this again, could you please attach a log file? We haven't been able to reproduce this thus far, other than the patch that Thomas mentioned above (which only kicks in when you restart heat-engine twice in quick succession). One theory is that it's stopping the thread in such a way that the resources do not get stopped, and the exception hits the catch-all that resets the stack status but not the resources. It's not yet clear to me why this can happen, but if it is then the exception should be logged so we ought to be able to figure out something from that. Given how hard it apparently is to reproduce, I don't think this is a blocker so I've cleared the blocker flag. It's also vanishingly unlikely that this is a regression, so I've cleared the Regression keyword also.
The patch Thomas linked above is included in openstack-heat-5.0.1-5.el7ost. I am *not* moving the bug to modified though, since we don't think it addresses the actual cause of the issue.
Bug 1326126 may be another manifestation of this, and has logs attached.
*** Bug 1326126 has been marked as a duplicate of this bug. ***
From looking through the logs attached to bug 1326126, and there appear to be two distinct problems that contributed to it. The first is that while Heat resets the status of zombie stacks and resources from IN_PROGRESS to FAILED at startup, it cannot do so if it starts before Keystone is available. In this case it appears that they started around the same time (as you might expect after a reboot), with the result that some of the zombie stacks were reset and some were not. I raised https://bugs.launchpad.net/heat/+bug/1570569 for this issue. The second is that if a user updates a zombie stack then it will fail and also move the stack state to FAILED, but unlike the startup reset it will *not* reset the resources within it. The stack is thus left in a permanently zombified state, where some of the resources can never be updated. This is the bigger problem, but it only occurs when the user is able to try updating their stack after an engine has died but before any other engine starts up to reset it, or when the startup reset doesn't work because of the first bug above. I raised https://bugs.launchpad.net/heat/+bug/1570576 for this issue.
Before verifying we should check the negative scenario mentioned in: https://bugzilla.redhat.com/show_bug.cgi?id=1326126 Test Scenario: simulate undercloud power-outage during upgrade. verify: that upgrade can resume and finish successfully.
As a note on this we ran into the same problem when our deployment ran out sql connections and crashed. We found the only way to work around this was to find the stale resource in the heat database on the undercloud and change the status to FAILED. This resolved the issue and we have been able to successfully update the stack.
Fix merged upstream in Ocata.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245