Bug 1320771 - [Heat] Heat resources left IN_PROGRESS after engine restart during update
Summary: [Heat] Heat resources left IN_PROGRESS after engine restart during update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: rc
: 11.0 (Ocata)
Assignee: Thomas Hervé
QA Contact: Amit Ugol
URL:
Whiteboard:
: 1326126 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-24 00:42 UTC by Alexander Chuzhoy
Modified: 2021-12-10 14:49 UTC (History)
14 users (show)

Fixed In Version: openstack-heat-8.0.0-4.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, while the Orchestration service could reset the status of resource when the state of the stack was incorrect, the service failed to do so when an update was retriggered. This resulted in resources being stuck in progress, which required database fixes to unblock the deployment. With this release, the Orchestration service now sets the status of all resources when it sets the status of the stack. This prevents the resources from getting stuck in progress, allowing operations to be retried successfully.
Clone Of:
Environment:
Last Closed: 2017-05-17 19:27:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1561429 0 None None None 2016-03-24 09:50:10 UTC
Launchpad 1570569 0 None None None 2016-04-14 20:34:45 UTC
Launchpad 1570576 0 None None None 2016-04-14 20:33:35 UTC
OpenStack gerrit 296976 0 'None' MERGED Reset stack status after resources 2021-02-07 23:05:00 UTC
OpenStack gerrit 386741 0 'None' MERGED Fix for resources stuck in progress after engine crash 2021-02-07 23:05:00 UTC
Red Hat Issue Tracker OSP-11266 0 None None None 2021-12-10 14:49:59 UTC
Red Hat Product Errata RHEA-2017:1245 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory 2017-05-17 23:01:50 UTC

Description Alexander Chuzhoy 2016-03-24 00:42:59 UTC
openstack-heat: No way to stop the running heat deployment.

Environment:
openstack-heat-engine-5.0.1-4.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch
openstack-heat-api-cloudwatch-5.0.1-4.el7ost.noarch
openstack-heat-common-5.0.1-4.el7ost.noarch
openstack-heat-api-5.0.1-4.el7ost.noarch
openstack-heat-api-cfn-5.0.1-4.el7ost.noarch



Currently, if there's a problem with the started heat deployment (like missing or bad argument), the user has no way to stot it. This becomes more urgent with the upgrade from 7.x to 8.0.

In OSP7 we could restart the heat-engine, but this approach isn't applicable in OSP8.

Comment 2 Zane Bitter 2016-03-24 01:03:56 UTC
The approach is applicable as a last resort in OSP8, in the case where a deployment is going to hang until it times out. Ideally in the future we'll have a cancel-update command that does not always roll back (since TripleO can't deal with rollbacks), but AFAIK for now cancel-update always rolls back so it isn't an option.

Restarting heat-engine shouldn't be the first port of call any more, however, because if a resource FAILs then everything else all of its siblings will be stopped automatically within 4 minutes (this might take a while to trickle down the nested stacks). In OSP7, nested stacks that were siblings of the failed resource were not stopped, and thus another update could not begin until they either completed or timed out. So in most cases there should be no need to restart heat-engine, and in fact it's undesirable because that is (and always has been) fragile.

What's of more immediate concern is that it appears that if someone restarts heat-engine in mid-update anyway, its possible for some resources to get stuck in the IN_PROGRESS state and even further restarts fail to dislodge them from it. From Sasha's setup:


[stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLETE
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                     | resource_status    | updated_time        | stack_name                                                                                                                                        |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Controller                                    | 39a69d8c-c3fd-4493-ba97-224b10576494          | OS::Heat::ResourceGroup                           | UPDATE_FAILED      | 2016-03-23T21:47:05 | overcloud                                                                                                                                         |
| Compute                                       | 3d1fa595-3273-4c8f-b8df-8aed969b6594          | OS::Heat::ResourceGroup                           | UPDATE_FAILED      | 2016-03-23T21:47:08 | overcloud                                                                                                                                         |
| 0                                             | 883f1d3c-1c34-4cb7-8b2a-4630c29c56ff          | OS::TripleO::Controller                           | UPDATE_IN_PROGRESS | 2016-03-23T21:47:09 | overcloud-Controller-oe63xwdjvve3                                                                                                                 |
| 1                                             | 15a39eed-01d5-4990-8146-e96ad2350862          | OS::TripleO::Compute                              | UPDATE_FAILED      | 2016-03-23T21:47:11 | overcloud-Compute-5chmvfdk4kcu                                                                                                                    |
| 1                                             | 4c35bed2-4406-4956-bd8f-f336fce341b7          | OS::TripleO::Controller                           | UPDATE_FAILED      | 2016-03-23T21:47:11 | overcloud-Controller-oe63xwdjvve3                                                                                                                 |
| 0                                             | df8d1927-4c89-42da-be0a-e0e0dc3bb629          | OS::TripleO::Compute                              | UPDATE_IN_PROGRESS | 2016-03-23T21:47:13 | overcloud-Compute-5chmvfdk4kcu                                                                                                                    |
| 2                                             | 1eab5295-90e1-416b-8661-4623fe7515fd          | OS::TripleO::Controller                           | UPDATE_IN_PROGRESS | 2016-03-23T21:47:14 | overcloud-Controller-oe63xwdjvve3                                                                                                                 |
| ControllerDeployment                          | da92af35-e5dc-419c-b7a4-65972311cc08          | OS::TripleO::SoftwareDeployment                   | UPDATE_FAILED      | 2016-03-23T21:49:05 | overcloud-Controller-oe63xwdjvve3-0-lszx5qxppmpk                                                                                                  |
| NovaComputeDeployment                         | 7f1380c0-e3a9-45fc-9d15-58148bb011f5          | OS::TripleO::SoftwareDeployment                   | UPDATE_FAILED      | 2016-03-23T21:49:09 | overcloud-Compute-5chmvfdk4kcu-1-4w6pbutbbhzn                                                                                                     |
| NovaComputeDeployment                         | 5829233a-7c98-4f22-9b28-6d2121d2a1d3          | OS::TripleO::SoftwareDeployment                   | UPDATE_FAILED      | 2016-03-23T21:49:24 | overcloud-Compute-5chmvfdk4kcu-0-3xy6vq23zuus                                                                                                     |
| ControllerDeployment                          | f4b4c0e5-59b7-46fd-a8bf-079f935f669c          | OS::TripleO::SoftwareDeployment                   | UPDATE_FAILED      | 2016-03-23T21:49:31 | overcloud-Controller-oe63xwdjvve3-1-4udvp2s5plai                                                                                                  |
| ControllerDeployment                          | ed4d5f72-26ff-4f29-81e0-8367b56e7349          | OS::TripleO::SoftwareDeployment                   | UPDATE_FAILED      | 2016-03-23T21:49:43 | overcloud-Controller-oe63xwdjvve3-2-y3vqu7jua3u7                                                                                                  |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+

Since the startup code in the engine should reset any IN_PROGRESS stacks, this may be due to the stack itself being in the FAILED state but resources within it being IN_PROGRESS. We need a way of resetting these that is still safe for convergence. Or, alternatively, to make sure that resources are always set to FAILED before stacks.

As far as I can tell https://bugs.launchpad.net/heat/+bug/1560688 is *NOT* related, because the resources shown as affected are all nested stacks that are never replaced (they're only ever updated in-place).

Comment 3 Thomas Hervé 2016-03-24 09:51:39 UTC
I've found a first occurrence of the problem during the reset itself, and linked the bug and the patch.

Comment 4 Zane Bitter 2016-03-31 01:11:16 UTC
If you see this again, could you please attach a log file? We haven't been able to reproduce this thus far, other than the patch that Thomas mentioned above (which only kicks in when you restart heat-engine twice in quick succession).

One theory is that it's stopping the thread in such a way that the resources do not get stopped, and the exception hits the catch-all that resets the stack status but not the resources. It's not yet clear to me why this can happen, but if it is then the exception should be logged so we ought to be able to figure out something from that.

Given how hard it apparently is to reproduce, I don't think this is a blocker so I've cleared the blocker flag. It's also vanishingly unlikely that this is a regression, so I've cleared the Regression keyword also.

Comment 5 Zane Bitter 2016-03-31 14:02:20 UTC
The patch Thomas linked above is included in openstack-heat-5.0.1-5.el7ost. I am *not* moving the bug to modified though, since we don't think it addresses the actual cause of the issue.

Comment 6 Zane Bitter 2016-04-12 14:47:51 UTC
Bug 1326126 may be another manifestation of this, and has logs attached.

Comment 7 Zane Bitter 2016-04-12 15:35:45 UTC
*** Bug 1326126 has been marked as a duplicate of this bug. ***

Comment 8 Zane Bitter 2016-04-14 20:33:35 UTC
From looking through the logs attached to bug 1326126, and there appear to be two distinct problems that contributed to it.

The first is that while Heat resets the status of zombie stacks and resources from IN_PROGRESS to FAILED at startup, it cannot do so if it starts before Keystone is available. In this case it appears that they started around the same time (as you might expect after a reboot), with the result that some of the zombie stacks were reset and some were not. I raised https://bugs.launchpad.net/heat/+bug/1570569 for this issue.

The second is that if a user updates a zombie stack then it will fail and also move the stack state to FAILED, but unlike the startup reset it will *not* reset the resources within it. The stack is thus left in a permanently zombified state, where some of the resources can never be updated. This is the bigger problem, but it only occurs when the user is able to try updating their stack after an engine has died but before any other engine starts up to reset it, or when the startup reset doesn't work because of the first bug above. I raised https://bugs.launchpad.net/heat/+bug/1570576 for this issue.

Comment 9 Omri Hochman 2016-04-19 20:04:37 UTC
Before verifying we should check the negative scenario mentioned in: 
https://bugzilla.redhat.com/show_bug.cgi?id=1326126  

Test Scenario: simulate undercloud power-outage during upgrade.
verify: that upgrade can resume and finish successfully.

Comment 11 Charlie Llewellyn 2016-08-11 18:42:22 UTC
As a note on this we ran into the same problem when our deployment ran out sql connections and crashed. 

We found the only way to work around this was to find the stale resource in the heat database on the undercloud and change the status to FAILED.

This resolved the issue and we have been able to successfully update the stack.

Comment 13 Zane Bitter 2017-01-26 15:10:05 UTC
Fix merged upstream in Ocata.

Comment 16 errata-xmlrpc 2017-05-17 19:27:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245


Note You need to log in before you can comment on or make changes to this bug.