Bug 1042160

Summary: [RFE][heat]: Update Failure Recovery
Product: Red Hat OpenStack Reporter: RHOS Integration <rhos-integ>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: ajeain, ddomingo, markmc, sbaker, sgordon, shardy, yeylon, zbitter
Target Milestone: Upstream M3Keywords: FutureFeature
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
URL: https://blueprints.launchpad.net/heat/+spec/update-failure-recovery
Whiteboard: upstream_milestone_juno-3 upstream_status_implemented upstream_definition_approved
Fixed In Version: openstack-heat-2014.2-1.el7ost Doc Type: Enhancement
Doc Text:
The Orchestration service now allows the user to update a stack in a FAILED state. Previously, failed stacks could only be deleted, not updated.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-09 15:01:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description RHOS Integration 2013-12-12 21:14:17 UTC
Cloned from launchpad blueprint https://blueprints.launchpad.net/heat/+spec/update-failure-recovery.

Description:

Currently, stack updates are handled in an all-or-nothing kind of way. If a failure occurs, we attempt to roll back to the previous state if rollback is enabled. If the rollback fails or is disabled, we leave the stack in its failed state, but accept the old or new template (respectively) as a true representation of the current state of the stack. (This means that we could lose track of some resources and not be able to delete them.) We also prohibit updates to the stack from this point on; once an update has failed, you can only delete the stack.

We need to incrementally update the current template as resources are added, removed or modified. This will give us a valid picture of the true state when a failure occurs, allowing us to safely run updates in the future.

Specification URL (additional information):

None

Comment 3 Zane Bitter 2014-10-20 17:33:25 UTC
The idea is that even if a resource fails during create or update, we should still be able to successfully run another update - with the same or a different template - and have the stack recover to the right state.

So some things that would be interesting to test are:
- Updating after a create failure
- Updating after an update failure with rollback disabled
- Updating after a rollback failure
- Update failures where the new template has added new parameters
- Update failures where the new template has removed existing parameters
- Update failures where parameter values are changing

BTW one thing to note is that when something fails, we now wait for up to 4 minutes for other in-progress resources to complete rather than killing them immediately, since we hope to be able to recover.

Comment 6 errata-xmlrpc 2015-02-09 15:01:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0147.html