Bug 1042199

Summary: [RFE][heat]: Troubleshooting: pause on error
Product: Red Hat OpenStack Reporter: RHOS Integration <rhos-integ>
Component: openstack-heatAssignee: RHOS Maint <rhos-maint>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: markmc, sbaker, shardy, yeylon
Target Milestone: gaKeywords: FutureFeature
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
URL: https://blueprints.launchpad.net/heat/+spec/troubleshooting-low-level-control
Whiteboard: upstream_milestone_juno-rc1 upstream_status_implemented upstream_definition_approved
Fixed In Version: openstack-heat-2014.2.1-5.el7ost Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-09 20:04:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description RHOS Integration 2013-12-12 21:21:04 UTC
Cloned from launchpad blueprint https://blueprints.launchpad.net/heat/+spec/troubleshooting-low-level-control.

Description:

Problem:
When a Heat template is deployed and an error occurs, the VM's are rolled back and deleted.  Server logs can help to determine the problem, but often we need to log into the VM's being deployed to debug the scripts and environment.  This blueprint proposes pausing the template deployment at the point of error so that the user can inspect the partial stack for problem determination.
 
Proposed support:
Command line + API:

   - stack-create: add a debugging option
     - validation of template: pause and point to exact failure in template (better message, suggest solution), and continue stack create once correction is made in template
     - error during deployment:  Heat engine pauses deployment, leaving all resources/components as is.  Failed template is shown as in PAUSED_ERROR state

   - stack-show:  to inspect template, show info on the current state of each resources and components
     - successfully deployed
     - error with message
     - not yet deployed

   - logs collection as they are available

   - stack-continue:  new option to continue deployment 

 
Related blueprint:  
Use stack-update to attempt recovery of failed create or update
https://blueprints.launchpad.net/heat/+spec/retry-failed-update
 
Concern:
How to handle concurrency when pausing deployment:  
   - do nothing:  sufficient in many cases
   - serialize deployment during debugging: deterministic behavior but not guaranteed to reproduce error 
   - trace/replay deployment

Specification URL (additional information):

None

Comment 2 Scott Lewis 2015-02-09 20:04:55 UTC
This bug has been closed as a part of the RHEL-OSP 6 general availability release. For details, see https://rhn.redhat.com/errata/rhel7-rhos-6-errata.html