Bug 1216967

Summary:	Stack is in CREATE_IN_PROGRESS although a resource is in CREATE_FAILED state after clearing breakpoints
Product:	[Community] RDO	Reporter:	Udi Kalifon <ukalifon>
Component:	openstack-heat	Assignee:	Zane Bitter <zbitter>
Status:	CLOSED NOTABUG	QA Contact:	Amit Ugol <augol>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	trunk	CC:	jpeeler, yeylon
Target Milestone:	---
Target Release:	Kilo
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-07-08 21:32:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Udi Kalifon 2015-04-29 10:22:45 UTC

Description of problem:
I tried to deploy an overcloud using the following hooks:
resource_registry:
  resources:
    "*NodesPostDeployment":
      "*_Step1":
          hooks: [pre-create, pre-update]
      "*_Step2":
          hooks: [pre-create, pre-update]
      "*_Step3":
          hooks: [pre-create, pre-update]
      "*_Step4":
          hooks: [pre-create, pre-update]

I then cleared 3 hooks which resulted in a CREATE_FAILED status on the 3rd resource:

heat hook-clear --pre-create overcloud ObjectStorageNodesPostDeployment/StorageDeployment_Step1
heat hook-clear --pre-create overcloud ObjectStorageNodesPostDeployment/StorageRingbuilderDeployment_Step2
heat hook-clear --pre-create overcloud ControllerNodesPostDeployment/ControllerDeploymentLoadBalancer_Step1
heat resource-list -n 5 overcloud | grep _Step
| StorageDeployment_Step1                     | cd31de94-2fac-4b6f-95b0-243fcca27553          | OS::Heat::StructuredDeployments                   | CREATE_COMPLETE    | 2015-04-29T06:49:18Z | ObjectStorageNodesPostDeployment  |
| StorageRingbuilderDeployment_Step2          | f485dccc-8eab-48f2-b74a-923f920799f2          | OS::Heat::StructuredDeployments                   | CREATE_COMPLETE    | 2015-04-29T06:49:18Z | ObjectStorageNodesPostDeployment  |
| ControllerDeploymentLoadBalancer_Step1      |                                               | OS::Heat::StructuredDeployments                   | CREATE_FAILED      | 2015-04-29T06:50:03Z | ControllerNodesPostDeployment     |

When running stack-list you can see that the stack is still in CREATE_IN_PROGRESS state:

heat stack-list
+--------------------------------------+------------+--------------------+----------------------+
| id                                   | stack_name | stack_status       | creation_time        |
+--------------------------------------+------------+--------------------+----------------------+
| 1da41b7b-e6b4-42ec-9586-1f1d95dfa3e3 | overcloud  | CREATE_IN_PROGRESS | 2015-04-29T06:46:51Z |
+--------------------------------------+------------+--------------------+----------------------+

The stack creation should be FAILED if one of the resources in it is in failed state.


How reproducible:
100%


Steps to reproduce:
1. Create a file called breakpoint.yaml which has the contents from the description above.
2. Edit the deployment script /bin/instack-deploy-overcloud and add "-e breakpoint.yaml" to the heat stack-create command
3. To see that the breakpoint was reached, run the command "heat resource-list overcloud" to find the resource id of the ObjectStorageNodesPostDeployment resource, and then run event-list on it. For example: heat event-list b647a518-ee71-48f4-8f82-9c8a3d8445b2
4. Clear the breakpoints with the same "heat hook-clear" commands as in the description.
5. Make a recursive resource-list to see that there is a resource in FAILED state: heat resource-list -n 5 overcloud | grep _Step
6. See that the stack's state is not FAILED: heat stack-list

Comment 1 Udi Kalifon 2015-04-29 10:33:29 UTC

Stack creation doesn't fail when you don't use breakpoints, so this is not just a bug of the wrong status being displayed...

Comment 2 Zane Bitter 2015-04-30 15:59:52 UTC

When a resource fails, we don't stop processing any other resources that are in progress immediately (we don't start any new ones though), since ideally we'd like them to complete successfully so they don't need to be replaced in a future update. At the moment we wait for up to 4 minutes before stopping anything in progress, so you may well see a different result after 4 minutes.

(Ideally we wouldn't wait for hooks, but the code that controls the timeout currently has no way of knowing that a hook is what's being processed.)