Bug 1216967

Summary: Stack is in CREATE_IN_PROGRESS although a resource is in CREATE_FAILED state after clearing breakpoints
Product: [Community] RDO Reporter: Udi Kalifon <ukalifon>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED NOTABUG QA Contact: Amit Ugol <augol>
Severity: medium Docs Contact:
Priority: unspecified    
Version: trunkCC: jpeeler, yeylon
Target Milestone: ---   
Target Release: Kilo   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-08 21:32:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Udi Kalifon 2015-04-29 10:22:45 UTC
Description of problem:
I tried to deploy an overcloud using the following hooks:
resource_registry:
  resources:
    "*NodesPostDeployment":
      "*_Step1":
          hooks: [pre-create, pre-update]
      "*_Step2":
          hooks: [pre-create, pre-update]
      "*_Step3":
          hooks: [pre-create, pre-update]
      "*_Step4":
          hooks: [pre-create, pre-update]

I then cleared 3 hooks which resulted in a CREATE_FAILED status on the 3rd resource:

heat hook-clear --pre-create overcloud ObjectStorageNodesPostDeployment/StorageDeployment_Step1
heat hook-clear --pre-create overcloud ObjectStorageNodesPostDeployment/StorageRingbuilderDeployment_Step2
heat hook-clear --pre-create overcloud ControllerNodesPostDeployment/ControllerDeploymentLoadBalancer_Step1
heat resource-list -n 5 overcloud | grep _Step
| StorageDeployment_Step1                     | cd31de94-2fac-4b6f-95b0-243fcca27553          | OS::Heat::StructuredDeployments                   | CREATE_COMPLETE    | 2015-04-29T06:49:18Z | ObjectStorageNodesPostDeployment  |
| StorageRingbuilderDeployment_Step2          | f485dccc-8eab-48f2-b74a-923f920799f2          | OS::Heat::StructuredDeployments                   | CREATE_COMPLETE    | 2015-04-29T06:49:18Z | ObjectStorageNodesPostDeployment  |
| ControllerDeploymentLoadBalancer_Step1      |                                               | OS::Heat::StructuredDeployments                   | CREATE_FAILED      | 2015-04-29T06:50:03Z | ControllerNodesPostDeployment     |

When running stack-list you can see that the stack is still in CREATE_IN_PROGRESS state:

heat stack-list
+--------------------------------------+------------+--------------------+----------------------+
| id                                   | stack_name | stack_status       | creation_time        |
+--------------------------------------+------------+--------------------+----------------------+
| 1da41b7b-e6b4-42ec-9586-1f1d95dfa3e3 | overcloud  | CREATE_IN_PROGRESS | 2015-04-29T06:46:51Z |
+--------------------------------------+------------+--------------------+----------------------+

The stack creation should be FAILED if one of the resources in it is in failed state.


How reproducible:
100%


Steps to reproduce:
1. Create a file called breakpoint.yaml which has the contents from the description above.
2. Edit the deployment script /bin/instack-deploy-overcloud and add "-e breakpoint.yaml" to the heat stack-create command
3. To see that the breakpoint was reached, run the command "heat resource-list overcloud" to find the resource id of the ObjectStorageNodesPostDeployment resource, and then run event-list on it. For example: heat event-list b647a518-ee71-48f4-8f82-9c8a3d8445b2
4. Clear the breakpoints with the same "heat hook-clear" commands as in the description.
5. Make a recursive resource-list to see that there is a resource in FAILED state: heat resource-list -n 5 overcloud | grep _Step
6. See that the stack's state is not FAILED: heat stack-list

Comment 1 Udi Kalifon 2015-04-29 10:33:29 UTC
Stack creation doesn't fail when you don't use breakpoints, so this is not just a bug of the wrong status being displayed...

Comment 2 Zane Bitter 2015-04-30 15:59:52 UTC
When a resource fails, we don't stop processing any other resources that are in progress immediately (we don't start any new ones though), since ideally we'd like them to complete successfully so they don't need to be replaced in a future update. At the moment we wait for up to 4 minutes before stopping anything in progress, so you may well see a different result after 4 minutes.

(Ideally we wouldn't wait for hooks, but the code that controls the timeout currently has no way of knowing that a hook is what's being processed.)