RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/
Bug 1216967 - Stack is in CREATE_IN_PROGRESS although a resource is in CREATE_FAILED state after clearing breakpoints
Summary: Stack is in CREATE_IN_PROGRESS although a resource is in CREATE_FAILED state ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: RDO
Classification: Community
Component: openstack-heat
Version: trunk
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: Kilo
Assignee: Zane Bitter
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-29 10:22 UTC by Udi Kalifon
Modified: 2016-04-26 19:26 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-07-08 21:32:55 UTC
Embargoed:


Attachments (Terms of Use)

Description Udi Kalifon 2015-04-29 10:22:45 UTC
Description of problem:
I tried to deploy an overcloud using the following hooks:
resource_registry:
  resources:
    "*NodesPostDeployment":
      "*_Step1":
          hooks: [pre-create, pre-update]
      "*_Step2":
          hooks: [pre-create, pre-update]
      "*_Step3":
          hooks: [pre-create, pre-update]
      "*_Step4":
          hooks: [pre-create, pre-update]

I then cleared 3 hooks which resulted in a CREATE_FAILED status on the 3rd resource:

heat hook-clear --pre-create overcloud ObjectStorageNodesPostDeployment/StorageDeployment_Step1
heat hook-clear --pre-create overcloud ObjectStorageNodesPostDeployment/StorageRingbuilderDeployment_Step2
heat hook-clear --pre-create overcloud ControllerNodesPostDeployment/ControllerDeploymentLoadBalancer_Step1
heat resource-list -n 5 overcloud | grep _Step
| StorageDeployment_Step1                     | cd31de94-2fac-4b6f-95b0-243fcca27553          | OS::Heat::StructuredDeployments                   | CREATE_COMPLETE    | 2015-04-29T06:49:18Z | ObjectStorageNodesPostDeployment  |
| StorageRingbuilderDeployment_Step2          | f485dccc-8eab-48f2-b74a-923f920799f2          | OS::Heat::StructuredDeployments                   | CREATE_COMPLETE    | 2015-04-29T06:49:18Z | ObjectStorageNodesPostDeployment  |
| ControllerDeploymentLoadBalancer_Step1      |                                               | OS::Heat::StructuredDeployments                   | CREATE_FAILED      | 2015-04-29T06:50:03Z | ControllerNodesPostDeployment     |

When running stack-list you can see that the stack is still in CREATE_IN_PROGRESS state:

heat stack-list
+--------------------------------------+------------+--------------------+----------------------+
| id                                   | stack_name | stack_status       | creation_time        |
+--------------------------------------+------------+--------------------+----------------------+
| 1da41b7b-e6b4-42ec-9586-1f1d95dfa3e3 | overcloud  | CREATE_IN_PROGRESS | 2015-04-29T06:46:51Z |
+--------------------------------------+------------+--------------------+----------------------+

The stack creation should be FAILED if one of the resources in it is in failed state.


How reproducible:
100%


Steps to reproduce:
1. Create a file called breakpoint.yaml which has the contents from the description above.
2. Edit the deployment script /bin/instack-deploy-overcloud and add "-e breakpoint.yaml" to the heat stack-create command
3. To see that the breakpoint was reached, run the command "heat resource-list overcloud" to find the resource id of the ObjectStorageNodesPostDeployment resource, and then run event-list on it. For example: heat event-list b647a518-ee71-48f4-8f82-9c8a3d8445b2
4. Clear the breakpoints with the same "heat hook-clear" commands as in the description.
5. Make a recursive resource-list to see that there is a resource in FAILED state: heat resource-list -n 5 overcloud | grep _Step
6. See that the stack's state is not FAILED: heat stack-list

Comment 1 Udi Kalifon 2015-04-29 10:33:29 UTC
Stack creation doesn't fail when you don't use breakpoints, so this is not just a bug of the wrong status being displayed...

Comment 2 Zane Bitter 2015-04-30 15:59:52 UTC
When a resource fails, we don't stop processing any other resources that are in progress immediately (we don't start any new ones though), since ideally we'd like them to complete successfully so they don't need to be replaced in a future update. At the moment we wait for up to 4 minutes before stopping anything in progress, so you may well see a different result after 4 minutes.

(Ideally we wouldn't wait for hooks, but the code that controls the timeout currently has no way of knowing that a hook is what's being processed.)


Note You need to log in before you can comment on or make changes to this bug.