Bug 1306857

Summary: heat stack-delete of overcloud not cleaning up overcloud baremetal nodes
Product: Red Hat OpenStack Reporter: lokesh.jain
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED NOTABUG QA Contact: Shai Revivo <srevivo>
Severity: high Docs Contact:
Priority: low    
Version: 7.0 (Kilo)CC: aguetta, athomas, dbecker, jcoufal, lokesh.jain, mburns, morazi, pcaruana, rhel-osp-director-maint, rkharwar, sbaker
Target Milestone: ---   
Target Release: 10.0 (Newton)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-13 19:48:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description lokesh.jain 2016-02-11 23:21:41 UTC
After heat stack-delete overcloud, baremetal nodes are still showing previous images and failing the subsequent deployment.

Steps to reproduce:

1. Deploy overcloud on OSP-Director 7.2 with "openstack overcloud deploy --templates --control-scale 1 --compute-scale 2 --ceph-storage-scale 0 --block-storage-scale 0 --swift-storage-scale 0 --ntp-server pool.ntp.org"

2. Do a heat stack-delete overcloud to remove the previous deployment

3. Run introspection of the nodes "openstack baremetal introspection bulk start"

4. Re-deploy with the same nodes and same command "openstack overcloud deploy --templates --control-scale 1 --compute-scale 2 --ceph-storage-scale 0 --block-storage-scale 0 --swift-storage-scale 0 --ntp-server pool.ntp.org"

After these steps, CREATE failed with: 
"message": "No valid host was found. Exceeded max scheduling attempts 3 for instance e10e19a2-aefb-4a98-9536-a81793773938.

The nodes still had the image from the previous deployment.

heat-api.log details:

2016-02-11 17:56:59.340 5612 INFO eventlet.wsgi.server [-] (5612) accepted ('192.0.2.1', 46766)
2016-02-11 17:56:59.342 5612 DEBUG heat.api.middleware.version_negotiation [-] Processing request: GET /v1/df00efd4218041be9abfc70e6c05f210/stacks Accept: application/json process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:50
2016-02-11 17:56:59.342 5612 DEBUG heat.api.middleware.version_negotiation [-] Matched versioned URI. Version: 1.0 process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:65
2016-02-11 17:56:59.343 5612 DEBUG keystoneclient.auth.identity.v2 [-] Making authentication request to http://192.0.2.1:35357/v2.0/tokens get_auth_ref /usr/lib/python2.7/site-packages/keystoneclient/auth/identity/v2.py:76
2016-02-11 17:56:59.512 5612 DEBUG keystoneclient.session [-] REQ: curl -g -i -X GET http://192.0.2.1:35357/v3/auth/tokens -H "X-Subject-Token: {SHA1}2984da1d915161a3f91a87e64c4bc1d3a7759427" -H "User-Agent: python-keystoneclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}03eedc6987f3137e1206798d66044c45fa1ba215" _http_log_request /usr/lib/python2.7/site-packages/keystoneclient/session.py:195
2016-02-11 17:56:59.599 5612 DEBUG keystoneclient.session [-] RESP: [200] content-length: 6349 x-subject-token: {SHA1}2984da1d915161a3f91a87e64c4bc1d3a7759427 vary: X-Auth-Token connection: keep-alive date: Thu, 11 Feb 2016 22:56:59 GMT content-type: application/json x-openstack-request-id: req-4e563a4d-050a-4f90-97c5-93c68d33e9cd 
RESP BODY: {"token": {"methods": ["password", "token"], "roles": [{"id": "9fe2ff9ee4384b1894a90878d3e92bab", "name": "_member_"}, {"id": "1f36114af125490e964851e05a972259", "name": "admin"}], "expires_at": "2016-02-12T02:56:59.000000Z", "project": {"domain": {"id": "default", "name": "Default"}, "id": "df00efd4218041be9abfc70e6c05f210", "name": "admin"}, "catalog": "<removed>", "extras": {}, "user": {"domain": {"id": "default", "name": "Default"}, "id": "b56faf710c2f476bad2199f2fc6d8127", "name": "admin"}, "audit_ids": ["ypUarebwRgOQZa0zeWMl1g"], "issued_at": "2016-02-11T22:56:59.316033"}}
 _http_log_response /usr/lib/python2.7/site-packages/keystoneclient/session.py:224
2016-02-11 17:56:59.603 5612 DEBUG heat.openstack.common.policy [req-0839eff5-5166-4694-8b63-a5c72cf10bcf b56faf710c2f476bad2199f2fc6d8127 df00efd4218041be9abfc70e6c05f210] Rules successfully reloaded _load_policy_file /usr/lib/python2.7/site-packages/heat/openstack/common/policy.py:295
2016-02-11 17:56:59.604 5612 INFO heat.openstack.common.policy [req-0839eff5-5166-4694-8b63-a5c72cf10bcf b56faf710c2f476bad2199f2fc6d8127 df00efd4218041be9abfc70e6c05f210] Can not find policy directory: policy.d
2016-02-11 17:56:59.605 5612 DEBUG heat.common.wsgi [req-0839eff5-5166-4694-8b63-a5c72cf10bcf b56faf710c2f476bad2199f2fc6d8127 df00efd4218041be9abfc70e6c05f210] Calling <heat.api.openstack.v1.stacks.StackController object at 0x3152890> : index __call__ /usr/lib/python2.7/site-packages/heat/common/wsgi.py:667
2016-02-11 17:56:59.606 5612 INFO heat.openstack.common.policy [req-0839eff5-5166-4694-8b63-a5c72cf10bcf b56faf710c2f476bad2199f2fc6d8127 df00efd4218041be9abfc70e6c05f210] Can not find policy directory: policy.d
2016-02-11 17:56:59.607 5612 DEBUG oslo_messaging._drivers.amqpdriver [req-0839eff5-5166-4694-8b63-a5c72cf10bcf b56faf710c2f476bad2199f2fc6d8127 df00efd4218041be9abfc70e6c05f210] MSG_ID is a2430edbb84c472da9fe38ed61fff872 _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:311
2016-02-11 17:56:59.607 5612 DEBUG oslo_messaging._drivers.amqp [req-0839eff5-5166-4694-8b63-a5c72cf10bcf b56faf710c2f476bad2199f2fc6d8127 df00efd4218041be9abfc70e6c05f210] UNIQUE_ID is 209c449205d44d46a118c994a1bd8513. _add_unique_id /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py:258
2016-02-11 17:56:59.691 5612 DEBUG heat.common.serializers [req-0839eff5-5166-4694-8b63-a5c72cf10bcf b56faf710c2f476bad2199f2fc6d8127 df00efd4218041be9abfc70e6c05f210] JSON response : {"stacks": [{"parent": null, "description": "Nova API,Keystone,Heat Engine and API,Glance,Neutron,Dedicated MySQL server,Dedicated RabbitMQ Server,Group of Nova Computes\n", "links": [{"href": "http://192.0.2.1:8004/v1/df00efd4218041be9abfc70e6c05f210/stacks/overcloud/29f0da81-5e96-43ef-83cc-c31ca9f127f0", "rel": "self"}], "stack_status_reason": "Resource CREATE failed: ResourceInError: resources.Compute.resources[0].resources.NovaCompute: Went to status ERROR due to \"Message: No valid host was found. Exceeded max scheduling attempts 3 for instance 4be4a480-a82b-45d0-b574-e7b558cf600c. Last exception: [u'Traceback (most recent call last): \\n', u'  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 2261, in _do, Code: 500\"", "stack_name": "overcloud", "stack_user_project_id": "39c7227225eb489b8282175e54aff5cc", "creation_time": "2016-02-05T18:14:44Z", "updated_time": null, "stack_owner": "admin", "stack_status": "CREATE_FAILED", "id": "29f0da81-5e96-43ef-83cc-c31ca9f127f0"}]} to_json /usr/lib/python2.7/site-packages/heat/common/serializers.py:42
2016-02-11 17:56:59.692 5612 INFO eventlet.wsgi.server [req-0839eff5-5166-4694-8b63-a5c72cf10bcf b56faf710c2f476bad2199f2fc6d8127 df00efd4218041be9abfc70e6c05f210] 192.0.2.1 - - [11/Feb/2016 17:56:59] "GET /v1/df00efd4218041be9abfc70e6c05f210/stacks HTTP/1.1" 200 1228 0.351209

Comment 2 Steve Baker 2016-02-12 03:27:36 UTC
You shouldn't need to redo introspection at step 3, assuming the stack was deleted successfully it could be that some ironic nodes went to an ERROR state.

At step 3, please confirm the following:
"heat stack-list" is empty
"nova list" is empty
"ironic node-list" has all nodes in power off & available

Comment 3 lokesh.jain 2016-02-22 19:57:10 UTC
I am not able to confirm the previous stack-delete issue because I am running into this bug now: https://bugzilla.redhat.com//show_bug.cgi?id=1259834.
Will update the bug when I am able to reproduce this.

Comment 4 Mike Burns 2016-04-07 21:11:06 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 6 Ruchika K 2016-05-12 13:55:58 UTC
Able to reproduce this bug with relative ease.

Steve:
Here is the info you requested
[stack@undercloud ~]$ heat stack-list
+----+------------+--------------+---------------+--------------+
| id | stack_name | stack_status | creation_time | updated_time |
+----+------------+--------------+---------------+--------------+
+----+------------+--------------+---------------+--------------+
[stack@undercloud ~]$ ironic node-list
+--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name              | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+
| ce438c41-50e1-444b-a60c-3e6e0fdbae77 | over8-controller1 | None                                 | power off   | available          | False       |
| 4aa8d990-3a93-4c4e-98e1-e9f9277a6a1e | over8-controller2 | 77d25a45-8671-41ec-a7e8-aeed5282e9d1 | power on    | active             | False       |
| 98324b7c-0730-4c37-838b-c85fef324a08 | over8-controller3 | 6303ec52-a8d8-4b34-9599-107cc576c521 | power on    | deploy failed      | False       |
| 98c8bd64-095a-4268-8f29-b1bfab41f250 | over8-ceph1       | None                                 | power off   | available          | False       |
| 76e7fe8a-084a-4cfd-b00b-6ab116f4228b | over8-ceph2       | None                                 | power off   | available          | False       |
| b3fc91c4-66a0-4a58-b981-60175f8ed6e4 | over8-ceph3       | a52d88f3-0ea2-4b8d-b558-fe234d6db1ff | power on    | active             | False       |
| dfb7ef4a-0576-4a38-b78d-cd7ade6c595b | over8-compute1    | c5fff957-af09-45b7-9fdb-b6ba780b883c | power on    | active             | False       |
+--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+
[stack@undercloud ~]$ nova list
+----+------+--------+------------+-------------+----------+
| ID | Name | Status | Task State | Power State | Networks |
+----+------+--------+------------+-------------+----------+
+----+------+--------+------------+-------------+----------+

Comment 7 Aviv Guetta 2016-08-04 07:28:00 UTC
We have the same issue in OSPd 8 environment.
The behavior is the same, the logs a bit different (different ralease), but the 'error 500' message:
2016-08-02 22:06:46 [ControllerClusterDeployment]: CREATE_COMPLETE  state changed
2016-08-02 22:06:53 [NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2016-08-02 22:06:53 [NovaCompute]: DELETE_IN_PROGRESS  state changed
2016-08-02 22:06:55 [NovaCompute]: DELETE_COMPLETE  state changed
2016-08-02 22:07:13 [NovaCompute]: CREATE_IN_PROGRESS  state changed
2016-08-02 22:14:09 [NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2016-08-02 22:14:09 [NovaCompute]: DELETE_IN_PROGRESS  state changed
2016-08-02 22:14:12 [NovaCompute]: DELETE_COMPLETE  state changed
2016-08-02 22:14:45 [NovaCompute]: CREATE_IN_PROGRESS  state changed
2016-08-02 22:21:55 [NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2016-08-02 22:21:56 [overcloud-Compute-jinfa62mh2y4-0-wwcirrnn5yhi]: CREATE_FAILED  Resource CREATE failed: ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2016-08-02 22:21:57 [0]: CREATE_FAILED  ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2016-08-02 22:21:58 [overcloud-Compute-jinfa62mh2y4]: UPDATE_FAILED  ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
Stack overcloud CREATE_FAILED
Heat Stack create failed.

Comment 15 Steve Baker 2016-10-11 20:11:28 UTC
After the stack-delete, if there are no results from a "nova list" then as far as heat is aware the delete was a success.

There may be manual cleanup required before attempting the next deploy. This cleanup will consist of looking at the output of "ironic node-list" and running ironic node commands to get all nodes back to a good state, specifically:

- Any nodes in "Maintenance:True" need "ironic node-set-maintenance <node> False"
- Any nodes in "Power State:power on" need "ironic node-set-power-state <node> off"
- Any nodes not in "Provision State:available" need "ironic node-set-provision-state <node> deleted"

You can then confirm that nove has the required capacity to deploy the cloud by running "nova hypervisor-stats".

Later versions of OSP do an available node check before deploying the overcloud, which leads to early failure and a more obvious error message if there are not enough nodes available.

Comment 16 Jaromir Coufal 2016-10-13 19:48:23 UTC
Heat cleans up only nova instances, some manual clean up is required (especially ironic) as per comment #15.