Bug 1338002 - The "openstack overcloud deploy" command fails in one environment (while it works in all others) after the timeout period, the overcloud is in state CREATE_FAILED. This applies both to RHEL-OSP7 and RH-OSP8.
Summary: The "openstack overcloud deploy" command fails in one environment (while it w...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Angus Thomas
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-20 16:34 UTC by Joachim von Thadden
Modified: 2016-06-11 13:28 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-11 13:28:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Joachim von Thadden 2016-05-20 16:34:00 UTC
Description of problem:

The "openstack overcloud deploy" command fails in one environment after the timeout period. The overcloud is in state CREATE_FAILED. This applies both to RHEL-OSP7 and RH-OSP8.

The same setup in some other environments rolls out without problems. When the deployment fails, you can see that the task NetworkDeployment is never completed.

In the following we are looking at the control-0 node, the same applies to all other control and compute nodes:

[root@rhosp-director ~]# heat resource-list 25a2e3e6-10ad-4a85-8edf-af60e19b3814
+--------------------------+--------------------------------------+-------------------------------------------------+--------------------+---------------------+
| resource_name            | physical_resource_id                 | resource_type                                   | resource_status    | updated_time        |
+--------------------------+--------------------------------------+-------------------------------------------------+--------------------+---------------------+
| ControllerConfig         |                                      | OS::Heat::StructuredConfig                      | INIT_COMPLETE      | 2016-05-12T15:47:53 |
| ControllerDeployment     |                                      | OS::TripleO::SoftwareDeployment                 | INIT_COMPLETE      | 2016-05-12T15:47:53 |
| ControllerExtraConfigPre |                                      | OS::TripleO::ControllerExtraConfigPre           | INIT_COMPLETE      | 2016-05-12T15:47:53 |
| InternalApiPort          | 744a06dc-a243-4d37-beb3-8b64b467320f | OS::TripleO::Controller::Ports::InternalApiPort | CREATE_COMPLETE    | 2016-05-12T15:47:53 |
| NetIpMap                 | ea00f870-0552-4111-8d40-d93923c5892c | OS::TripleO::Network::Ports::NetIpMap           | CREATE_COMPLETE    | 2016-05-12T15:47:53 |
| NetIpSubnetMap           | 97dc7472-727a-4dac-89db-a488a36e4cb8 | OS::TripleO::Network::Ports::NetIpSubnetMap     | CREATE_COMPLETE    | 2016-05-12T15:47:53 |
| NetworkConfig            | 9aaa2bb2-6c28-4541-9c52-7bdb73d32de5 | OS::TripleO::Controller::Net::SoftwareConfig    | CREATE_COMPLETE    | 2016-05-12T15:47:53 |
| NetworkDeployment        | dbfa3900-2ee6-4882-b270-a4b4caf99a97 | OS::TripleO::SoftwareDeployment                 | CREATE_IN_PROGRESS | 2016-05-12T15:47:53 |
| NodeExtraConfig          |                                      | OS::TripleO::NodeExtraConfig                    | INIT_COMPLETE      | 2016-05-12T15:47:53 |
| NodeTLSCAData            |                                      | OS::TripleO::NodeTLSCAData                      | INIT_COMPLETE      | 2016-05-12T15:47:53 |
| NodeTLSData              |                                      | OS::TripleO::NodeTLSData                        | INIT_COMPLETE      | 2016-05-12T15:47:53 |
| Controller               | 7f1a1e62-550c-4ea6-9443-b7c3b60b5dde | OS::Nova::Server                                | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| ExternalPort             | adcefda5-254c-445e-9b16-e0057cfe3478 | OS::TripleO::Controller::Ports::ExternalPort    | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| ManagementPort           | 9330a086-928f-4a12-9b6e-5a16dc65a4d9 | OS::TripleO::Controller::Ports::ManagementPort  | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| NodeAdminUserData        | d3d341da-aa71-4208-9984-3548b57d906f | OS::TripleO::NodeAdminUserData                  | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| NodeUserData             | 756f2dee-0711-4c29-aece-b0ebd96c26e8 | OS::TripleO::NodeUserData                       | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| StorageMgmtPort          | 5e04f446-1aab-4330-adee-5524ea1e9608 | OS::TripleO::Controller::Ports::StorageMgmtPort | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| StoragePort              | 2d27bd7a-90d2-4e96-aee1-37dec5454de7 | OS::TripleO::Controller::Ports::StoragePort     | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| TenantPort               | ef6aef68-aa58-4ac4-bba4-4e6940cc79f4 | OS::TripleO::Controller::Ports::TenantPort      | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| UpdateConfig             | 2fa8f9cc-c18e-470e-ae95-cd2bc0eaeb80 | OS::TripleO::Tasks::PackageUpdate               | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| UpdateDeployment         | ce1455b2-2de6-4120-9cec-475eb9f2355b | OS::Heat::SoftwareDeployment                    | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
| UserData                 | f27e6851-dbb6-488a-a915-6b6aa638ecc4 | OS::Heat::MultipartMime                         | CREATE_COMPLETE    | 2016-05-12T15:47:54 |
+--------------------------+--------------------------------------+-------------------------------------------------+--------------------+---------------------+


We have validated that the control-0 node does in fact send a completion notice back to director which is acknowledged successfully:

[heat-admin@overcloud-controller-0 ~]$ journalctl -u os-collect-config | grep heat-config-notify
May 12 11:56:17 overcloud-controller-0.localdomain os-collect-config[3613]: [2016-05-12 11:56:17,932] (heat-config) [DEBUG] Running heat-config-notify /var/run/heat-config/deployed/bfa80838-99b3-43ab-a3fb-49c6689cbbfb.json < /var/run/heat-config/deployed/bfa80838-99b3-43ab-a3fb-49c6689cbbfb.notify.json
May 12 11:56:18 overcloud-controller-0.localdomain os-collect-config[3613]: [2016-05-12 11:56:18,750] (heat-config) [DEBUG] [2016-05-12 11:56:18,270] (heat-config-notify) [DEBUG] Signaling to http://192.168.105.30:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A544f4d0524ae4ebfa78d00aded709ab4%3Astacks%2Fovercloud-Controller-5sql3qekulel-0-tj6lej3cyni7%2F25a2e3e6-10ad-4a85-8edf-af60e19b3814%2Fresources%2FUpdateDeployment?Timestamp=2016-05-12T15%3A47%3A54Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=06374abd3b284618a5b9f419437ac942&SignatureVersion=2&Signature=f3NjhsXrbg7iWBVZQ2ShYof%2B3zZYlRgLxN44ywz2N4M%3D via POST
May 12 11:56:18 overcloud-controller-0.localdomain os-collect-config[3613]: [2016-05-12 11:56:18,728] (heat-config-notify) [DEBUG] Response <Response [200]>


However on the director side, the heat engine never triggers the next task:

[root@rhosp-director ~]# grep req-07110575-33ef-4a3a-aa9d-a760e7a3bd18 /var/log/heat/*
/var/log/heat/heat-api.log-20160513:2016-05-12 17:56:12.686 32678 DEBUG oslo_policy._cache_handler [req-07110575-33ef-4a3a-aa9d-a760e7a3bd18 7473c157226146489c6ef48a90903071 544f4d0524ae4ebfa78d00aded709ab4] Reloading cached file /etc/heat/policy.json read_cached_file /usr/lib/python2.7/site-packages/oslo_policy/_cache_handler.py:38
/var/log/heat/heat-api.log-20160513:2016-05-12 17:56:12.688 32678 DEBUG oslo_policy.policy [req-07110575-33ef-4a3a-aa9d-a760e7a3bd18 7473c157226146489c6ef48a90903071 544f4d0524ae4ebfa78d00aded709ab4] Reloaded policy file: /etc/heat/policy.json _load_policy_file /usr/lib/python2.7/site-packages/oslo_policy/policy.py:441
/var/log/heat/heat-api.log-20160513:2016-05-12 17:56:12.689 32678 DEBUG heat.common.wsgi [req-07110575-33ef-4a3a-aa9d-a760e7a3bd18 7473c157226146489c6ef48a90903071 544f4d0524ae4ebfa78d00aded709ab4] Calling <heat.api.openstack.v1.events.EventController object at 0x4905550> : index __call__ /usr/lib/python2.7/site-packages/heat/common/wsgi.py:851
/var/log/heat/heat-api.log-20160513:2016-05-12 17:56:12.690 32678 DEBUG oslo_messaging._drivers.amqpdriver [req-07110575-33ef-4a3a-aa9d-a760e7a3bd18 7473c157226146489c6ef48a90903071 544f4d0524ae4ebfa78d00aded709ab4] MSG_ID is 33502df678094fab988dac0c9ef8b136 _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:392
/var/log/heat/heat-api.log-20160513:2016-05-12 17:56:12.731 32678 DEBUG heat.common.serializers [req-07110575-33ef-4a3a-aa9d-a760e7a3bd18 7473c157226146489c6ef48a90903071 544f4d0524ae4ebfa78d00aded709ab4] JSON response : {"events": []} to_json /usr/lib/python2.7/site-packages/heat/common/serializers.py:42
/var/log/heat/heat-api.log-20160513:2016-05-12 17:56:12.732 32678 DEBUG eventlet.wsgi.server [req-07110575-33ef-4a3a-aa9d-a760e7a3bd18 7473c157226146489c6ef48a90903071 544f4d0524ae4ebfa78d00aded709ab4] 192.168.105.30 - - [12/May/2016 17:56:12] "GET /v1/544f4d0524ae4ebfa78d00aded709ab4/stacks/overcloud-Controller-5sql3qekulel-1-bhid3zenxssk/0b051d2f-4a8f-4284-946b-ac48148a7749/events?sort_dir=asc&marker=36607635-b55b-4d7e-8d3d-3a16499ed338 HTTP/1.1" 200 227 0.047566 write /usr/lib/python2.7/site-packages/heat/common/wsgi.py:267
[root@rhosp-director ~]#


We have checked the logs to find that every task which is started is in fact completed. Nevertheless heat does not advance. The follwing script will show any started but not completed Stack task (or just dots for completed ones):

[root@rhosp-director ~]# for n in `zcat /var/log/heat/heat-engine.log-20160513.gz|grep "stack_task from Stack"|grep Controller|cut -d\" -f2|sort|uniq`; do zgrep "$n" /var/log/heat/heat-engine.log-20160513.gz|grep "stack_task from Stack"|grep "complete" >/dev/null; if [ "$?" = "0" ]; then echo -n "."; else echo $n; fi; done

It results in only dots in the working as well in the broken environment.


---- ANY LOG SNIPPET  BELOW THIS LINE REFER TO THE ENVIRONMENT WHERE THE ROLLOUT IS WORKING ----

In the working requirement, you can see a large json response object to the heat-config-notify request from the controller which is nearly empty in the non-working environment

[root@rhosp-director ~]# grep req-5ab9fe43-7ec1-4c05-8379-8c6c6a3d1ba4 /var/log/heat/*
/var/log/heat/heat-api.log:2016-05-13 10:16:53.554 32756 DEBUG oslo_policy._cache_handler [req-5ab9fe43-7ec1-4c05-8379-8c6c6a3d1ba4 624c205b78ad4a04969d4b1953933d03 1d420ee76b1a4946a15876641162367d] Reloading cached file /etc/heat/policy.json read_cached_file /usr/lib/python2.7/site-packages/oslo_policy/_cache_handler.py:38
/var/log/heat/heat-api.log:2016-05-13 10:16:53.556 32756 DEBUG oslo_policy.policy [req-5ab9fe43-7ec1-4c05-8379-8c6c6a3d1ba4 624c205b78ad4a04969d4b1953933d03 1d420ee76b1a4946a15876641162367d] Reloaded policy file: /etc/heat/policy.json _load_policy_file /usr/lib/python2.7/site-packages/oslo_policy/policy.py:441
/var/log/heat/heat-api.log:2016-05-13 10:16:53.557 32756 DEBUG heat.common.wsgi [req-5ab9fe43-7ec1-4c05-8379-8c6c6a3d1ba4 624c205b78ad4a04969d4b1953933d03 1d420ee76b1a4946a15876641162367d] Calling <heat.api.openstack.v1.events.EventController object at 0x33f8b50> : index __call__ /usr/lib/python2.7/site-packages/heat/common/wsgi.py:851
/var/log/heat/heat-api.log:2016-05-13 10:16:53.559 32756 DEBUG oslo_messaging._drivers.amqpdriver [req-5ab9fe43-7ec1-4c05-8379-8c6c6a3d1ba4 624c205b78ad4a04969d4b1953933d03 1d420ee76b1a4946a15876641162367d] MSG_ID is a17eda6d623c4f5ab7f1db78406ecf69 _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:392
/var/log/heat/heat-api.log:2016-05-13 10:16:54.134 32756 DEBUG heat.common.serializers [req-5ab9fe43-7ec1-4c05-8379-8c6c6a3d1ba4 624c205b78ad4a04969d4b1953933d03 1d420ee76b1a4946a15876641162367d] JSON response : {"events": [{"resource_name": "Controller", "event_time": "2016-05-13T08:16:21", "links": [{"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/Controller/events/f66ec878-1a9d-4b41-bfbb-6a5f4f8b5d09", "rel": "self"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/Controller", "rel": "resource"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb", "rel": "stack"}], "logical_resource_id": "Controller", "resource_status": "CREATE_COMPLETE", "resource_status_reason": "state changed", "physical_resource_id": "6cf4180e-8985-47af-9214-bb8fac15474a", "id": "f66ec878-1a9d-4b41-bfbb-6a5f4f8b5d09"}, {"resource_name": "StorageMgmtPort", "event_time": "2016-05-13T08:16:23", "links": [{"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/StorageMgmtPort/events/31e83d46-5d88-4f64-a794-dcdc35027720", "rel": "self"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/StorageMgmtPort", "rel": "resource"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb", "rel": "stack"}], "logical_resource_id": "StorageMgmtPort", "resource_status": "CREATE_IN_PROGRESS", "re
source_status_reason": "state changed", "physical_resource_id": null, "id": "31e83d46-5d88-4f64-a794-dcdc35027720"}, {"resource_name": "ManagementPort", "event_time": "2016-05-13T08:16:27", "links": [{"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/ManagementPort/events/e2e66a90-768a-4061-a5d8-4267d4a7c7d3", "rel": "self"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/ManagementPort", "rel": "resource"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb", "rel": "stack"}], "logical_resource_id": "ManagementPort", "resource_status": "CREATE_IN_PROGRESS", "resource_status_reason": "state changed", "physical_resource_id": null, "id": "e2e66a90-768a-4061-a5d8-4267d4a7c7d3"}, {"resource_name": "StoragePort", "event_time": "2016-05-13T08:16:32", "links": [{"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/StoragePort/events/3472b400-d68f-40ff-9397-3b8fb96bc29a", "rel": "self"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/StoragePort", "rel": "resource"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb", "rel": "stack"}], "logical_resource_id": "StoragePort", "resource_status": "CREATE_IN_PROGRESS", "resource_status_reason": "state changed", "physical_resource_id": null, "id": "3472b400-d68f-40ff-9397-3b8fb96bc29a"}, {"resource_name": "Te
nantPort", "event_time": "2016-05-13T08:16:36", "links": [{"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/TenantPort/events/627ec3f7-8d2d-4c05-ad03-b34a84696e61", "rel": "self"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/TenantPort", "rel": "resource"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb", "rel": "stack"}], "logical_resource_id": "TenantPort", "resource_status": "CREATE_IN_PROGRESS", "resource_status_reason": "state changed", "physical_resource_id": null, "id": "627ec3f7-8d2d-4c05-ad03-b34a84696e61"}, {"resource_name": "ExternalPort", "event_time": "2016-05-13T08:16:43", "links": [{"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/ExternalPort/events/be5bcf31-2d7e-4757-94db-5473ef46aecc", "rel": "self"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/ExternalPort", "rel": "resource"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb", "rel": "stack"}], "logical_resource_id": "ExternalPort", "resource_status": "CREATE_IN_PROGRESS", "resource_status_reason": "state changed", "physical_resource_id": null, "id": "be5bcf31-2d7e-4757-94db-5473ef46aecc"}, {"resource_name": "UpdateDeployment", "event_time": "2016-05-13T08:16:48", "links": [{"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcl
oud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/UpdateDeployment/events/73b434e3-1725-4438-bb75-66f8323fbb98", "rel": "self"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/UpdateDeployment", "rel": "resource"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb", "rel": "stack"}], "logical_resource_id": "UpdateDeployment", "resource_status": "CREATE_IN_PROGRESS", "resource_status_reason": "state changed", "physical_resource_id": null, "id": "73b434e3-1725-4438-bb75-66f8323fbb98"}, {"resource_name": "InternalApiPort", "event_time": "2016-05-13T08:16:53", "links": [{"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/InternalApiPort/events/463a4bfd-35a2-499a-b012-a9c200bc94d9", "rel": "self"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/resources/InternalApiPort", "rel": "resource"}, {"href": "http://192.168.105.30:8004/v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb", "rel": "stack"}], "logical_resource_id": "InternalApiPort", "resource_status": "CREATE_IN_PROGRESS", "resource_status_reason": "state changed", "physical_resource_id": null, "id": "463a4bfd-35a2-499a-b012-a9c200bc94d9"}]} to_json /usr/lib/python2.7/site-packages/heat/common/serializers.py:42
/var/log/heat/heat-api.log:2016-05-13 10:16:54.135 32756 DEBUG eventlet.wsgi.server [req-5ab9fe43-7ec1-4c05-8379-8c6c6a3d1ba4 624c205b78ad4a04969d4b1953933d03 1d420ee76b1a4946a15876641162367d] 192.168.105.30 - - [13/May/2016 10:16:54] "GET /v1/1d420ee76b1a4946a15876641162367d/stacks/overcloud-Controller-kplc32v5g5s4-1-sdmy3dhxhh6y/81d2dbc9-c15f-49be-81e8-a290428c45cb/events?sort_dir=asc&marker=5e1cc1b7-f4b2-426a-b3ce-6f3b96520484 HTTP/1.1" 200 7718 0.584925 write /usr/lib/python2.7/site-packages/heat/common/wsgi.py:267
[root@rhosp-director ~]# grep a17eda6d623c4f5ab7f1db78406ecf69 /var/log/heat/*
/var/log/heat/heat-api.log:2016-05-13 10:16:53.559 32756 DEBUG oslo_messaging._drivers.amqpdriver [req-5ab9fe43-7ec1-4c05-8379-8c6c6a3d1ba4 624c205b78ad4a04969d4b1953933d03 1d420ee76b1a4946a15876641162367d] MSG_ID is a17eda6d623c4f5ab7f1db78406ecf69 _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:392
[root@rhosp-director ~]# zgrep a17eda6d623c4f5ab7f1db78406ecf69 /var/log/heat/*
/var/log/heat/heat-api.log:2016-05-13 10:16:53.559 32756 DEBUG oslo_messaging._drivers.amqpdriver [req-5ab9fe43-7ec1-4c05-8379-8c6c6a3d1ba4 624c205b78ad4a04969d4b1953933d03 1d420ee76b1a4946a15876641162367d] MSG_ID is a17eda6d623c4f5ab7f1db78406ecf69 _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:392
[root@rhosp-director ~]#


Version-Release number of selected component (if applicable):

We have two hardware environments named "inf43" and "storm1". Each is in fact a single physical machine on which we use KVM to create multiple virtual machines in which we install (among other components) RHEL-OSP in a 1 director, 3 controller and 2 compute nodes configuration. The intent is to use nested virtualisation to run testing workloads on top of RHEL-OSP.

Both environments are created, configured and rolled out via ansible scripts, i.e. the mechanism to create the VMs, their network configuration and to deploy openstack is 100% the same. The ansible script as well as a target architecture description is available from https://github.com/wrichter/hailstorm.

The machines on which the VMs are running are both inside the Red Hat VPN and be made accessible to support and engineering if need be.

Inside the machines, there are the following networks - realised as libvirt virtual networks:

nic1: rhosp_provisioning: 192.168.105.0/24
nic2: (not used) services: 192.168.101.0/24
nic3: admin: 192.168.103.0/24
nic4: rhosp_internal: 192.168.106.0/24

The rhosp-director has the ip addresses 192.168.103.30 and 192.168.105.30
The controller/compute nodes are allocated IPs on the provisioning network, i.e. controller-0 has 192.168.105.45


How reproducible:

The behaviour is 100% repeatable and tested over >10 attempts - the deployment works on the inf43 machine and fails on the storm1 machine. There is no relation to times of day.

Steps to Reproduce:
1. openstack overcloud deploy

Actual results:
stalls until time-out (4h)

Expected results:
installs the overcloud

Additional info:
This is a part of an overall Red Hat product Demo called Hailstorm, which deploys Satellite, IPA, RHEV, OSP, OSE, CF in a nested testing environment.

I am attaching SOSreports from the working (inf43) and the broken environment, the director and controller-0 for each env.

Comment 3 Joachim von Thadden 2016-05-20 17:10:49 UTC
For us it looks like a bug in heat, as the sub-stack is in creation but never gets created. The resource is just not there while the resource-list is stating it is in state CREATE_IN_PROGRESS. That never happens. BUT: We only have it on this very fast machine storm1 with huge amount of RAM, CPU and SSDs only.

Comment 4 Joachim von Thadden 2016-05-20 22:58:52 UTC
The SOS-Reports of the very same run on a working hardware can be found here:

Working Controller: https://drive.google.com/a/redhat.com/file/d/0BwBxey3wId2LME1jWFpSSXppT2M/view?usp=sharing_eid&ts=573f962c

Working Director: https://drive.google.com/a/redhat.com/file/d/0BwBxey3wId2LZFFQcHdCMDJCeDg/view?usp=sharing_eid&ts=573f9639

Comment 5 Hugh Brock 2016-05-23 08:02:31 UTC
Joachim, we don't have permission to see the SOS reports, can you make them visible?

Comment 6 Joachim von Thadden 2016-05-23 16:54:01 UTC
The link to the whole folder is https://drive.google.com/open?id=0BwBxey3wId2LUldTazRyeE0tWWs and is open for Red Hatters. Sorry for the confusion!

Comment 7 Thomas Hervé 2016-05-24 08:02:22 UTC
I'm still investigating, but I've found the following error in the controller: 

WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(110, 'Connection timed out'))
WARNING os-collect-config [-] Source [ec2] Unavailable.

The metadata services seems to be running in nova, but we don't see any request to it. AFAICT firewall configuration seems good to, and the controller gets the proper route to 169.254.169.254, but it doesn't manage to connect to it.

Comment 8 Joachim von Thadden 2016-05-24 08:18:49 UTC
Hmmm, I see this warning on all my recent installations without any further problems arising.

Comment 9 Joachim von Thadden 2016-05-24 11:25:32 UTC
From a chat and screen session:
(11:14:39) therve: OK I'm somewhat out of my area of expertise now
(11:14:48) therve: But you need os-apply-config  --key instance-id --type raw --key-default '' to return something on the controller
(11:15:07) therve: I don't know why it doesn't work, but I think the non-access to the ec2 API is the reason
(11:15:19) therve: IE curl http://169.254.169.254/2009-04-04/meta-data/instance-id should return something


I will look into that and come back...

Comment 10 Joachim von Thadden 2016-06-09 13:12:10 UTC
Digging into it I found

[root@overcloud-controller-0 instance]# os-apply-config  --key instance-id 
[2016/06/09 09:05:18 AM] [ERROR] key instance-id does not exist in [u'/var/lib/os-collect-config/heat_local.json', u'/var/lib/os-collect-config/cfn.json', u'/var/lib/os-collect-config/NetworkDeployment.json']

which is truly a no go. the instance-id is in none of the above mentioned json files. The not-availability of 

curl http://169.254.169.254/2009-04-04/meta-data/instance-id

is not the problem as the director does not use EC2 network access, but a disk config, which can be found on the first partition and holds the right instance-id key:

[root@overcloud-controller-0 instance]# fdisk -l /dev/vda

Disk /dev/vda: 64.4 GB, 64424509440 bytes, 125829120 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000e3f0a

   Device Boot      Start         End      Blocks   Id  System
/dev/vda1            2048        4095        1024   83  Linux
/dev/vda2   *        4096   125828639    62912272   83  Linux

[root@overcloud-controller-0 instance]# mount /dev/vda1 /mnt/

[root@overcloud-controller-0 instance]# grep -r instance-id /mnt
/mnt/ec2/2009-04-04/meta-data.json:{"reservation-id": "r-kqrzflhq", "hostname": "overcloud-controller-0", "security-groups": ["default"], "ramdisk-id": "ari-00000003", "public-ipv4": "", "ami-manifest-path": "FIXME", "instance-type": "control", "instance-id": "i-00000004", "local-ipv4": "192.168.105.45", "local-hostname": "overcloud-controller-0", "placement": {"availability-zone": "nova"}, "ami-launch-index": 0, "public-hostname": "overcloud-controller-0", "public-keys": {"0": {"openssh-key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6Z8jsLAJsTzLaPCgvjQjEIA/GPLM+Etz26U404m49g1iiXD5HaXpgRUlr3L47q6SvaiAqudduV/UN5yavJtFSHY0xtluEmZFT6v4ajgkEbADaC5Y75MbKT6Xg69v9cnmoFUlYNsiAYryDuGktvg8YjuLZh7sX+CDtX9cBPg0NEqs0c+y1pzS1ZFNjxDmqYiezZAuTsDmDMHe3uP31aYiZioS+JqCWvl/2b16w5yVSKN12mOJg9QtMlSPqDlaQFhRFYayhA/SjYJmX4dch7+XymnkNfNXUYMx6IFlDbL1hAyL+3BjwJFpR0Pt2t98xvEdq4uXCd89J129+vBqaglVH wolfram", "_name": "0=default"}}, "ami-id": "ami-00000001", "kernel-id": "aki-00000002", "instance-action": "none", "block-device-mapping": {"ami": "sda", "root": "/dev/sda"}}
/mnt/ec2/latest/meta-data.json:{"reservation-id": "r-kqrzflhq", "hostname": "overcloud-controller-0", "security-groups": ["default"], "ramdisk-id": "ari-00000003", "public-ipv4": "", "ami-manifest-path": "FIXME", "instance-type": "control", "instance-id": "i-00000004", "local-ipv4": "192.168.105.45", "local-hostname": "overcloud-controller-0", "placement": {"availability-zone": "nova"}, "ami-launch-index": 0, "public-hostname": "overcloud-controller-0", "public-keys": {"0": {"openssh-key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6Z8jsLAJsTzLaPCgvjQjEIA/GPLM+Etz26U404m49g1iiXD5HaXpgRUlr3L47q6SvaiAqudduV/UN5yavJtFSHY0xtluEmZFT6v4ajgkEbADaC5Y75MbKT6Xg69v9cnmoFUlYNsiAYryDuGktvg8YjuLZh7sX+CDtX9cBPg0NEqs0c+y1pzS1ZFNjxDmqYiezZAuTsDmDMHe3uP31aYiZioS+JqCWvl/2b16w5yVSKN12mOJg9QtMlSPqDlaQFhRFYayhA/SjYJmX4dch7+XymnkNfNXUYMx6IFlDbL1hAyL+3BjwJFpR0Pt2t98xvEdq4uXCd89J129+vBqaglVH wolfram", "_name": "0=default"}}, "ami-id": "ami-00000001", "kernel-id": "aki-00000002", "instance-action": "none", "block-device-mapping": {"ami": "sda", "root": "/dev/sda"}}

So the question remains why on this installation the disk config is not or not fully read, but on other installations it is?

Can anyone give me a hint where to dig deeper into this?

Comment 11 Joachim von Thadden 2016-06-09 15:31:29 UTC
I also see in /var/log/cloud-init.log as the last line:

May 24 04:31:44 overcloud-controller-0 cloud-init: Cloud-init v. 0.7.6 finished at Tue, 24 May 2016 08:31:43 +0000. Datasource DataSourceConfigDriveNet [net,ver=2][source=/dev/vda1].  Up 50.28 seconds

So it seems that cloud-init grabbed the right drive. Nevertheless it did not read the instance-id...

Comment 12 Joachim von Thadden 2016-06-11 13:28:03 UTC
Turned out that this was a brigeing problem. After some search we figured out, that the rule

iptables -t nat -D POSTROUTING -s 192.168.105.0/24 ! -d 192.168.105.0/24 -j MASQUERADE

on the virtualization host lead to questions from the VM via a masqueraded IP that never get answered. And the reason for that lies in the bridge modul that sets

net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-arptables = 1

as the default. And this is widely discussed in 

https://bugzilla.redhat.com/show_bug.cgi?id=634736

The difference in the failing machine is simply the fact that is has no bridges defined in /etc/sysconfig/network-scripts and thus loads the bridge module to late!

OpenStack deploy now successfully. Sorry for bothering everyone and thanks for the hint from  Thomas Hervé that lead us in the right direction.


Note You need to log in before you can comment on or make changes to this bug.