Bug 1299613 - rhel-osp-director: Scale-up Ceph from 1 to 3 fails, when Overcloud is deployed with SSL (resources.EndpointMap: Timed out) .
rhel-osp-director: Scale-up Ceph from 1 to 3 fails, when Overcloud is deploy...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat (Show other bugs)
7.0 (Kilo)
x86_64 Linux
urgent Severity high
: z4
: 7.0 (Kilo)
Assigned To: Ben Nemec
Amit Ugol
: TestOnly, ZStream
Depends On: 1302880 1305947 1308562 1309823
Blocks: 1309816
  Show dependency treegraph
 
Reported: 2016-01-18 13:45 EST by Omri Hochman
Modified: 2016-04-26 21:14 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1301627 1301629 1302593 1309816 (view as bug list)
Environment:
Last Closed: 2016-02-18 11:43:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
heat-engine.log (14.44 MB, application/x-bzip)
2016-01-18 14:49 EST, Omri Hochman
no flags Details
heat-engine.log.gz (1.37 MB, application/x-gzip)
2016-01-27 14:38 EST, Giulio Fidente
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 275437 None None None 2016-02-02 18:41 EST

  None (edit)
Description Omri Hochman 2016-01-18 13:45:09 EST
rhel-osp-director:  Scale-up Ceph from 1 to 3 fails, when Overcloud is deployed with SSL (resources.EndpointMap: Timed out) .


Environment (backport-job 7.3):
------------------------------
instack-0.0.7-2.el7ost.noarch
instack-undercloud-2.1.2-37.el7ost.noarch
python-rdomanager-oscplugin-0.0.10-25.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-2015.1.2-6.el7ost.noarch
heat-cfntools-1.2.8-2.el7.noarch
openstack-heat-common-2015.1.2-6.el7ost.noarch
openstack-heat-engine-2015.1.2-6.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-103.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.2-6.el7ost.noarch
openstack-heat-api-cfn-2015.1.2-6.el7ost.noarch


Steps:
-------
(1) Deploy overcloud with SSL with 1 Ceph node.
(2) Attempt to Scale-up Ceph from 1 to 3 node 

Results :
---------
Stack Update failed - due to resources.EndpointMap: Timed out . 


[stack@undercloud72 ~]$  openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /home/stack/ssl-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml -e ~/ssl-heat-templates/environments/enable-tls.yaml -e ~/ssl-heat-templates/environments/inject-trust-anchor.yaml --ntp-server 10.5.26.10 --neutron-network-type vxlan --neutron-tunnel-types vxlan --timeout 90
Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
Stack failed with status: MessagingTimeout: resources.EndpointMap: Timed out waiting for a reply to message ID 9f543dfa60b7439789e06acbc8768548
ERROR: openstack Heat Stack update failed.



[stack@undercloud72 ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks              |
+--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+
| 93eaf3e9-5b91-4531-9a16-fc9783117b12 | overcloud-cephstorage-0 | ACTIVE | -          | Running     | ctlplane=192.168.0.7  |
| cd26fb71-8d0a-4358-96de-a0aeb33dd773 | overcloud-cephstorage-1 | ACTIVE | -          | Running     | ctlplane=192.168.0.13 |
| 5825914e-cc18-49a7-9d7f-57ef2c6ef0f6 | overcloud-cephstorage-2 | ACTIVE | -          | Running     | ctlplane=192.168.0.12 |
| 1c604744-a783-4ab9-9a27-a55dab04e76a | overcloud-compute-0     | ACTIVE | -          | Running     | ctlplane=192.168.0.8  |
| db28e938-d175-4add-9931-399538a8352c | overcloud-controller-0  | ACTIVE | -          | Running     | ctlplane=192.168.0.11 |
| e6f7754e-c12f-4124-84e4-23e5916e2d57 | overcloud-controller-1  | ACTIVE | -          | Running     | ctlplane=192.168.0.10 |
| b4cab872-ffe2-49ef-bfd7-fa11718bf2aa | overcloud-controller-2  | ACTIVE | -          | Running     | ctlplane=192.168.0.9  |
+--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+


[stack@undercloud72 ~]$ heat resource-list -n5 overcloud | grep FAIL
| CephStorage                                   | f5edaacd-b79e-4d03-adc3-3c22582b509e          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2016-01-06T00:13:01Z |                                               |

| EndpointMap                                   | 2c7f6eba-03cf-4a83-b424-a882c03d2452          | OS::TripleO::EndpointMap                          | UPDATE_FAILED   | 2016-01-06T00:13:37Z |  
[stack@undercloud72 ~]$ heat stack-show overcloud | grep -i status
| stack_status          | UPDATE_FAILED                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| stack_status_reason   | MessagingTimeout: resources.EndpointMap: Timed out  

[stack@undercloud72 ~]$ heat resource-show overcloud 2c7f6eba-03cf-4a83-b424-a882c03d2452
Stack or resource not found: overcloud 2c7f6eba-03cf-4a83-b424-a882c03d2452


heat-engine.log:
-----------------
2016-01-05 19:14:39.558 6951 DEBUG heat.engine.scheduler [-] Task update_task from Stack "overcloud-CephStorage-e6bsjx7luyo2-0-qkhbl6bpdjgg" [a6535d9f-7f18-45
c5-9c4d-0fe30a850fb3] complete step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:226
2016-01-05 19:14:39.584 6951 DEBUG heat.engine.stack_lock [-] Engine 952fae67-e632-4cbc-95d8-71043d152d18 released lock on stack a6535d9f-7f18-45c5-9c4d-0fe30
a850fb3 release /usr/lib/python2.7/site-packages/heat/engine/stack_lock.py:132
2016-01-05 19:14:39.620 6951 ERROR heat.engine.resources.stack_resource [-] update_stack
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource Traceback (most recent call last):
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 402, in update_with_template
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     args)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 226, in update_stack
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     args=args))
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 51, in call
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     return client.call(ctxt, method, **kwargs)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 393, in call
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     return self.prepare().call(ctxt, method, **kwargs)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     retry=self.retry)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     timeout=timeout, retry=retry)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     retry=retry)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     result = self._waiter.wait(msg_id, timeout)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     message = self.waiters.get(msg_id, timeout=timeout)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     'to message ID %s' % msg_id)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource MessagingTimeout: Timed out waiting for a reply to message ID 9f543dfa60b7439789e06acbc8768548
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource 
2016-01-05 19:14:39.635 6951 INFO heat.engine.resource [-] UPDATE: OS::TripleO::EndpointMap "EndpointMap" [2c7f6eba-03cf-4a83-b424-a882c03d2452] Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb]
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource Traceback (most recent call last):
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 528, in _action_recorder
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     yield
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 796, in update
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     args=[after, tmpl_diff, prop_diff])
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 296, in wrapper
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     step = next(subtask)
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 569, in action_handler_task
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     handler_data = handler(*args)
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/template_resource.py", line 290, in handle_update
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     self.child_params())
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 405, in update_with_template
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     self.raise_local_exception(ex)
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 288, in raise_local_exception
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     raise ex
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource MessagingTimeout: Timed out waiting for a reply to message ID 9f543dfa60b7439789e06acbc8768548
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource 
2016-01-05 19:14:39.729 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.729 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.729 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancel
Comment 1 Omri Hochman 2016-01-18 14:49 EST
Created attachment 1115955 [details]
heat-engine.log

heat-engine.log
Comment 2 Giulio Fidente 2016-01-27 08:18:09 EST
This same problem exists when trying to scale Ceph storage nodes in a regular IPv4 deployment.
Comment 3 Giulio Fidente 2016-01-27 14:36:24 EST
the same error happens with 2/4/6/12 workers and it is always the same resource failing with the Timed out message:

https://github.com/openstack/tripleo-heat-templates/blob/master/network/endpoints/endpoint_map.yaml
which is defining a big number of
https://github.com/openstack/tripleo-heat-templates/blob/master/network/endpoints/endpoint.yaml

adding a depends_on across the Endpoint resources doesn't help, same error
trying to deploy and update a standalone stack from endpoint_map does *not* trigger the same error

yet on a full overcloud this is reproducible consistenly on both ipv4 and ipv6 but only when scaling ceph nodes, if you try to scale compute nodes, the update will complete successfully
Comment 4 Giulio Fidente 2016-01-27 14:38 EST
Created attachment 1118860 [details]
heat-engine.log.gz

extract from heat-engine.log (debug) during a failed update attempt
Comment 5 Giulio Fidente 2016-01-27 14:38:58 EST
the single useful message which I seem to find in the logs, in addition to the TRACE is:

  INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 4fc5bf4064674d04832ef3d638665979
Comment 6 Steve Baker 2016-02-02 18:41:27 EST
Possible alternative to generating the endpoint map proposed upstream
Comment 7 Zane Bitter 2016-02-02 18:58:37 EST
I suspect the remainder of this issue will be fixed by bug 1302880.
Comment 8 James Slagle 2016-02-03 09:33:10 EST
Ben, can you test the Endpoint map patch and if it looks good propose it downstream in tripleo-heat-templates?
Comment 9 Zane Bitter 2016-02-10 12:08:26 EST
This should be resolved by the fixes for bug 1302880 and bug 1305947.
Comment 13 errata-xmlrpc 2016-02-18 11:43:10 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-0266.html

Note You need to log in before you can comment on or make changes to this bug.