Hide Forgot
rhel-osp-director: Scale-up Ceph from 1 to 3 fails, when Overcloud is deployed with SSL (resources.EndpointMap: Timed out) . Environment (backport-job 7.3): ------------------------------ instack-0.0.7-2.el7ost.noarch instack-undercloud-2.1.2-37.el7ost.noarch python-rdomanager-oscplugin-0.0.10-25.el7ost.noarch python-heatclient-0.6.0-1.el7ost.noarch openstack-heat-api-2015.1.2-6.el7ost.noarch heat-cfntools-1.2.8-2.el7.noarch openstack-heat-common-2015.1.2-6.el7ost.noarch openstack-heat-engine-2015.1.2-6.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-103.el7ost.noarch openstack-heat-templates-0-0.8.20150605git.el7ost.noarch openstack-heat-api-cloudwatch-2015.1.2-6.el7ost.noarch openstack-heat-api-cfn-2015.1.2-6.el7ost.noarch Steps: ------- (1) Deploy overcloud with SSL with 1 Ceph node. (2) Attempt to Scale-up Ceph from 1 to 3 node Results : --------- Stack Update failed - due to resources.EndpointMap: Timed out . [stack@undercloud72 ~]$ openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /home/stack/ssl-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml -e ~/ssl-heat-templates/environments/enable-tls.yaml -e ~/ssl-heat-templates/environments/inject-trust-anchor.yaml --ntp-server 10.5.26.10 --neutron-network-type vxlan --neutron-tunnel-types vxlan --timeout 90 Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates Stack failed with status: MessagingTimeout: resources.EndpointMap: Timed out waiting for a reply to message ID 9f543dfa60b7439789e06acbc8768548 ERROR: openstack Heat Stack update failed. [stack@undercloud72 ~]$ nova list +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | 93eaf3e9-5b91-4531-9a16-fc9783117b12 | overcloud-cephstorage-0 | ACTIVE | - | Running | ctlplane=192.168.0.7 | | cd26fb71-8d0a-4358-96de-a0aeb33dd773 | overcloud-cephstorage-1 | ACTIVE | - | Running | ctlplane=192.168.0.13 | | 5825914e-cc18-49a7-9d7f-57ef2c6ef0f6 | overcloud-cephstorage-2 | ACTIVE | - | Running | ctlplane=192.168.0.12 | | 1c604744-a783-4ab9-9a27-a55dab04e76a | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.168.0.8 | | db28e938-d175-4add-9931-399538a8352c | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.0.11 | | e6f7754e-c12f-4124-84e4-23e5916e2d57 | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=192.168.0.10 | | b4cab872-ffe2-49ef-bfd7-fa11718bf2aa | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.168.0.9 | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ [stack@undercloud72 ~]$ heat resource-list -n5 overcloud | grep FAIL | CephStorage | f5edaacd-b79e-4d03-adc3-3c22582b509e | OS::Heat::ResourceGroup | UPDATE_FAILED | 2016-01-06T00:13:01Z | | | EndpointMap | 2c7f6eba-03cf-4a83-b424-a882c03d2452 | OS::TripleO::EndpointMap | UPDATE_FAILED | 2016-01-06T00:13:37Z | [stack@undercloud72 ~]$ heat stack-show overcloud | grep -i status | stack_status | UPDATE_FAILED | | stack_status_reason | MessagingTimeout: resources.EndpointMap: Timed out [stack@undercloud72 ~]$ heat resource-show overcloud 2c7f6eba-03cf-4a83-b424-a882c03d2452 Stack or resource not found: overcloud 2c7f6eba-03cf-4a83-b424-a882c03d2452 heat-engine.log: ----------------- 2016-01-05 19:14:39.558 6951 DEBUG heat.engine.scheduler [-] Task update_task from Stack "overcloud-CephStorage-e6bsjx7luyo2-0-qkhbl6bpdjgg" [a6535d9f-7f18-45 c5-9c4d-0fe30a850fb3] complete step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:226 2016-01-05 19:14:39.584 6951 DEBUG heat.engine.stack_lock [-] Engine 952fae67-e632-4cbc-95d8-71043d152d18 released lock on stack a6535d9f-7f18-45c5-9c4d-0fe30 a850fb3 release /usr/lib/python2.7/site-packages/heat/engine/stack_lock.py:132 2016-01-05 19:14:39.620 6951 ERROR heat.engine.resources.stack_resource [-] update_stack 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource Traceback (most recent call last): 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 402, in update_with_template 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource args) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 226, in update_stack 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource args=args)) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 51, in call 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource return client.call(ctxt, method, **kwargs) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 393, in call 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource return self.prepare().call(ctxt, method, **kwargs) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource retry=self.retry) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource timeout=timeout, retry=retry) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource retry=retry) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource result = self._waiter.wait(msg_id, timeout) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource message = self.waiters.get(msg_id, timeout=timeout) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource 'to message ID %s' % msg_id) 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource MessagingTimeout: Timed out waiting for a reply to message ID 9f543dfa60b7439789e06acbc8768548 2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource 2016-01-05 19:14:39.635 6951 INFO heat.engine.resource [-] UPDATE: OS::TripleO::EndpointMap "EndpointMap" [2c7f6eba-03cf-4a83-b424-a882c03d2452] Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource Traceback (most recent call last): 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 528, in _action_recorder 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource yield 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 796, in update 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource args=[after, tmpl_diff, prop_diff]) 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 296, in wrapper 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource step = next(subtask) 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 569, in action_handler_task 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource handler_data = handler(*args) 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/template_resource.py", line 290, in handle_update 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource self.child_params()) 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 405, in update_with_template 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource self.raise_local_exception(ex) 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 288, in raise_local_exception 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource raise ex 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource MessagingTimeout: Timed out waiting for a reply to message ID 9f543dfa60b7439789e06acbc8768548 2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource 2016-01-05 19:14:39.729 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246 2016-01-05 19:14:39.729 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246 2016-01-05 19:14:39.729 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246 2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246 2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246 2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246 2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancel
Created attachment 1115955 [details] heat-engine.log heat-engine.log
This same problem exists when trying to scale Ceph storage nodes in a regular IPv4 deployment.
the same error happens with 2/4/6/12 workers and it is always the same resource failing with the Timed out message: https://github.com/openstack/tripleo-heat-templates/blob/master/network/endpoints/endpoint_map.yaml which is defining a big number of https://github.com/openstack/tripleo-heat-templates/blob/master/network/endpoints/endpoint.yaml adding a depends_on across the Endpoint resources doesn't help, same error trying to deploy and update a standalone stack from endpoint_map does *not* trigger the same error yet on a full overcloud this is reproducible consistenly on both ipv4 and ipv6 but only when scaling ceph nodes, if you try to scale compute nodes, the update will complete successfully
Created attachment 1118860 [details] heat-engine.log.gz extract from heat-engine.log (debug) during a failed update attempt
the single useful message which I seem to find in the logs, in addition to the TRACE is: INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 4fc5bf4064674d04832ef3d638665979
Possible alternative to generating the endpoint map proposed upstream
I suspect the remainder of this issue will be fixed by bug 1302880.
Ben, can you test the Endpoint map patch and if it looks good propose it downstream in tripleo-heat-templates?
This should be resolved by the fixes for bug 1302880 and bug 1305947.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-0266.html