Bug 1299613

Summary: rhel-osp-director: Scale-up Ceph from 1 to 3 fails, when Overcloud is deployed with SSL (resources.EndpointMap: Timed out) .
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: openstack-heatAssignee: Ben Nemec <bnemec>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: apevec, dyasny, gfidente, jslagle, lhh, mburns, ohochman, rhel-osp-director-maint, sbaker, shardy, ssainkar, yeylon, zbitter
Target Milestone: z4Keywords: TestOnly, ZStream
Target Release: 7.0 (Kilo)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1301627 1301629 1302593 1309816 (view as bug list) Environment:
Last Closed: 2016-02-18 16:43:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1302880, 1305947, 1308562, 1309823    
Bug Blocks: 1309816    
Attachments:
Description Flags
heat-engine.log
none
heat-engine.log.gz none

Description Omri Hochman 2016-01-18 18:45:09 UTC
rhel-osp-director:  Scale-up Ceph from 1 to 3 fails, when Overcloud is deployed with SSL (resources.EndpointMap: Timed out) .


Environment (backport-job 7.3):
------------------------------
instack-0.0.7-2.el7ost.noarch
instack-undercloud-2.1.2-37.el7ost.noarch
python-rdomanager-oscplugin-0.0.10-25.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-2015.1.2-6.el7ost.noarch
heat-cfntools-1.2.8-2.el7.noarch
openstack-heat-common-2015.1.2-6.el7ost.noarch
openstack-heat-engine-2015.1.2-6.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-103.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.2-6.el7ost.noarch
openstack-heat-api-cfn-2015.1.2-6.el7ost.noarch


Steps:
-------
(1) Deploy overcloud with SSL with 1 Ceph node.
(2) Attempt to Scale-up Ceph from 1 to 3 node 

Results :
---------
Stack Update failed - due to resources.EndpointMap: Timed out . 


[stack@undercloud72 ~]$  openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /home/stack/ssl-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml -e ~/ssl-heat-templates/environments/enable-tls.yaml -e ~/ssl-heat-templates/environments/inject-trust-anchor.yaml --ntp-server 10.5.26.10 --neutron-network-type vxlan --neutron-tunnel-types vxlan --timeout 90
Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
Stack failed with status: MessagingTimeout: resources.EndpointMap: Timed out waiting for a reply to message ID 9f543dfa60b7439789e06acbc8768548
ERROR: openstack Heat Stack update failed.



[stack@undercloud72 ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks              |
+--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+
| 93eaf3e9-5b91-4531-9a16-fc9783117b12 | overcloud-cephstorage-0 | ACTIVE | -          | Running     | ctlplane=192.168.0.7  |
| cd26fb71-8d0a-4358-96de-a0aeb33dd773 | overcloud-cephstorage-1 | ACTIVE | -          | Running     | ctlplane=192.168.0.13 |
| 5825914e-cc18-49a7-9d7f-57ef2c6ef0f6 | overcloud-cephstorage-2 | ACTIVE | -          | Running     | ctlplane=192.168.0.12 |
| 1c604744-a783-4ab9-9a27-a55dab04e76a | overcloud-compute-0     | ACTIVE | -          | Running     | ctlplane=192.168.0.8  |
| db28e938-d175-4add-9931-399538a8352c | overcloud-controller-0  | ACTIVE | -          | Running     | ctlplane=192.168.0.11 |
| e6f7754e-c12f-4124-84e4-23e5916e2d57 | overcloud-controller-1  | ACTIVE | -          | Running     | ctlplane=192.168.0.10 |
| b4cab872-ffe2-49ef-bfd7-fa11718bf2aa | overcloud-controller-2  | ACTIVE | -          | Running     | ctlplane=192.168.0.9  |
+--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+


[stack@undercloud72 ~]$ heat resource-list -n5 overcloud | grep FAIL
| CephStorage                                   | f5edaacd-b79e-4d03-adc3-3c22582b509e          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2016-01-06T00:13:01Z |                                               |

| EndpointMap                                   | 2c7f6eba-03cf-4a83-b424-a882c03d2452          | OS::TripleO::EndpointMap                          | UPDATE_FAILED   | 2016-01-06T00:13:37Z |  
[stack@undercloud72 ~]$ heat stack-show overcloud | grep -i status
| stack_status          | UPDATE_FAILED                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| stack_status_reason   | MessagingTimeout: resources.EndpointMap: Timed out  

[stack@undercloud72 ~]$ heat resource-show overcloud 2c7f6eba-03cf-4a83-b424-a882c03d2452
Stack or resource not found: overcloud 2c7f6eba-03cf-4a83-b424-a882c03d2452


heat-engine.log:
-----------------
2016-01-05 19:14:39.558 6951 DEBUG heat.engine.scheduler [-] Task update_task from Stack "overcloud-CephStorage-e6bsjx7luyo2-0-qkhbl6bpdjgg" [a6535d9f-7f18-45
c5-9c4d-0fe30a850fb3] complete step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:226
2016-01-05 19:14:39.584 6951 DEBUG heat.engine.stack_lock [-] Engine 952fae67-e632-4cbc-95d8-71043d152d18 released lock on stack a6535d9f-7f18-45c5-9c4d-0fe30
a850fb3 release /usr/lib/python2.7/site-packages/heat/engine/stack_lock.py:132
2016-01-05 19:14:39.620 6951 ERROR heat.engine.resources.stack_resource [-] update_stack
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource Traceback (most recent call last):
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 402, in update_with_template
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     args)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 226, in update_stack
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     args=args))
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 51, in call
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     return client.call(ctxt, method, **kwargs)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 393, in call
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     return self.prepare().call(ctxt, method, **kwargs)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     retry=self.retry)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     timeout=timeout, retry=retry)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     retry=retry)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     result = self._waiter.wait(msg_id, timeout)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     message = self.waiters.get(msg_id, timeout=timeout)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource     'to message ID %s' % msg_id)
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource MessagingTimeout: Timed out waiting for a reply to message ID 9f543dfa60b7439789e06acbc8768548
2016-01-05 19:14:39.620 6951 TRACE heat.engine.resources.stack_resource 
2016-01-05 19:14:39.635 6951 INFO heat.engine.resource [-] UPDATE: OS::TripleO::EndpointMap "EndpointMap" [2c7f6eba-03cf-4a83-b424-a882c03d2452] Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb]
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource Traceback (most recent call last):
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 528, in _action_recorder
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     yield
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 796, in update
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     args=[after, tmpl_diff, prop_diff])
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 296, in wrapper
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     step = next(subtask)
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 569, in action_handler_task
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     handler_data = handler(*args)
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/template_resource.py", line 290, in handle_update
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     self.child_params())
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 405, in update_with_template
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     self.raise_local_exception(ex)
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 288, in raise_local_exception
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource     raise ex
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource MessagingTimeout: Timed out waiting for a reply to message ID 9f543dfa60b7439789e06acbc8768548
2016-01-05 19:14:39.635 6951 TRACE heat.engine.resource 
2016-01-05 19:14:39.729 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.729 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.729 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:246
2016-01-05 19:14:39.730 6951 DEBUG heat.engine.scheduler [-] Task _resource_update from Stack "overcloud" [2587a53c-8251-4fad-9e96-254fcda003bb] Update cancel

Comment 1 Omri Hochman 2016-01-18 19:49:02 UTC
Created attachment 1115955 [details]
heat-engine.log

heat-engine.log

Comment 2 Giulio Fidente 2016-01-27 13:18:09 UTC
This same problem exists when trying to scale Ceph storage nodes in a regular IPv4 deployment.

Comment 3 Giulio Fidente 2016-01-27 19:36:24 UTC
the same error happens with 2/4/6/12 workers and it is always the same resource failing with the Timed out message:

https://github.com/openstack/tripleo-heat-templates/blob/master/network/endpoints/endpoint_map.yaml
which is defining a big number of
https://github.com/openstack/tripleo-heat-templates/blob/master/network/endpoints/endpoint.yaml

adding a depends_on across the Endpoint resources doesn't help, same error
trying to deploy and update a standalone stack from endpoint_map does *not* trigger the same error

yet on a full overcloud this is reproducible consistenly on both ipv4 and ipv6 but only when scaling ceph nodes, if you try to scale compute nodes, the update will complete successfully

Comment 4 Giulio Fidente 2016-01-27 19:38:01 UTC
Created attachment 1118860 [details]
heat-engine.log.gz

extract from heat-engine.log (debug) during a failed update attempt

Comment 5 Giulio Fidente 2016-01-27 19:38:58 UTC
the single useful message which I seem to find in the logs, in addition to the TRACE is:

  INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 4fc5bf4064674d04832ef3d638665979

Comment 6 Steve Baker 2016-02-02 23:41:27 UTC
Possible alternative to generating the endpoint map proposed upstream

Comment 7 Zane Bitter 2016-02-02 23:58:37 UTC
I suspect the remainder of this issue will be fixed by bug 1302880.

Comment 8 James Slagle 2016-02-03 14:33:10 UTC
Ben, can you test the Endpoint map patch and if it looks good propose it downstream in tripleo-heat-templates?

Comment 9 Zane Bitter 2016-02-10 17:08:26 UTC
This should be resolved by the fixes for bug 1302880 and bug 1305947.

Comment 13 errata-xmlrpc 2016-02-18 16:43:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-0266.html