Bug 1395712

Summary: [ocp-on-osp] stack update fails because "MessagingTimeout: resources.openshift_nodes: Timed out waiting for a reply to message ID"
Product: OpenShift Container Platform Reporter: Jan Provaznik <jprovazn>
Component: Reference ArchitectureAssignee: scollier
Status: CLOSED WONTFIX QA Contact: Johnny Liu <jialiu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.4.0CC: aos-bugs, jeder, jokerman, mmccomas, tsedovic, zbitter
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-34
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-22 14:46:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1394920    
Bug Blocks:    

Description Jan Provaznik 2016-11-16 13:45:45 UTC
Description of problem:
when scaling up from 50 to 100 nodes, stack-update fails, in stack-events is:
MessagingTimeout: resources.openshift_nodes: Timed out waiting for a reply to message ID 25ab52ae1dc2462e9e93b8f6359b3fc3

and in heat-engine error logs:
2016-11-16 13:31:01.828 47502 INFO heat.engine.resource [req-6a9c852b-1d7a-407f-8c9a-07ceeadc9974 - - - - -] UPDATE: AutoScalingResourceGroup "openshift_nodes" [bf913780-de53-4af5-8f9c-d4750
ea33f73] Stack "test" [75c575ea-f8a2-4261-a8b1-a41158db4da0]
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource Traceback (most recent call last):
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 743, in _action_recorder
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     yield
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1318, in update
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     prop_diff])
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 336, in wrapper
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     step = next(subtask)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 790, in action_handler_task
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     handler_data = handler(*args)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/aws/autoscaling/autoscaling_group.py", line 278, in handle_update
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     self.resize(new_capacity)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/instance_group.py", line 356, in resize
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     updater = self.update_with_template(new_template)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 484, in update_with_template
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     **kwargs)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 323, in _update_stack
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     version='1.29')
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 84, in call
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     return client.call(ctxt, method, **kwargs)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     retry=self.retry)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     timeout=timeout, retry=retry)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 464, in send
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     retry=retry)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 453, in _send
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     result = self._waiter.wait(msg_id, timeout)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in wait
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     message = self.waiters.get(msg_id, timeout=timeout)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 238, in get
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     'to message ID %s' % msg_id)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource MessagingTimeout: Timed out waiting for a reply to message ID 861a41d5710e42fa92d8e218ad13f57a
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource 
2016-11-16 13:31:02.360 47502 INFO heat.engine.stack [req-6a9c852b-1d7a-407f-8c9a-07ceeadc9974 - - - - -] Stack UPDATE FAILED (test): MessagingTimeout: resources.openshift_nodes: Timed out waiting for a reply to message ID 861a41d5710e42fa92d8e218ad13f57a


Version-Release number of selected component (if applicable):
python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch
openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cloudwatch-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch
openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch


Steps to Reproduce:
1. create openshift-on-openstack with 5 nodes
2. scale out to 50
3. scale out to 100

Actual results:
stack fails

Expected results:
stack update completes

Additional info:
rpc_response_timeout is set to 180 in heat.conf.

We started hitting this issue yesterday when even creating the stack, the problem turned out to be in contraints in a nested template which caused that external API calls were executed during stack validation on stack-update, each of nodes took ~6 secs to validate -> in summary only validation took 5 minutes. Removing these contraints solved the issue for stacks <50, now the problem  occurs with 100 nodes -> I suppose it's again the same validation issue.

Comment 1 Jan Provaznik 2016-11-16 13:51:24 UTC
Note that few days ago (on Saturday) I was able to scale out to 250 nodes w/o hitting MessageTimeout issue so I still wonder if something in the scalelab env didn't become slower.

Comment 2 Zane Bitter 2016-11-16 16:52:01 UTC
I'll add comments on the Heat issue, bug 1394920.

Comment 5 scollier 2018-02-22 14:46:59 UTC
Team,  Closing this as the heat templates have been deprecated.  Future funcionality and integration capabilities will be moving to openshift-ansible moving forward.