Bug 1395712 - [ocp-on-osp] stack update fails because "MessagingTimeout: resources.openshift_nodes: Timed out waiting for a reply to message ID"
Summary: [ocp-on-osp] stack update fails because "MessagingTimeout: resources.openshif...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Reference Architecture
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: scollier
QA Contact: Johnny Liu
URL:
Whiteboard: aos-scalability-34
Depends On: 1394920
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-16 13:45 UTC by Jan Provaznik
Modified: 2018-02-22 14:46 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-22 14:46:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jan Provaznik 2016-11-16 13:45:45 UTC
Description of problem:
when scaling up from 50 to 100 nodes, stack-update fails, in stack-events is:
MessagingTimeout: resources.openshift_nodes: Timed out waiting for a reply to message ID 25ab52ae1dc2462e9e93b8f6359b3fc3

and in heat-engine error logs:
2016-11-16 13:31:01.828 47502 INFO heat.engine.resource [req-6a9c852b-1d7a-407f-8c9a-07ceeadc9974 - - - - -] UPDATE: AutoScalingResourceGroup "openshift_nodes" [bf913780-de53-4af5-8f9c-d4750
ea33f73] Stack "test" [75c575ea-f8a2-4261-a8b1-a41158db4da0]
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource Traceback (most recent call last):
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 743, in _action_recorder
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     yield
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1318, in update
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     prop_diff])
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 336, in wrapper
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     step = next(subtask)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 790, in action_handler_task
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     handler_data = handler(*args)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/aws/autoscaling/autoscaling_group.py", line 278, in handle_update
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     self.resize(new_capacity)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/instance_group.py", line 356, in resize
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     updater = self.update_with_template(new_template)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 484, in update_with_template
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     **kwargs)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 323, in _update_stack
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     version='1.29')
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 84, in call
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     return client.call(ctxt, method, **kwargs)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     retry=self.retry)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     timeout=timeout, retry=retry)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 464, in send
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     retry=retry)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 453, in _send
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     result = self._waiter.wait(msg_id, timeout)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in wait
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     message = self.waiters.get(msg_id, timeout=timeout)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 238, in get
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource     'to message ID %s' % msg_id)
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource MessagingTimeout: Timed out waiting for a reply to message ID 861a41d5710e42fa92d8e218ad13f57a
2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource 
2016-11-16 13:31:02.360 47502 INFO heat.engine.stack [req-6a9c852b-1d7a-407f-8c9a-07ceeadc9974 - - - - -] Stack UPDATE FAILED (test): MessagingTimeout: resources.openshift_nodes: Timed out waiting for a reply to message ID 861a41d5710e42fa92d8e218ad13f57a


Version-Release number of selected component (if applicable):
python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch
openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cloudwatch-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch
openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch


Steps to Reproduce:
1. create openshift-on-openstack with 5 nodes
2. scale out to 50
3. scale out to 100

Actual results:
stack fails

Expected results:
stack update completes

Additional info:
rpc_response_timeout is set to 180 in heat.conf.

We started hitting this issue yesterday when even creating the stack, the problem turned out to be in contraints in a nested template which caused that external API calls were executed during stack validation on stack-update, each of nodes took ~6 secs to validate -> in summary only validation took 5 minutes. Removing these contraints solved the issue for stacks <50, now the problem  occurs with 100 nodes -> I suppose it's again the same validation issue.

Comment 1 Jan Provaznik 2016-11-16 13:51:24 UTC
Note that few days ago (on Saturday) I was able to scale out to 250 nodes w/o hitting MessageTimeout issue so I still wonder if something in the scalelab env didn't become slower.

Comment 2 Zane Bitter 2016-11-16 16:52:01 UTC
I'll add comments on the Heat issue, bug 1394920.

Comment 5 scollier 2018-02-22 14:46:59 UTC
Team,  Closing this as the heat templates have been deprecated.  Future funcionality and integration capabilities will be moving to openshift-ansible moving forward.


Note You need to log in before you can comment on or make changes to this bug.