Description of problem: when scaling up from 50 to 100 nodes, stack-update fails, in stack-events is: MessagingTimeout: resources.openshift_nodes: Timed out waiting for a reply to message ID 25ab52ae1dc2462e9e93b8f6359b3fc3 and in heat-engine error logs: 2016-11-16 13:31:01.828 47502 INFO heat.engine.resource [req-6a9c852b-1d7a-407f-8c9a-07ceeadc9974 - - - - -] UPDATE: AutoScalingResourceGroup "openshift_nodes" [bf913780-de53-4af5-8f9c-d4750 ea33f73] Stack "test" [75c575ea-f8a2-4261-a8b1-a41158db4da0] 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource Traceback (most recent call last): 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 743, in _action_recorder 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource yield 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1318, in update 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource prop_diff]) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 336, in wrapper 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource step = next(subtask) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 790, in action_handler_task 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource handler_data = handler(*args) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/aws/autoscaling/autoscaling_group.py", line 278, in handle_update 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource self.resize(new_capacity) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/instance_group.py", line 356, in resize 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource updater = self.update_with_template(new_template) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 484, in update_with_template 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource **kwargs) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 323, in _update_stack 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource version='1.29') 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/rpc/client.py", line 84, in call 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource return client.call(ctxt, method, **kwargs) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource retry=self.retry) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource timeout=timeout, retry=retry) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 464, in send 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource retry=retry) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 453, in _send 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource result = self._waiter.wait(msg_id, timeout) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in wait 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource message = self.waiters.get(msg_id, timeout=timeout) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 238, in get 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource 'to message ID %s' % msg_id) 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource MessagingTimeout: Timed out waiting for a reply to message ID 861a41d5710e42fa92d8e218ad13f57a 2016-11-16 13:31:01.828 47502 ERROR heat.engine.resource 2016-11-16 13:31:02.360 47502 INFO heat.engine.stack [req-6a9c852b-1d7a-407f-8c9a-07ceeadc9974 - - - - -] Stack UPDATE FAILED (test): MessagingTimeout: resources.openshift_nodes: Timed out waiting for a reply to message ID 861a41d5710e42fa92d8e218ad13f57a Version-Release number of selected component (if applicable): python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-cloudwatch-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch Steps to Reproduce: 1. create openshift-on-openstack with 5 nodes 2. scale out to 50 3. scale out to 100 Actual results: stack fails Expected results: stack update completes Additional info: rpc_response_timeout is set to 180 in heat.conf. We started hitting this issue yesterday when even creating the stack, the problem turned out to be in contraints in a nested template which caused that external API calls were executed during stack validation on stack-update, each of nodes took ~6 secs to validate -> in summary only validation took 5 minutes. Removing these contraints solved the issue for stacks <50, now the problem occurs with 100 nodes -> I suppose it's again the same validation issue.
Note that few days ago (on Saturday) I was able to scale out to 250 nodes w/o hitting MessageTimeout issue so I still wonder if something in the scalelab env didn't become slower.
I'll add comments on the Heat issue, bug 1394920.
Team, Closing this as the heat templates have been deprecated. Future funcionality and integration capabilities will be moving to openshift-ansible moving forward.