Bug 1394920

Summary: Timeout error when heat stack-update is used for big cluster
Product: Red Hat OpenStack Reporter: Vikas Laad <vlaad>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: athomas, gkeegan, jeder, jprovazn, jruzicka, jschluet, mburns, rhel-osp-director-maint, sbaker, shardy, srevivo, therve, Tuomo.Tikkanen, vcojot, vlaad, zbitter
Target Milestone: Upstream M2Keywords: Triaged
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-34
Fixed In Version: openstack-heat-8.0.0-5.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 19:46:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1395712    
Attachments:
Description Flags
controller0 heat-engine.log
none
controller2 heat-engine.log
none
heat engine debug log snippet none

Description Vikas Laad 2016-11-14 18:12:31 UTC
Description of problem:
I am trying to create a 300 node cluster in scale lab, where we are trying to scale up the cluster with a batch size of 50. After 150 nodes in the cluster heat client times out with following error. Actual update starts running but client comes back with error.

[heat-admin@overcloud-controller-0 ~]$ heat stack-update test -e openshift-on-openstack/env_scale_lab.yaml -f openshift-on-openstack/openshift.yaml
WARNING (shell) "heat stack-update" is deprecated, please use "openstack stack update" instead
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>


Version-Release number of selected component (if applicable):
1.4.0

How reproducible:


Steps to Reproduce:
1. Create a small cluster, 3 masters, 3 etcd, 2 infra and 2 nodes
2. Start scaling up the cluster in batch size of 50
3. update the template to have 200 nodes and run "heat stack-update"

Actual results:
Error above

Expected results:
Client should show the list of stack after requesting update.

Additional info:

Comment 1 Steve Baker 2016-11-14 20:10:39 UTC
Can you please attach /var/log/heat/heat-engine.log and /var/log/heat/heat-api.log for the period of the failed scale up?

Also the version of the openstack-heat-engine package?

Also, can you confirm these are the exact templates you are deploying? https://github.com/redhat-openstack/openshift-on-openstack

Comment 2 Zane Bitter 2016-11-14 21:06:50 UTC
Judging by bug 1393802 it sounds like we have a HAProxy timeout of 60s. Maybe we should look to increase it.

Comment 3 Vikas Laad 2016-11-15 14:38:52 UTC
(In reply to Steve Baker from comment #1)
I will provide the logs next time we run into this again, I don't have logs anymore.

Yes we are using the same exact template.

Comment 4 Steve Baker 2016-11-16 00:00:11 UTC
Last advice was to comment out the custom_constraints in node.yaml (and other resource group templates) since this may be generating a swarm of REST API requests to other services.

Please update this bug when the results of that are known.

Comment 5 Vikas Laad 2016-11-16 16:45:29 UTC
Another bug was created https://bugzilla.redhat.com/show_bug.cgi?id=1395712 to track the issues you are referring to in comment #4 

This issue was noticed when running stack update, stack update actually happens and failed because of the other but, but response from stack update is 504 error.

Comment 6 Zane Bitter 2016-11-16 17:02:35 UTC
It sounds like removing the custom constraints on parameters helped, but not as much as we need. Custom constraints are also checked on resource properties (and this can't be disabled in the template), so that's likely where most of the remaining time is still being burned.

It is possible to tell Heat to cache this data in memcache, to avoid making repeated calls to other OpenStack APIs to validate the same thing over and over. This uses the oslo_cache library, so it's likely the same config options used by any other service using memcache: http://git.openstack.org/cgit/openstack/oslo.cache/tree/oslo_cache/_opts.py#n21

Comment 7 Zane Bitter 2016-11-16 17:23:12 UTC
Also, increasing the rpc_response_timeout even further is a legitimate option here IMHO. The TripleO undercloud already uses 600s for that timeout.

Comment 8 Jan Provaznik 2016-11-17 13:23:32 UTC
Instructions for setting up memcached in heat (thanks ramishra):
update /etc/heat/heat.conf with:

[cache]
backend = dogpile.cache.memcached
enabled = True

Comment 9 Jan Provaznik 2016-11-17 22:12:20 UTC
after using memcached with configuration:
[cache]
backend=dogpile.cache.memcached
enabled=True
memcache_servers=172.16.0.26:11211,172.16.0.11:11211,172.16.0.29:11211
debug_cache_backend=True

Then validation of Autoscaling group which usually hit the timeout before was much quicker, instead of 20 minutes it took 3 mins:
2016-11-17 21:31:24.062 91594 INFO heat.engine.resource [req-a84a726b-c6f9-4b52-8cbb-c1c4cbf5cb76 - - - - -] Validating AutoScalingResourceGroup "openshift_nodes"
2016-11-17 21:34:23.207 91594 INFO heat.engine.resource [req-f154b367-c535-47bf-8ec3-6581210d3da9 - - - - -] updating AutoScalingResourceGroup "openshift_nodes" [bf913780-de53-4af5-8f9c-d4750ea33f73] Stack "test" [75c575ea-f8a2-4261-a8b1-a41158db4da0]

Unfortunately it seems that validation is done twice (I guess because of nested stack) so I can see in logs there was another rpc request which took ~19 minutes:
2016-11-17 21:34:34.088 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Validating SoftwareConfig "run_ansible"
2016-11-17 21:34:34.089 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Validating CinderVolume "cinder_volume"
2016-11-17 21:34:34.090 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-
.......... <validating all 350 nodes here> .........
 
2016-11-17 21:53:15.078 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Validating SoftwareDeployment "deployment_run_ansible"
2016-11-17 21:53:15.078 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Validating SoftwareDeployment "deployment_bastion_node_cleanup"
2016-11-17 21:53:15.346 20655 INFO heat.engine.stack [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Stack UPDATE IN_PROGRESS (test-openshift_nodes-cq4spc7or5a3): Stack UPDATE started
2016-11-17 21:53:21.176 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] updating TemplateResource "qciylhzvmcee" [1c320c97-3946-4992-9df6-3c85dbfde263] Stack "test-openshift_nodes-cq4spc7or5a3" [bf913780-de53-4af5-8f9c-d4750ea33f73]

So I'm afraid that scaling upper and upper means also increasing rpc_response_timeout and wait longer and longer for finishing validation.

Comment 10 Zane Bitter 2016-11-17 23:06:02 UTC
I suspect that we're validating the nested stack when we're validating the parent stack, and then validating it again when we create/update the nested stack. There's probably some way we can cut it down in the future. (I'm not sure that we're doing the same amount of validation in each case though.) In any event the caching ought to have helped speed up the slowest validation path, so if it didn't then I'd suspect that it may not be working.

Comment 11 Vikas Laad 2016-11-18 01:10:36 UTC
Next scale up to 400 nodes failed again, here is the memory usage on one of the controller nodes

713.6 MiB +   2.0 MiB = 715.7 MiB	glance-api
  2.5 GiB + 633.5 KiB =   2.5 GiB	beam.smp
  3.8 GiB +  66.1 MiB =   3.9 GiB	heat-api (57)
  5.2 GiB +  37.6 MiB =   5.3 GiB	nova-conductor (57)
  6.0 GiB +  63.4 MiB =   6.0 GiB	cinder-api (57)
  7.1 GiB +  30.1 MiB =   7.1 GiB	heat-api-cfn (57)
  8.7 GiB +   4.4 MiB =   8.7 GiB	mysqld
  9.7 GiB +  46.7 MiB =   9.7 GiB	neutron-server (115)
  9.9 GiB +  43.5 MiB =  10.0 GiB	httpd (132)
 10.8 GiB +  29.7 MiB =  10.8 GiB	heat-engine (57)
 16.7 GiB + 128.3 MiB =  16.8 GiB	nova-api (113)
---------------------------------
                         85.1 GiB

Are these many httpd/nova-api/heat-engine processes expected to run?

Comment 12 Jan Provaznik 2016-11-18 05:41:13 UTC
(In reply to Zane Bitter from comment #10)
> I suspect that we're validating the nested stack when we're validating the
> parent stack, and then validating it again when we create/update the nested
> stack. There's probably some way we can cut it down in the future. (I'm not
> sure that we're doing the same amount of validation in each case though.) In
> any event the caching ought to have helped speed up the slowest validation
> path, so if it didn't then I'd suspect that it may not be working.

The slowest validation used to be AutoScalingResourceGroup (in comment 9) - with memcached it took 3 mins instead of 20 so I *think* memcached is working but I don't know why it wasn't used for the second validation (probably done in nested stacks).

Comment 13 Jan Provaznik 2016-11-18 05:57:34 UTC
(In reply to Steve Baker from comment #1)
> Can you please attach /var/log/heat/heat-engine.log and
> /var/log/heat/heat-api.log for the period of the failed scale up?
> 
> Also the version of the openstack-heat-engine package?
> 

python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch
openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cloudwatch-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch
openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch


> Also, can you confirm these are the exact templates you are deploying?
> https://github.com/redhat-openstack/openshift-on-openstack

Some custom changes were applied in the templates for this setup, we use:
https://github.com/redhat-openstack/openshift-on-openstack/tree/slab-changes

Only difference which might be related is that "boot from volume" is used for booting all nodes.

Comment 14 Jan Provaznik 2016-11-18 11:25:35 UTC
Created attachment 1221837 [details]
controller0 heat-engine.log

Comment 15 Jan Provaznik 2016-11-18 11:26:19 UTC
Created attachment 1221838 [details]
controller2 heat-engine.log

Comment 16 Jan Provaznik 2016-11-18 11:32:44 UTC
I attached heat-engine logs, the last timeout is logged on controller2 at 23:30:28. After closer look it's possible I was wrong in comment #9 about faster validation. It's possible that memcached was not used - I'm not sure how to check if it is or is not used (no mention in heat-engine logs). Though if I enable mode there I can see in logs:
2016-11-18 11:00:29.051 30386 DEBUG oslo_service.service [-] cache.enabled                  = True log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630
2016-11-18 11:00:29.054 30386 DEBUG oslo_service.service [-] cache.memcache_servers         = ['172.16.0.26:11211', '172.16.0.11:11211', '172.16.0.29:11211'] log_opt_values /usr/lib/python2.

Comment 17 Jan Provaznik 2016-11-18 12:19:30 UTC
Created attachment 1221871 [details]
heat engine debug log snippet

Comment 18 Jan Provaznik 2016-11-18 12:23:30 UTC
I uploaded a snippet of heat-engine log in debug mode when validation is happening - it seems there is bunch of requests to http://172.100.0.12:8774/v2.1/limits when validating nova hosts, I suppose this causes the major slowdown.

Comment 19 Jan Provaznik 2016-11-18 12:32:24 UTC
And last update - in debug mode I can see in heat-engine messages like this:
2016-11-18 12:05:48.155 128362 DEBUG dogpile.lock [req-83864202-d4f5-4f67-85a3-1bdde833ff1c - - - - -] value creation lock <dogpile.cache.region._LockWrapper object at 0x64b6a50> acquired _enter_create /usr/lib/python2.7/site-packages/dogpile/lock.py:127

And on heat-engine start:
2016-11-18 11:00:29.051 30386 DEBUG oslo_service.service [-] cache.enabled                  = True log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630
2016-11-18 11:00:29.054 30386 DEBUG oslo_service.service [-] cache.memcache_servers         = ['172.16.0.26:11211', '172.16.0.11:11211', '172.16.0.29:11211'] log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630

So I think that the last try was done with caching enabled but didn't help much.

Comment 20 Zane Bitter 2016-11-18 13:32:55 UTC
(In reply to Vikas Laad from comment #11)
> Next scale up to 400 nodes failed again, here is the memory usage on one of
> the controller nodes
> 
> 713.6 MiB +   2.0 MiB = 715.7 MiB	glance-api
>   2.5 GiB + 633.5 KiB =   2.5 GiB	beam.smp
>   3.8 GiB +  66.1 MiB =   3.9 GiB	heat-api (57)
>   5.2 GiB +  37.6 MiB =   5.3 GiB	nova-conductor (57)
>   6.0 GiB +  63.4 MiB =   6.0 GiB	cinder-api (57)
>   7.1 GiB +  30.1 MiB =   7.1 GiB	heat-api-cfn (57)
>   8.7 GiB +   4.4 MiB =   8.7 GiB	mysqld
>   9.7 GiB +  46.7 MiB =   9.7 GiB	neutron-server (115)
>   9.9 GiB +  43.5 MiB =  10.0 GiB	httpd (132)
>  10.8 GiB +  29.7 MiB =  10.8 GiB	heat-engine (57)
>  16.7 GiB + 128.3 MiB =  16.8 GiB	nova-api (113)
> ---------------------------------
>                          85.1 GiB
> 
> Are these many httpd/nova-api/heat-engine processes expected to run?

The default for heat is one worker process per CPU - there are 56 CPUs on this box so yes it's expected. (A similar story applies for other processes.) However that's probably a lot more than you need and limiting it to a smaller number in the config file should reduce the amount of total memory used.

Comment 21 Jan Provaznik 2016-11-23 08:42:23 UTC
Because MessagingTimeout is caused by the validation step, there are 2 potential solutions:
1) this patch speeds up validation:
https://review.openstack.org/#/c/400961/1/heat/engine/stack.py

2) there will be a new option to disable validation (if user wants to):
https://review.openstack.org/#/q/status:open+project:openstack/heat+branch:master+topic:bug/1640829

Comment 22 Zane Bitter 2016-12-01 16:05:37 UTC
The patch https://review.openstack.org/400961 has merged upstream and will be in Ocata, so I've targeted this bug for OSP11. I'd be open to backporting the patch to OSP10 though if that'd be useful. It's well-contained and fairly low-risk.

Comment 23 Tuomo Tikkanen 2016-12-09 20:18:39 UTC
(In reply to Zane Bitter from comment #22)
> The patch https://review.openstack.org/400961 has merged upstream and will
> be in Ocata, so I've targeted this bug for OSP11. I'd be open to backporting
> the patch to OSP10 though if that'd be useful. It's well-contained and
> fairly low-risk.

Please backport to OSP10 too.

I have a OSP9 cloud with only 26 nodes (including 3 ceph and 3 controller) and interactive stack update dies to this "504  Gateway Time-out" error when it has run about 3 minutes. By activating the memcached like mentioned in comment #8 I managed to update the stack fine. The Director host is a kvm vm with 32G mem which IMHO should be more than enough for undercloud host for this size of overcloud.

Comment 24 Zane Bitter 2016-12-15 23:38:18 UTC
(In reply to Tuomo Tikkanen from comment #23)
> Please backport to OSP10 too.

That patch is unlikely to make a huge difference to TripleO, because I believe most TripleO ResourceGroups use index substitution, which results in each member of the group having a different definition, so no time is saved by ignoring identical definitions.

Gory details here:
http://lists.openstack.org/pipermail/openstack-dev/2016-November/107875.html

A patch that *is* likely to help TripleO is https://review.openstack.org/#/c/403828/ but that is a riskier backport.

Comment 34 errata-xmlrpc 2017-05-17 19:46:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245