Bug 1394920
Summary: | Timeout error when heat stack-update is used for big cluster | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Vikas Laad <vlaad> | ||||||||
Component: | openstack-heat | Assignee: | Zane Bitter <zbitter> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Amit Ugol <augol> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 10.0 (Newton) | CC: | athomas, gkeegan, jeder, jprovazn, jruzicka, jschluet, mburns, rhel-osp-director-maint, sbaker, shardy, srevivo, therve, Tuomo.Tikkanen, vcojot, vlaad, zbitter | ||||||||
Target Milestone: | Upstream M2 | Keywords: | Triaged | ||||||||
Target Release: | 11.0 (Ocata) | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | aos-scalability-34 | ||||||||||
Fixed In Version: | openstack-heat-8.0.0-5.el7ost | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-05-17 19:46:18 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1395712 | ||||||||||
Attachments: |
|
Description
Vikas Laad
2016-11-14 18:12:31 UTC
Can you please attach /var/log/heat/heat-engine.log and /var/log/heat/heat-api.log for the period of the failed scale up? Also the version of the openstack-heat-engine package? Also, can you confirm these are the exact templates you are deploying? https://github.com/redhat-openstack/openshift-on-openstack Judging by bug 1393802 it sounds like we have a HAProxy timeout of 60s. Maybe we should look to increase it. (In reply to Steve Baker from comment #1) I will provide the logs next time we run into this again, I don't have logs anymore. Yes we are using the same exact template. Last advice was to comment out the custom_constraints in node.yaml (and other resource group templates) since this may be generating a swarm of REST API requests to other services. Please update this bug when the results of that are known. Another bug was created https://bugzilla.redhat.com/show_bug.cgi?id=1395712 to track the issues you are referring to in comment #4 This issue was noticed when running stack update, stack update actually happens and failed because of the other but, but response from stack update is 504 error. It sounds like removing the custom constraints on parameters helped, but not as much as we need. Custom constraints are also checked on resource properties (and this can't be disabled in the template), so that's likely where most of the remaining time is still being burned. It is possible to tell Heat to cache this data in memcache, to avoid making repeated calls to other OpenStack APIs to validate the same thing over and over. This uses the oslo_cache library, so it's likely the same config options used by any other service using memcache: http://git.openstack.org/cgit/openstack/oslo.cache/tree/oslo_cache/_opts.py#n21 Also, increasing the rpc_response_timeout even further is a legitimate option here IMHO. The TripleO undercloud already uses 600s for that timeout. Instructions for setting up memcached in heat (thanks ramishra): update /etc/heat/heat.conf with: [cache] backend = dogpile.cache.memcached enabled = True after using memcached with configuration: [cache] backend=dogpile.cache.memcached enabled=True memcache_servers=172.16.0.26:11211,172.16.0.11:11211,172.16.0.29:11211 debug_cache_backend=True Then validation of Autoscaling group which usually hit the timeout before was much quicker, instead of 20 minutes it took 3 mins: 2016-11-17 21:31:24.062 91594 INFO heat.engine.resource [req-a84a726b-c6f9-4b52-8cbb-c1c4cbf5cb76 - - - - -] Validating AutoScalingResourceGroup "openshift_nodes" 2016-11-17 21:34:23.207 91594 INFO heat.engine.resource [req-f154b367-c535-47bf-8ec3-6581210d3da9 - - - - -] updating AutoScalingResourceGroup "openshift_nodes" [bf913780-de53-4af5-8f9c-d4750ea33f73] Stack "test" [75c575ea-f8a2-4261-a8b1-a41158db4da0] Unfortunately it seems that validation is done twice (I guess because of nested stack) so I can see in logs there was another rpc request which took ~19 minutes: 2016-11-17 21:34:34.088 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Validating SoftwareConfig "run_ansible" 2016-11-17 21:34:34.089 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Validating CinderVolume "cinder_volume" 2016-11-17 21:34:34.090 20655 INFO heat.engine.resource [req-d0db50ef-e1ed- .......... <validating all 350 nodes here> ......... 2016-11-17 21:53:15.078 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Validating SoftwareDeployment "deployment_run_ansible" 2016-11-17 21:53:15.078 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Validating SoftwareDeployment "deployment_bastion_node_cleanup" 2016-11-17 21:53:15.346 20655 INFO heat.engine.stack [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] Stack UPDATE IN_PROGRESS (test-openshift_nodes-cq4spc7or5a3): Stack UPDATE started 2016-11-17 21:53:21.176 20655 INFO heat.engine.resource [req-d0db50ef-e1ed-43cc-83f3-240f5b9125ff - - - - -] updating TemplateResource "qciylhzvmcee" [1c320c97-3946-4992-9df6-3c85dbfde263] Stack "test-openshift_nodes-cq4spc7or5a3" [bf913780-de53-4af5-8f9c-d4750ea33f73] So I'm afraid that scaling upper and upper means also increasing rpc_response_timeout and wait longer and longer for finishing validation. I suspect that we're validating the nested stack when we're validating the parent stack, and then validating it again when we create/update the nested stack. There's probably some way we can cut it down in the future. (I'm not sure that we're doing the same amount of validation in each case though.) In any event the caching ought to have helped speed up the slowest validation path, so if it didn't then I'd suspect that it may not be working. Next scale up to 400 nodes failed again, here is the memory usage on one of the controller nodes 713.6 MiB + 2.0 MiB = 715.7 MiB glance-api 2.5 GiB + 633.5 KiB = 2.5 GiB beam.smp 3.8 GiB + 66.1 MiB = 3.9 GiB heat-api (57) 5.2 GiB + 37.6 MiB = 5.3 GiB nova-conductor (57) 6.0 GiB + 63.4 MiB = 6.0 GiB cinder-api (57) 7.1 GiB + 30.1 MiB = 7.1 GiB heat-api-cfn (57) 8.7 GiB + 4.4 MiB = 8.7 GiB mysqld 9.7 GiB + 46.7 MiB = 9.7 GiB neutron-server (115) 9.9 GiB + 43.5 MiB = 10.0 GiB httpd (132) 10.8 GiB + 29.7 MiB = 10.8 GiB heat-engine (57) 16.7 GiB + 128.3 MiB = 16.8 GiB nova-api (113) --------------------------------- 85.1 GiB Are these many httpd/nova-api/heat-engine processes expected to run? (In reply to Zane Bitter from comment #10) > I suspect that we're validating the nested stack when we're validating the > parent stack, and then validating it again when we create/update the nested > stack. There's probably some way we can cut it down in the future. (I'm not > sure that we're doing the same amount of validation in each case though.) In > any event the caching ought to have helped speed up the slowest validation > path, so if it didn't then I'd suspect that it may not be working. The slowest validation used to be AutoScalingResourceGroup (in comment 9) - with memcached it took 3 mins instead of 20 so I *think* memcached is working but I don't know why it wasn't used for the second validation (probably done in nested stacks). (In reply to Steve Baker from comment #1) > Can you please attach /var/log/heat/heat-engine.log and > /var/log/heat/heat-api.log for the period of the failed scale up? > > Also the version of the openstack-heat-engine package? > python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-cloudwatch-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch > Also, can you confirm these are the exact templates you are deploying? > https://github.com/redhat-openstack/openshift-on-openstack Some custom changes were applied in the templates for this setup, we use: https://github.com/redhat-openstack/openshift-on-openstack/tree/slab-changes Only difference which might be related is that "boot from volume" is used for booting all nodes. Created attachment 1221837 [details]
controller0 heat-engine.log
Created attachment 1221838 [details]
controller2 heat-engine.log
I attached heat-engine logs, the last timeout is logged on controller2 at 23:30:28. After closer look it's possible I was wrong in comment #9 about faster validation. It's possible that memcached was not used - I'm not sure how to check if it is or is not used (no mention in heat-engine logs). Though if I enable mode there I can see in logs: 2016-11-18 11:00:29.051 30386 DEBUG oslo_service.service [-] cache.enabled = True log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630 2016-11-18 11:00:29.054 30386 DEBUG oslo_service.service [-] cache.memcache_servers = ['172.16.0.26:11211', '172.16.0.11:11211', '172.16.0.29:11211'] log_opt_values /usr/lib/python2. Created attachment 1221871 [details]
heat engine debug log snippet
I uploaded a snippet of heat-engine log in debug mode when validation is happening - it seems there is bunch of requests to http://172.100.0.12:8774/v2.1/limits when validating nova hosts, I suppose this causes the major slowdown. And last update - in debug mode I can see in heat-engine messages like this: 2016-11-18 12:05:48.155 128362 DEBUG dogpile.lock [req-83864202-d4f5-4f67-85a3-1bdde833ff1c - - - - -] value creation lock <dogpile.cache.region._LockWrapper object at 0x64b6a50> acquired _enter_create /usr/lib/python2.7/site-packages/dogpile/lock.py:127 And on heat-engine start: 2016-11-18 11:00:29.051 30386 DEBUG oslo_service.service [-] cache.enabled = True log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630 2016-11-18 11:00:29.054 30386 DEBUG oslo_service.service [-] cache.memcache_servers = ['172.16.0.26:11211', '172.16.0.11:11211', '172.16.0.29:11211'] log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2630 So I think that the last try was done with caching enabled but didn't help much. (In reply to Vikas Laad from comment #11) > Next scale up to 400 nodes failed again, here is the memory usage on one of > the controller nodes > > 713.6 MiB + 2.0 MiB = 715.7 MiB glance-api > 2.5 GiB + 633.5 KiB = 2.5 GiB beam.smp > 3.8 GiB + 66.1 MiB = 3.9 GiB heat-api (57) > 5.2 GiB + 37.6 MiB = 5.3 GiB nova-conductor (57) > 6.0 GiB + 63.4 MiB = 6.0 GiB cinder-api (57) > 7.1 GiB + 30.1 MiB = 7.1 GiB heat-api-cfn (57) > 8.7 GiB + 4.4 MiB = 8.7 GiB mysqld > 9.7 GiB + 46.7 MiB = 9.7 GiB neutron-server (115) > 9.9 GiB + 43.5 MiB = 10.0 GiB httpd (132) > 10.8 GiB + 29.7 MiB = 10.8 GiB heat-engine (57) > 16.7 GiB + 128.3 MiB = 16.8 GiB nova-api (113) > --------------------------------- > 85.1 GiB > > Are these many httpd/nova-api/heat-engine processes expected to run? The default for heat is one worker process per CPU - there are 56 CPUs on this box so yes it's expected. (A similar story applies for other processes.) However that's probably a lot more than you need and limiting it to a smaller number in the config file should reduce the amount of total memory used. Because MessagingTimeout is caused by the validation step, there are 2 potential solutions: 1) this patch speeds up validation: https://review.openstack.org/#/c/400961/1/heat/engine/stack.py 2) there will be a new option to disable validation (if user wants to): https://review.openstack.org/#/q/status:open+project:openstack/heat+branch:master+topic:bug/1640829 The patch https://review.openstack.org/400961 has merged upstream and will be in Ocata, so I've targeted this bug for OSP11. I'd be open to backporting the patch to OSP10 though if that'd be useful. It's well-contained and fairly low-risk. (In reply to Zane Bitter from comment #22) > The patch https://review.openstack.org/400961 has merged upstream and will > be in Ocata, so I've targeted this bug for OSP11. I'd be open to backporting > the patch to OSP10 though if that'd be useful. It's well-contained and > fairly low-risk. Please backport to OSP10 too. I have a OSP9 cloud with only 26 nodes (including 3 ceph and 3 controller) and interactive stack update dies to this "504 Gateway Time-out" error when it has run about 3 minutes. By activating the memcached like mentioned in comment #8 I managed to update the stack fine. The Director host is a kvm vm with 32G mem which IMHO should be more than enough for undercloud host for this size of overcloud. (In reply to Tuomo Tikkanen from comment #23) > Please backport to OSP10 too. That patch is unlikely to make a huge difference to TripleO, because I believe most TripleO ResourceGroups use index substitution, which results in each member of the group having a different definition, so no time is saved by ignoring identical definitions. Gory details here: http://lists.openstack.org/pipermail/openstack-dev/2016-November/107875.html A patch that *is* likely to help TripleO is https://review.openstack.org/#/c/403828/ but that is a riskier backport. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245 |