Bug 1739997
Summary: | OSP 13 fails deploy at 300+ baremetal nodes | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Dave Wilson <dwilson> | |
Component: | openstack-heat | Assignee: | Rabi Mishra <ramishra> | |
Status: | CLOSED ERRATA | QA Contact: | Jad Haj Yahya <jhajyahy> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 13.0 (Queens) | CC: | afariasa, aschultz, chris.smart, emacchi, fiezzi, harsh.kotak, marjones, mbayer, mburns, pmannidi, ramishra, sbaker, shardy, smalleni | |
Target Milestone: | --- | Keywords: | Reopened, TestOnly, Triaged, ZStream | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | openstack-heat-10.0.3-7.el7ost | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1826325 (view as bug list) | Environment: | ||
Last Closed: | 2020-03-10 11:25:06 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1826325 |
Description
Dave Wilson
2019-08-12 04:25:51 UTC
To provide more context, This happens during scale up from 252 nodes to 367 nodes, after all the new nodes go into ACTIVE. It happens consistently even on trying multiple redeploys. Keystone and heat both have memcached caching enabled. There are no errors in keystone logs. There are some shutdown_errors in the rabbitmq logs on undercloud. However, on retrying a stack update, we do not see rabbitmq errors but stack update still fails with the same error. Could this be possibly be related to rabbitmq going down during initial scale up and then stack going to a bad state, so that even subsequent updates to the stack are failing? In heat logs.. Keystone requests are timing out... It's possibly keystone performance issue. I see 12 admin/public workers configured in keystone. May be those need to be increased if caching is not helping. Keystone team may be able to help here. 2019-08-12 03:37:25.874 284316 WARNING keystoneauth.identity.generic.base [req-a4e090be-c3e0-4e25-9c2c-1a2c7fd19a24 - admin - default default] Failed to discover available identity versions when conta cting http://192.168.0.1:35357. Attempting to parse version from URL.: ConnectFailure: Unable to establish connection to http://192.168.0.1:35357: HTTPConnectionPool(host='192.168.0.1', port=35357): M ax retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcf87ba4c10>: Failed to establish a new connection: [Errno 110] ETIMEDO UT',)) Changing Keystone admin and main services to have 24 processes and 1 thread fixed the issue. I don't think it's a good idea to increase the default keystone workers for undercloud. It uses ::os_workers fact whose value is max between '(<# processors> / 2)' and '2' with a cap of 12 for performance reasons[1]. We can possibly add it to the scale documentation among other things. If we need to keep the bug open for fixing docs, feel free to re-open it and assign it to the correct DFG. [1] https://github.com/openstack/puppet-openstacklib/commit/e6b658bd007924f0f71d0f2d8a6928976567cc98 Even after changes to keystone worker count, all further updates of the stack are failing. The errors in heat logs are as follows: https://gist.githubusercontent.com/smalleni/96ab99d19cc4a5b14ae0f79bfec6bbcb/raw/d0d04be7d166aaa50fe60012c0b24964eb6d2842/gistfile1.txt https://gist.githubusercontent.com/smalleni/d5bcb9ba419eaf573188507d317a531b/raw/4a2ff44c8ba2460af40b333d7574eb0a20bc8299/gistfile1.txt We have tried with executor_thread_pool_size set to 32 as well as 16. We have even tried reducing heat engine workers to 12 and seeing if that fixes the problem but with 12 workers, the stack update does not progress. We have rpc_response_timeout set to 1200 in heat.conf. Tweaks tried so far and current settings Heat Engine workers = 12/24 (tried 12 and 24, 24 is default) executor_thread_pool_size = 12/32/64 (64 is deault, tried other values as well) rpc_response_timeout (increased to 1200 from default 600) Keystone Increased admin to 48 processes and 1 thread Increased main to 24 processes and 1 thread Ironic sync_power_state_interval = 180 Mistral rpc_response_timeout=600 Hey Rabi, Many thanks for your keen interest to help us get past the issues. If I understand correctly, because of the keystoneauth change you applied manually, we would only need to apply 1. https://code.engineering.redhat.com/gerrit/#/c/178588/ 2. https://review.opendev.org/#/c/676733/ The following two patches are not needed because you manually patched /usr/lib/python2.7/site-packages/keystoneauth1/identity/v3/base.py ....... _logger.debug('Making authentication request to %s', token_url) resp = session.post(token_url, json=body, headers=headers, authenticated=False, log=False, connect_retries=10, **rkwargs) ....... 1. https://review.opendev.org/#/c/676664/ 2. https://review.opendev.org/#/c/676648/1 Is this correct? Need to backport https://review.opendev.org/#/c/676821/ as well. Post 400+ nodes, we are running with 48 workers and 48 threads for heat-engine. Another quick note, we are seeing mistral-engine bottlenecked on CPU as soon as we kick off a stack update at the 300+ node scale, as it creates/updates plan, which is slowing down the stack update. We need to figure out if we can run more than 1 worker for mistral-engine. At around close to 400 nodes or a little less, another change that was made by Rabi to prevent too many large messages pushed to the queue when updating SshKnownHostsDeployment resources, that are replaced (deleted/created) for every node added to a role and a new config (with the new node details added) created. overcloud.j2.yaml ------------------------ {{role.name}}SshKnownHostsDeployment: type: OS::TripleO::Ssh::KnownHostsDeployment + update_policy: + rolling_update: + max_batch_size: {get_param: NodeCreateBatchSize} <<30 is the default for this parameter properties: name: {{role.name}}SshKnownHostsDeployment config: {get_resource: SshKnownHostsConfig} servers: {get_attr: [{{role.name}}Servers, value]} max_batch_size is 30 per role. We are not sure if this change was really needed but documenting for future reference. > another change that was made by Rabi to prevent too many large messages pushed to the queue when updating SshKnownHostsDeployment resources
I've reverted it after some testing today and it did not seem to make any difference.
To add to Rabi's observation. With the change overcloud.j2.yaml ------------------------ {{role.name}}SshKnownHostsDeployment: type: OS::TripleO::Ssh::KnownHostsDeployment + update_policy: + rolling_update: + max_batch_size: {get_param: NodeCreateBatchSize} <<30 is the default for this parameter properties: name: {{role.name}}SshKnownHostsDeployment config: {get_resource: SshKnownHostsConfig} servers: {get_attr: [{{role.name}}Servers, value]} Adding a node to a an existing 505 node overcloud takes 80 minutes. Without this change it takes 65 minutes. Given that this slows down the deployment and we are not currently seeing any rabbitmq issues without this patch, this does not seem necessary. According to our records, this should be resolved by openstack-heat-10.0.3-8.el7ost. This build is available now. Went through all patches and verified their existence Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0761 |