Description of problem: Customer has RHOSP 16.1 deployment with beefy controller services. From time to time memcached stops working on all controller nodes. Because of bug #1891034 we can't tell what's going on from memcached perspective. But in controller's logs we can see that at some point memcached healthcheck starts failing from time to time and then becomes completely broken (no successful healthchecks). Customer provided sosreports from controller nodes collected at the time of the failure. I kindly ask developers to provide troubleshooting tips and help finding a workaround.
https://bugs.launchpad.net/oslo.cache/+bug/1888394 has been noted and is a likely cause of this, notes neutron in the comments too.
Bug #1915700 was reported to investigate Neutron behavior.
Does creating a cronjob on the controller nodes to have Memcached service restarted for every X hours (~12 hrs) sound like a valid workaround? Can such a procedure be safely implemented in production environments? Thank you.
Hello Yadnesh, Did you redeployed a stack with that? (the config) Else, you can directly update the neutron config and restart the service to apply it.
So far setting keystone_authtoken/memcache_use_advanced_pool is a valid workaround and this is considered to be actual fix. I reported a bug[1] in launchpad against tripleo and submitted a draft patch here[2]. [1] https://bugs.launchpad.net/tripleo/+bug/1931047 [2] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795010 I'd appreciate any feedback to that patch especially based on the following point - Currently this patch enables advanced_pool for all services, based on the fact that advanced_pool is now recommended. - A new option is added to tht in case a user want to switch back to "legacy pool". However considering the fact that advanced_pool is now recomended, maybe we can just hard-code the parameter instead.
So far the following template can be used to try enabling memcache_use_advanced_pool for all overcloud services. ~~~ parameter_defaults: ControllerExtraConfig: aodh::keystone::authtoken::memcache_use_advanced_pool: true barbican::keystone::authtoken::memcache_use_advanced_pool: true cinder::keystone::authtoken::memcache_use_advanced_pool: true glance::api::authtoken::memcache_use_advanced_pool: true gnocchi::keystone::authtoken::memcache_use_advanced_pool: true heat::keystone::authtoken::memcache_use_advanced_pool: true ironic::api::authtoken::memcache_use_advanced_pool: true manila::keystone::authtoken::memcache_use_advanced_pool: true neutron::keystone::authtoken::memcache_use_advanced_pool: true nova::keystone::authtoken::memcache_use_advanced_pool: true nova::metadata::novajoin::authtoken::memcache_use_advanced_pool: true octavia::keystone::authtoken::memcache_use_advanced_pool: true panko::keystone::authtoken::memcache_use_advanced_pool: true placement::keystone::authtoken::memcache_use_advanced_pool: true ~~~
*** Bug 1849754 has been marked as a duplicate of this bug. ***
*** Bug 1915700 has been marked as a duplicate of this bug. ***
Hello, I am hoping to get a bit of clarity here. Was this patch rolled out into the latest 16.1 as well as 16.2? Thank you
There was a confusion caused by the past discussion in this bug so I'm posting this to clear that. The parameter we are changing to workaround the problem is used by keystonemiddleware, which is used by only API process. Thus the parameter should be set in controller nodes, and there is NO NEED to set the same in compute nodes or any other nodes where no api processes are running. Please follow https://bugzilla.redhat.com/show_bug.cgi?id=1893205#c46 . It should be enough to apply the required change.
fixed in 16.2 , for 16.1 see Takashi's advice regarding a workaround.