Bug 1893205

Summary: From time to time memcached stops processing requests and brings down OpenStack control plane
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: openstack-tripleo-heat-templatesAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED NEXTRELEASE QA Contact: Joe H. Rahme <jhakimra>
Severity: medium Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: apevec, aruffin, bdobreli, camorris, dciabrin, dhill, dhruv, dsedgmen, enothen, ggrimaux, hberaud, jmelvin, jraju, jschluet, lhh, lmiccini, mbayer, mburns, mgarciac, michal.vasko, michele, msecaur, satmakur, schhabdi, tkajinam, xili, ykulkarn, yusuf, yusufhadiwinata
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2046185 2100879 (view as bug list) Environment:
Last Closed: 2023-08-11 12:22:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2046185, 2100879, 2101864, 2101865    

Description Alex Stupnikov 2020-10-30 14:46:28 UTC
Description of problem:

Customer has RHOSP 16.1 deployment with beefy controller services. From time to time memcached stops working on all controller nodes.

Because of bug #1891034 we can't tell what's going on from memcached perspective. But in controller's logs we can see that at some point memcached healthcheck starts failing from time to time and then becomes completely broken (no successful healthchecks).

Customer provided sosreports from controller nodes collected at the time of the failure. I kindly ask developers to provide troubleshooting tips and help finding a workaround.

Comment 14 Michael Bayer 2021-01-12 16:37:55 UTC
https://bugs.launchpad.net/oslo.cache/+bug/1888394 has been noted and is a likely cause of this, notes neutron in the comments too.

Comment 17 Alex Stupnikov 2021-01-13 09:33:07 UTC
Bug #1915700 was reported to investigate Neutron behavior.

Comment 22 Srinivas Atmakuri 2021-01-28 04:06:45 UTC
Does creating a cronjob on the controller nodes to have Memcached service restarted for every X hours (~12 hrs) sound like a valid workaround?

Can such a procedure be safely implemented in production environments?

Thank you.

Comment 35 Hervé Beraud 2021-05-25 12:13:44 UTC
Hello Yadnesh,

Did you redeployed a stack with that? (the config)

Else, you can directly update the neutron config and restart the service to apply it.

Comment 45 Takashi Kajinami 2021-06-07 09:27:25 UTC
So far setting keystone_authtoken/memcache_use_advanced_pool is a valid workaround
and this is considered to be actual fix.

I reported a bug[1] in launchpad against tripleo and submitted a draft patch here[2].
 [1] https://bugs.launchpad.net/tripleo/+bug/1931047
 [2] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795010

I'd appreciate any feedback to that patch especially based on the following point
- Currently this patch enables advanced_pool for all services, based on the fact
  that advanced_pool is now recommended. 

- A new option is added to tht in case a user want to switch back to "legacy pool".
  However considering the fact that advanced_pool is now recomended, maybe
  we can just hard-code the parameter instead.

Comment 46 Takashi Kajinami 2021-06-08 15:19:47 UTC
So far the following template can be used to try enabling memcache_use_advanced_pool for all overcloud services.

~~~
parameter_defaults:
  ControllerExtraConfig:
    aodh::keystone::authtoken::memcache_use_advanced_pool: true
    barbican::keystone::authtoken::memcache_use_advanced_pool: true
    cinder::keystone::authtoken::memcache_use_advanced_pool: true
    glance::api::authtoken::memcache_use_advanced_pool: true
    gnocchi::keystone::authtoken::memcache_use_advanced_pool: true
    heat::keystone::authtoken::memcache_use_advanced_pool: true
    ironic::api::authtoken::memcache_use_advanced_pool: true
    manila::keystone::authtoken::memcache_use_advanced_pool: true
    neutron::keystone::authtoken::memcache_use_advanced_pool: true
    nova::keystone::authtoken::memcache_use_advanced_pool: true
    nova::metadata::novajoin::authtoken::memcache_use_advanced_pool: true
    octavia::keystone::authtoken::memcache_use_advanced_pool: true
    panko::keystone::authtoken::memcache_use_advanced_pool: true
    placement::keystone::authtoken::memcache_use_advanced_pool: true
~~~

Comment 47 Damien Ciabrini 2021-06-25 13:08:24 UTC
*** Bug 1849754 has been marked as a duplicate of this bug. ***

Comment 48 Takashi Kajinami 2021-07-04 13:11:42 UTC
*** Bug 1915700 has been marked as a duplicate of this bug. ***

Comment 63 aruffin@redhat.com 2022-04-07 17:46:51 UTC
Hello,

I am hoping to get a bit of clarity here.  Was this patch rolled out into the latest 16.1  as well as 16.2?

Thank you

Comment 69 Takashi Kajinami 2022-08-03 12:05:24 UTC
There was a confusion caused by the past discussion in this bug so I'm posting this to clear that.

The parameter we are changing to workaround the problem is used by keystonemiddleware,
which is used by only API process.
Thus the parameter should be set in controller nodes, and there is NO NEED to set the same
in compute nodes or any other nodes where no api processes are running.

Please follow https://bugzilla.redhat.com/show_bug.cgi?id=1893205#c46 .
It should be enough to apply the required change.

Comment 70 Luca Miccini 2023-08-11 12:22:42 UTC
fixed in 16.2 , for 16.1 see Takashi's advice regarding a workaround.