Bug 1893205 - From time to time memcached stops processing requests and brings down OpenStack control plane
Summary: From time to time memcached stops processing requests and brings down OpenSta...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: ---
Assignee: Damien Ciabrini
QA Contact: Joe H. Rahme
URL:
Whiteboard:
: 1849754 1915700 (view as bug list)
Depends On:
Blocks: 2046185 2100879 2101864 2101865
TreeView+ depends on / blocked
 
Reported: 2020-10-30 14:46 UTC by Alex Stupnikov
Modified: 2024-03-25 16:52 UTC (History)
29 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2046185 2100879 (view as bug list)
Environment:
Last Closed: 2023-08-11 12:22:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1634646 0 None None None 2021-10-08 08:39:45 UTC
Launchpad 1931047 0 None None None 2021-06-07 09:30:20 UTC
OpenStack gerrit 795497 0 None NEW Add support for keystone_authtoken/memcache_use_advanced_pool 2021-06-09 11:24:07 UTC
Red Hat Issue Tracker OSP-642 0 None None None 2021-11-10 17:31:49 UTC

Description Alex Stupnikov 2020-10-30 14:46:28 UTC
Description of problem:

Customer has RHOSP 16.1 deployment with beefy controller services. From time to time memcached stops working on all controller nodes.

Because of bug #1891034 we can't tell what's going on from memcached perspective. But in controller's logs we can see that at some point memcached healthcheck starts failing from time to time and then becomes completely broken (no successful healthchecks).

Customer provided sosreports from controller nodes collected at the time of the failure. I kindly ask developers to provide troubleshooting tips and help finding a workaround.

Comment 14 Michael Bayer 2021-01-12 16:37:55 UTC
https://bugs.launchpad.net/oslo.cache/+bug/1888394 has been noted and is a likely cause of this, notes neutron in the comments too.

Comment 17 Alex Stupnikov 2021-01-13 09:33:07 UTC
Bug #1915700 was reported to investigate Neutron behavior.

Comment 22 Srinivas Atmakuri 2021-01-28 04:06:45 UTC
Does creating a cronjob on the controller nodes to have Memcached service restarted for every X hours (~12 hrs) sound like a valid workaround?

Can such a procedure be safely implemented in production environments?

Thank you.

Comment 35 Hervé Beraud 2021-05-25 12:13:44 UTC
Hello Yadnesh,

Did you redeployed a stack with that? (the config)

Else, you can directly update the neutron config and restart the service to apply it.

Comment 45 Takashi Kajinami 2021-06-07 09:27:25 UTC
So far setting keystone_authtoken/memcache_use_advanced_pool is a valid workaround
and this is considered to be actual fix.

I reported a bug[1] in launchpad against tripleo and submitted a draft patch here[2].
 [1] https://bugs.launchpad.net/tripleo/+bug/1931047
 [2] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795010

I'd appreciate any feedback to that patch especially based on the following point
- Currently this patch enables advanced_pool for all services, based on the fact
  that advanced_pool is now recommended. 

- A new option is added to tht in case a user want to switch back to "legacy pool".
  However considering the fact that advanced_pool is now recomended, maybe
  we can just hard-code the parameter instead.

Comment 46 Takashi Kajinami 2021-06-08 15:19:47 UTC
So far the following template can be used to try enabling memcache_use_advanced_pool for all overcloud services.

~~~
parameter_defaults:
  ControllerExtraConfig:
    aodh::keystone::authtoken::memcache_use_advanced_pool: true
    barbican::keystone::authtoken::memcache_use_advanced_pool: true
    cinder::keystone::authtoken::memcache_use_advanced_pool: true
    glance::api::authtoken::memcache_use_advanced_pool: true
    gnocchi::keystone::authtoken::memcache_use_advanced_pool: true
    heat::keystone::authtoken::memcache_use_advanced_pool: true
    ironic::api::authtoken::memcache_use_advanced_pool: true
    manila::keystone::authtoken::memcache_use_advanced_pool: true
    neutron::keystone::authtoken::memcache_use_advanced_pool: true
    nova::keystone::authtoken::memcache_use_advanced_pool: true
    nova::metadata::novajoin::authtoken::memcache_use_advanced_pool: true
    octavia::keystone::authtoken::memcache_use_advanced_pool: true
    panko::keystone::authtoken::memcache_use_advanced_pool: true
    placement::keystone::authtoken::memcache_use_advanced_pool: true
~~~

Comment 47 Damien Ciabrini 2021-06-25 13:08:24 UTC
*** Bug 1849754 has been marked as a duplicate of this bug. ***

Comment 48 Takashi Kajinami 2021-07-04 13:11:42 UTC
*** Bug 1915700 has been marked as a duplicate of this bug. ***

Comment 63 aruffin@redhat.com 2022-04-07 17:46:51 UTC
Hello,

I am hoping to get a bit of clarity here.  Was this patch rolled out into the latest 16.1  as well as 16.2?

Thank you

Comment 69 Takashi Kajinami 2022-08-03 12:05:24 UTC
There was a confusion caused by the past discussion in this bug so I'm posting this to clear that.

The parameter we are changing to workaround the problem is used by keystonemiddleware,
which is used by only API process.
Thus the parameter should be set in controller nodes, and there is NO NEED to set the same
in compute nodes or any other nodes where no api processes are running.

Please follow https://bugzilla.redhat.com/show_bug.cgi?id=1893205#c46 .
It should be enough to apply the required change.

Comment 70 Luca Miccini 2023-08-11 12:22:42 UTC
fixed in 16.2 , for 16.1 see Takashi's advice regarding a workaround.


Note You need to log in before you can comment on or make changes to this bug.