Bug 1646640 - healthcheck_curl() causes massive dentry cache growth
Summary: healthcheck_curl() causes massive dentry cache growth
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z6
: 13.0 (Queens)
Assignee: Vinay Kapalavai
QA Contact: Leonid Natapov
URL:
Whiteboard:
: 1737305 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-05 19:32 UTC by Andrew Austin
Modified: 2019-09-17 04:12 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-common-8.6.6-18.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, `NSS_SDB_USE_CACHE=no` was not set before calling curl in the container health checks, and the `dentry` cache on controller nodes grew continuously. Controller nodes with a large amount of RAM experienced soft lockups when memory pressure forces a reclamation of the extraneous cache entries. With this update, the `NSS_SDB_USE_CACHE=no` environment variable is set before executing a curl statement in the container health check. As a result, the `dentry` cache on controller nodes no longer grows continuously and does not cause soft lockups.
Clone Of:
: 1685732 (view as bug list)
Environment:
Last Closed: 2019-04-30 17:27:35 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 620649 'None' MERGED Stops growth of massive dentry cache growth 2020-06-15 02:37:43 UTC
Red Hat Bugzilla 1044666 'medium' 'CLOSED' 'Can curl HTTPS requests make fewer access system calls?' 2019-12-09 07:06:24 UTC
Red Hat Knowledge Base (Solution) 3679411 None None None 2018-11-06 16:24:48 UTC
Red Hat Product Errata RHBA-2019:0939 None None None 2019-04-30 17:27:45 UTC

Description Andrew Austin 2018-11-05 19:32:05 UTC
Description of problem:
Since NSS_SDB_USE_CACHE=no is not set before calling curl in the container health checks, the dentry cache on controller nodes grows continually. On controllers with a larger amount of RAM, this can lead to soft lockups once memory pressure forces a reclamation of the extraneous cache entries.

See RHBZ 1044666 for background.

Version-Release number of selected component (if applicable):
13.0

How reproducible:
Run the following systemtap script on a containerized OSP13 controller:

probe kernel.function("d_alloc").return { log(reverse_path_walk($return)) }

Observe many repeated calls referencing lib/docker/overlay2/<some_id>/diff/etc/pki/nssdb/.<some_number>_dOeSnotExist_.db

Steps to Reproduce:
1. Deploy an OSP13 environment with containerized control plane
2. Run the systemtap script above for 1 minute 
3. Observe many calls for nonexistent NSS DB files
4. Run the following to add NSS_SDB_USE_CACHE=no to the healthcheck function:

 docker ps -q | xargs -I {} docker exec -u root {} sed -i '/^healthcheck_curl/a \ \ export NSS_SDB_USE_CACHE=no' /usr/share/openstack-tripleo-common/healthcheck/common.sh

5. Re-run the systemtap script
6. Observe a large reduction (90%+) in dentry cache calls over 1 minute

Actual results:
[root@ctl01 ~]# stap -o test1.out -T 60 dentry.stap 
[root@ctl01 ~]# wc -l test1.out 
186526 test1.out
[root@ctl01 ~]# grep dOeSnotExist test1.out | wc -l
158649

Expected results:
[root@ctl01 ~]# stap -o test2.out -T 60 dentry.stap 
[root@ctl01 ~]# wc -l test2.out 
17544 test2.out
[root@ctl01 ~]# grep dOeSnotExist test2.out | wc -l
0

Comment 1 Andrew Austin 2018-11-05 21:45:24 UTC
The customer environment where this behavior was noted originally had approximately 1 GB of dentry cache growth every 10 minutes. After applying the in-place change to /usr/share/openstack-tripleo-common/healthcheck/common.sh, that dropped to less than 10MB per 10 minutes.

Comment 24 Leonid Natapov 2019-04-15 10:32:13 UTC
openstack-tripleo-common-8.6.8-3.el7ost.noarch

Comment 26 errata-xmlrpc 2019-04-30 17:27:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0939

Comment 27 Ken Gaillot 2019-09-17 04:12:20 UTC
*** Bug 1737305 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.