Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1646640

Summary: healthcheck_curl() causes massive dentry cache growth
Product: Red Hat OpenStack Reporter: Andrew Austin <aaustin>
Component: openstack-tripleo-commonAssignee: Vinay Kapalavai <vkapalav>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: amcleod, astupnik, broskos, ccopello, mburns, mmagr, mzheng, pkilambi, ramishra, rzaleski, slinaber, vkapalav
Target Milestone: z6Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-8.6.6-18.el7ost Doc Type: Bug Fix
Doc Text:
Previously, `NSS_SDB_USE_CACHE=no` was not set before calling curl in the container health checks, and the `dentry` cache on controller nodes grew continuously. Controller nodes with a large amount of RAM experienced soft lockups when memory pressure forces a reclamation of the extraneous cache entries. With this update, the `NSS_SDB_USE_CACHE=no` environment variable is set before executing a curl statement in the container health check. As a result, the `dentry` cache on controller nodes no longer grows continuously and does not cause soft lockups.
Story Points: ---
Clone Of:
: 1685732 (view as bug list) Environment:
Last Closed: 2019-04-30 17:27:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Austin 2018-11-05 19:32:05 UTC
Description of problem:
Since NSS_SDB_USE_CACHE=no is not set before calling curl in the container health checks, the dentry cache on controller nodes grows continually. On controllers with a larger amount of RAM, this can lead to soft lockups once memory pressure forces a reclamation of the extraneous cache entries.

See RHBZ 1044666 for background.

Version-Release number of selected component (if applicable):
13.0

How reproducible:
Run the following systemtap script on a containerized OSP13 controller:

probe kernel.function("d_alloc").return { log(reverse_path_walk($return)) }

Observe many repeated calls referencing lib/docker/overlay2/<some_id>/diff/etc/pki/nssdb/.<some_number>_dOeSnotExist_.db

Steps to Reproduce:
1. Deploy an OSP13 environment with containerized control plane
2. Run the systemtap script above for 1 minute 
3. Observe many calls for nonexistent NSS DB files
4. Run the following to add NSS_SDB_USE_CACHE=no to the healthcheck function:

 docker ps -q | xargs -I {} docker exec -u root {} sed -i '/^healthcheck_curl/a \ \ export NSS_SDB_USE_CACHE=no' /usr/share/openstack-tripleo-common/healthcheck/common.sh

5. Re-run the systemtap script
6. Observe a large reduction (90%+) in dentry cache calls over 1 minute

Actual results:
[root@ctl01 ~]# stap -o test1.out -T 60 dentry.stap 
[root@ctl01 ~]# wc -l test1.out 
186526 test1.out
[root@ctl01 ~]# grep dOeSnotExist test1.out | wc -l
158649

Expected results:
[root@ctl01 ~]# stap -o test2.out -T 60 dentry.stap 
[root@ctl01 ~]# wc -l test2.out 
17544 test2.out
[root@ctl01 ~]# grep dOeSnotExist test2.out | wc -l
0

Comment 1 Andrew Austin 2018-11-05 21:45:24 UTC
The customer environment where this behavior was noted originally had approximately 1 GB of dentry cache growth every 10 minutes. After applying the in-place change to /usr/share/openstack-tripleo-common/healthcheck/common.sh, that dropped to less than 10MB per 10 minutes.

Comment 24 Leonid Natapov 2019-04-15 10:32:13 UTC
openstack-tripleo-common-8.6.8-3.el7ost.noarch

Comment 26 errata-xmlrpc 2019-04-30 17:27:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0939

Comment 27 Ken Gaillot 2019-09-17 04:12:20 UTC
*** Bug 1737305 has been marked as a duplicate of this bug. ***