1646640 – healthcheck_curl() causes massive dentry cache growth

Bug 1646640 - healthcheck_curl() causes massive dentry cache growth

Summary: healthcheck_curl() causes massive dentry cache growth

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z6
Target Release:	13.0 (Queens)
Assignee:	Vinay Kapalavai
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1737305 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-05 19:32 UTC by Andrew Austin
Modified:	2022-03-13 16:07 UTC (History)
CC List:	12 users (show)
Fixed In Version:	openstack-tripleo-common-8.6.6-18.el7ost
Doc Type:	Bug Fix
Doc Text:	Previously, `NSS_SDB_USE_CACHE=no` was not set before calling curl in the container health checks, and the `dentry` cache on controller nodes grew continuously. Controller nodes with a large amount of RAM experienced soft lockups when memory pressure forces a reclamation of the extraneous cache entries. With this update, the `NSS_SDB_USE_CACHE=no` environment variable is set before executing a curl statement in the container health check. As a result, the `dentry` cache on controller nodes no longer grows continuously and does not cause soft lockups.
Clone Of:
Clones:	1685732 (view as bug list)
Environment:
Last Closed:	2019-04-30 17:27:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	620649	'None'	MERGED	Stops growth of massive dentry cache growth	2021-02-17 22:16:41 UTC
Red Hat Bugzilla	1044666	medium	CLOSED	Can curl HTTPS requests make fewer access system calls?	2021-02-22 00:41:40 UTC
Red Hat Issue Tracker	OSP-13706	None	None	None	2022-03-13 16:07:31 UTC
Red Hat Knowledge Base (Solution)	3679411	None	None	None	2018-11-06 16:24:48 UTC
Red Hat Product Errata	RHBA-2019:0939	None	None	None	2019-04-30 17:27:45 UTC

Description Andrew Austin 2018-11-05 19:32:05 UTC

Description of problem:
Since NSS_SDB_USE_CACHE=no is not set before calling curl in the container health checks, the dentry cache on controller nodes grows continually. On controllers with a larger amount of RAM, this can lead to soft lockups once memory pressure forces a reclamation of the extraneous cache entries.

See RHBZ 1044666 for background.

Version-Release number of selected component (if applicable):
13.0

How reproducible:
Run the following systemtap script on a containerized OSP13 controller:

probe kernel.function("d_alloc").return { log(reverse_path_walk($return)) }

Observe many repeated calls referencing lib/docker/overlay2/<some_id>/diff/etc/pki/nssdb/.<some_number>_dOeSnotExist_.db

Steps to Reproduce:
1. Deploy an OSP13 environment with containerized control plane
2. Run the systemtap script above for 1 minute 
3. Observe many calls for nonexistent NSS DB files
4. Run the following to add NSS_SDB_USE_CACHE=no to the healthcheck function:

 docker ps -q | xargs -I {} docker exec -u root {} sed -i '/^healthcheck_curl/a \ \ export NSS_SDB_USE_CACHE=no' /usr/share/openstack-tripleo-common/healthcheck/common.sh

5. Re-run the systemtap script
6. Observe a large reduction (90%+) in dentry cache calls over 1 minute

Actual results:
[root@ctl01 ~]# stap -o test1.out -T 60 dentry.stap 
[root@ctl01 ~]# wc -l test1.out 
186526 test1.out
[root@ctl01 ~]# grep dOeSnotExist test1.out | wc -l
158649

Expected results:
[root@ctl01 ~]# stap -o test2.out -T 60 dentry.stap 
[root@ctl01 ~]# wc -l test2.out 
17544 test2.out
[root@ctl01 ~]# grep dOeSnotExist test2.out | wc -l
0

Comment 1 Andrew Austin 2018-11-05 21:45:24 UTC

The customer environment where this behavior was noted originally had approximately 1 GB of dentry cache growth every 10 minutes. After applying the in-place change to /usr/share/openstack-tripleo-common/healthcheck/common.sh, that dropped to less than 10MB per 10 minutes.

Comment 24 Leonid Natapov 2019-04-15 10:32:13 UTC

openstack-tripleo-common-8.6.8-3.el7ost.noarch

Comment 26 errata-xmlrpc 2019-04-30 17:27:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0939

Comment 27 Ken Gaillot 2019-09-17 04:12:20 UTC

*** Bug 1737305 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.