2097811 – certmonger startup very slow using default NSS sqlite database backend [rhel-7.9.z]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2097811 - certmonger startup very slow using default NSS sqlite database backend [rhel-7.9.z]

Summary: certmonger startup very slow using default NSS sqlite database backend [rhel-...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	nss-softokn
Sub Component:
Version:	7.9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Bob Relyea
QA Contact:	BaseOS QE Security Team
Docs Contact:
URL:
Whiteboard:
Depends On:	2084334 2097816
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-16 15:36 UTC by Bob Relyea
Modified:	2022-09-26 15:18 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When upgrading dbm databases with lots of Certificates with private keys, the resulting sqlite database becomes extremely slow to access. This is because the sqlite db will contain extra Trust objects for these certs that are unneccessary. Consequence: Accessing the resulting sqlite database becomes extremely slow Fix: 1) this patch speeds up accessing trust objects that don't affect the actual trust values. 2) fixes dbm so that it no longer creates the extra trust objects for certs that have private keys. Result: Access to these sqlite databases are now faster. Customers can get faster still results by reupdating the databases from the original dbm after the patch has been applied.f this bug requires documentation, please select an appropriate Doc Type value.
Clone Of:	2084334
Environment:
Last Closed:	2022-09-26 15:18:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CRYPTO-7584	0	None	None	None	2022-06-16 16:14:57 UTC
Red Hat Issue Tracker	RHELPLAN-125504	0	None	None	None	2022-06-16 16:15:22 UTC

Description Bob Relyea 2022-06-16 15:36:24 UTC

+++ This bug was initially created as a clone of Bug #2084334 +++

Description of problem:

Background: The nss deployed by rhel8 defaults to the sqlite backend for NSS databases, as opposed to the dbm backend for NSS databases on rhel7. This change affects certmonger, which will consider cert storage of type nssdb to be in sqlite storage format. If it encounters an NSS database for cert/key storage that is in (legacy) dbm format, it performs an automatic migration and uses sqlite storage from then on. 

The challenge: We use certmonger with a remote SCEP CA. To migrate our productive certificate management from rhel7 to rhel8.5, we copied 

* the directory /var/lib/certmonger
* the NSS database containing approx. 100 keys/certs

from old to new machine. All files are located on virtualized storage (same as in the rhel7 installation), on xfs (also same).

The problem: When starting certmonger.service, service startup time exceeded the system default timeout (120 seconds), and we had to increase it to >1000 seconds to be able to start the service at all. A startup time in the minutes is not acceptable for our certificate management.

The workaround: Analyzing the cause of the performance regression (with some help from certmonger devs! thank you!) we found out, that if we force certmonger to switch to (legacy) dbm storage, performance increased manyfold, to levels comparable with the rhel7 installation. We accomplished this by

1. sed -i s/(cert|key)_storage_location=/&dbm:/ /var/lib/certmonger/requests/*
2. Prepend the nss directory location with "dmb:" when calling "getcert":
   gertcert ... -d dbm:<nss directory>

Version-Release number of selected component (if applicable):

* certmonger-0.79.13-3.el8.x86_64
* nss-3.67.0-7.el8_5.x86_64
* nss-tools-3.67.0-7.el8_5.x86_64

How reproducible:

Always.

Steps to Reproduce:

1. Use certmonger on rhel7 to create an nss database directory containing multiple entries (the startup time grows linear with the number of requests), like "getcert add-ca ..." "getcert -c <my-ca> -d <my-nss-dir> -P <my-nss-dir-pin> -I <task-nickname> -n <cert-nickname>" -N <subject>"

2. Install certmonger on rhel8, copy /var/lib/certmonger and the nss dir from rhel7. 

3 Set OPTS="-d 4" in /etc/sysconfig/certmonger. 

4. Start certmonger.service

Actual results:

Starting certmonger.service will take several seconds per request to look up key and certificate, possibly exceeding systemd's service startup timeout. Initial start is even longer since certmonger will perform an NSS database migration from dbm to sqlite.

Expected results:

For ~ 100 managed requests, certmonger.service startup time should be in the seconds, not in the minutes. 

Additional info:

The underlying problem seems to be with the change of the default database backend in NSS as designed here: https://fedoraproject.org/wiki/Changes/NSSDefaultFileFormatSql where such performance impact was apparently not considered/foreseen.

--- Additional comment from Rob Crittenden on 2022-05-11 19:08:08 PDT ---

The issue seems to be with a database that is migrated from dbm to sqlite.

The attached script generates a self-signed CA and 100 server certificates.

To run it:

$ mkdir /tmp/nssdb
$ bash gencert dbm:/tmp/nssdb
$ echo httptest > /tmp/nssdb/passwd

Listing all the keys takes less than a second:

$ time certutil -K -d dbm:/tmp/nssdb -f /tmp/nssdb/passwd
real    0m0.559s
user    0m0.444s
sys     0m0.086s

Upgrade it to sqlite:

$ certutil -d sql:/tmp/nssdb/ -N -f /tmp/nssdb/passwd  -@ /tmp/nssdb/passwd

Same listing of keys:

$ time certutil -K -d dbm:/tmp/nssdb -f /tmp/nssdb/passwd
real    0m46.905s
user    0m45.400s
sys     0m0.177s

Now if we create the database directly as sqlite the timing is more in line with dbm:

$ mkdir /tmp/nssdb2
$ bash gencert sql:/tmp/nssdb2
$ echo httptest > /tmp/nssdb2/passwd

And list the keys:
$ time certutil -K -d sql:/tmp/nssdb2 -f /tmp/nssdb2/passwd

real    0m0.742s
user    0m0.581s
sys     0m0.032s

Also worth mentioning that generating the sqlite database using gencert takes significantly longer than the dbm database. It's plausible that entropy on this VM is simply exhausted.

Reproduced with nss-3.67.0-6.el8_4.x86_64

--- Additional comment from Rob Crittenden on 2022-05-11 19:08:49 PDT ---



--- Additional comment from Rob Crittenden on 2022-05-11 19:09:17 PDT ---

Changing component to nss for review.

--- Additional comment from Bob Relyea on 2022-05-17 07:55:33 PDT ---

This is almost certainly caused by the cache trashing bug when we added integrity to AES. The issue is the key for the decrypt and the key for the integrity check are different, and they would throuh each other out of the cache, so you ended up doing the PBE for every key. (The issue is seen with databases with large numbers of private keys). Does this happen on RHEL-9? If not it should be fixed on the next NSS rebase next month.

--- Additional comment from Bob Relyea on 2022-05-26 14:41:18 PDT ---

Rob, can you verify that this does not happen on fedora (I also think RHEL-9 has the appropriate patches as well).

bob

--- Additional comment from Rob Crittenden on 2022-05-27 06:24:15 PDT ---

dbm isn't allowed in Fedora since I think Fedora 32 or 33.

$ certutil -N -d dbm:/tmp/nssdb
certutil: function failed: SEC_ERROR_LEGACY_DATABASE: The certificate/key database is in an old, unsupported format.

--- Additional comment from Bob Relyea on 2022-05-27 09:25:09 PDT ---

Oh, I thought that it was just that the database on sqlite was being slow. Hmmm If you copy the database dbm upgraded database to rhel-9 or fedora, is it still slow?

There is an upstream bug that was fixed where if you have 100 or so keys, sqlite was really slow listing them. The fix for this is not in RHEL-8. I wonder why we aren't tripping over this when you create the database in sqlite?

bob

--- Additional comment from Rob Crittenden on 2022-05-27 11:00:19 PDT ---

It takes about 4s to list the 100 keys from the same database using nss-3.71.0-1.fc33.x86_64

$ time certutil -K -d sql:/tmp/nssdb/ -f /tmp/nssdb/passwd
real    0m4.155s
user    0m4.102s
sys     0m0.031s

--- Additional comment from Bob Relyea on 2022-05-27 13:49:25 PDT ---

Thanks Rob. That looks like there may be multiple issues, the main one being cache thrashing.

--- Additional comment from Bob Relyea on 2022-05-27 14:11:28 PDT ---

slight error in comment 1

> Upgrade it to sqlite:
>
> $ certutil -d sql:/tmp/nssdb/ -N -f /tmp/nssdb/passwd  -@ /tmp/nssdb/passwd
>
> Same listing of keys:
>
> $ time certutil -K -d dbm:/tmp/nssdb -f /tmp/nssdb/passwd

This last line should be:

$ time certutil -K -d sql:/tmp/nssdb -f /tmp/nssdb/passwd

--- Additional comment from Bob Relyea on 2022-05-31 12:43:31 PDT ---

> Also worth mentioning that generating the sqlite database using gencert takes significantly longer than the dbm database. 
> It's plausible that entropy on this VM is simply exhausted.

No keygen against the sql database definitely takes longer I can see that in both the rhel-8 certutil and my current upsteam certutil.

--- Additional comment from Bob Relyea on 2022-05-31 13:57:15 PDT ---

So the issuer is the CERT_USERDB bit in the trust, fools the legacydb (dbm) into presenting trust objects that are actually empty trust objects. Since NSS checks the integrity of trust objects if you've logged in (which you have to to display the keys), it takes quite some time to display each cert.

There are two fixes: 1) we can skip the integrity check if the value we are checking is the value we would default to if there wasn't any trust value (which you get when the integrity check fails. This speeds up the listing of the databases with these dead trust values by about 10x. 2) Fix dbm to to correctly skip cert trust objects with the CERTDB_USER bit and nothing else. This will fix the case the created the bad databases, but won't fix the displaying of the bad databases.

NSS 3.79 shipped today, so it won't be upstreamed in time to patch this there. We'll carry the patch until the next release of NSS.

Comment 18 errata-xmlrpc 2022-09-26 15:18:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (nss, nss-softokn, nss-util, and nspr bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:6712

Note You need to log in before you can comment on or make changes to this bug.