2084334 – certmonger startup very slow using default NSS sqlite database backend [rhel-8.7.0]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2084334 - certmonger startup very slow using default NSS sqlite database backend [rhel-8.7.0]

Summary: certmonger startup very slow using default NSS sqlite database backend [rhel-...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	nss
Sub Component:
Version:	8.5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Bob Relyea
QA Contact:	BaseOS QE Security Team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2097811 2097816 2097900
TreeView+	depends on / blocked

Reported:	2022-05-11 21:56 UTC by Tilman Kranz
Modified:	2023-06-05 16:48 UTC (History)
CC List:	4 users (show)
Fixed In Version:	nss-3.79.0-7.el8_6
Doc Type:	Bug Fix
Doc Text:	Cause: When upgrading dbm databases with lots of Certificates with private keys, the resulting sqlite database becomes extremely slow to access. This is because the sqlite db will contain extra Trust objects for these certs that are unneccessary. Consequence: Accessing the resulting sqlite database becomes extremely slow Fix: 1) this patch speeds up accessing trust objects that don't affect the actual trust values. 2) fixes dbm so that it no longer creates the extra trust objects for certs that have private keys. Result: Access to these sqlite databases are now faster. Customers can get faster still results by reupdating the databases from the original dbm after the patch has been applied.
Clone Of:
Clones:	2097811 2097816 2097900 (view as bug list)
Environment:
Last Closed:	2023-06-05 16:45:32 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
Script to generate a CA and 100 server certificates (3.29 KB, text/plain) 2022-05-12 02:08 UTC, Rob Crittenden	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CRYPTO-7258	0	None	None	None	2022-05-17 15:03:19 UTC
Red Hat Issue Tracker	RHELPLAN-121780	0	None	None	None	2022-05-11 22:00:20 UTC

Description Tilman Kranz 2022-05-11 21:56:16 UTC

Description of problem:

Background: The nss deployed by rhel8 defaults to the sqlite backend for NSS databases, as opposed to the dbm backend for NSS databases on rhel7. This change affects certmonger, which will consider cert storage of type nssdb to be in sqlite storage format. If it encounters an NSS database for cert/key storage that is in (legacy) dbm format, it performs an automatic migration and uses sqlite storage from then on.

The challenge: We use certmonger with a remote SCEP CA. To migrate our productive certificate management from rhel7 to rhel8.5, we copied

* the directory /var/lib/certmonger
* the NSS database containing approx. 100 keys/certs

from old to new machine. All files are located on virtualized storage (same as in the rhel7 installation), on xfs (also same).

The problem: When starting certmonger.service, service startup time exceeded the system default timeout (120 seconds), and we had to increase it to >1000 seconds to be able to start the service at all. A startup time in the minutes is not acceptable for our certificate management.

The workaround: Analyzing the cause of the performance regression (with some help from certmonger devs! thank you!) we found out, that if we force certmonger to switch to (legacy) dbm storage, performance increased manyfold, to levels comparable with the rhel7 installation. We accomplished this by

1. sed -i s/(cert|key)_storage_location=/&dbm:/ /var/lib/certmonger/requests/*
2. Prepend the nss directory location with "dmb:" when calling "getcert":
gertcert ... -d dbm:<nss directory>

Version-Release number of selected component (if applicable):

* certmonger-0.79.13-3.el8.x86_64
* nss-3.67.0-7.el8_5.x86_64
* nss-tools-3.67.0-7.el8_5.x86_64

How reproducible:

Always.

Steps to Reproduce:

1. Use certmonger on rhel7 to create an nss database directory containing multiple entries (the startup time grows linear with the number of requests), like "getcert add-ca ..." "getcert -c <my-ca> -d <my-nss-dir> -P <my-nss-dir-pin> -I <task-nickname> -n <cert-nickname>" -N <subject>"

2. Install certmonger on rhel8, copy /var/lib/certmonger and the nss dir from rhel7.

3 Set OPTS="-d 4" in /etc/sysconfig/certmonger.

4. Start certmonger.service

Actual results:

Starting certmonger.service will take several seconds per request to look up key and certificate, possibly exceeding systemd's service startup timeout. Initial start is even longer since certmonger will perform an NSS database migration from dbm to sqlite.

Expected results:

For ~ 100 managed requests, certmonger.service startup time should be in the seconds, not in the minutes.

Additional info:

The underlying problem seems to be with the change of the default database backend in NSS as designed here: https://fedoraproject.org/wiki/Changes/NSSDefaultFileFormatSql where such performance impact was apparently not considered/foreseen.

Comment 1 Rob Crittenden 2022-05-12 02:08:08 UTC

The issue seems to be with a database that is migrated from dbm to sqlite.

The attached script generates a self-signed CA and 100 server certificates.

To run it:

$ mkdir /tmp/nssdb
$ bash gencert dbm:/tmp/nssdb
$ echo httptest > /tmp/nssdb/passwd

Listing all the keys takes less than a second:

$ time certutil -K -d dbm:/tmp/nssdb -f /tmp/nssdb/passwd
real    0m0.559s
user    0m0.444s
sys     0m0.086s

Upgrade it to sqlite:

$ certutil -d sql:/tmp/nssdb/ -N -f /tmp/nssdb/passwd  -@ /tmp/nssdb/passwd

Same listing of keys:

$ time certutil -K -d dbm:/tmp/nssdb -f /tmp/nssdb/passwd
real    0m46.905s
user    0m45.400s
sys     0m0.177s

Now if we create the database directly as sqlite the timing is more in line with dbm:

$ mkdir /tmp/nssdb2
$ bash gencert sql:/tmp/nssdb2
$ echo httptest > /tmp/nssdb2/passwd

And list the keys:
$ time certutil -K -d sql:/tmp/nssdb2 -f /tmp/nssdb2/passwd

real    0m0.742s
user    0m0.581s
sys     0m0.032s

Also worth mentioning that generating the sqlite database using gencert takes significantly longer than the dbm database. It's plausible that entropy on this VM is simply exhausted.

Reproduced with nss-3.67.0-6.el8_4.x86_64

Comment 2 Rob Crittenden 2022-05-12 02:08:49 UTC

Created attachment 1878803 [details]
Script to generate a CA and 100 server certificates

Comment 3 Rob Crittenden 2022-05-12 02:09:17 UTC

Changing component to nss for review.

Comment 4 Bob Relyea 2022-05-17 14:55:33 UTC

This is almost certainly caused by the cache trashing bug when we added integrity to AES. The issue is the key for the decrypt and the key for the integrity check are different, and they would throuh each other out of the cache, so you ended up doing the PBE for every key. (The issue is seen with databases with large numbers of private keys). Does this happen on RHEL-9? If not it should be fixed on the next NSS rebase next month.

Comment 5 Bob Relyea 2022-05-26 21:41:18 UTC

Rob, can you verify that this does not happen on fedora (I also think RHEL-9 has the appropriate patches as well).

bob

Comment 6 Rob Crittenden 2022-05-27 13:24:15 UTC

dbm isn't allowed in Fedora since I think Fedora 32 or 33.

$ certutil -N -d dbm:/tmp/nssdb
certutil: function failed: SEC_ERROR_LEGACY_DATABASE: The certificate/key database is in an old, unsupported format.

Comment 7 Bob Relyea 2022-05-27 16:25:09 UTC

Oh, I thought that it was just that the database on sqlite was being slow. Hmmm If you copy the database dbm upgraded database to rhel-9 or fedora, is it still slow?

There is an upstream bug that was fixed where if you have 100 or so keys, sqlite was really slow listing them. The fix for this is not in RHEL-8. I wonder why we aren't tripping over this when you create the database in sqlite?

bob

Comment 8 Rob Crittenden 2022-05-27 18:00:19 UTC

It takes about 4s to list the 100 keys from the same database using nss-3.71.0-1.fc33.x86_64

$ time certutil -K -d sql:/tmp/nssdb/ -f /tmp/nssdb/passwd
real    0m4.155s
user    0m4.102s
sys     0m0.031s

Comment 9 Bob Relyea 2022-05-27 20:49:25 UTC

Thanks Rob. That looks like there may be multiple issues, the main one being cache thrashing.

Comment 10 Bob Relyea 2022-05-27 21:11:28 UTC

slight error in comment 1

> Upgrade it to sqlite:
>
> $ certutil -d sql:/tmp/nssdb/ -N -f /tmp/nssdb/passwd  -@ /tmp/nssdb/passwd
>
> Same listing of keys:
>
> $ time certutil -K -d dbm:/tmp/nssdb -f /tmp/nssdb/passwd

This last line should be:

$ time certutil -K -d sql:/tmp/nssdb -f /tmp/nssdb/passwd

Comment 11 Bob Relyea 2022-05-31 19:43:31 UTC

> Also worth mentioning that generating the sqlite database using gencert takes significantly longer than the dbm database. 
> It's plausible that entropy on this VM is simply exhausted.

No keygen against the sql database definitely takes longer I can see that in both the rhel-8 certutil and my current upsteam certutil.

Comment 12 Bob Relyea 2022-05-31 20:57:15 UTC

So the issuer is the CERT_USERDB bit in the trust, fools the legacydb (dbm) into presenting trust objects that are actually empty trust objects. Since NSS checks the integrity of trust objects if you've logged in (which you have to to display the keys), it takes quite some time to display each cert.

There are two fixes: 1) we can skip the integrity check if the value we are checking is the value we would default to if there wasn't any trust value (which you get when the integrity check fails. This speeds up the listing of the databases with these dead trust values by about 10x. 2) Fix dbm to to correctly skip cert trust objects with the CERTDB_USER bit and nothing else. This will fix the case the created the bad databases, but won't fix the displaying of the bad databases.

NSS 3.79 shipped today, so it won't be upstreamed in time to patch this there. We'll carry the patch until the next release of NSS.

Comment 20 Clemens Lang 2023-06-05 16:45:32 UTC

RHEL 8.7 contains nss-3.79.0-10.el8_6.

Note You need to log in before you can comment on or make changes to this bug.