RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1994390 - FIPS: deadlock between PID 1 and "modprobe crypto-jitterentropy_rng" at boot, preventing system to boot
Summary: FIPS: deadlock between PID 1 and "modprobe crypto-jitterentropy_rng" at boot,...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: kernel
Version: 8.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Herbert Xu
QA Contact: Linqing Lu
URL:
Whiteboard:
Depends On:
Blocks: 2029365
TreeView+ depends on / blocked
 
Reported: 2021-08-17 08:24 UTC by Renaud Métrich
Modified: 2022-05-10 16:01 UTC (History)
8 users (show)

Fixed In Version: kernel-4.18.0-355.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2029365 (view as bug list)
Environment:
Last Closed: 2022-05-10 15:06:29 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/rhel/src/kernel rhel-8 merge_requests 1758 0 None None None 2021-12-02 06:19:20 UTC
Red Hat Bugzilla 1729309 1 None None None 2021-08-17 08:33:23 UTC
Red Hat Issue Tracker RHELPLAN-93771 0 None None None 2021-08-17 08:25:26 UTC
Red Hat Knowledge Base (Solution) 4384951 0 None None None 2021-08-17 08:33:23 UTC
Red Hat Product Errata RHSA-2022:1988 0 None None None 2022-05-10 15:06:58 UTC

Description Renaud Métrich 2021-08-17 08:24:33 UTC
Description of problem:

We have reports from 2 customers already that updating the kernel to 8.2.z, 8.3.z or 8.4.0 lead to not being able to boot the system in FIPS mode:
"modprobe crypto-jitterentropy_rng" is hanging indefinitely.

This seems to only happen for a specific CPU model from AMD:
 - AMD EPYC 7262 8-Core Processor
 - AMD EPYC 7F72 24-Core Processor

The issue is 100% reproducible on these CPUs, even with booting a QEMU/KVM.

The vmcore shows that PID 1 (init/systemd) and "modprobe crypto-jitterentropy_rng" deadlock: PID 1 owning the crypto_default_rng_lock lock:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
crash> bt 1
PID: 1      TASK: ffff97004f484740  CPU: 0   COMMAND: "init"
 #0 [ffffa84a00323b10] __schedule at ffffffff96747d64
 #1 [ffffa84a00323ba8] schedule at ffffffff967481d8
 #2 [ffffa84a00323bb8] schedule_timeout at ffffffff9674b916
 #3 [ffffa84a00323c50] wait_for_completion_killable at ffffffff967492b7
 #4 [ffffa84a00323c90] call_usermodehelper_exec at ffffffff95efb065
 #5 [ffffa84a00323cd0] __request_module at ffffffff95f09910
 #6 [ffffa84a00323dc8] crypto_alg_mod_lookup at ffffffff96212366
 #7 [ffffa84a00323df0] crypto_alloc_tfm at ffffffff96212502
 #8 [ffffa84a00323e30] drbg_kcapi_seed at ffffffff962320e9
 #9 [ffffa84a00323ea8] crypto_rng_reset at ffffffff9622deb6
#10 [ffffa84a00323ed8] crypto_get_default_rng at ffffffff9622e07c
#11 [ffffa84a00323ef0] crypto_devrandom_read at ffffffff9622e3ea
#12 [ffffa84a00323f08] __x64_sys_getrandom at ffffffff963b8381
#13 [ffffa84a00323f38] do_syscall_64 at ffffffff95e0420b
#14 [ffffa84a00323f50] entry_SYSCALL_64_after_hwframe at ffffffff968000ad
…

crash> bt 259
PID: 259    TASK: ffff970075ce97c0  CPU: 1   COMMAND: "modprobe"
 #0 [ffffa84a0050bd98] __schedule at ffffffff96747d64
 #1 [ffffa84a0050be30] schedule at ffffffff967481d8
 #2 [ffffa84a0050be40] schedule_preempt_disabled at ffffffff9674851a
 #3 [ffffa84a0050be48] __mutex_lock at ffffffff9674a220
 #4 [ffffa84a0050bed8] crypto_get_default_rng at ffffffff9622e023
 #5 [ffffa84a0050bef0] crypto_devrandom_read at ffffffff9622e3ea
 #6 [ffffa84a0050bf08] __x64_sys_getrandom at ffffffff963b8381
 #7 [ffffa84a0050bf38] do_syscall_64 at ffffffff95e0420b
 #8 [ffffa84a0050bf50] entry_SYSCALL_64_after_hwframe at ffffffff968000ad
…

crash> dis -r ffffffff9622e023
…
0xffffffff9622e016 <crypto_get_default_rng+0x6>:        mov    $0xffffffff975260c0,%rdi
…
0xffffffff9622e01e <crypto_get_default_rng+0xe>:        callq  0xffffffff9674a400 <mutex_lock>
…

crash> sym 0xffffffff975260c0
ffffffff975260c0 (d) crypto_default_rng_lock

crash> mutex.owner crypto_default_rng_lock
  owner = {
    counter = 0xffff97004f484741
  }

crash> ps -m 0xffff97004f484740
[0 00:04:15.421] [UN]  PID: 1      TASK: ffff97004f484740  CPU: 0   COMMAND: "init"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

The vmcore also shows the following log message:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
[    1.113168] jitterentropy: Initialization failed with host not compliant with requirements: 9
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

The latter seems related to commit 4aa229703f32d.

We are currently bisecting the 8.3 kernels with one of the customers and we are currently at this exact commit, result expected later today.


Version-Release number of selected component (if applicable):

8.2.z (don't have exact version number)
8.3.z (4.18.0-240.15.1.el8_3 fails, 4.18.0-240.10.1.el8_3 ok)
8.4.0+ (4.18.0-305.el8 fails)


How reproducible:

Always

Steps to Reproduce:
1. Boot a AMD EPYC 7F72 in FIPS mode or AMD EPYC 7262 in FIPS mode

Actual results:

Boot hangs, modprobe hanging:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
[  246.752132] INFO: task modprobe:259 blocked for more than 120 seconds.
[  246.753445]       Not tainted 4.18.0-305.el8.x86_64 #1
[  246.753958] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.754760] modprobe        D    0   259      8 0x00000000
[  246.755327] Call Trace:
[  246.755586]  __schedule+0x2c4/0x700
[  246.755942]  schedule+0x38/0xa0
[  246.756272]  schedule_preempt_disabled+0xa/0x10
[  246.756739]  __mutex_lock.isra.6+0x2d0/0x4a0
[  246.757183]  crypto_get_default_rng+0x13/0x90
[  246.757630]  crypto_devrandom_read+0x1a/0x40
[  246.758073]  __x64_sys_getrandom+0x61/0xf0
[  246.758497]  do_syscall_64+0x5b/0x1a0
[  246.758873]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  246.759395] RIP: 0033:0x7f6e60b2852d
[  246.759765] Code: Unable to access opcode bytes at RIP 0x7f6e60b28503.
[  246.760433] RSP: 002b:00007fff9cf2d3a8 EFLAGS: 00000246 ORIG_RAX: 000000000000013e
[  246.761204] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007f6e60b2852d
[  246.761921] RDX: 0000000000000001 RSI: 0000000000000010 RDI: 000055baabe42570
[  246.762652] RBP: 0000000000000002 R08: 0000000000000003 R09: 0000000000000000
[  246.763377] R10: 00000000000000ca R11: 0000000000000246 R12: 000055baabe42520
[  246.764101] R13: 0000000000000001 R14: 00007f6e6126a480 R15: 0000000000000001
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Expected results:

No hang

Additional info:

Comment 2 Renaud Métrich 2021-08-17 13:02:48 UTC
Hello,

We found out through bisecting tha the culprit is the following commit:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
commit baf657e49985982cc880be5b08bbd394326226e5 (HEAD)
Author: Vladis Dronov <vdronov>
Date:   Fri Dec 11 20:01:29 2020 -0500

    [crypto] crypto: drbg - always seeded with SP800-90B compliant noise source
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Since there is apparently no available workaround and running prior kernels is not really an option, please work on this asap.


Additionally, from my understanding (please correct me if I'm wrong), the deadlock happens because
- systemd wants a random number
but
- there is no random number facility yet because "jitterentropy" failed to initialize
- any modprobe that would push a random number source would fail due to the lock being already held by systemd

If that's the case, then something smarter must be implemented to make sure there is always a source early during boot, even if that source should be disabled later.

Comment 4 Herbert Xu 2021-08-17 13:50:57 UTC
This is a failure of the Jitter RNG.  I will refer this to the author of Jitter, Stephan Mueller for his opinion.

Comment 19 Herbert Xu 2021-08-20 12:57:10 UTC
Thanks Renaud, I've relayed the information to Stephan.

Comment 49 Renaud Métrich 2021-12-14 07:50:57 UTC
This also happens on "AMD EPYC 74F3 24-Core Processor"

Comment 54 errata-xmlrpc 2022-05-10 15:06:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1988


Note You need to log in before you can comment on or make changes to this bug.