Bug 1778762
Summary: | Please backport Jitter Entropy patches | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Simo Sorce <ssorce> | |
Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> | |
kernel sub component: | Crypto | QA Contact: | Vilém Maršík <vmarsik> | |
Status: | CLOSED ERRATA | Docs Contact: | Khushbu Borole <kborole> | |
Severity: | high | |||
Priority: | high | CC: | bhu, dornelas, dsanzmor, herbert.xu, jklech, jlebon, lilu, miabbott, omosnace, rheinzma, rvr, sferguso, smeisner, ssorce, syangsao, tmraz, vmarsik, walters | |
Version: | 8.2 | Keywords: | TestBlocker, ZStream | |
Target Milestone: | rc | |||
Target Release: | 8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | kernel-4.18.0-198.el8 | Doc Type: | Bug Fix | |
Doc Text: |
.Boot process no longer fails due to lack of entropy
Previously, the boot process failed due to lack of entropy. A better mechanism is now used to allow the kernel to gather entropy early in the boot process, which does not depend on any hardware specific interrupts. This update fixes the problem by ensuring availability of sufficient entropy to secure random generation in early boot. As a result, the fix prevents kickstart timeout or slow boots and the boot process works as expected.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1884682 (view as bug list) | Environment: | ||
Last Closed: | 2020-11-04 00:56:06 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1186913, 1788715, 1819241, 1825061, 1884682 |
Description
Simo Sorce
2019-12-02 13:09:46 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=25088771 Test build with 50ee7529ec4500c88f8664560770a7a1b65db72b backported. Simo, is there any special testing you would like to do on this? CCing Tomas who may still have a reference to some pathological cases? For example there is the bug 1712776 that we were able to reproduce even with RHEL-8.1 snapshots. We did not try 8.1 GA though. asking the owners of bz 1712776 to test Acking. How do we test this? I can do SanityOnly, if we have no better way. Vilem, there is a proposed test method in bz 1712776, comment 51 Thanks for the first shot, but still not 100% clear: 1. does this really reproduce everywhere, as Hardware:All at bz 1712776 suggests, or do I need a system with no HW entropy sources (no RDRAND, no TPM) ? 2. the slash number mismatch at bz 1712776, comment 51 (https:// vs. https:/// vs. https:/ ) is a typo, right? 3. so reproducing/verifying this requires running installation of pre-patch / post-patch RHEL with just one inst.ks=https://... ? 4. and "inst.ks.all" is a part of the workaround, and should not be used for reproducing/verifying? In response: 1. I'm honestly not sure, though I would expect, generally speaking a system with less entropy would hit this more often 2. I think it has to be a typo, yes 3. Thats my read of the reproducer, correct. By specifying multiple ks files and asking to use them all, we give anaconda multiple opportunities to setup an ssl socket, and pull enough entropy out of the kernel to do so. 4. Yes, correct. With the fix in place in the kernel, you should only need one ks line This would help OpenShift a lot in general, particularly scenarios like VMWare where we aren't currently getting entropy from the hypervisor. https://github.com/openshift/machine-config-operator/issues/854 https://bugzilla.redhat.com/show_bug.cgi?id=1781902 Colin, if you could please test on either one of those issues above, to ensure that we've resolved the issue, I can post this ASAP. Sure; it should be pretty easy to reproduce this; just have an initramfs which runs a userspace process that calls getrandom() very early on, and for KVM don't provide the VM the virtio-rng device. we have several ways we can test this, I was asking if you had a reliable method, could you please do that. Sorry for the delay on this; first, AFAICS the build linked above https://bugzilla.redhat.com/show_bug.cgi?id=1778762#c1 has been garbage collected? (If CentOS Stream worked we could be doing this there...) I think I figured out how to resubmit the build from the same source: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28229163 Also a very simple test is to check for "uninitialized urandom read" in the kernel dmesg: https://github.com/openshift/machine-config-operator/issues/854#issuecomment-620216001 you can't resubmit builds like you did unless you have the same commit history in the tree you are submitting from. I've resubmitted the build here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28229801 Colin, please preform whatever tests you feel are needed to confirm that this bug fixes the problem you are seeing OK: ``` [root@cosa-devsh ~]# rpm-ostree status -b State: idle AutomaticUpdates: disabled BootedDeployment: * ostree://2d9147852129fd08386ff8c0df6df04f6eccf6e57aa08ce03a7e9b9761b26253 Version: 45.81.202003032014-0 (2020-03-03T20:18:46Z) ReplacedBasePackages: kernel-modules kernel-modules-extra kernel-core kernel 4.18.0-147.5.1.el8_1 -> 4.18.0-159.el8.test [root@cosa-devsh ~]# dmesg | grep -i 'uninitialized urandom' [ 1.203431] random: modprobe: uninitialized urandom read (16 bytes read) [ 1.205217] random: modprobe: uninitialized urandom read (16 bytes read) [ 1.217429] random: modprobe: uninitialized urandom read (16 bytes read) [root@cosa-devsh ~]# ``` So not what I was expecting to see but then I just re-read the code carefully and I was wrong; we of course don't block for urandom so we'll see this message still. The blocking is only for /dev/random - I am not sure we have much in our initramfs using /dev/random (or But that led me to investigate why we're not seeing any of that in Fedora CoreOS and then I realized the other factor that's quite different between Fedora and RHEL8 (even with this patch is): [root@fedora-coreos ~]# grep RANDOM_TRUST_CPU /usr/lib/modules/5.5.15-200.fc31.x86_64/config CONFIG_RANDOM_TRUST_CPU=y [root@rhel-coreos ~]# grep TRUST /usr/lib/modules/4.18.0-159.el8.test.x86_64/config # CONFIG_RANDOM_TRUST_CPU is not set Which is a big factor here I'm sure; it's really common to have RDRAND hardware nowadays. Was backporting that discussed too? But, let me see if I can get a reproducer scenario again. OK I wrote this script which runs qemu with an emulated SandyBridge (pre-RDRAND) processor: https://github.com/cgwalters/playground/blob/3906daab8a759ce1c825223381bdcb284e00f00f/run-coreos-hostile-entropy Which I'm invoking like this so I can get a shell: ~/src/github/cgwalters/playground/run-coreos-hostile-entropy /srv/walters/rhcos-4.4/builds/latest/x86_64/rhcos-45.82.202004282336-0-qemu.x86_64.qcow2 ~/src/github/cgwalters/playground/fcct/autologin.ign2 The script is basically a one liner: qemu-kvm -m 4096 -cpu SandyBridge -drive if=virtio,file=${disk},snapshot=on -fw_cfg name=opt/com.coreos/config,file=${config} "$@" And of that really all you need to do is run qemu with `-cpu SandyBridge` on a booted OS, doesn't have to be Fedora/RHEL CoreOS, and doesn't need to be Anaconda - seems totally valid to me to test this by just replacing the kernel in a booted system, rm /var/lib/systemd/random-seed, then reboot. Note that AIUI it's not uncommon in datacenters that have modern hardware to end up doing a "lowest common denominator" processor type so they can do live migration. I wouldn't be surprised if the VMware boot hangs we've seen reported are something similar. Running hostile-entropy on latest FCOS, it seems to boot OK although there are some interesting warnings like: Apr 28 23:59:51 localhost NetworkManager[859]: <warn> [1588118391.9351] secret-key: failure to generate good random data for secret-key (use non-persistent key) Plus we see crng init 12 seconds after boot: [ 12.818853] random: crng init done Running hostile-entropy on the latest RHCOS (without the patched kernel here) I see "random: crng init done" a full 26 seconds after boot, which shows just how bad entropy generation is without RDRAND. Also notable is that at 0 seconds: random: get_random_u64 called from cache_random_seq_create with crng_init=0 A quick git grep shows that as the kernel slab allocator initalization; see that in all cases. Anyways finally to the point here now that we have a "baseline": Trying this kernel...the boot just hangs right after "Probing EDD (edd=off to disable)...ok" I haven't tried to debug this but my guess is that it's something like slab init calling try_to_generate_entropy() calling into the timer code which calls back into slab or something? And to eliminate all other variables here, simply unpacking the RPM and doing: qemu-kvm -cpu host -kernel lib/modules/4.18.0-159.el8.test.x86_64/vmlinuz works fine. But this hangs with the same symptoms: qemu-kvm -cpu SandyBridge -kernel lib/modules/4.18.0-159.el8.test.x86_64/vmlinuz And just to double check, grabbing a random recent RHEL8 kernel build from brew: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1179819 also works fine with -cpu SandyBridge. Thats....really early in the boot process, before we've even switched out of protected mode. We should have at least seen the linux banner and kernel command line there before anything started touching try_to_generate_entropy. I'll try to recreate it hmmm, wow, so I managed to reproduce it. With some earlyprintk logging, I managed to find the problem: PANIC: early exception 0x06 IP 10:ffffffff8a223e7a error 0 cr2 0xffff8a8ee0c01000 [ 0.000000] CPU: 0 PID: 0 Comm: swapper Tainted: G W --------- - - 4.18.0+ #1 [ 0.000000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 0.000000] RIP: 0010:bug_at+0x1d/0x23 [ 0.000000] Code: 9c 01 01 00 00 00 eb d9 e8 93 d1 08 00 e8 ae db 9d 00 41 89 f0 48 89 f9 48 89 fa 48 89 fe 48 c7 c7 78 a5 27 8b e8 94 60 0f 00 <0f> 0b 90 90 90 90 e8 8b db 9d 00 48 8b 05 c8 51 3e 01 f6 c4 02 74 [ 0.000000] RSP: 0000:ffffffff8b403ed0 EFLAGS: 00010046 ORIG_RAX: 0000000000000000 [ 0.000000] RAX: 0000000000000071 RBX: ffffffff8b5e1b90 RCX: ffffffff8b45ac28 [ 0.000000] RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000046 [ 0.000000] RBP: ffffffff8bbf0f50 R08: 0000000000000091 R09: 0000000000000080 [ 0.000000] R10: 6562616c5f706d75 R11: 6c61746146203a6c R12: ffffffff8bbf0f50 [ 0.000000] R13: ffff8a8efffad140 R14: 0000000000000000 R15: 0000000000000000 [ 0.000000] FS: 0000000000000000(0000) GS:ffff8a8effc00000(0000) knlGS:0000000000000000 [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.000000] CR2: ffff8a8ee0c01000 CR3: 000000005fe0a000 CR4: 00000000000406b0 [ 0.000000] Call Trace: [ 0.000000] __jump_label_transform.isra.0+0x62/0x150 [ 0.000000] jump_label_init+0x9b/0xda [ 0.000000] start_kernel+0x241/0x55b [ 0.000000] ? load_ucode_bsp+0x42/0x12e [ 0.000000] secondary_startup_64+0xb7/0xc0 The problem is that I don't seem to see any relation of this to the introduction of the jitter entropy source. There is a prior warning about the __use_tsc static key being used before its jump_label_init is called, and I have a vague recollection about that being a bug that was fixed awhile ago, so I'm wondering if I didn't branch this feature from a kernel prior to that getting fixed. I'm going to rebase the patch to the latest RHEL8 head and see if the problem persists yup, rebasing to the current head of the RHEL8.2 tree seems to have corrected the problem, at least for me. Heres a new build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28280629 Colin if you could confirm with your reproducer asap, I would appreciate it. Then I can post this OK, so this is a *dramatic* improvement in the simple "boot RHCOS in qemu with -cpu Skylake and no virtio-rng" scenario. Without this, I see: [ 101.679027] random: crng init done And hence SSH key generation slows down the bootup by over a minute. With this patch: [ 15.294050] random: crng init done Which is much, much better. Another good view is: Before: # systemctl status sshd-keygen@rsa ... Apr 30 17:53:41 localhost systemd[1]: Starting OpenSSH rsa Server Key Generation... Apr 30 17:55:12 localhost systemd[1]: Started OpenSSH rsa Server Key Generation. After: Apr 30 17:54:40 localhost systemd[1]: Starting OpenSSH rsa Server Key Generation Apr 30 17:54:42 localhost systemd[1]: Started OpenSSH rsa Server Key Generation. i.e. in the first case RSA keygen took a minute and a half! With this, just 2 seconds. I haven't tried to reproduce the OpenShift-in-VSphere scenarios but with even more demands on entropy I am sure this will help a lot. It's basically the same thing - we may not have CPU entropy and we have no hypervisor entropy (like virtio-rng). So ship it! BUT: I would like to open discussion of also enabling CONFIG_RANDOM_TRUST_CPU per https://bugzilla.redhat.com/show_bug.cgi?id=1778762#c17 - should we roll that into this bug or do you want it separate? Because even today in a top-tier case like AWS, we don't have hypervisor entropy and a lot of modern machines do have RDRAND, but still hit crng init about 12 seconds after boot because we're not trusting the CPU. That should be a separate bug, though looking at the setup for that code, you really shouldn't need to configure it on at build time. Its equivalent to add trust_cpu=true to the kernel command line without any changes to build time configuration I'll post this shortly OK, filed as https://bugzilla.redhat.com/show_bug.cgi?id=1830280 *** Bug 1781902 has been marked as a duplicate of this bug. *** I have a public bug dup'd against this, and I think something in Bugzilla changed to default to private bugs recently. There's nothing confidental I can see here, and I'd like to make the bug public so customers and the community can track progress. Any objections to lifting the "redhat" group field? *** Bug 1833335 has been marked as a duplicate of this bug. *** Patch(es) available on kernel-4.18.0-198.el8 Having difficulties reproducing this on available HW, failed to prepare a suitable VM so far, and running out of time. Passed SanityOnly checks: patch is implemented and looks sane, the rngd stack is working. Closing as SanityOnly for now. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4431 |