Description of problem: Installing cluster using 4.5.0-0.nightly-2020-05-08-075620. When any host is configured to enable TPM disk encryption, encryption success on the bootstrap RHCOS version 44.81.202004250133-0 but, after it is automatically updated to 45.81.202005080327-0 (version required by the MCO), host never comes up. Attached: - console log for booting on 44.81.202004250133-0 - journal log during upgrade from 44.81.202004250133-0 to 45.81.202005080327-0 Version-Release number of selected component (if applicable): Installer: 4.5.0-0.nightly-2020-05-08-075620 Bootstrap RHCOS Version: 44.81.202004250133-0 MCO RHCOS Version: 45.81.202005080327-0 How reproducible: Steps to Reproduce: 1.Create manifest to enable TPM encryption 2.Install cluster on baremetal 3.Watch for host installation Actual results: Installation is completed. Server boot on bootstrap RHCOS version and encrypt disk, then it reboots and boot MCO RHCOS version, but never ends Expected results: Server boots on MCO RHCOS version with disk encrypted Additional info:
Ahh OK, I see you have console logs for the bootstrap boot, so I'm guessing this is *not* on Packet, right? Does this reproduce in a VM with a virtual TPM? Will try it out.
OK, I'm fairly confident this is another entropy issue. The logs here are a telltale sign: ``` May 08 14:13:30 localhost systemd[1]: Starting dracut pre-mount hook... May 08 14:13:30 localhost coreos-cryptfs[1275]: coreos-cryptfs: /dev/sdc4 is configured for Clevis pin 'tpm2' May 08 14:13:30 localhost systemd[1]: Started dracut pre-mount hook. May 08 14:14:56 localhost systemd-journald[664]: Missed 3 kernel messages May 08 14:14:56 localhost kernel: random: crng init done May 08 14:14:56 localhost kernel: random: 7 urandom warning(s) missed due to ratelimiting May 08 14:14:58 localhost systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start timed out. May 08 14:14:58 localhost systemd[1]: Timed out waiting for device dev-disk-by\x2dlabel-root.device. ``` Note the `random: crng init done` message, which happens a full 1m30s later. One can easily reproduce this locally by provisioning a VM without a virtio-rng device and turning on TPM2 encryption. It does not reproduce with a virtio-rng device attached. *** This bug has been marked as a duplicate of bug 1778762 ***
Note we merged https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/937 into master, which is targeting 4.6. Once that's in an ART build, we can sanity check that it helps with TPM testing on Packet and backport to 4.5 and 4.4.
Let's re-open this as one tracker for OCP issue,so that any OCP issue will not be missing from OCP 4.5 release blocker list.. Once the https://bugzilla.redhat.com/show_bug.cgi?id=1778762 , OpenShift QE will check this bug and verify it.
Working on this. Nodes sfails to boot on 45.81.202005200134-0 Only adding following parameter in the grub linux line, encryption is complete and host boots correctly: `random.trust_cpu=on`
I can confirm that 4.5.0-0.nightly-2020-05-22-111153 using recent RHCOS version can complete TPM enabled installation. You can mark this bug as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days