Created attachment 1663130 [details] virt-console-log.txt Description of problem: CRC team use libvirt provider to create a single node openshift cluster and the shutdown the VM after cluster is created. This VM disk imaage used on different platform to run openshift locally. Generating the disk image part of: https://github.com/code-ready/snc/blob/master/createdisk.sh From 4.3 time frame our CentOS CI experiencing lot of failure because this disk image not able to boot correctly on the CI node and it fails with `coreos-luks-open.service: Main process exited, code=killed, status=15/TERM error` error Version-Release number of selected component (if applicable): ``` $ cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="43.81.202002032142.0" VERSION_ID="4.3" OPENSHIFT_VERSION="4.3" RHEL_VERSION=8.0 PRETTY_NAME="Red Hat Enterprise Linux CoreOS 43.81.202002032142.0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.3" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.3" OSTREE_VERSION='43.81.202002032142.0' ``` How reproducible: Running the CI job with created image is always fails on the centos CI. Actual results: [ 12.510624] loop: module loaded [ 12.640185] coreos-cryptfs[817]: coreos-cryptfs: /dev/vda4 uses the 'cipher_null-ecb' [ 12.644748] coreos-cryptfs[817]: coreos-cryptfs: /dev/vda4 is not encrypted. Device will be mounted as device-mapper linear target. [ 13.000541] vda: vda1 vda2 vda3 vda4 [ 46.576110] random: crng init done [* 46.581526] random: 7 urandom warning(s) missed due to ratelimiting [ *** ] (1 of 2) A start job is running for…S LUKS Opener (1min 6s / no limit)[ 73.326302] systemd-udevd[506]: seq 1511 '/devices/pci0000:00/0000:00:05.0/virtio2/block/vda' is taking a long time [ TIME ] Timed out waiting for device dev-disk-by\x2dlabel-root.device. [DEPEND] Dependency failed for CoreOS: Mount (subsequent) /sysroot. [ 96.597431] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start timed out. Starting OSTree Prepare OS/... [ 96.612372] systemd[1]: Timed out waiting for device dev-disk-by\x2dlabel-root.device. [ 96.621701] systemd[1]: Dependency failed for CoreOS: Mount (subsequent) /sysroot. [ 96.629609] systemd[1]: ignition-ostree-mount-subsequent-sysroot.service: Job ignition-ostree-mount-subsequent-sysroot.service/start failed with result 'dependency'. [ 96.643539] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start failed with result 'timeout'. [ 96.654760] ostree-prepare-root[887]: ostree-prepare-root: Couldn't find specified OSTree root '/sysroot//ostree/boot.1/rhcos/359d8bc4f6e23689c281de46db291779a6261f8339e5e01c1b37ead9f1986e78/0': No such file or directory [FAILED] Failed to start OSTree Prepare OS/. Expected results: Should able to run successfully on this platform. Additional info: ``` # lshw -short H/W path Device Class Description ==================================================== system SM15000-64-CC-AA-1Ox1 (830-7367-03) /0 bus AMD Server CRB /0/0 memory 128KiB BIOS /0/4 processor AMD Opteron(tm) Processor 4365 EE /0/4/a memory 80KiB L1 cache /0/4/b memory 2MiB L2 cache /0/4/c memory 8MiB L3 cache /0/12 memory 32GiB System Memory /0/12/0 memory 8GiB DDR3 Synchronous 1333 MHz (0.8 ns) /0/12/1 memory 8GiB DDR3 Synchronous 1333 MHz (0.8 ns) /0/12/2 memory 8GiB DDR3 Synchronous 1333 MHz (0.8 ns) /0/12/3 memory 8GiB DDR3 Synchronous 1333 MHz (0.8 ns) /0/100 bridge RD890S/SR5650 Host Bridge /0/100/2 bridge RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GFX port 0) /0/100/2/0 scsi0 storage 82801IB (ICH9) 4 port SATA Controller [AHCI mode] /0/100/2/0/0.0.0 /dev/sda disk 239GB SMvD14250C548CC1 /0/100/2/0/0.0.0/1 volume 500MiB EXT4 volume /0/100/2/0/0.0.0/2 volume 222GiB Linux LVM Physical Volume partition /0/100/2/0.1 eth0 network 82540EM Gigabit Ethernet Controller /0/100/3 bridge RD890/RD9x0 PCI to PCI bridge (PCI Express GFX port 1) /0/100/3/0 storage 82801IB (ICH9) 4 port SATA Controller [AHCI mode] /0/100/3/0.1 eth1 network 82540EM Gigabit Ethernet Controller /0/100/4 bridge RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 0) /0/100/4/0 storage 82801IB (ICH9) 4 port SATA Controller [AHCI mode] /0/100/4/0.1 eth2 network 82540EM Gigabit Ethernet Controller ```
Hmm, do you have a link to the postprocessed image to try out? The postprocessing workflow might be conflicting with the new LUKS container approach in 4.3. > Running the CI job with created image is always fails on the centos CI. Chatting on Slack, I had the impression this didn't always happen. Is that not the case? Is it a consistent success/failure for a *specific* postprocessed image, or will the same image sometimes boot and sometimes not?
I believe that this is a problem with Hyperkit. I downloaded tarball and confirmed that it boots correctly. However, if I remove the VirtIO RNG device, then it boots _slowly_ and failed the time out in one case. Is it possible to include a `virto-rng` device?
Assigning this to myself.
> I believe that this is a problem with Hyperkit. I downloaded tarball and confirmed that it boots correctly. However, if I remove the VirtIO RNG device, then it boots _slowly_ and failed the time out in one case. We didn't noticed any failure on the hyperkit side yet but on the CentosCI as I mentioned in the bug. -
We are unable to test this on 4.4 because of https://bugzilla.redhat.com/show_bug.cgi?id=1805034 and this does not seem to have been backported to 4.3, so we cannot provide useful testing on the proposed fix.
Merged to 4.3.
Pushing this back to POST until it lands in a 4.3.z build. Then we'll let the errata sweeper pick it up once it is in MODIFIED again.
Ah dang it...I thought this was targeted for 4.3...unwinding my changes
Booting up the latest RHCOS 4.4 shows that the changes to how `dm_crypt` are present. But it would be best to get the reporter to confirm the fix is successful. ``` $ rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: * ostree://558f27e963ce5314dbe005b5cacef5f230ee111209ad3037a801e7a4e88d407d Version: 44.81.202003130330-0 (2020-03-13T03:36:02Z) $ grep -C5 load_kmods /usr/libexec/coreos-cryptfs fatal() { msg ${@}; exit 1; } load_kmods() { # Load kernel modules only when required. In BZ# 1803130, # it was discovered that virtual hardware that lacks an entrophy # source fails to boot. Ensure that we only load the required modules # if needed. modprobe -a dm_crypt loop -- # Determine if the volume is clevis managed or automatic unlock. # The automatic clevis hooks will NOT work since they rely on the systemd # crypttab generator. local clevis_id="$(cryptsetup luksDump ${dev} | sed -rn 's|^\s+([0-9]+): clevis|\1|p')" if [ -n "${clevis_id}" ]; then load_kmods local pin=$(cryptsetup token export --token-id "${cleivs_id}" "${dev}" \ | jq -rM '.jwe.protected' | base64 -d | jq -rM '.clevis.pin') msg "${dev} is configured for Clevis pin '${pin}'" [ "${pin}" == "tpm2" ] || net_waiter clevis-luks-unlock -d "${dev}" -- if [ -z "${pin}" ]; then msg "No Clevis pin provided, encryption is not enabled" exit 0 fi load_kmods validate_pin "${pin}" msg "detected pin=${pin}" # Strip off the build token that is set via the CoreOS installer. token_rm "${dev}" ```
PTAL Praveen
@Steve at the moment we are block by https://bugzilla.redhat.com/show_bug.cgi?id=1805034 one to generated a 4.4 disk image, will test and report back once we have that fixed.
Moving to 4.5 since the dependent BZ is targeted to 4.5.
@praveen, it looks like https://bugzilla.redhat.com/show_bug.cgi?id=1805034 is verified. Can you verify that this fix works for you on a ocp 4.4 nightly?
@Michael On the CentOS CI side we are still getting the same issue, I tested the 4.4.0-rc.11 and still same. ``` [** ] A stop job is running for CoreOS LUKS Opener (33s / 1min 29s)[ 129.254905] systemd[1]: coreos-luks-open.service: Failed with result 'signal'. [ OK ] Stopped CoreOS LUKS Opener. [ 129.261846] systemd[1]: Stopped CoreOS LUKS Opener. [ OK ] Stopped target Initrd Root File System. [ 129.266796] systemd[1]: coreos-luks-open.service: Triggering OnFailure= dependencies. Starting Setup Virtual Console... [ 129.272759] systemd[1]: Stopped target Initrd Root File System. [ OK ] Stopped target Initrd File Systems. [ OK ] Stopped target Basic System. [ 129.278955] systemd[1]: Starting Setup Virtual Console... [ OK ] Stopped target System Initialization. [ 129.283158] systemd[1]: Stopped target Initrd File Systems. [ OK ] Stopped Apply Kernel Variables. [ 129.287146] systemd[1]: Stopped target Basic System. [ 129.289788] systemd[1]: Stopped target System Initialization. [ 129.292680] systemd[1]: Stopped Apply Kernel Variables. Stopping udev Kernel Device Manager... [ 129.296466] systemd[1]: Stopping udev Kernel Device Manager... [ OK ] Stopped Create Volatile Files and Directories. [ 129.300858] systemd[1]: Stopped Create Volatile Files and Directories. [ OK ] Stopped target Swap. [ 129.304952] systemd[1]: Stopped target Swap. [ OK ] Stopped target Local File Systems. [ OK ] Stopped target Local File Systems (Pre). [ 129.309751] systemd[1]: Stopped target Local File Systems. Stopping Device-Mapper Multipath Device Controller... [ 129.313923] systemd[1]: Stopped target Local File Systems (Pre). [ 129.313988] systemd[1]: Stopping Journal Service... [ 129.316938] systemd[1]: Stopping Device-Mapper Multipath Device Controller... Stopping Journal Service... [ 129.323429] systemd[1]: Stopped target Local Encrypted Volumes. [ 129.323524] systemd-journald[305]: Received SIGTERM from PID 1 (systemd). [ OK ] Stopped target Local Encrypted Volumes. [ 129.331097] systemd[1]: Stopped target Slices. [ OK ] Stopped target Slices. [ 129.389042] systemd[1]: Stopped target Sockets. [ OK ] Stopped target Sockets. [ 129.392622] systemd[1]: Closed Open-iSCSI iscsid Socket. [ OK ] Closed Open-iSCSI iscsid Socket. [ 129.396637] systemd[1]: Closed Open-iSCSI iscsiuio Socket. [ OK ] Closed Open-iSCSI iscsiuio Socket. [ 129.400635] systemd[1]: Stopped target Paths. [ OK ] Stopped target Paths. [ 129.408051] systemd[1]: Stopped Dispatch Password Requests to Console Directory Watch. [ OK ] Stopped Dispatch Password Requests to Console Directory Watch. [ 129.414735] systemd[1]: Stopped Journal Service. [ OK ] Stopped Journal Service. [ 129.418730] systemd[1]: Stopped udev Kernel Device Manager. [ OK ] Stopped udev Kernel Device Manager. [ 129.423505] systemd[1]: Started Setup Virtual Console. [ OK ] Started Setup Virtual Console. [ 129.428637] systemd[1]: Stopped Device-Mapper Multipath Device Controller. [ OK ] Stopped Device-Mapper Multipath Device Controller. [ 129.434447] systemd[1]: Stopped udev Wait for Complete Device Initialization. [ OK ] Stopped udev Wait for Complete Device Initialization. [ 129.440805] systemd[1]: Stopped udev Coldplug all Devices. [ OK ] Stopped udev Coldplug all Devices. [ 129.446348] systemd[1]: Started Emergency Shell. [ OK ] Started Emergency Shell. [ 129.450277] systemd[1]: Reached target Emergency Mode. [ OK ] Reached target Emergency Mode. [ 129.454534] systemd[1]: Stopped Create Static Device Nodes in /dev. [ OK ] Stopped Create Static Device Nodes in /dev. [ 129.459492] systemd[1]: Stopped Create list of required static device nodes for the current kernel. [ OK ] Stopped Create list of required sta…vice nodes for the current kernel. [ 129.466122] systemd[1]: Closed udev Kernel Socket. [ OK ] Closed udev Kernel Socket. [ 129.469744] systemd[1]: Closed udev Control Socket. [ OK ] Closed udev Control Socket. [ 129.473361] systemd[1]: Startup finished in 5.066s (kernel) + 0 (initrd) + 2min 4.407s (userspace) = 2min 9.473s. Press Enter for emergency shell or wait 5 minutes for reboot. ```
Will you clone one bug to track 4.4 issue? Will 4.4 issue be one 4.4 release blocker?
``` [ OK ] Started dracut pre-mount hook. [ 15.155394] coreos-cryptfs[704]: coreos-cryptfs: /dev/vda4 uses the 'cipher_null-ecb' [ 15.160410] coreos-cryptfs[704]: coreos-cryptfs: /dev/vda4 is not encrypted. Device will be mounted as device-mapper linear target. [ 15.523096] vda: vda1 vda2 vda3 vda4 [ ***[ 76.176833] systemd-udevd[585]: seq 1524 '/devices/pci0000:00/0000:00:05.0/virtio2/block/vda' is taking a long time [[ 101.176992] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start timed out. TIME ] Timed out waiting for device dev-disk-by\x2dlabe[ 101.191838] systemd[1]: Timed out waiting for device dev-disk-by\x2dlabel-root.device. l-root.device. [[ 101.203725] systemd[1]: Dependency failed for CoreOS: Mount (subsequent) /sysroot. DEPEND] Dependency failed for CoreOS: Mount (subsequent) /sysroot. [ 101.218655] systemd[1]: ignition-ostree-mount-subsequent-sysroot.service: Job ignition-ostree-mount-subsequent-sysroot.service/start failed with result 'dependency'. [ 101.233181] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start failed with result 'timeout'. Starting OSTree Prepare OS/... [ 101.246547] systemd[1]: Starting OSTree Prepare OS/... [ 101.268452] ostree-prepare-root[769]: ostree-prepare-root: Couldn't find specified OSTree root '/sysroot//ostree/boot.1/rhcos/caa4be84aa481b7bfdacdec9e872bc72f6bbe80607b29a2dd97306c7901dd510/0': No such file or directory [ 101.284786] o[ 101.285142] systemd[1]: ostree-prepare-root.service: Main process exited, code=exited, status=1/FAILURE stree-prepare-root[769]: ostree-prepare-root: Couldn't find spec[ 101.294626] systemd[1]: ostree-prepare-root.service: Failed with result 'exit-code'. ified OSTree root '/sysroot//ostree/boot.1/rhcos/caa4be84aa481b7[ 101.303692] systemd[1]: Failed to start OSTree Prepare OS/. bfdacdec9e872bc72f6bbe80607b29a2dd97306c7901dd510/0': No such file or directory [FAILED] Failed to start OSTree Prepare OS/. See 'systemctl status ostree-prepare-root.service' for details. [ 101.321999] systemd[1]: ostree-prepare-root.service: Triggering OnFailure= dependencies. /# lsmod Module Size Used by crct10dif_pclmul 16384 1 crc32_pclmul 16384 0 crc32c_intel 24576 0 ata_generic 16384 0 ata_piix 36864 0 libata 270336 2 ata_piix,ata_generic virtio_net 53248 0 ghash_clmulni_intel 16384 0 net_failover 24576 1 virtio_net virtio_console 36864 0 serio_raw 16384 0 virtio_blk 20480 0 failover 16384 1 net_failover dm_multipath 32768 0 dm_mirror 28672 0 dm_region_hash 20480 1 dm_mirror dm_log 20480 2 dm_region_hash,dm_mirror dm_mod 151552 3 dm_multipath,dm_log,dm_mirror be2iscsi 122880 0 bnx2i 57344 0 cnic 69632 1 bnx2i uio 20480 1 cnic cxgb4i 49152 0 cxgb4 430080 1 cxgb4i libcxgbi 69632 1 cxgb4i libcxgb 20480 1 libcxgbi qla4xxx 299008 0 iscsi_boot_sysfs 16384 2 be2iscsi,qla4xxx iscsi_tcp 20480 0 libiscsi_tcp 24576 3 libcxgbi,iscsi_tcp,cxgb4i libiscsi 61440 7 be2iscsi,libiscsi_tcp,bnx2i,libcxgbi,iscsi_tcp,qla4xxx,cxgb4i scsi_transport_iscsi 110592 6 be2iscsi,bnx2i,libcxgbi,iscsi_tcp,qla4xxx,libiscsi :/# grep -C5 load_kmods /usr/libexec/coreos-cryptfs fatal() { msg ${@}; exit 1; } load_kmods() { # Load kernel modules only when required. In BZ# 1803130, # it was discovered that virtual hardware that lacks an entrophy # source fails to boot. Ensure that we only load the required modules # if needed. modprobe -a dm_crypt loop -- # Determine if the volume is clevis managed or automatic unlock. # The automatic clevis hooks will NOT work since they rely on the systemd # crypttab generator. local clevis_id="$(cryptsetup luksDump ${dev} | sed -rn 's|^\s+([0-9]+): clevis|\1|p')" if [ -n "${clevis_id}" ]; then load_kmods local pin=$(cryptsetup token export --token-id "${cleivs_id}" "${dev}" \ | jq -rM '.jwe.protected' | base64 -d | jq -rM '.clevis.pin') msg "${dev} is configured for Clevis pin '${pin}'" [ "${pin}" == "tpm2" ] || net_waiter clevis-luks-unlock -d "${dev}" -- if [ -z "${pin}" ]; then msg "No Clevis pin provided, encryption is not enabled" exit 0 fi load_kmods validate_pin "${pin}" msg "detected pin=${pin}" # Strip off the build token that is set via the CoreOS installer. token_rm "${dev}" ```
Maybe related to what you're doing in: https://github.com/code-ready/snc/blob/5c9cb9a570a52bb9117c00615538afb5d1ef0d56/createdisk.sh#L38 Rather than doing all that, what you should be doing is just ${QEMU_IMG} create -f qcow2 $tmpFile 31G that should be sufficient for the OS to grow on firstboot - and you should do that before passing the image to openshift-install. But..that's probably not the bug. Hmm, this is a hint: [ ***[ 76.176833] systemd-udevd[585]: seq 1524 '/devices/pci0000:00/0000:00:05.0/virtio2/block/vda' is taking a long time From a quick search, we'll see that message when a udev rule is timing out. Just a sanity check, are you providing a virtio-rng device? How are you running this VM? Via libvirt?
> Just a sanity check, are you providing a virtio-rng device? How are you running this VM? Via libvirt? @cwalters Running this via libvirt. ``` # virsh dumpxml crc <domain type='kvm' id='3'> <name>crc</name> <uuid>9e9a5ca3-8072-4c9f-b56c-183a944a4e84</uuid> <memory unit='KiB'>9000960</memory> <currentMemory unit='KiB'>9000960</currentMemory> <vcpu placement='static'>4</vcpu> <resource> <partition>/machine</partition> </resource> <os> <type arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type> <boot dev='hd'/> <bootmenu enable='no'/> </os> <features> <acpi/> <apic/> <pae/> </features> <cpu mode='host-passthrough' check='none'> <feature policy='disable' name='rdrand'/> </cpu> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' io='threads'/> <source file='/home/crc_ci/.crc/machines/crc/crc'/> <backingStore/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </disk> <controller type='usb' index='0' model='piix3-uhci'> <alias name='usb'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> </controller> <controller type='pci' index='0' model='pci-root'> <alias name='pci.0'/> </controller> <controller type='virtio-serial' index='0'> <alias name='virtio-serial0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </controller> <interface type='network'> <mac address='52:fd:fc:07:21:82'/> <source network='crc' bridge='crc'/> <target dev='vnet0'/> <model type='virtio'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/2'/> <target type='isa-serial' port='0'> <model name='isa-serial'/> </target> <alias name='serial0'/> </serial> <console type='pty' tty='/dev/pts/2'> <source path='/dev/pts/2'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> <channel type='pty'> <source path='/dev/pts/3'/> <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/> <alias name='channel0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <input type='mouse' bus='ps2'> <alias name='input0'/> </input> <input type='keyboard' bus='ps2'> <alias name='input1'/> </input> <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'> <listen type='address' address='127.0.0.1'/> </graphics> <video> <model type='cirrus' vram='16384' heads='1' primary='yes'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='virtio'> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </memballoon> <rng model='virtio'> <backend model='random'>/dev/random</backend> <alias name='rng0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </rng> </devices> <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c672,c810</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c672,c810</imagelabel> </seclabel> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+107</label> <imagelabel>+107:+107</imagelabel> </seclabel> </domain> ```
(In reply to Praveen Kumar from comment #24) > <rng model='virtio'> > <backend model='random'>/dev/random</backend> > <alias name='rng0'/> > <address type='pci' domain='0x0000' bus='0x00' slot='0x07' > function='0x0'/> > </rng> Hmm, this is using `/dev/random` instead of `/dev/urandom`. My understanding is that one should just use /dev/random nowadays (see urandom(4)). Especially for the use case here, which is for a testing/development focused environment. Depending on the execution environment (how exactly is this being run in CentOS CI? Directly on bare metal? Or e.g. nested?) and how many of these tests are running in simultaneously, it may be exhausting the host entropy and delaying guest booting. Can you try with `/dev/urandom` instead? And actually looking at the regular libvirt OpenShift installer now (but without testing it), I think it also implicitly uses urandom: https://github.com/dmacvicar/terraform-provider-libvirt/blob/8d6f2c419514d3e3c0565b4a44a8b325f6459c23/libvirt/domain_def.go#L89.
/dev/urandom is also recommended upstream https://libvirt.org/formatdomain.html#elementsRng (but it's not the default value).
@praveen can you please retest this BZ using `/dev/urandom` as noted in comment #27 and comment #28?
@Micah We did try the `/dev/urandom` and in CentOS CI we still have same issue.
I think that the root cause is NOT RHCOS but: - Per [1] the disk is resized and the FS is grown. - The disk is sparse, but the root-filesystem goes from ~4GB allocated to ~31GB allocated - When RHCOS runs the disk-encryption stage, it now has apply encryption to the entire 31GB - This means that RHCOS is encrypting roughly ~25Gb of zeros. Unless you're doing CPU pinning for the vCPU's you could end up with slow encryption. I concur with what Colin's saying in https://bugzilla.redhat.com/show_bug.cgi?id=1803130#c23. Resize the disk and then let RHCOS do the resize on first boot. [1] https://github.com/code-ready/snc/blob/master/createdisk.sh#L38-L89
Given that the root cause may be in the CRC `createdisk.sh` script, I'm going to mark this verified with RHCOS 45.82.202006191325-0 I can boot a single node successfully and the `coreos-cryptfs` script shows the change from MR 830 (https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/830) has been included. ``` $ rpm-ostree status -b State: idle AutomaticUpdates: disabled BootedDeployment: * ostree://3c1ded0e8180de2d9ef1c836ae2bd6f1f8215b28b55c96da1d13ec840ee58c7d Version: 45.82.202006191325-0 (2020-06-19T13:29:15Z) [core@cosa-devsh ~]$ cat /usr/libexec/coreos-cryptfs | grep -C5 load_kmods fatal() { msg ${@}; exit 1; } load_kmods() { # Load kernel modules only when required. In BZ# 1803130, # it was discovered that virtual hardware that lacks an entrophy # source fails to boot. Ensure that we only load the required modules # if needed. modprobe -a dm_crypt loop -- # Determine if the volume is clevis managed or automatic unlock. # The automatic clevis hooks will NOT work since they rely on the systemd # crypttab generator. local clevis_id="$(cryptsetup luksDump ${dev} | sed -rn 's|^\s+([0-9]+): clevis|\1|p')" if [ -n "${clevis_id}" ]; then load_kmods local pin=$(cryptsetup token export --token-id "${cleivs_id}" "${dev}" \ | jq -rM '.jwe.protected' | base64 -d | jq -rM '.clevis.pin') msg "${dev} is configured for Clevis pin '${pin}'" [ "${pin}" == "tpm2" ] || net_waiter clevis-luks-unlock -d "${dev}" -- if [ -z "${pin}" ]; then msg "No Clevis pin provided, encryption is not enabled" exit 0 fi load_kmods validate_pin "${pin}" msg "detected pin=${pin}" # Strip off the build token that is set via the CoreOS installer. token_rm "${dev}" ```
(In reply to Ben Howard from comment #31) > I think that the root cause is NOT RHCOS but: > - Per [1] the disk is resized and the FS is grown. > - The disk is sparse, but the root-filesystem goes from ~4GB allocated to > ~31GB allocated > [*] When RHCOS runs the disk-encryption stage, it now has apply encryption to > the entire 31GB > - This means that RHCOS is encrypting roughly ~25Gb of zeros. > > Unless you're doing CPU pinning for the vCPU's you could end up with slow > encryption. > > I concur with what Colin's saying in > https://bugzilla.redhat.com/show_bug.cgi?id=1803130#c23. Resize the disk and > then let RHCOS do the resize on first boot. > > [1] https://github.com/code-ready/snc/blob/master/createdisk.sh#L38-L89 We finally have time to try to let RHCOS do the resize itself. However, I'm not sure I understand the step I marked with [*]. What is being done is https://github.com/code-ready/snc/blob/9034bac608bf96a0b7f488b7888484aa0e22a51c/createdisk.sh#L66-L73 so we are filling the mounted luks file system with 0, wouldn't they be encrypted as they are written? Or is this 'disk-encryption stage' doing more things?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409