Bug 1803130 - RHCOS fails to boot with coreos-luks-open.service: Main process exited, code=killed, status=15/TERM error
Summary: RHCOS fails to boot with coreos-luks-open.service: Main process exited, code=...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.3.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: 4.5.0
Assignee: Ben Howard
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks: 1813047 1847673
TreeView+ depends on / blocked
 
Reported: 2020-02-14 14:38 UTC by Praveen Kumar
Modified: 2020-07-13 17:15 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1813047 1847673 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:15:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
virt-console-log.txt (21.03 KB, text/plain)
2020-02-14 14:38 UTC, Praveen Kumar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:15:58 UTC

Description Praveen Kumar 2020-02-14 14:38:10 UTC
Created attachment 1663130 [details]
virt-console-log.txt

Description of problem: CRC team use libvirt provider to create a single node openshift cluster and the shutdown the VM after cluster is created. This VM disk imaage used on different platform to run openshift locally.

Generating the disk image part of: https://github.com/code-ready/snc/blob/master/createdisk.sh

From 4.3 time frame our CentOS CI experiencing lot of failure because this disk image not able to boot correctly on the CI node and it fails with `coreos-luks-open.service: Main process exited, code=killed, status=15/TERM error` error


Version-Release number of selected component (if applicable):
```
$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="43.81.202002032142.0"
VERSION_ID="4.3"
OPENSHIFT_VERSION="4.3"
RHEL_VERSION=8.0
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 43.81.202002032142.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.3"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.3"
OSTREE_VERSION='43.81.202002032142.0'
```


How reproducible:
Running the CI job with created image is always fails on the centos CI.


Actual results:
[   12.510624] loop: module loaded
[   12.640185] coreos-cryptfs[817]: coreos-cryptfs: /dev/vda4 uses the 'cipher_null-ecb'
[   12.644748] coreos-cryptfs[817]: coreos-cryptfs: /dev/vda4 is not encrypted. Device will be mounted as device-mapper linear target.
[   13.000541]  vda: vda1 vda2 vda3 vda4
[   46.576110] random: crng init done
[*   46.581526] random: 7 urandom warning(s) missed due to ratelimiting
[  *** ] (1 of 2) A start job is running for…S LUKS Opener (1min 6s / no limit)[   73.326302] systemd-udevd[506]: seq 1511 '/devices/pci0000:00/0000:00:05.0/virtio2/block/vda' is taking a long time
[ TIME ] Timed out waiting for device dev-disk-by\x2dlabel-root.device.
[DEPEND] Dependency failed for CoreOS: Mount (subsequent) /sysroot.
[   96.597431] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start timed out.
         Starting OSTree Prepare OS/...
[   96.612372] systemd[1]: Timed out waiting for device dev-disk-by\x2dlabel-root.device.
[   96.621701] systemd[1]: Dependency failed for CoreOS: Mount (subsequent) /sysroot.
[   96.629609] systemd[1]: ignition-ostree-mount-subsequent-sysroot.service: Job ignition-ostree-mount-subsequent-sysroot.service/start failed with result 'dependency'.
[   96.643539] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start failed with result 'timeout'.
[   96.654760] ostree-prepare-root[887]: ostree-prepare-root: Couldn't find specified OSTree root '/sysroot//ostree/boot.1/rhcos/359d8bc4f6e23689c281de46db291779a6261f8339e5e01c1b37ead9f1986e78/0': No such file or directory
[FAILED] Failed to start OSTree Prepare OS/.


Expected results:
Should able to run successfully on this platform.


Additional info:
```
# lshw -short
H/W path            Device    Class      Description
====================================================
                              system     SM15000-64-CC-AA-1Ox1 (830-7367-03)
/0                            bus        AMD Server CRB
/0/0                          memory     128KiB BIOS
/0/4                          processor  AMD Opteron(tm) Processor 4365 EE
/0/4/a                        memory     80KiB L1 cache
/0/4/b                        memory     2MiB L2 cache
/0/4/c                        memory     8MiB L3 cache
/0/12                         memory     32GiB System Memory
/0/12/0                       memory     8GiB DDR3 Synchronous 1333 MHz (0.8 ns)
/0/12/1                       memory     8GiB DDR3 Synchronous 1333 MHz (0.8 ns)
/0/12/2                       memory     8GiB DDR3 Synchronous 1333 MHz (0.8 ns)
/0/12/3                       memory     8GiB DDR3 Synchronous 1333 MHz (0.8 ns)
/0/100                        bridge     RD890S/SR5650 Host Bridge
/0/100/2                      bridge     RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GFX port 0)
/0/100/2/0          scsi0     storage    82801IB (ICH9) 4 port SATA Controller [AHCI mode]
/0/100/2/0/0.0.0    /dev/sda  disk       239GB SMvD14250C548CC1
/0/100/2/0/0.0.0/1            volume     500MiB EXT4 volume
/0/100/2/0/0.0.0/2            volume     222GiB Linux LVM Physical Volume partition
/0/100/2/0.1        eth0      network    82540EM Gigabit Ethernet Controller
/0/100/3                      bridge     RD890/RD9x0 PCI to PCI bridge (PCI Express GFX port 1)
/0/100/3/0                    storage    82801IB (ICH9) 4 port SATA Controller [AHCI mode]
/0/100/3/0.1        eth1      network    82540EM Gigabit Ethernet Controller
/0/100/4                      bridge     RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 0)
/0/100/4/0                    storage    82801IB (ICH9) 4 port SATA Controller [AHCI mode]
/0/100/4/0.1        eth2      network    82540EM Gigabit Ethernet Controller
```

Comment 1 Jonathan Lebon 2020-02-14 14:50:04 UTC
Hmm, do you have a link to the postprocessed image to try out? The postprocessing workflow might be conflicting with the new LUKS container approach in 4.3.

> Running the CI job with created image is always fails on the centos CI.

Chatting on Slack, I had the impression this didn't always happen. Is that not the case?
Is it a consistent success/failure for a *specific* postprocessed image, or will the same image sometimes boot and sometimes not?

Comment 3 Ben Howard 2020-02-17 23:06:48 UTC
I believe that this is a problem with Hyperkit.  I downloaded tarball and confirmed that it boots correctly. However, if I remove the VirtIO RNG device, then it boots _slowly_ and failed the time out in one case. 

Is it possible to include a `virto-rng` device?

Comment 4 Ben Howard 2020-02-17 23:21:10 UTC
Assigning this to myself.

Comment 6 Praveen Kumar 2020-02-19 07:11:56 UTC
> I believe that this is a problem with Hyperkit.  I downloaded tarball and confirmed that it boots correctly. However, if I remove the VirtIO RNG device, then it boots _slowly_ and failed the time out in one case. 


We didn't noticed any failure on the hyperkit side yet but on the CentosCI as I mentioned in the bug.

-

Comment 10 Christophe Fergeau 2020-03-10 10:40:56 UTC
We are unable to test this on 4.4 because of https://bugzilla.redhat.com/show_bug.cgi?id=1805034 and this does not seem to have been backported to 4.3, so we cannot provide useful testing on the proposed fix.

Comment 11 Ben Howard 2020-03-12 19:18:02 UTC
Merged to 4.3.

Comment 12 Micah Abbott 2020-03-12 19:27:53 UTC
Pushing this back to POST until it lands in a 4.3.z build.  Then we'll let the errata sweeper pick it up once it is in MODIFIED again.

Comment 13 Micah Abbott 2020-03-12 19:30:49 UTC
Ah dang it...I thought this was targeted for 4.3...unwinding my changes

Comment 14 Micah Abbott 2020-03-13 19:04:25 UTC
Booting up the latest RHCOS 4.4 shows that the changes to how `dm_crypt` are present.  But it would be best to get the reporter to confirm the fix is successful.

```
$ rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* ostree://558f27e963ce5314dbe005b5cacef5f230ee111209ad3037a801e7a4e88d407d
                   Version: 44.81.202003130330-0 (2020-03-13T03:36:02Z)

$ grep -C5 load_kmods /usr/libexec/coreos-cryptfs 
fatal() {
    msg ${@};
    exit 1;
}

load_kmods() {
    # Load kernel modules only when required. In BZ# 1803130,
    # it was discovered that virtual hardware that lacks an entrophy
    # source fails to boot. Ensure that we only load the required modules
    # if needed.
    modprobe -a dm_crypt loop
--
    # Determine if the volume is clevis managed or automatic unlock.
    # The automatic clevis hooks will NOT work since they rely on the systemd
    # crypttab generator.
    local clevis_id="$(cryptsetup luksDump ${dev} | sed -rn 's|^\s+([0-9]+): clevis|\1|p')"
    if [ -n "${clevis_id}" ]; then
        load_kmods
        local pin=$(cryptsetup token export --token-id "${cleivs_id}" "${dev}" \
            | jq -rM '.jwe.protected' | base64 -d | jq -rM '.clevis.pin')
        msg "${dev} is configured for Clevis pin '${pin}'"
        [ "${pin}" == "tpm2" ] || net_waiter
        clevis-luks-unlock -d "${dev}"
--
    if [ -z "${pin}" ]; then
        msg "No Clevis pin provided, encryption is not enabled"
        exit 0
    fi

    load_kmods
    validate_pin "${pin}"
    msg "detected pin=${pin}"

    # Strip off the build token that is set via the CoreOS installer.
    token_rm "${dev}"
```

Comment 15 Steve Milner 2020-03-16 18:13:32 UTC
PTAL Praveen

Comment 16 Praveen Kumar 2020-03-17 03:52:11 UTC
@Steve at the moment we are block by https://bugzilla.redhat.com/show_bug.cgi?id=1805034 one to generated a 4.4 disk image, will test and report back once we have that fixed.

Comment 17 Steve Milner 2020-04-06 18:20:58 UTC
Moving to 4.5 since the dependent BZ is targeted to 4.5.

Comment 18 Michael Nguyen 2020-04-22 18:36:47 UTC
@praveen, it looks like https://bugzilla.redhat.com/show_bug.cgi?id=1805034 is verified.  Can you verify that this fix works for you on a ocp 4.4 nightly?

Comment 19 Praveen Kumar 2020-04-27 04:19:10 UTC
@Michael On the CentOS CI side we are still getting the same issue, I tested the 4.4.0-rc.11 and still same.

```
[**    ] A stop job is running for CoreOS LUKS Opener (33s / 1min 29s)[  129.254905] systemd[1]: coreos-luks-open.service: Failed with result 'signal'.
[  OK  ] Stopped CoreOS LUKS Opener.
[  129.261846] systemd[1]: Stopped CoreOS LUKS Opener.
[  OK  ] Stopped target Initrd Root File System.
[  129.266796] systemd[1]: coreos-luks-open.service: Triggering OnFailure= dependencies.
         Starting Setup Virtual Console...
[  129.272759] systemd[1]: Stopped target Initrd Root File System.
[  OK  ] Stopped target Initrd File Systems.
[  OK  ] Stopped target Basic System.
[  129.278955] systemd[1]: Starting Setup Virtual Console...
[  OK  ] Stopped target System Initialization.
[  129.283158] systemd[1]: Stopped target Initrd File Systems.
[  OK  ] Stopped Apply Kernel Variables.
[  129.287146] systemd[1]: Stopped target Basic System.
[  129.289788] systemd[1]: Stopped target System Initialization.
[  129.292680] systemd[1]: Stopped Apply Kernel Variables.
         Stopping udev Kernel Device Manager...
[  129.296466] systemd[1]: Stopping udev Kernel Device Manager...
[  OK  ] Stopped Create Volatile Files and Directories.
[  129.300858] systemd[1]: Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Swap.
[  129.304952] systemd[1]: Stopped target Swap.
[  OK  ] Stopped target Local File Systems.
[  OK  ] Stopped target Local File Systems (Pre).
[  129.309751] systemd[1]: Stopped target Local File Systems.
         Stopping Device-Mapper Multipath Device Controller...
[  129.313923] systemd[1]: Stopped target Local File Systems (Pre).
[  129.313988] systemd[1]: Stopping Journal Service...
[  129.316938] systemd[1]: Stopping Device-Mapper Multipath Device Controller...
         Stopping Journal Service...
[  129.323429] systemd[1]: Stopped target Local Encrypted Volumes.
[  129.323524] systemd-journald[305]: Received SIGTERM from PID 1 (systemd).
[  OK  ] Stopped target Local Encrypted Volumes.
[  129.331097] systemd[1]: Stopped target Slices.
[  OK  ] Stopped target Slices.
[  129.389042] systemd[1]: Stopped target Sockets.
[  OK  ] Stopped target Sockets.
[  129.392622] systemd[1]: Closed Open-iSCSI iscsid Socket.
[  OK  ] Closed Open-iSCSI iscsid Socket.
[  129.396637] systemd[1]: Closed Open-iSCSI iscsiuio Socket.
[  OK  ] Closed Open-iSCSI iscsiuio Socket.
[  129.400635] systemd[1]: Stopped target Paths.
[  OK  ] Stopped target Paths.
[  129.408051] systemd[1]: Stopped Dispatch Password Requests to Console Directory Watch.
[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
[  129.414735] systemd[1]: Stopped Journal Service.
[  OK  ] Stopped Journal Service.
[  129.418730] systemd[1]: Stopped udev Kernel Device Manager.
[  OK  ] Stopped udev Kernel Device Manager.
[  129.423505] systemd[1]: Started Setup Virtual Console.
[  OK  ] Started Setup Virtual Console.
[  129.428637] systemd[1]: Stopped Device-Mapper Multipath Device Controller.
[  OK  ] Stopped Device-Mapper Multipath Device Controller.
[  129.434447] systemd[1]: Stopped udev Wait for Complete Device Initialization.
[  OK  ] Stopped udev Wait for Complete Device Initialization.
[  129.440805] systemd[1]: Stopped udev Coldplug all Devices.
[  OK  ] Stopped udev Coldplug all Devices.
[  129.446348] systemd[1]: Started Emergency Shell.
[  OK  ] Started Emergency Shell.
[  129.450277] systemd[1]: Reached target Emergency Mode.
[  OK  ] Reached target Emergency Mode.
[  129.454534] systemd[1]: Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  129.459492] systemd[1]: Stopped Create list of required static device nodes for the current kernel.
[  OK  ] Stopped Create list of required sta…vice nodes for the current kernel.
[  129.466122] systemd[1]: Closed udev Kernel Socket.
[  OK  ] Closed udev Kernel Socket.
[  129.469744] systemd[1]: Closed udev Control Socket.
[  OK  ] Closed udev Control Socket.
[  129.473361] systemd[1]: Startup finished in 5.066s (kernel) + 0 (initrd) + 2min 4.407s (userspace) = 2min 9.473s.
Press Enter for emergency shell or wait 5 minutes for reboot.                 

```

Comment 20 Wei Sun 2020-04-27 07:47:43 UTC
Will you clone one bug to track 4.4 issue? Will 4.4 issue be one 4.4 release blocker?

Comment 22 Praveen Kumar 2020-04-27 12:27:14 UTC
```
[  OK  ] Started dracut pre-mount hook.
[   15.155394] coreos-cryptfs[704]: coreos-cryptfs: /dev/vda4 uses the 'cipher_null-ecb'
[   15.160410] coreos-cryptfs[704]: coreos-cryptfs: /dev/vda4 is not encrypted. Device will be mounted as device-mapper linear target.
[   15.523096]  vda: vda1 vda2 vda3 vda4
[  ***[   76.176833] systemd-udevd[585]: seq 1524 '/devices/pci0000:00/0000:00:05.0/virtio2/block/vda' is taking a long time
[[  101.176992] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start timed out.
 TIME ] Timed out waiting for device dev-disk-by\x2dlabe[  101.191838] systemd[1]: Timed out waiting for device dev-disk-by\x2dlabel-root.device.
l-root.device.
[[  101.203725] systemd[1]: Dependency failed for CoreOS: Mount (subsequent) /sysroot.
DEPEND] Dependency failed for CoreOS: Mount (subsequent) /sysroot.
[  101.218655] systemd[1]: ignition-ostree-mount-subsequent-sysroot.service: Job ignition-ostree-mount-subsequent-sysroot.service/start failed with result 'dependency'.
         [  101.233181] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start failed with result 'timeout'.
Starting OSTree Prepare OS/...
[  101.246547] systemd[1]: Starting OSTree Prepare OS/...
[  101.268452] ostree-prepare-root[769]: ostree-prepare-root: Couldn't find specified OSTree root '/sysroot//ostree/boot.1/rhcos/caa4be84aa481b7bfdacdec9e872bc72f6bbe80607b29a2dd97306c7901dd510/0': No such file or directory
[  101.284786] o[  101.285142] systemd[1]: ostree-prepare-root.service: Main process exited, code=exited, status=1/FAILURE
stree-prepare-root[769]: ostree-prepare-root: Couldn't find spec[  101.294626] systemd[1]: ostree-prepare-root.service: Failed with result 'exit-code'.
ified OSTree root '/sysroot//ostree/boot.1/rhcos/caa4be84aa481b7[  101.303692] systemd[1]: Failed to start OSTree Prepare OS/.
bfdacdec9e872bc72f6bbe80607b29a2dd97306c7901dd510/0': No such file or directory
[FAILED] Failed to start OSTree Prepare OS/.
See 'systemctl status ostree-prepare-root.service' for details.
[  101.321999] systemd[1]: ostree-prepare-root.service: Triggering OnFailure= dependencies.


/# lsmod 
Module                  Size  Used by
crct10dif_pclmul       16384  1
crc32_pclmul           16384  0
crc32c_intel           24576  0
ata_generic            16384  0
ata_piix               36864  0
libata                270336  2 ata_piix,ata_generic
virtio_net             53248  0
ghash_clmulni_intel    16384  0
net_failover           24576  1 virtio_net
virtio_console         36864  0
serio_raw              16384  0
virtio_blk             20480  0
failover               16384  1 net_failover
dm_multipath           32768  0
dm_mirror              28672  0
dm_region_hash         20480  1 dm_mirror
dm_log                 20480  2 dm_region_hash,dm_mirror
dm_mod                151552  3 dm_multipath,dm_log,dm_mirror
be2iscsi              122880  0
bnx2i                  57344  0
cnic                   69632  1 bnx2i
uio                    20480  1 cnic
cxgb4i                 49152  0
cxgb4                 430080  1 cxgb4i
libcxgbi               69632  1 cxgb4i
libcxgb                20480  1 libcxgbi
qla4xxx               299008  0
iscsi_boot_sysfs       16384  2 be2iscsi,qla4xxx
iscsi_tcp              20480  0
libiscsi_tcp           24576  3 libcxgbi,iscsi_tcp,cxgb4i
libiscsi               61440  7 be2iscsi,libiscsi_tcp,bnx2i,libcxgbi,iscsi_tcp,qla4xxx,cxgb4i
scsi_transport_iscsi   110592  6 be2iscsi,bnx2i,libcxgbi,iscsi_tcp,qla4xxx,libiscsi


:/# grep -C5 load_kmods /usr/libexec/coreos-cryptfs
fatal() {
    msg ${@};
    exit 1;
}

load_kmods() {
    # Load kernel modules only when required. In BZ# 1803130,
    # it was discovered that virtual hardware that lacks an entrophy
    # source fails to boot. Ensure that we only load the required modules
    # if needed.
    modprobe -a dm_crypt loop
--
    # Determine if the volume is clevis managed or automatic unlock.
    # The automatic clevis hooks will NOT work since they rely on the systemd
    # crypttab generator.
    local clevis_id="$(cryptsetup luksDump ${dev} | sed -rn 's|^\s+([0-9]+): clevis|\1|p')"
    if [ -n "${clevis_id}" ]; then
        load_kmods
        local pin=$(cryptsetup token export --token-id "${cleivs_id}" "${dev}" \
            | jq -rM '.jwe.protected' | base64 -d | jq -rM '.clevis.pin')
        msg "${dev} is configured for Clevis pin '${pin}'"
        [ "${pin}" == "tpm2" ] || net_waiter
        clevis-luks-unlock -d "${dev}"
--
    if [ -z "${pin}" ]; then
        msg "No Clevis pin provided, encryption is not enabled"
        exit 0
    fi

    load_kmods
    validate_pin "${pin}"
    msg "detected pin=${pin}"

    # Strip off the build token that is set via the CoreOS installer.
    token_rm "${dev}"

```

Comment 23 Colin Walters 2020-04-27 12:51:21 UTC
Maybe related to what you're doing in:
https://github.com/code-ready/snc/blob/5c9cb9a570a52bb9117c00615538afb5d1ef0d56/createdisk.sh#L38
Rather than doing all that, what you should be doing is just 
${QEMU_IMG} create -f qcow2 $tmpFile 31G

that should be sufficient for the OS to grow on firstboot - and you should do that before passing the image to openshift-install.

But..that's probably not the bug.  Hmm, this is a hint:

[  ***[   76.176833] systemd-udevd[585]: seq 1524 '/devices/pci0000:00/0000:00:05.0/virtio2/block/vda' is taking a long time

From a quick search, we'll see that message when a udev rule is timing out.

Just a sanity check, are you providing a virtio-rng device?  How are you running this VM?  Via libvirt?

Comment 24 Praveen Kumar 2020-04-27 13:10:17 UTC
> Just a sanity check, are you providing a virtio-rng device?  How are you running this VM?  Via libvirt?


@cwalters Running this via libvirt.

```
# virsh dumpxml crc
<domain type='kvm' id='3'>
  <name>crc</name>
  <uuid>9e9a5ca3-8072-4c9f-b56c-183a944a4e84</uuid>
  <memory unit='KiB'>9000960</memory>
  <currentMemory unit='KiB'>9000960</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type>
    <boot dev='hd'/>
    <bootmenu enable='no'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <feature policy='disable' name='rdrand'/>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' io='threads'/>
      <source file='/home/crc_ci/.crc/machines/crc/crc'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='piix3-uhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <interface type='network'>
      <mac address='52:fd:fc:07:21:82'/>
      <source network='crc' bridge='crc'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/2'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='pty'>
      <source path='/dev/pts/3'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/random</backend>
      <alias name='rng0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c672,c810</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c672,c810</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>

```

Comment 27 Jonathan Lebon 2020-05-11 21:17:04 UTC
(In reply to Praveen Kumar from comment #24)
>     <rng model='virtio'>
>       <backend model='random'>/dev/random</backend>
>       <alias name='rng0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
> function='0x0'/>
>     </rng>

Hmm, this is using `/dev/random` instead of `/dev/urandom`. My understanding is that one should just use /dev/random nowadays (see urandom(4)). Especially for the use case here, which is for a testing/development focused environment.  Depending on the execution environment (how exactly is this being run in CentOS CI? Directly on bare metal? Or e.g. nested?) and how many of these tests are running in simultaneously, it may be exhausting the host entropy and delaying guest booting.

Can you try with `/dev/urandom` instead? And actually looking at the regular libvirt OpenShift installer now (but without testing it), I think it also implicitly uses urandom: https://github.com/dmacvicar/terraform-provider-libvirt/blob/8d6f2c419514d3e3c0565b4a44a8b325f6459c23/libvirt/domain_def.go#L89.

Comment 28 Christophe Fergeau 2020-05-12 12:14:23 UTC
/dev/urandom is also recommended upstream https://libvirt.org/formatdomain.html#elementsRng (but it's not the default value).

Comment 29 Micah Abbott 2020-06-16 19:08:58 UTC
@praveen can you please retest this BZ using `/dev/urandom` as noted in comment #27 and comment #28?

Comment 30 Praveen Kumar 2020-06-17 05:58:12 UTC
@Micah We did try the `/dev/urandom` and in CentOS CI we still have same issue.

Comment 31 Ben Howard 2020-06-19 19:37:54 UTC
I think that the root cause is NOT RHCOS but:
- Per [1] the disk is resized and the FS is grown.
- The disk is sparse, but the root-filesystem goes from ~4GB allocated to ~31GB allocated
- When RHCOS runs the disk-encryption stage, it now has apply encryption to the entire 31GB
- This means that RHCOS is encrypting roughly ~25Gb of zeros. 

Unless you're doing CPU pinning for the vCPU's you could end up with slow encryption.

I concur with what Colin's saying in https://bugzilla.redhat.com/show_bug.cgi?id=1803130#c23. Resize the disk and then let RHCOS do the resize on first boot. 

[1] https://github.com/code-ready/snc/blob/master/createdisk.sh#L38-L89

Comment 32 Micah Abbott 2020-06-19 19:44:19 UTC
Given that the root cause may be in the CRC `createdisk.sh` script, I'm going to mark this verified with RHCOS 45.82.202006191325-0

I can boot a single node successfully and the `coreos-cryptfs` script shows the change from MR 830 (https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/830) has been included.

```
$ rpm-ostree status -b
State: idle
AutomaticUpdates: disabled
BootedDeployment:
* ostree://3c1ded0e8180de2d9ef1c836ae2bd6f1f8215b28b55c96da1d13ec840ee58c7d
                   Version: 45.82.202006191325-0 (2020-06-19T13:29:15Z)
[core@cosa-devsh ~]$ cat /usr/libexec/coreos-cryptfs | grep -C5 load_kmods
fatal() {
    msg ${@};
    exit 1;
}

load_kmods() {
    # Load kernel modules only when required. In BZ# 1803130,
    # it was discovered that virtual hardware that lacks an entrophy
    # source fails to boot. Ensure that we only load the required modules
    # if needed.
    modprobe -a dm_crypt loop
--
    # Determine if the volume is clevis managed or automatic unlock.
    # The automatic clevis hooks will NOT work since they rely on the systemd
    # crypttab generator.
    local clevis_id="$(cryptsetup luksDump ${dev} | sed -rn 's|^\s+([0-9]+): clevis|\1|p')"
    if [ -n "${clevis_id}" ]; then
        load_kmods
        local pin=$(cryptsetup token export --token-id "${cleivs_id}" "${dev}" \
            | jq -rM '.jwe.protected' | base64 -d | jq -rM '.clevis.pin')
        msg "${dev} is configured for Clevis pin '${pin}'"
        [ "${pin}" == "tpm2" ] || net_waiter
        clevis-luks-unlock -d "${dev}"
--
    if [ -z "${pin}" ]; then
        msg "No Clevis pin provided, encryption is not enabled"
        exit 0
    fi

    load_kmods
    validate_pin "${pin}"
    msg "detected pin=${pin}"

    # Strip off the build token that is set via the CoreOS installer.
    token_rm "${dev}"
```

Comment 34 Christophe Fergeau 2020-06-25 14:05:36 UTC
(In reply to Ben Howard from comment #31)
> I think that the root cause is NOT RHCOS but:
> - Per [1] the disk is resized and the FS is grown.
> - The disk is sparse, but the root-filesystem goes from ~4GB allocated to
> ~31GB allocated
> [*] When RHCOS runs the disk-encryption stage, it now has apply encryption to
> the entire 31GB
> - This means that RHCOS is encrypting roughly ~25Gb of zeros. 
> 
> Unless you're doing CPU pinning for the vCPU's you could end up with slow
> encryption.
> 
> I concur with what Colin's saying in
> https://bugzilla.redhat.com/show_bug.cgi?id=1803130#c23. Resize the disk and
> then let RHCOS do the resize on first boot. 
> 
> [1] https://github.com/code-ready/snc/blob/master/createdisk.sh#L38-L89

We finally have time to try to let RHCOS do the resize itself. However, I'm not sure I understand the step I marked with [*]. What is being done is https://github.com/code-ready/snc/blob/9034bac608bf96a0b7f488b7888484aa0e22a51c/createdisk.sh#L66-L73 so we are filling the mounted luks file system with 0, wouldn't they be encrypted as they are written? Or is this 'disk-encryption stage' doing more things?

Comment 36 errata-xmlrpc 2020-07-13 17:15:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.