Bug 1806532

Summary: Do not use --enable-rdrand when building json-c
Product: Red Hat Enterprise Linux 8 Reporter: Rhys Oxenham <roxenham>
Component: json-cAssignee: Joe Orton <jorton>
Status: CLOSED ERRATA QA Contact: Ondrej Mejzlik <omejzlik>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.1CC: amit, berrange, cfergeau, dwmw2, gbraad, itamar, jorton, miabbott, omejzlik, pbonzini, prkumar, psklenar, rjones, virt-maint
Target Milestone: rcKeywords: AutoVerified, EasyFix, TestCaseProvided, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-18 14:47:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1771008, 1894575    
Attachments:
Description Flags
CoreOS boot failure with default CPU setup none

Description Rhys Oxenham 2020-02-24 13:12:34 UTC
Created attachment 1665426 [details]
CoreOS boot failure with default CPU setup

Description of problem:

When testing virtualised OpenShift 4.4 (with CoreOS 4.4) deployments on my AMD Ryzen 3950x based Fedora 31 system the CoreOS installation fails on firstboot due to it failing to run the "cryptsetup luksDump /dev/vda4 | sed..." command properly - it simply hangs indefinitely. I verified the same with a RHEL 8.1 guest, so this is not limited to CoreOS.

On the underlying host, I'm able to run the luksDump command without fail, and it returns instantaneously, this is only showing up in the virtual machine. I was using host-model initially, and playing around with CPU flags I managed to get it to work just fine by disabling *rdrand* in libvirt:

  <cpu mode="host-model" check="partial">
    <model fallback="allow"/>
    <feature policy="disable" name="rdrand"/>
  </cpu>

Which yields:

  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-IBPB</model>
    <vendor>AMD</vendor>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='tsc-deadline'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='clwb'/>
    <feature policy='require' name='umip'/>
    <feature policy='require' name='stibp'/>
    <feature policy='require' name='arch-capabilities'/>
    <feature policy='require' name='ssbd'/>
    <feature policy='require' name='cmp_legacy'/>
    <feature policy='require' name='perfctr_core'/>
    <feature policy='require' name='wbnoinvd'/>
    <feature policy='require' name='amd-ssbd'/>
    <feature policy='require' name='virt-ssbd'/>
    <feature policy='require' name='rdctl-no'/>
    <feature policy='require' name='skip-l1dfl-vmentry'/>
    <feature policy='require' name='mds-no'/>
    <feature policy='disable' name='monitor'/>
    <feature policy='disable' name='rdrand'/>
    <feature policy='disable' name='svm'/>
    <feature policy='require' name='topoext'/>
  </cpu>

Now my deployments work successfully and without fault.

Version-Release number of selected component (if applicable):

# uname -r
5.4.19-200.fc31.x86_64

# rpm -qa | egrep '(qemu-system|qemu-common|libvirt-daemon|kvm)' | sort 
libvirt-daemon-5.6.0-5.fc31.x86_64
libvirt-daemon-config-network-5.6.0-5.fc31.x86_64
libvirt-daemon-config-nwfilter-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-interface-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-libxl-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-lxc-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-network-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-nodedev-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-nwfilter-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-qemu-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-secret-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-core-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-disk-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-gluster-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-iscsi-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-iscsi-direct-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-logical-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-mpath-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-rbd-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-scsi-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-sheepdog-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-storage-zfs-5.6.0-5.fc31.x86_64
libvirt-daemon-driver-vbox-5.6.0-5.fc31.x86_64
libvirt-daemon-kvm-5.6.0-5.fc31.x86_64
qemu-common-4.1.1-1.fc31.x86_64
qemu-kvm-4.1.1-1.fc31.x86_64
qemu-system-x86-4.1.1-1.fc31.x86_64
qemu-system-x86-core-4.1.1-1.fc31.x86_64

# lscpu
lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 9 3950X 16-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2195.600
CPU max MHz:                     3500.0000
CPU min MHz:                     2200.0000
BogoMIPS:                        6987.01
Virtualization:                  AMD-V
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        8 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP always-on, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f1
                                 6c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflus
                                 hopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor
                                  smca


How reproducible:

It would appear every time you try and do something involving cryptsetup & luks with rdrand in a virtual machine. I've not yet tried changing my host OS to RHEL nor trying with other guests, but happy to do so if the maintainers need some more information.

Steps to Reproduce:
1. On an AMD Ryzen (or *maybe* EPYC?) based system, provision a RHEL 8.1 VM with host-model setup, i.e. leave the defaults.

2. Download the CoreOS raw disk image from here onto your VM (note it has to be >4.2 as that's when they introduced luks support):

https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/

3. Decompress image (gzip -d /path.to/raw.gz)

3. Create a loop device from the disk image so you can access like a block device (losetup -f -P /path/to/raw)

4. Attempt to dump the crypt/luks info from it (cryptsetup luksDump /dev/loop0p4)

Actual results:

With rdrand enabled this currently hangs indefinitely, and must be killed.

Expected results:

The crypt information can be dumped immediately, and in terms of the original use-case, the CoreOS first boot is successful.

Thanks!

Comment 1 Gerard Braad (Red Hat) 2020-02-26 09:13:59 UTC
We are also experiencing this issue with our OpenShift cluster images that contain RHCOS. A user has reported this on Hyper-V and therefore seems to be a hardware issue: https://github.com/code-ready/crc/issues/1035

Comment 2 Christophe Fergeau 2020-02-26 14:58:48 UTC
See https://bugzilla.redhat.com/show_bug.cgi?id=1745333 for a Fedora bug which seems fairly similar:
« The rdrand instruction is apparently broken on some motherboards with the new Ryzen 3000 CPUs. This issue is supposed to be fixed by BIOS update, but that's not available for all boards yet.

json-c seems to use the rdrand in its initialization. On the broken boards it enters an infinite loop. This causes the boot to hang on systems with encrypted storage (due to cryptsetup using json-c?).

Please consider disabling the rdrand support in json-c. It doesn't seem to be very useful and causes problems. Alternatively, check first if rdrand is reported in /proc/cpuinfo to make the nordrand option effective for json-c. »

The json-c package in rhel 8.1 is still built with --enable-rdrand, which was disabled in fedora to avoid that bug.

Comment 3 Gerard Braad (Red Hat) 2020-02-28 09:39:33 UTC
The OP of the issue for CRC has, as instructed, updated his system's (mainboard) BIOS and was not able to reproduce the issue after this. It looks like the BIOS might have contained an updated microcode for the Ryzen 3000 series of CPUs which resolved the issue.

Comment 10 Christophe Fergeau 2020-10-23 15:34:37 UTC
Regarding this bug, seeing https://src.fedoraproject.org/rpms/json-c/c/a826638ccf5afc?branch=master this seems to have been fixed upstream in json-c 0.14. Upstream issue was https://github.com/json-c/json-c/issues/588

Comment 20 errata-xmlrpc 2021-05-18 14:47:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (json-c bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1601