Run this command: $ while guestfish -a /dev/null -v run >& /tmp/log; do echo -n . ; done Occasionally (less than 1 in 100 times) it will hang. See the attached log, but the last lines of output are: [ 0.070120] x86/cpu: User Mode Instruction Prevention (UMIP) activated [ 0.070120] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127 [ 0.070120] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0 [ 0.070120] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization [ 0.070120] Spectre V2 : Mitigation: Retpolines [ 0.070120] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch [ 0.070120] Spectre V2 : Spectre v2 / SpectreRSB : Filling RSB on VMEXIT [ 0.070120] Spectre V2 : Enabling Speculation Barrier for firmware calls [ 0.070120] RETBleed: Mitigation: untrained return thunk [ 0.070120] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier [ 0.070120] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl [ 0.070120] Freeing SMP alternatives memory: 48K Normally we would expect to see this line next: [ 0.070794] smpboot: CPU0: AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping: 0x0) Reproducible: Sometimes Steps to Reproduce: 1. See description above
Host kernel 6.4.0-0.rc2.23.fc39.x86_64 Guest kernel 6.4.0-0.rc5.41.fc39.x86_64 qemu-system-x86-8.0.0-4.fc39.x86_64 glibc-2.37.9000-10.fc39.x86_64
Created attachment 1969619 [details] log file
Created attachment 1969620 [details] guestfs-m5oblnwjlg44snt8.log (qemu command) Note that "qemu-kvm: terminating on signal 15" in this log is from me killing qemu after the hang, it's not related to the hang itself.
Still happens with kernel 6.4.0-0.rc5.41.fc39.x86_64 and qemu-system-x86-8.0.0-4.fc39.x86_64 which are the latest in Rawhide at time of writing.
Created attachment 1969789 [details] another hang log I reproduced the hang after about 280 iterations on Fedora 37 on Intel hardware: host kernel-6.2.15-200.fc37.x86_64 qemu-system-x86-7.0.0-15.fc37.x86_64 glibc-2.36-9.fc37.x86_64 $ grep 'model name' /proc/cpuinfo | head -n1 model name : Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
Upstream bug: https://gitlab.com/qemu-project/qemu/-/issues/1696
It seems to be a kernel bug: https://lkml.org/lkml/2023/6/13/733
Seems to be fixed by: https://lore.kernel.org/all/bead7acb-ed71-4a14-0094-f8e39323a3b5@grsecurity.net/T/#m3ba5981ed4d5b534fa53589e226d832639584826
Fixed in kernel commit 13bb06f8dd42071cb9a49f6e21099eea05d4b856
Any chance we could get the fix backported? This causes not only problems in our CI, but also in our build system (with uses guestfish quite a lot to manipulate disk images).
We absolutely need to fix this everywhere the original bug ended up (including RHEL if its there) because this is a serious issue.
FWIW I've seen stable backports posted by Greg K-H to the following branches: 6.3 6.1 5.15 5.10 5.4 So all of those versions of Linux are potentially affected if they've been following the upstream stable branch.