Bug 2213346

Summary: CONFIG_PRINTK_TIME causes occasional hangs when booting on qemu
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: POST --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rawhideCC: acaringi, accounts+fedora, adscvr, airlied, alciregi, berrange, bskeggs, cfergeau, crobinso, dustymabe, eblake, hdegoede, hpa, jarodwilson, josef, kernel-maint, lgoncalv, linville, masami256, mcascell, mchehab, pbonzini, philmd, ptalbert, rjones, steved, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
URL: https://lkml.org/lkml/2023/6/13/733
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 910269    
Attachments:
Description Flags
log file
none
guestfs-m5oblnwjlg44snt8.log (qemu command)
none
another hang log none

Description Richard W.M. Jones 2023-06-07 21:30:58 UTC
Run this command:

$ while guestfish -a /dev/null -v run >& /tmp/log; do echo -n . ; done

Occasionally (less than 1 in 100 times) it will hang.  See the attached log, but the last lines of output are:

[    0.070120] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[    0.070120] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127
[    0.070120] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0
[    0.070120] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.070120] Spectre V2 : Mitigation: Retpolines
[    0.070120] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    0.070120] Spectre V2 : Spectre v2 / SpectreRSB : Filling RSB on VMEXIT
[    0.070120] Spectre V2 : Enabling Speculation Barrier for firmware calls
[    0.070120] RETBleed: Mitigation: untrained return thunk
[    0.070120] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    0.070120] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
[    0.070120] Freeing SMP alternatives memory: 48K

Normally we would expect to see this line next:

[    0.070794] smpboot: CPU0: AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping: 0x0)


Reproducible: Sometimes

Steps to Reproduce:
1. See description above

Comment 1 Richard W.M. Jones 2023-06-07 21:32:34 UTC
Host kernel 6.4.0-0.rc2.23.fc39.x86_64
Guest kernel 6.4.0-0.rc5.41.fc39.x86_64
qemu-system-x86-8.0.0-4.fc39.x86_64
glibc-2.37.9000-10.fc39.x86_64

Comment 2 Richard W.M. Jones 2023-06-07 21:33:07 UTC
Created attachment 1969619 [details]
log file

Comment 3 Richard W.M. Jones 2023-06-07 21:35:01 UTC
Created attachment 1969620 [details]
guestfs-m5oblnwjlg44snt8.log (qemu command)

Note that "qemu-kvm: terminating on signal 15" in this log is from
me killing qemu after the hang, it's not related to the hang itself.

Comment 4 Richard W.M. Jones 2023-06-08 14:46:37 UTC
Still happens with kernel 6.4.0-0.rc5.41.fc39.x86_64 and
qemu-system-x86-8.0.0-4.fc39.x86_64 which are the latest in
Rawhide at time of writing.

Comment 5 Eric Blake 2023-06-08 14:52:33 UTC
Created attachment 1969789 [details]
another hang log

I reproduced the hang after about 280 iterations on Fedora 37 on Intel hardware:
host kernel-6.2.15-200.fc37.x86_64
qemu-system-x86-7.0.0-15.fc37.x86_64
glibc-2.36-9.fc37.x86_64
$ grep 'model name' /proc/cpuinfo | head -n1
model name	: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz

Comment 6 Richard W.M. Jones 2023-06-08 15:37:51 UTC
Upstream bug:
https://gitlab.com/qemu-project/qemu/-/issues/1696

Comment 7 Richard W.M. Jones 2023-06-13 14:30:56 UTC
It seems to be a kernel bug:
https://lkml.org/lkml/2023/6/13/733

Comment 9 Richard W.M. Jones 2023-06-22 14:02:13 UTC
Fixed in kernel commit 13bb06f8dd42071cb9a49f6e21099eea05d4b856

Comment 10 Dusty Mabe 2023-06-26 02:51:09 UTC
Any chance we could get the fix backported? This causes not only problems in our CI, but also in our build system (with uses guestfish quite a lot to manipulate disk images).

Comment 11 Richard W.M. Jones 2023-06-26 09:44:41 UTC
We absolutely need to fix this everywhere the original bug ended up (including
RHEL if its there) because this is a serious issue.

Comment 12 Richard W.M. Jones 2023-06-27 07:53:10 UTC
FWIW I've seen stable backports posted by Greg K-H to the following branches:

6.3
6.1
5.15
5.10
5.4

So all of those versions of Linux are potentially affected if they've been
following the upstream stable branch.