Bug 723822
Summary: | Boot occasionally hangs in calibrate_APIC_clock function | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Richard W.M. Jones <rjones> |
Component: | qemu | Assignee: | Fedora Virtualization Maintainers <virt-maint> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | amit.shah, berrange, crobinso, dougsland, dwmw2, ehabkost, gansalmon, glux, itamar, jaswinder, jforbes, jonathan, kernel-maint, knoel, madhu.chinakonda, scottt.tw, tburke, virt-maint |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-07-10 11:59:53 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Richard W.M. Jones
2011-07-21 09:22:47 UTC
Does NOT seem to happen with qemu 2:0.15.0-0.1.20110718525e3df.fc16 I can't reproduce this with qemu-0.15.0-0.2.20110718525e3df.fc16.x86_64 either. However I'll make an observation: the next line in normal output would be the one about verifying the APIC: [ 0.405947] CPU0: AMD QEMU Virtual CPU version 0.14.50 stepping 03 [ 0.516930] APIC timer disabled due to verification failure We've had lots of problems with the qemu APIC in the past, so if we can reproduce this, we should try adding LIBGUESTFS_APPEND=noapic . Closing since it appears to have "fixed itself" for some reason. I've just seen this happen again with: 2:qemu-kvm-0.15.0-0.2.20110718525e3df.fc16.i686 Last messages before the hang: [ 0.000999] SELinux: Disabled at boot. [ 0.000999] Mount-cache hash table entries: 512 [ 0.000999] Initializing cgroup subsys cpuacct [ 0.000999] Initializing cgroup subsys memory [ 0.000999] Initializing cgroup subsys devices [ 0.000999] Initializing cgroup subsys freezer [ 0.000999] Initializing cgroup subsys net_cls [ 0.000999] Initializing cgroup subsys blkio [ 0.000999] Initializing cgroup subsys perf_event [ 0.000999] mce: CPU supports 10 MCE banks [ 0.000999] SMP alternatives: switching to UP code [ 0.000999] Freeing SMP alternatives: 12k freed [ 0.000999] ftrace: allocating 24340 entries in 48 pages [ 0.000999] Enabling APIC mode: Flat. Using 1 I/O APICs [ 0.000999] CPU0: AMD QEMU Virtual CPU version 0.14.50 stepping 03 This doesn't seem to happen in Rawhide with 2:qemu-kvm-0.15.0-1.fc17.i686. Can we backport 0.15.0 to F16? Assign to correct component. Unfortunately I've seen this again on Rawhide 2:qemu-kvm-0.15.0-3.fc17.x86_64. Failed build: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296778 Is there some way to find out where it's hanging? No, I tried to reproduce this, but I can't reproduce it locally. It only happens inside Koji which is like the worst place to do any debugging. I tried to reproduce tris bug with our VPS. Problem occurs only when guest kernel has parameter "noapic". (In reply to comment #9) > No, I tried to reproduce this, but I can't reproduce it > locally. It only happens inside Koji which is like the > worst place to do any debugging. Interesting ... I'm going to remove that parameter in the next release and see if it happens again. This bug happens without noapic (ie. with APICs). I'm still trying to reproduce it locally in the hope of capturing a kernel dump when it happens. After 1764 iterations of my boot test, I have reproduced this bug on my local machine! Now to capture some information from the KVM process ... I captured %rip and the hang happens in the loop in arch/x86/kernel/apic.c : calibrate_APIC_clock http://lxr.linux.no/#linux+v3.0.4/arch/x86/kernel/apic/apic.c#L632 while (lapic_cal_loops <= LAPIC_CAL_LOOPS) cpu_relax(); <--- hangs here The assembly is: ffffffff81b849f3: eb 02 jmp ffffffff81b849f7 <setup_b oot_APIC_clock+0xa3> ffffffff81b849f5: f3 90 pause ffffffff81b849f7: 83 3d e6 f4 05 00 64 cmpl $0x64,0x5f4e6(%rip) # ffffffff81be3ee4 <lapic_cal_loops> ffffffff81b849fe: 7e f5 jle ffffffff81b849f5 <setup_boot_APIC_clock+0xa1> (%rip pointed to the pause instruction) Note that it consumes lots of CPU so it seems to be spinning in this loop, not stopped at the pause. (In reply to comment #14) > while (lapic_cal_loops <= LAPIC_CAL_LOOPS) > cpu_relax(); <--- hangs here To clarify what I said: %rip points to that instruction. It doesn't "hang" there. It appears to be spinning in this loop forever. Still happening occasionally. Note the strange times which I see often (but not always): [ 0.000000] Fast TSC calibration using PIT [ 0.000000] Detected 2532.902 MHz processor. [ 0.000504] Calibrating delay loop (skipped), value calculated using timer frequency.. 5065.80 BogoMIPS (lpj=2532902) [ 0.000999] pid_max: default: 32768 minimum: 301 [ 0.000999] Security Framework initialized [ 0.000999] SELinux: Disabled at boot. [ 0.000999] Dentry cache hash table entries: 65536 (order: 7, 524288 bytes) [ 0.000999] Inode-cache hash table entries: 32768 (order: 6, 262144 bytes) [ 0.000999] Mount-cache hash table entries: 256 [ 0.000999] Initializing cgroup subsys cpuacct [ 0.000999] Initializing cgroup subsys memory [ 0.000999] Initializing cgroup subsys devices [ 0.000999] Initializing cgroup subsys freezer [ 0.000999] Initializing cgroup subsys net_cls [ 0.000999] Initializing cgroup subsys blkio [ 0.000999] Initializing cgroup subsys perf_event [ 0.000999] mce: CPU supports 10 MCE banks [ 0.000999] SMP alternatives: switching to UP code [ 0.000999] Freeing SMP alternatives: 12k freed [ 0.000999] ftrace: allocating 25829 entries in 102 pages [ 0.000999] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 0.000999] CPU0: AMD QEMU Virtual CPU version 0.15.0 stepping 03 With current version of kernel and qemu-kvm problem does not occurs. I think this bug can be closed. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. Closing per Comment #17, if anyone hits it again please reopen |