Description of problem: We can't boot RHEL6 32bit hvm guest with 4 vcpus on the large x86_64 host, even booting failed with vcpus=2. Here, the RHEL6 32bit hvm guest installation via pxe or iso image failed either. Version-Release number of selected component (if applicable): host: x86_64, kernel-xen-2.6.18-236.el5, xen-3.0.3-120.el5, 96-core, 980G ram guest: RHEL-Server-6.0-32-20100922.1-hvm.raw How reproducible: 100% Steps to Reproduce: 1. set vcpus=4 in the config file and do "xm create $config -c" 2. Actual results: Can't boot RHEL6 32bit hvm guest with 4 vcpus, printing lots of "BUG: soft lockup - CPU#0 stuck for 108s" messages. Expected results: RHEL6 32bit hvm guest should boot normally and work well. Additional info: 1. We can boot RHEL6 32bit hvm guest with 4 vcpus on my own work-machine, which has 4 cpu and 8G ram. 2. Try this case on the large host with kernel-xen-231, and it failed either.
Created attachment 467390 [details] config file
Created attachment 467392 [details] console output when failing boot
Created attachment 467394 [details] xend log for failing boot
Created attachment 467395 [details] xm dmesg for failing boot
xm info on the large host: [root@intel-e7450-512-1 xen]# xm info host : intel-e7450-512-1.englab.nay.redhat.com release : 2.6.18-236.el5xen version : #1 SMP Mon Dec 6 19:01:22 EST 2010 machine : x86_64 nr_cpus : 96 nr_nodes : 1 sockets_per_node : 16 cores_per_socket : 6 threads_per_core : 1 cpu_mhz : 2398 hw_caps : bfebfbff:20100800:00000000:00000940:000ce3bd:00000000:00000001 total_memory : 982014 free_memory : 930427 node_to_cpu : node0:0-95 xen_major : 3 xen_minor : 1 xen_extra : .2-236.el5 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable cc_compiler : gcc version 4.1.2 20080704 (Red Hat 4.1.2-48) cc_compile_by : mockbuild cc_compile_domain : redhat.com cc_compile_date : Mon Dec 6 18:38:03 EST 2010 xend_config_format : 2
1. Can you reproduce the "INFO: task khubd:45 blocked for more than 120 seconds." message if you retry the boot? 2. Can you please get a crash dump of the guest? 3. It would be interesting to install "hwloc" (http://www.open-mpi.org/projects/hwloc/) on the host from source, and run "lstopo host.png". If you have time for this, please attach "host.png" (shouldn't be big). Thanks.
Just to be clear, since it wasn't written anywhere explicitly, the guest can be booted with 1 vcpu, correct? Are you attempting to boot with or without pv-on-hvm drivers? Or both?
(In reply to comment #7) > Just to be clear, since it wasn't written anywhere explicitly, the guest can be > booted with 1 vcpu, correct? Yes. The guest can be booted with 1 vcpu. > Are you attempting to boot with or without > pv-on-hvm drivers? Or both? This would be confirmed by byu tomorrow.
(In reply to comment #8) > (In reply to comment #7) > > Just to be clear, since it wasn't written anywhere explicitly, the guest can be > > booted with 1 vcpu, correct? > > Yes. The guest can be booted with 1 vcpu. > > > Are you attempting to boot with or without > > pv-on-hvm drivers? Or both? > > This would be confirmed by byu tomorrow. We booted the guest with pv_on_hvm=enable when the bug occurred. We will try to provide more info according to comment 6 and comment 7, maybe next week, for the big machine is occupied by other section now.
Ok, so I understand that we haven't tried > 1 vcpu without pv_on_hvm yet. We should definitely try that when the machine is returned. Thanks!
(In reply to comment #10) > Ok, so I understand that we haven't tried > 1 vcpu without pv_on_hvm yet. We > should definitely try that when the machine is returned. Thanks! the problem also exist when start guest without pv_on_hvm=enable now only can be reproduced on the machine which reporter used
This appears to be clocksource related. When I reproduced the problem I saw we get soft lockups during boot that eventually make the kernel give up. I looked closer at the logs that came out prior to the first soft lockup and saw ... TSC synchronization [CPU#0 -> CPU#1]: Measured 20035715238 cycles TSC warp between CPUs, turning off TSC clock. Marking TSC unstable due to check_tsc_sync_source failed ... * Found PM-Timer Bug on the chipset. Due to workarounds for a bug, * this clock source is slow. Consider trying other clock sources ... hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 hpet0: 3 comparators, 64-bit 74.964345 MHz counter Switching to clocksource hpet BUG: soft lockup - CPU#0 stuck for 106s! [swapper:1] ... adding clocksource=jiffies to the command line allowed me to boot and use the machine fine with 4 vcpus, both with and without pv_on_hvm drivers enabled. This problem isn't seen on most machines (like my test box) because the tsc generally works on them (like it does on my box). I also tried with the latest 6.1 beta kernel on this machine. The problem reproduced without clocksource=jiffies and went away with clocksource=jiffies. That's not too unexpected considering we haven't changed anything with clocksource related code. It does mean I need to look into a 6.1 fix. I'll start by testing an upstream kernel to see if a fix exists.
Another note is that I installed a 64-bit rhel 6.1 beta HVM guest and got soft lockups with it as well. So this isn't 32-bit specific.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: RHEL6 Xen HVM guests may experience constant soft lockups when booting on some machines. A possible workaround is to boot with clocksource=jiffies on the guest's kernel command line.
The latest Linux kernel (2.6.38-rc4+) also detects warp for the tsc and switches to hpet on this machine and eventually fails to boot. The symptoms of the boot failure are different, but if you look at where it BUGs it's likely due to a softirq not firing. It can be worked-around in the same way 'clocksource=jiffies'. Another note is that with both the rhel kernel and the latest upstream kernel the hpet appears to be stable and usable if the vcpus are pinned to pcpus that are on the same node of this numa machine.
The problem seems to be that the HPET is unreliable on RHEL5 Xen. Perhaps we can blacklist it in the kernel, so that it goes straight from tsc to jiffies?
*** Bug 712319 has been marked as a duplicate of this bug. ***
This needs a patch like this: diff --git a/tools/ioemu/hw/piix4acpi.c b/tools/ioemu/hw/piix4acpi.c index f607074..0229773 100644 --- a/tools/ioemu/hw/piix4acpi.c +++ b/tools/ioemu/hw/piix4acpi.c @@ -532,7 +532,7 @@ void pci_piix4_acpi_init(PCIBus *bus, int devfn) pci_conf[0x01] = 0x80; pci_conf[0x02] = 0x13; pci_conf[0x03] = 0x71; - pci_conf[0x08] = 0x01; /* B0 stepping */ + pci_conf[0x08] = 0x03; pci_conf[0x09] = 0x00; /* base class */ pci_conf[0x0a] = 0x80; /* Sub class */ pci_conf[0x0b] = 0x06; which is a backport of this upstream qemu commit: commit a78b03cb6985466beb006b4e0eec4ba22d537c43 Author: balrog <balrog@c046a42c-6fe2-441c-8c8c-71466251a162> Date: Mon Jan 14 03:43:18 2008 +0000 Bump ACPI/SMBus PIIX4 controller revision to 3 (Marcelo Tosatti).
Deleted Technical Notes Contents. Old Contents: RHEL6 Xen HVM guests may experience constant soft lockups when booting on some machines. A possible workaround is to boot with clocksource=jiffies on the guest's kernel command line.
Hi, can you please retest the problem with RHEL 5.7 xen package and with package from brew build: https://brewweb.devel.redhat.com/taskinfo?taskID=3503146 Is the problem solved in brew build?
Hi,Miroslav The problem still exists in both xen132 and xen-3.0.3-132.el5661211. Boot logs attached. Yuyu Zhou
Created attachment 514114 [details] boot log of xen132
Created attachment 514115 [details] boot log of xen-3.0.3-132.el5661211
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: In some cases, Red Hat Enterprise Linux 6 guests running fully-virtualized under Red Hat Enterprise Linux 5 experience time drift or fail to boot. In some cases, drifting may start after migration of the virtual machine to a host with different speed. This is due to limitations in the Red Hat Enterprise Linux 5 Xen hypervisor. To work around this, add "clocksource=acpi_pm" to the kernel command line for the guest. Alternatively, if running under Red Hat Enterprise Linux 5.7 or newer, locate the guest configuration file for the guest and add "hpet=0" there.
Yes, "acpi_pm" should work, what does the call trace look like? Same as comment 2? Also, can you attach the boot log for "hpet=0"? Changing the technote to jiffies doesn't sound too bad anyway, but I'd like to understand what's going on.
Created attachment 527662 [details] 20111012-661211-acpi_pm-boot-log Call Trace when set "clocksource=acpi_pm".
Created attachment 527663 [details] 20111012-661211-hpet-0-boot-log Boot log for hpet=0.
Setting clocksource=acpi_pm isn't enough to override the hpet. We have this in the boot log before the traces start ... Switching to clocksource hpet ... Override clocksource acpi_pm is not HRT compatible. Cannot switch while in HRT/NOHZ mode
Then it must be jiffies.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -In some cases, Red Hat Enterprise Linux 6 guests running fully-virtualized under Red Hat Enterprise Linux 5 experience time drift or fail to boot. In some cases, drifting may start after migration of the virtual machine to a host with different speed. This is due to limitations in the Red Hat Enterprise Linux 5 Xen hypervisor. To work around this, add "clocksource=acpi_pm" to the kernel command line for the guest. Alternatively, if running under Red Hat Enterprise Linux 5.7 or newer, locate the guest configuration file for the guest and add "hpet=0" there.+In some cases, Red Hat Enterprise Linux 6 guests running fully-virtualized under Red Hat Enterprise Linux 5 experience time drift or fail to boot. In some cases, drifting may start after migration of the virtual machine to a host with different speed. This is due to limitations in the Red Hat Enterprise Linux 5 Xen hypervisor. To work around this, add "clocksource=jiffies" to the kernel command line for the guest. Alternatively, if running under Red Hat Enterprise Linux 5.7 or newer, locate the guest configuration file for the guest and add "hpet=0" there.
I put jiffies in the meanwhile. However, I found this too: https://lkml.org/lkml/2011/5/19/490 and I'll brew a kernel for testing soon. If that fixes acpi_pm, we should include that patch in RHEL6 too.
Just putting nohpet on the guest's kernel command line might also work.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -In some cases, Red Hat Enterprise Linux 6 guests running fully-virtualized under Red Hat Enterprise Linux 5 experience time drift or fail to boot. In some cases, drifting may start after migration of the virtual machine to a host with different speed. This is due to limitations in the Red Hat Enterprise Linux 5 Xen hypervisor. To work around this, add "clocksource=jiffies" to the kernel command line for the guest. Alternatively, if running under Red Hat Enterprise Linux 5.7 or newer, locate the guest configuration file for the guest and add "hpet=0" there.+In some cases, Red Hat Enterprise Linux 6 guests running fully-virtualized under Red Hat Enterprise Linux 5 experience time drift or fail to boot. In some cases, drifting may start after migration of the virtual machine to a host with different speed. This is due to limitations in the Red Hat Enterprise Linux 5 Xen hypervisor. To work around this, add "clocksource=acpi_pm" or "clocksource=jiffies" to the kernel command line for the guest. Alternatively, if running under Red Hat Enterprise Linux 5.7 or newer, locate the guest configuration file for the guest and add "hpet=0" there.
Documented at https://access.redhat.com/kb/docs/DOC-65074
Verify this problem with 2.6.18-302.el5xen. Version: kernel-xen-2.6.18-302.el5 xen-3.0.3-135.el5 xen-libs-3.0.3-135.el5 Host CPU: Intel E7450 Steps: 1. Create a RHEL6.2 HVM guest with vcpus=8 2. Guest Call Trace with the message "BUG: soft lockup - CPU#5 stuck" in the console when no clocksource specified 3. Guest startup successfully when specify the clocksource in one of the following ways: - Set "hpet=0" in guest conf - Set "clocksource=acpi_pm" to guest kernel command line - Set "clocksource=jiffies" to guest kernel command line - Set "nohpet" to guest kernel command line
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0160.html