Description of problem: After a successful save and restore of a RHEL-5 PV guest, I see continual softlockup messages on the guests console. My setup: Dom0: RHEL 5.1 i686, 4 cpu, 2GB memory DomU: RHEL 5.0 i686 PV, 4 vCPU, 1.5GB memory First I boot the domU with the 2.6.18-8.el5 kernel. Then on the dom0, I run: # xm save rhel5pv /var/lib/xen/save/rhel5pv-save # sleep 10 # xm restore /var/lib/xen/save/rhel5pv-save The domain comes back up properly, but after a few minutes, I'll start seeing on the guest console: BUG: soft lockup detected on CPU#3! [<c043ea67>] softlockup_tick+0x98/0xa6 [<c0408b7d>] timer_interrupt+0x504/0x557 [<c043ec9b>] handle_IRQ_event+0x27/0x51 [<c043ed58>] __do_IRQ+0x93/0xe8 [<c040672b>] do_IRQ+0x93/0xae [<c053a045>] evtchn_do_upcall+0x64/0x9b [<c0404ec5>] hypervisor_callback+0x3d/0x48 [<c0407fd1>] raw_safe_halt+0x8c/0xaf [<c0402bca>] xen_idle+0x22/0x2e [<c0402ce9>] cpu_idle+0x91/0xab ======================= The domain continues to run, it will just spit out one of these messages every once in a while. Note that these softlockups seem to always come on one of the secondary CPUS (i.e. not CPU 0). I'm not sure if that is relevant or not, but just something I noticed. Next, I shutdown the domain, and booted into the 2.6.18-37.el5 kernel (with the patch for BZ 250420), and repeat exactly the same test. I get the same result. So nothing has changed in this department between GA and 5.1 Beta.
I believe this issue was fixed by http://xenbits.xensource.com/xen-unstable.hg/?cs=1a411820230b and the subsequent fix in http://xenbits.xensource.com/xen-unstable.hg/?cs=818da23b7571 these also correspond to http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/61ed8662b69c which might be easier to extract the patch from. I'm not entirely sure that a save/restore iteration would be necessary to trigger the bug but it does seem to be the easiest way.
Ian, Thanks for the pointers; I will take a look at them and see if they make a difference. Chris Lalancette
Ian, Well, the patch definitely solves the problem; I don't see the softlockup problems anymore. However, I have to admit I don't quite understand how the patch fixes the issue, and it makes me nervous since it changes the code path for bare-metal as well. Are we going to miss softlockup events because of this patch? Has there been any thought to pushing this patch to mainline Linux? Thanks, Chris Lalancette
Firstly, the real culprit here is the CPU hotplug which happens as part of the suspend/resume process rather than the suspend/resume itself. The observed behaviour was that the recently plugged CPU was going to nohz mode (via stop_hz_time()) for an extended amount of time just after waking up. The patch clamps this to be less than the time to the next softlockup tick event. The reason for the long time out is that the softlockup kick moved out of the normal timer wheel mechanism and into an implicit kick in the timer interrupt (via update_process_times()->run_local_timers()). This change was made somewhere between 2.6.16 and 2.6.18. Previously we were assuming that all events which we might need to wakeup for were in either the regular timer wheels, the hrtimers or rcu. Just after plugging the CPU in it's possible for all of these to be empty. The patch simply adds one more place we need to check for events. The effect of the patch is only ever to reduce the time the CPU goes into nohz mode for so if anything it will prevent us from missing soft lockup events rather than cause us to miss them. As for the impact on bare-metal -- I think next_timer_interrupt() is only used by Xen and S/390 so it'll be pretty limited ;-) It might be possible to just touch the softlockup in time-xen.c:timer_interrupt if blocked > 5*HZ (as we already do for stolen > 5*HZ) but I haven't really thought hard about this solution and I'm not immediately convinced it wouldn't hide real lockups.
*** Bug 233435 has been marked as a duplicate of this bug. ***
Some independent testing by one of the hardware vendors seem to confirm that having this patch fixes some dom0 softlockup issues, as well. Chris Lalancette
Created attachment 200301 [details] Patch to prevent softlockup from upstream This is a patch from upstream Xen to prevent the softlockup problems. It has been ported back to RHEL-5. Chris Lalancette
It seems to still happen on RHEL 5.1 (kernel-xen-2.6.18-53.el5 - x86_64 on non- VT xeon) - no manual suspend/remove done, there just happens to be a couple of these oops in the dmesg log: BUG: soft lockup detected on CPU#1! Call Trace: <IRQ> [<ffffffff802aaab5>] softlockup_tick+0xd5/0xe7 [<ffffffff8026cb4a>] timer_interrupt+0x396/0x3f2 [<ffffffff80210aed>] handle_IRQ_event+0x2d/0x60 [<ffffffff802aae3d>] __do_IRQ+0xa4/0x105 [<ffffffff8028870f>] _local_bh_enable+0x61/0xc5 [<ffffffff8026a90e>] do_IRQ+0xe7/0xf5 [<ffffffff803965e3>] evtchn_do_upcall+0x86/0xe0 [<ffffffff8025d8ce>] do_hypervisor_callback+0x1e/0x2c <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff8026be87>] raw_safe_halt+0x84/0xa8 [<ffffffff80269453>] xen_idle+0x38/0x4a [<ffffffff80247b98>] cpu_idle+0x97/0xba
Engineering, A customer who was facing a similar problem. The description is given below Description of problem: "BUG: soft lockup detect on CPU##!" message and call trace are printed out to domU console, when unpaused after 15 -- 30 seconds pause duration. This problem occurs both PV and FV domU. After this problem occurs, the OS still continues working without a problem. For example, if "vmstat 1" is executed before "xm pause", it still continues running normally after "soft lockup detect" message is printed out. Version-Release number of selected component: RHEL5.1GA-RC xen-3.0.3-41.el5 kernel-xen-2.6.18-52.el5 How reproducible: Almost always, but it rarely happens only in para virtualized x86_64 domU on x86_64 dom0.(It may not happen in 5 tries.) Steps to Reproduce: - boot domU. - execute "xm console <domid>" - execute "xm pause" to domU - wait 15 ~ 30 seconds or more - execute "xm unpause" to domU They tested with the kernel at http://people.redhat.com/clalance/rpms/kernel-xen-2.6.18-56.el5softlock.x86_64.rpm However they still face the problem. The results are as follows dom0 | domU | result --------+--------+--------+----------- | full | vcpu=1 | Failure | | vcpu=2 | Failure i386 +--------+--------+----------- | para | vcpu=1 | Okay | | vcpu=2 | Okay --------+--------+--------+----------- | full | vcpu=1 | Okay | i386 | vcpu>1 | not tested (IT#129501) +--------+--------+----------- | full | vcpu=1 | Failure | x86_64 | vcpu=2 | Failure x86_64 +--------+--------+----------- | para | vcpu=1 | Failure | i386 | vcpu=2 | Failure +--------+--------+----------- | para | vcpu=1 | Failure | x86_64 | vcpu=2 | Failure Let me know if you would like me to create a new bz for this issue. This event sent from IssueTracker by sprabhu issue 136310
*** Bug 425972 has been marked as a duplicate of this bug. ***
I have a comparable problem with a redhat cluster and live migration. After the migration the clock stops running for a few minuts. We have lots of loggings about "BUG: soft lockup detected on CPU#<1|2|3>! The kernel version of the dom0 and the domU is "2.6.18-53.1.4.el5xen". DomU is a para virtual host.
in 2.6.18-66.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
I have installed this kernel as dom0 and domU kernel on the cluster. Live migration works now without the messages but i stil see some problems with the clock after migration. On the first migration (while sleep 1;do date;done): Fri Jan 11 11:57:06 CET 2008 Fri Jan 11 11:57:07 CET 2008 Fri Jan 11 11:57:08 CET 2008 Fri Jan 11 12:08:18 CET 2008 Fri Jan 11 12:08:19 CET 2008 Fri Jan 11 12:08:20 CET 2008 Why does the clock change 10 minuts? After that i changed the host back the clock changed for a few seconds and then stoped for 10 minuts until it was the same as on the dom0 host. Fri Jan 11 12:11:08 CET 2008 Fri Jan 11 12:11:09 CET 2008 Fri Jan 11 12:11:10 CET 2008 Fri Jan 11 12:11:11 CET 2008 Fri Jan 11 12:11:12 CET 2008 Fri Jan 11 12:11:13 CET 2008 Fri Jan 11 12:11:13 CET 2008 ... (10 minuts) Fri Jan 11 12:11:13 CET 2008 Fri Jan 11 12:11:13 CET 2008 (dom0 has the same time) Fri Jan 11 12:11:14 CET 2008 Fri Jan 11 12:11:15 CET 2008 NOTE: Both dom0 systems have the same time (synced to ntp clock).
This BZ is specifically about PV guests having softlockups. The problem with fully virtualized guests may or may not be solvable....the problem is that fully virtualized guests don't actually know when they are being paused, etc, so they can't necessarily take steps to correct for it. When they do wake up, they will realize they haven't run for a long time and think they have a softlockup. I'm not 100% certain that is what is happening in the fully-virt case, but it seems likely. In any case, this BZ does not address that at all. Chris Lalancette
No, again, this is *only* for PV. It will have no effect on FV guests. Chris Lalancette
It's just the way that virtualization works. In theory, it is a bug that can be fixed by paravirtualized timers (as I described earlier). In practice, paravirtualized timers are not yet written, and even if they are written, they can't probably be backported to RHEL-4 or RHEL-5. RHEL-6 is the earliest time frame you might expect to see them. From that perspective, I would close it as WONTFIX and release note it. Chris Lalancette
Well, I don't know the exact circumstances of your customer, or your exact bug. You should open a bug describing the problem and giving the logs and package versions, and then we can look at what the problem is. Chris Lalancette
*** Bug 433522 has been marked as a duplicate of this bug. ***
I have filled in a service request as well. The number of that request is 1797096.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html
We also meet this failure under RHEL5.2 x86 and x86_64.
We experienced the same problem with RHEL 5.5 X86_64. Apr 21 17:05:36 machine kernel: BUG: soft lockup - CPU#11 stuck for 10s! [perl:25938] Apr 21 17:05:36 machine kernel: BUG: soft lockup - CPU#9 stuck for 10s! [irqbalance:3560]