Bug 250994 - [RHEL5]: Softlockup after save/restore in PV guest
Summary: [RHEL5]: Softlockup after save/restore in PV guest
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.1
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Chris Lalancette
QA Contact: Martin Jenner
URL:
Whiteboard:
: 233435 425972 433522 (view as bug list)
Depends On:
Blocks: 391501 444402 445799 1300182
TreeView+ depends on / blocked
 
Reported: 2007-08-06 13:30 UTC by Chris Lalancette
Modified: 2018-10-19 23:06 UTC (History)
15 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 14:48:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch to prevent softlockup from upstream (2.07 KB, patch)
2007-09-19 23:45 UTC, Chris Lalancette
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description Chris Lalancette 2007-08-06 13:30:48 UTC
Description of problem:
After a successful save and restore of a RHEL-5 PV guest, I see continual
softlockup messages on the guests console.  My setup:

Dom0: RHEL 5.1 i686, 4 cpu, 2GB memory
DomU: RHEL 5.0 i686 PV, 4 vCPU, 1.5GB memory

First I boot the domU with the 2.6.18-8.el5 kernel.  Then on the dom0, I run:

# xm save rhel5pv /var/lib/xen/save/rhel5pv-save
# sleep 10
# xm restore /var/lib/xen/save/rhel5pv-save

The domain comes back up properly, but after a few minutes, I'll start seeing on
the guest console:

BUG: soft lockup detected on CPU#3!
 [<c043ea67>] softlockup_tick+0x98/0xa6
 [<c0408b7d>] timer_interrupt+0x504/0x557
 [<c043ec9b>] handle_IRQ_event+0x27/0x51
 [<c043ed58>] __do_IRQ+0x93/0xe8
 [<c040672b>] do_IRQ+0x93/0xae
 [<c053a045>] evtchn_do_upcall+0x64/0x9b
 [<c0404ec5>] hypervisor_callback+0x3d/0x48
 [<c0407fd1>] raw_safe_halt+0x8c/0xaf
 [<c0402bca>] xen_idle+0x22/0x2e
 [<c0402ce9>] cpu_idle+0x91/0xab
 =======================

The domain continues to run, it will just spit out one of these messages every
once in a while.  Note that these softlockups seem to always come on one of the
secondary CPUS (i.e. not CPU 0).  I'm not sure if that is relevant or not, but
just something I noticed.

Next, I shutdown the domain, and booted into the 2.6.18-37.el5 kernel (with the
patch for BZ 250420), and repeat exactly the same test.  I get the same result.
 So nothing has changed in this department between GA and 5.1 Beta.

Comment 1 Ian Campbell 2007-08-06 14:33:16 UTC
I believe this issue was fixed by
http://xenbits.xensource.com/xen-unstable.hg/?cs=1a411820230b and the subsequent
fix in http://xenbits.xensource.com/xen-unstable.hg/?cs=818da23b7571 these also
correspond to http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/61ed8662b69c
which might be easier to extract the patch from.

I'm not entirely sure that a save/restore iteration would be necessary to
trigger the bug but it does seem to be the easiest way.

Comment 2 Chris Lalancette 2007-08-06 14:35:06 UTC
Ian,
     Thanks for the pointers; I will take a look at them and see if they make a
difference.

Chris Lalancette

Comment 4 Chris Lalancette 2007-08-06 21:14:47 UTC
Ian,
     Well, the patch definitely solves the problem; I don't see the softlockup
problems anymore.  However, I have to admit I don't quite understand how the
patch fixes the issue, and it makes me nervous since it changes the code path
for bare-metal as well.  Are we going to miss softlockup events because of this
patch?  Has there been any thought to pushing this patch to mainline Linux?

Thanks,
Chris Lalancette


Comment 5 Ian Campbell 2007-08-07 08:13:07 UTC
Firstly, the real culprit here is the CPU hotplug which happens as part of the
suspend/resume process rather than the suspend/resume itself.

The observed behaviour was that the recently plugged CPU was going to nohz mode
(via stop_hz_time()) for an extended amount of time just after waking up. The
patch clamps this to be less than the time to the next softlockup tick event.

The reason for the long time out is that the softlockup kick moved out of the
normal timer wheel mechanism and into an implicit kick in the timer interrupt 
(via update_process_times()->run_local_timers()). This change was made somewhere
between 2.6.16 and 2.6.18.

Previously we were assuming that all events which we might need to wakeup for
were in either the regular timer wheels, the hrtimers or rcu. Just after
plugging the CPU in it's possible for all of these to be empty. The patch simply
adds one more place we need to check for events.

The effect of the patch is only ever to reduce the time the CPU goes into nohz
mode for so if anything it will prevent us from missing soft lockup events
rather than cause us to miss them. As for the impact on bare-metal -- I think
next_timer_interrupt() is only used by Xen and S/390 so it'll be pretty limited ;-)

It might be possible to just touch the softlockup in time-xen.c:timer_interrupt
if blocked > 5*HZ (as we already do for stolen > 5*HZ) but I haven't really
thought hard about this solution and I'm not immediately convinced it wouldn't
hide real lockups.

Comment 6 Chris Lalancette 2007-08-13 20:18:36 UTC
*** Bug 233435 has been marked as a duplicate of this bug. ***

Comment 7 Chris Lalancette 2007-08-20 16:57:54 UTC
Some independent testing by one of the hardware vendors seem to confirm that
having this patch fixes some dom0 softlockup issues, as well.

Chris Lalancette

Comment 11 Chris Lalancette 2007-09-19 23:45:27 UTC
Created attachment 200301 [details]
Patch to prevent softlockup from upstream

This is a patch from upstream Xen to prevent the softlockup problems.  It has
been ported back to RHEL-5.

Chris Lalancette

Comment 13 Trond Eivind Glomsrød 2007-11-12 09:53:41 UTC
It seems to still happen on RHEL 5.1 (kernel-xen-2.6.18-53.el5 - x86_64 on non-
VT xeon) - no manual suspend/remove done, there just happens to be a couple of 
these oops in the dmesg log:



BUG: soft lockup detected on CPU#1!

Call Trace:
 <IRQ>  [<ffffffff802aaab5>] softlockup_tick+0xd5/0xe7
 [<ffffffff8026cb4a>] timer_interrupt+0x396/0x3f2
 [<ffffffff80210aed>] handle_IRQ_event+0x2d/0x60
 [<ffffffff802aae3d>] __do_IRQ+0xa4/0x105
 [<ffffffff8028870f>] _local_bh_enable+0x61/0xc5
 [<ffffffff8026a90e>] do_IRQ+0xe7/0xf5
 [<ffffffff803965e3>] evtchn_do_upcall+0x86/0xe0
 [<ffffffff8025d8ce>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff8026be87>] raw_safe_halt+0x84/0xa8
 [<ffffffff80269453>] xen_idle+0x38/0x4a
 [<ffffffff80247b98>] cpu_idle+0x97/0xba




Comment 16 Issue Tracker 2007-12-13 11:51:08 UTC
Engineering,

A customer who was facing a similar problem. The description is given
below

Description of problem:
"BUG: soft lockup detect on CPU##!" message and call trace are printed
out
to domU console, when unpaused after 15 -- 30 seconds pause duration.
This problem occurs both PV and FV domU.

After this problem occurs, the OS still continues working without a
problem.
For example, if "vmstat 1" is executed before "xm pause", it still
continues
running normally after "soft lockup detect" message is printed out.

Version-Release number of selected component:
RHEL5.1GA-RC
 xen-3.0.3-41.el5
 kernel-xen-2.6.18-52.el5

How reproducible:
Almost always, but it rarely happens only in para virtualized x86_64 domU
on x86_64 dom0.(It may not happen in 5 tries.)

Steps to Reproduce:
- boot domU.
- execute "xm console <domid>"
- execute "xm pause" to domU
- wait 15 ~ 30 seconds or more
- execute "xm unpause" to domU

They tested with the kernel at

http://people.redhat.com/clalance/rpms/kernel-xen-2.6.18-56.el5softlock.x86_64.rpm

However they still face the problem. The results are as follows

dom0  |      domU       |  result
--------+--------+--------+-----------
      | full   | vcpu=1 | Failure
      |        | vcpu=2 | Failure
i386  +--------+--------+-----------
      | para   | vcpu=1 | Okay
      |        | vcpu=2 | Okay
--------+--------+--------+-----------
      | full   | vcpu=1 | Okay
      | i386   | vcpu>1 | not tested (IT#129501)
      +--------+--------+-----------
      | full   | vcpu=1 | Failure
      | x86_64 | vcpu=2 | Failure
x86_64 +--------+--------+-----------
      | para   | vcpu=1 | Failure
      | i386   | vcpu=2 | Failure
      +--------+--------+-----------
      | para   | vcpu=1 | Failure
      | x86_64 | vcpu=2 | Failure


Let me know if you would like me to create a new bz for this issue.



This event sent from IssueTracker by sprabhu 
 issue 136310

Comment 17 Chris Lalancette 2007-12-17 15:27:15 UTC
*** Bug 425972 has been marked as a duplicate of this bug. ***

Comment 19 Marcel Nijenhof 2008-01-10 17:01:53 UTC
I have a comparable problem with a redhat cluster and live migration.

After the migration the clock stops running for a few minuts.

We have lots of loggings about "BUG: soft lockup detected on CPU#<1|2|3>!

The kernel version of the dom0 and the domU is "2.6.18-53.1.4.el5xen".
DomU is a para virtual host.

Comment 20 Don Zickus 2008-01-10 20:41:59 UTC
in 2.6.18-66.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 21 Marcel Nijenhof 2008-01-11 11:15:10 UTC
I have installed this kernel as dom0 and domU kernel on the cluster.

Live migration works now without the messages but i stil see some problems
with the clock after migration.

On the first migration (while sleep 1;do date;done):
  Fri Jan 11 11:57:06 CET 2008
  Fri Jan 11 11:57:07 CET 2008
  Fri Jan 11 11:57:08 CET 2008
  Fri Jan 11 12:08:18 CET 2008
  Fri Jan 11 12:08:19 CET 2008
  Fri Jan 11 12:08:20 CET 2008

Why does the clock change 10 minuts?

After that i changed the host back the clock changed for a few seconds and
then stoped for 10 minuts until it was the same as on the dom0 host.
  Fri Jan 11 12:11:08 CET 2008
  Fri Jan 11 12:11:09 CET 2008
  Fri Jan 11 12:11:10 CET 2008
  Fri Jan 11 12:11:11 CET 2008
  Fri Jan 11 12:11:12 CET 2008
  Fri Jan 11 12:11:13 CET 2008
  Fri Jan 11 12:11:13 CET 2008
  ...                            (10 minuts)
  Fri Jan 11 12:11:13 CET 2008
  Fri Jan 11 12:11:13 CET 2008   (dom0 has the same time)
  Fri Jan 11 12:11:14 CET 2008
  Fri Jan 11 12:11:15 CET 2008

NOTE:
  Both dom0 systems have the same time (synced to ntp clock).

Comment 22 Marcel Nijenhof 2008-01-11 14:25:51 UTC
I have installed this kernel as dom0 and domU kernel on the cluster.

Live migration works now without the messages but i stil see some problems
with the clock after migration.

On the first migration (while sleep 1;do date;done):
  Fri Jan 11 11:57:06 CET 2008
  Fri Jan 11 11:57:07 CET 2008
  Fri Jan 11 11:57:08 CET 2008
  Fri Jan 11 12:08:18 CET 2008
  Fri Jan 11 12:08:19 CET 2008
  Fri Jan 11 12:08:20 CET 2008

Why does the clock change 10 minuts?

After that i changed the host back the clock changed for a few seconds and
then stoped for 10 minuts until it was the same as on the dom0 host.
  Fri Jan 11 12:11:08 CET 2008
  Fri Jan 11 12:11:09 CET 2008
  Fri Jan 11 12:11:10 CET 2008
  Fri Jan 11 12:11:11 CET 2008
  Fri Jan 11 12:11:12 CET 2008
  Fri Jan 11 12:11:13 CET 2008
  Fri Jan 11 12:11:13 CET 2008
  ...                            (10 minuts)
  Fri Jan 11 12:11:13 CET 2008
  Fri Jan 11 12:11:13 CET 2008   (dom0 has the same time)
  Fri Jan 11 12:11:14 CET 2008
  Fri Jan 11 12:11:15 CET 2008

NOTE:
  Both dom0 systems have the same time (synced to ntp clock).

Comment 26 Chris Lalancette 2008-02-13 13:43:28 UTC
This BZ is specifically about PV guests having softlockups.  The problem with
fully virtualized guests may or may not be solvable....the problem is that fully
virtualized guests don't actually know when they are being paused, etc, so they
can't necessarily take steps to correct for it.  When they do wake up, they will
realize they haven't run for a long time and think they have a softlockup.

I'm not 100% certain that is what is happening in the fully-virt case, but it
seems likely.  In any case, this BZ does not address that at all.

Chris Lalancette

Comment 28 Chris Lalancette 2008-02-14 01:37:54 UTC
No, again, this is *only* for PV.  It will have no effect on FV guests.

Chris Lalancette

Comment 30 Chris Lalancette 2008-02-14 03:04:06 UTC
It's just the way that virtualization works.  In theory, it is a bug that can be
fixed by paravirtualized timers (as I described earlier).  In practice,
paravirtualized timers are not yet written, and even if they are written, they
can't probably be backported to RHEL-4 or RHEL-5.  RHEL-6 is the earliest time
frame you might expect to see them.  From that perspective, I would close it as
WONTFIX and release note it.

Chris Lalancette

Comment 35 Chris Lalancette 2008-02-27 02:18:50 UTC
Well, I don't know the exact circumstances of your customer, or your exact bug.
 You should open a bug describing the problem and giving the logs and package
versions, and then we can look at what the problem is.

Chris Lalancette

Comment 36 Bill Burns 2008-02-29 12:53:21 UTC
*** Bug 433522 has been marked as a duplicate of this bug. ***

Comment 37 Marcel Nijenhof 2008-03-01 21:22:11 UTC
I have filled in a service request as well.
The number of that request is 1797096.

Comment 50 errata-xmlrpc 2008-05-21 14:48:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html


Comment 56 Robin Luo 2008-07-07 06:24:35 UTC
We also meet this failure under RHEL5.2 x86 and x86_64.

Comment 58 Simon Gao 2011-04-22 21:48:30 UTC
We experienced the same problem with RHEL 5.5 X86_64.

Apr 21 17:05:36 machine kernel: BUG: soft lockup - CPU#11 stuck for 10s! [perl:25938]
Apr 21 17:05:36 machine kernel: BUG: soft lockup - CPU#9 stuck for 10s! [irqbalance:3560]


Note You need to log in before you can comment on or make changes to this bug.