Bug 466630

Summary:	processes in rawhide guest hang in schedule_timeout (problem with timer interrupts?)
Product:	Red Hat Enterprise Linux 5	Reporter:	Jeff Layton <jlayton>
Component:	xen	Assignee:	Rik van Riel <riel>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.3	CC:	clalance, mathieu-acct, steved, xen-maint
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-07-31 11:10:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeff Layton 2008-10-11 20:48:36 UTC

Opening this as a xen bug, but it's possible that the problem is with the guest kernel.

This is a bit of a strange problem, but I've been able to reproduce it several times. I have a FV rawhide xen guest that I use for testing upstream work. The symptoms I see are that the guest ends up in a state where processes go to sleep in schedule_timeout(), but never wake up.

Recently, I've been working on a patchset to fix some mount/umount races in CIFS. I have a reproducer for that problem that has 2 scripts doing mounts and unmounts of the same samba share in a tight loop.

After several hours of running this on my rawhide xen guest with the patchset that I have, the guest will fall into this state. I can no longer log in (sshd hangs) and can't get in on the console. All of the processes are stuck in schedule_timeout(). Existing logins work fine however (as long as I avoid syscalls that would sleep like this).

The reproducer scripts hang calling umount.cifs -- cifs_umount() calls msleep(1) and never wake up. Here's a stack trace from it:

umount.cifs   D 0000000000000000  4920  6437   6436
 ffff88000c1d3e38 0000000000000046 0000000000000000 ffff88001d410758
 ffffffff81ac0980 ffffffff81ac0980 ffff88001d410000 ffffffff815304e0
 ffff88001d4103b0 000000001d410000 ffffffff81067ab8 ffff88001d4103b0
Call Trace:
 [<ffffffff81067ab8>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81067a87>] ? trace_hardirqs_on_caller+0x10f/0x133
 [<ffffffff8138984e>] schedule_timeout+0x95/0xcb
 [<ffffffff8104f345>] ? process_timeout+0x0/0x10
 [<ffffffff8104fa6a>] ? __mod_timer+0x21/0xd5
 [<ffffffff813898a2>] schedule_timeout_uninterruptible+0x1e/0x20
 [<ffffffff8104fb37>] msleep+0x19/0x23
 [<ffffffffa02a04bb>] cifs_umount_begin+0xbb/0xe4 [cifs]
 [<ffffffff810e9ed1>] sys_umount+0x129/0x30c
 [<ffffffff8138b07e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff81067a87>] ? trace_hardirqs_on_caller+0x10f/0x133
 [<ffffffff8101139a>] system_call_fastpath+0x16/0x1b


...here's an sshd from an attempted login when the guest got into this state:

sshd          D 0000000000000000  5000  6460   2234
 ffff88001d963d58 0000000000000046 0000000000000000 ffffffff810674e2
 ffffffff81ac0980 ffffffff81ac0980 ffff88000c03a2d0 ffffffff815304e0
 ffff88000c03a680 0000000000000050 ffff88001d963d38 ffff88000c03a680
Call Trace:
 [<ffffffff810674e2>] ? mark_lock+0x22/0x3a2
 [<ffffffff813897e0>] schedule_timeout+0x27/0xcb
 [<ffffffff81067ab8>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81067a87>] ? trace_hardirqs_on_caller+0x10f/0x133
 [<ffffffff81067ab8>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff8138b4c6>] ? _spin_unlock_irq+0x30/0x3c
 [<ffffffff813896a3>] wait_for_common+0xfa/0x152
 [<ffffffff8103dbe9>] ? default_wake_function+0x0/0xf
 [<ffffffff81067a87>] ? trace_hardirqs_on_caller+0x10f/0x133
 [<ffffffff81389785>] wait_for_completion+0x18/0x1a
 [<ffffffff81057fd8>] synchronize_rcu+0x3a/0x41
 [<ffffffff81058061>] ? wakeme_after_rcu+0x0/0x17
 [<ffffffff8114ffc4>] install_session_keyring+0xe9/0xfe
 [<ffffffff8138b07e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff81150006>] join_session_keyring+0x2d/0x110
 [<ffffffff8114f7c7>] keyctl_join_session_keyring+0x2d/0x2f
 [<ffffffff8114f808>] sys_keyctl+0x3f/0xb3
 [<ffffffff8101139a>] system_call_fastpath+0x16/0x1b

...finally, here's an agetty (serial console):

agetty        S ffff88001cc69288  3744  2613      1
 ffff880013c75d48 0000000000000046 ffff880013d4c5a0 0000000000000050
 ffffffff81ac0980 ffffffff81ac0980 ffff880013d4c5a0 ffff880014eac5a0
 ffff880013d4c950 0000000100000001 ffff880013d4c5a0 ffff880013d4c950
Call Trace:
 [<ffffffff81066a75>] ? trace_hardirqs_off_caller+0x21/0xa4
 [<ffffffff810674e2>] ? mark_lock+0x22/0x3a2
 [<ffffffff810678c9>] ? mark_held_locks+0x67/0x82
 [<ffffffff813897e0>] schedule_timeout+0x27/0xcb
 [<ffffffff81067ab8>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff8105a435>] ? add_wait_queue+0x41/0x4a
 [<ffffffff811fa04b>] read_chan+0x504/0x814
 [<ffffffff8103dbe9>] ? default_wake_function+0x0/0xf
 [<ffffffff811fc889>] ? tty_ldisc_ref_wait+0x18/0xa6
 [<ffffffff811f6a1f>] tty_read+0x8d/0xca
 [<ffffffff810d2d8a>] vfs_read+0xad/0x107
 [<ffffffff810d2eb2>] sys_read+0x4c/0x73
 [<ffffffff8101139a>] system_call_fastpath+0x16/0x1b

...I'd suspect that the patchset that I have is bad, but I can't reproduce this on a bare-metal machine running the same kernels. I also am not sure why something fairly specific to CIFS would cause such a specific problem in other processes.

Memory corruption seems unlikely -- this is a debug kernel. It really seems like something has gone wrong with timers, which makes me suspect that something is wrong with the APIC?

Here's my kernel command line:

ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 hpet=disable

...I've disabled the hpet since I get really strange clock behavior with it enabled.

Another, possibly related problem -- the clock in the guest runs oddly once this happens:

[root@dhcp231-229 ~]# date
Sat Oct 11 13:11:13 EDT 2008
[root@dhcp231-229 ~]# date
Sat Oct 11 13:11:11 EDT 2008
[root@dhcp231-229 ~]# date
Sat Oct 11 13:11:10 EDT 2008
[root@dhcp231-229 ~]# date
Sat Oct 11 13:11:11 EDT 2008
[root@dhcp231-229 ~]# date
Sat Oct 11 13:11:11 EDT 2008
[root@dhcp231-229 ~]# date
Sat Oct 11 13:11:12 EDT 2008

...it also seems to run generally too fast before the problem starts.

Comment 1 Jeff Layton 2008-10-11 20:50:33 UTC

BTW, I have a core dump from the guest in this state if it's helpful. I also seem to be able to generally reproduce this within a day or so of running the reproducer for the CIFS problem.

Let me know if you need other info or access to the dom0 and domU in order to troubleshoot this.

Comment 2 Jeff Layton 2008-10-12 01:01:37 UTC

Collecting a bit more data while I have the guest in this state. Doing:

[root@dhcp231-229 ~]# strace -e clock_gettime date 
clock_gettime(CLOCK_REALTIME, {1223745073, 353702315}) = 0
Sat Oct 11 13:11:13 EDT 2008
[root@dhcp231-229 ~]# strace -e clock_gettime date 
clock_gettime(CLOCK_REALTIME, {1223745074, 225444917}) = 0
Sat Oct 11 13:11:14 EDT 2008
[root@dhcp231-229 ~]# strace -e clock_gettime date 
clock_gettime(CLOCK_REALTIME, {1223745070, 174429480}) = 0
Sat Oct 11 13:11:10 EDT 2008
[root@dhcp231-229 ~]# strace -e clock_gettime date 
clock_gettime(CLOCK_REALTIME, {1223745070, 706055296}) = 0
Sat Oct 11 13:11:10 EDT 2008
[root@dhcp231-229 ~]# strace -e clock_gettime date 
clock_gettime(CLOCK_REALTIME, {1223745071, 144712846}) = 0
Sat Oct 11 13:11:11 EDT 2008
[root@dhcp231-229 ~]# strace -e clock_gettime date 
clock_gettime(CLOCK_REALTIME, {1223745071, 506527926}) = 0
Sat Oct 11 13:11:11 EDT 2008
[root@dhcp231-229 ~]# strace -e clock_gettime date 
clock_gettime(CLOCK_REALTIME, {1223745071, 894165569}) = 0
Sat Oct 11 13:11:11 EDT 2008
[root@dhcp231-229 ~]# strace -e clock_gettime date 
clock_gettime(CLOCK_REALTIME, {1223745072, 218262996}) = 0
Sat Oct 11 13:11:12 EDT 2008
[root@dhcp231-229 ~]# strace -e clock_gettime date 
clock_gettime(CLOCK_REALTIME, {1223745072, 517668451}) = 0
Sat Oct 11 13:11:12 EDT 2008

...so here the seconds field is jumping backward and forward all over the place. Unfortunately, I can't install debuginfo now so that I can look at the xtime directly. I'll make it a point to get debuginfo installed before I reproduce this again so we might be able to gather more info.

Comment 3 Jeff Layton 2008-10-12 02:53:37 UTC

Booted without hpet=disable and the problem seems to occur much more quickly. I did:

# echo t > /proc/sysrq-trigger

multiple times and got the jiffies value for the CPU each time. It never changed:

[root@dhcp231-229 ~]# echo t > /proc/sysrq-trigger 
[root@dhcp231-229 ~]# dmesg | grep jiffies
  .jiffies                       : 4300384685
  .jiffies                       : 4300384685
[root@dhcp231-229 ~]# echo t > /proc/sysrq-trigger 
[root@dhcp231-229 ~]# dmesg | grep jiffies
  .jiffies                       : 4300384685
  .jiffies                       : 4300384685
[root@dhcp231-229 ~]# echo t > /proc/sysrq-trigger 
[root@dhcp231-229 ~]# dmesg | grep jiffies
  .jiffies                       : 4300384685
  .jiffies                       : 4300384685

...though maybe this is expected with a tickless kernel. I'm going to try booting with nohz=0 and see whether it works around the problem at all...

Comment 4 Jeff Layton 2008-10-12 10:56:50 UTC

Booting the guest with nohz=0 allowed the reproducer to run overnight without a hang. So whatever this problem is, it does seem to be related to using a tickless guest kernel. That said, the guest seems to have lost ~13 minutes of wallclock time in the ~7.5 hours since I booted it:

dom0$ date
Sun Oct 12 06:51:09 EDT 2008

domU# date
Sun Oct 12 06:38:05 EDT 2008

I should also mention that the dom0 is running 2.6.18-117.el5.jtltest.51xen. This is basically a -117.el5 kernel with a few extra patches (mostly CIFS and NFS stuff).

Hand-wavy hypothesis:

The guest is losing timer interrupts occasionally. On a "normal" guest, this still mostly works but manifests itself as clock drift vs. the dom0. On a tickless guest, we rely on a particular interrupt to wake up processes, and reset the timer for the next interrupt. Losing this interrupt is more critical, and processes end up stuck when timers can't pop.

Comment 6 Rik van Riel 2008-10-13 13:52:56 UTC

Sounds like maybe part of the hardware emulation that is needed to make tickless work right is not doing the right thing.

Jeff, does "xm dmesg" in dom0 show any error messages around the time your tickless FV guest gets stuck?

Comment 7 Jeff Layton 2008-10-13 14:14:47 UTC

I don't think so. Here's a snip of the tail of "xm dmesg". Note that I've reproduced this problem several times in the domU since rebooting the dom0:

(XEN) Scrubbing Free RAM: .done.
(XEN) Xen trace buffers: disabled
(XEN) Std. Loglevel: Errors and warnings
(XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
(XEN) Xen is relinquishing VGA console.
(XEN) *** Serial input -> DOM0 (type \047CTRL-a\047 three times to switch input to Xen).
(XEN) Freed 100kB init memory.
(XEN) ioapic_guest_write: apic=0, pin=4, old_irq=4, new_irq=4
(XEN) ioapic_guest_write: old_entry=000009f1, new_entry=000109f1
(XEN) ioapic_guest_write: Attempt to modify IO-APIC pin for in-use IRQ!

Still, I've set up the reproducer again and will verify whether any new messages pop once the problem shows up.

Comment 8 Jeff Layton 2008-10-13 19:26:26 UTC

Confirmed. I was able to reproduce the lockup, and no messages popped up in "xm dmesg"...

The interesting bit is that the host ran for a *long* time booted with no extra kernel parms and it didn't lock up. I then booted it with hpet=disable and it reproduced after just a little while. Yesterday, I was able to reproduce it faster when booted without that parm. Very inconsistent, unfortunately :(

Any other info you'd like me to gather while the guest is in this state?

Comment 9 Jeff Layton 2008-10-14 14:39:21 UTC

The guest ran the test overnight without the hpet=disable parm on the kernel command line. I'm fairly sure that I did see it lock up before with that parameter set, so I'm not sure why it's not doing so now.

In any case, I'm rerunning the test now with hpet=disable set:

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
acpi_pm

I've had more consistent results in reproducing it this way. Don mentioned booting the guest with "divider=10". Once I reconfirm to myself that I can reproduce the problem, I'll retry it with that.

Comment 10 Jeff Layton 2008-10-15 10:54:44 UTC

Was able to reproduce the lockup yesterday after about 4 hours of the reproducer running. I just booted the guest with divider=10, but I don't see any indiciation that the parameter was recognized. Still, rerunning the test now...

Comment 11 Jeff Layton 2008-10-16 00:06:52 UTC

I was able to also reproduce the hang today with divider=10. Complete kernel command line is:

ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 hpet=disable divider=10

Guest kernel is:

2.6.27-3.fc10.x86_64.debug

Based on a cursory look at the code, I sort of suspect that the tick divider isn't in rawhide kernels, but that's not really an area I'm that familiar with. I've still not been able to reproduce this again without the hpet disabled. I'm going to rerun the reproducer tonight and try that again.

Comment 12 Jeff Layton 2008-10-16 12:09:28 UTC

The guest ran the reproducer overnight using the hpet timer:

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
hpet

...so either I was mistaken that I had booted the guest with the hpet timer before, or it's just harder to reproduce it there and I just got lucky. Either way, it's pretty consistently reproducible within 8 hours or so when the guest uses the acpi_pm timer.

Comment 13 Jeff Layton 2008-10-17 14:25:35 UTC

Ran the reproducer overnight last night and the box was hung this morning. This time I was using hpet timer. I have a core if it'll be helpful:

crash> p saved_command_line
saved_command_line = $10 = 0xffff8800015f3580 "ro root=/dev/VolGroup00/LogVol00 
console=ttyS0,115200"

crash> bt
PID: 22188  TASK: ffff880018cc8000  CPU: 1   COMMAND: "umount.cifs"
 #0 [ffff880003681d88] trace_hardirqs_on at ffffffff81067af0
 #1 [ffff880003681da0] trace_hardirqs_on_caller at ffffffff81067abf
 #2 [ffff880003681dc0] trace_hardirqs_on at ffffffff81067af0
 #3 [ffff880003681df0] __mod_timer at ffffffff8104fb44
 #4 [ffff880003681e40] schedule_timeout at ffffffff8138a01e
 #5 [ffff880003681eb0] schedule_timeout_uninterruptible at ffffffff8138a072
 #6 [ffff880003681ec0] msleep at ffffffff8104fb6f
 #7 [ffff880003681ed0] cifs_umount_begin at ffffffffa01d44bb
 #8 [ffff880003681ef0] sys_umount at ffffffff810ea2d9
 #9 [ffff880003681f80] system_call_fastpath at ffffffff8101139a
    RIP: 000000300fae9ab7  RSP: 00007fff685e4b78  RFLAGS: 00010206
    RAX: 00000000000000a6  RBX: ffffffff8101139a  RCX: 4f4800662d003173
    RDX: 00000000ff534d42  RSI: 0000000000000001  RDI: 00007fff685e6968
    RBP: 00007fff685e4c50   R8: fefefefefefefeff   R9: 0000000000000001
    R10: 00007fff685e4900  R11: 0000000000000206  R12: 0000000000000000
    R13: 0000000000000000  R14: 00007fff685e4d20  R15: 0000000000400ac0
    ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b

...a lot of other processes are stuck in schedule_timeout with similar stack traces.

The guest kernel is:

2.6.27-13.fc10.x86_64.debug

Comment 15 Rik van Riel 2009-04-13 16:29:39 UTC

This is a bit of a long shot, but it wouldn't hurt trying out my test RPMs to
see whether this is a duplicate of bugs 480689 and 449346.

This bug could be caused by a combination of two main factors:
- while doing disk IO, one VCPU of an HVM guest can miss timer ticks
- Xen did not re-deliver those missed timer ticks later on, causing clock skew
between VCPUs inside an HVM guest

Both of these issues should be resolved with the backport of the AIO disk
handling code and upstream Xen 'no missed-tick accounting' timer code. Please
test the test RPMs from http://people.redhat.com/riel/.xenaiotime/ and let us
know if those (experimental!) test packages resolve the issue.

Comment 16 Jeff Layton 2009-07-29 13:39:23 UTC

Oh goodness...I had forgotten about this bug...

My rawhide guests are currently pretty stable with the hpet timer, but I'm booting them with nohz=off. I'll plan to do some testing today with a tickless guest and see how it behaves.

Comment 17 Chris Lalancette 2009-07-29 14:08:58 UTC

(In reply to comment #16)
> Oh goodness...I had forgotten about this bug...
> 
> My rawhide guests are currently pretty stable with the hpet timer, but I'm
> booting them with nohz=off. I'll plan to do some testing today with a tickless
> guest and see how it behaves.  

Heh, no problem, I was just going through the list and looking at what we need to do for 5.5, and I came across it.  Thanks for doing the additional testing.

Chris Lalancette

Comment 18 Jeff Layton 2009-07-31 11:10:47 UTC

I stress tested my rawhide guest the other day for several hours and never saw a hang. The clock seems to drift a lot more without the nohz=off option, but it didn't hang on me.

I suggest we go ahead and close this as being fixed, and I'll plan to reopen it if the problem resurfaces.

Comment 19 Chris Lalancette 2009-07-31 11:31:11 UTC

(In reply to comment #18)
> I stress tested my rawhide guest the other day for several hours and never saw
> a hang. The clock seems to drift a lot more without the nohz=off option, but it
> didn't hang on me.
> 
> I suggest we go ahead and close this as being fixed, and I'll plan to reopen it
> if the problem resurfaces.  

OK, great, thanks a lot Jeff.  If you want to open a bug about the nohz time drift problem, feel free to do so.

Chris Lalancette

Comment 20 Paolo Bonzini 2010-04-08 15:50:32 UTC

This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).