Bug 733672 - xen PV guest kernel 2.6.32 processes lock up in D state
Summary: xen PV guest kernel 2.6.32 processes lock up in D state
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.0
Hardware: All
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Andrew Jones
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 653816 743047
TreeView+ depends on / blocked
 
Reported: 2011-08-26 13:12 UTC by Paolo Bonzini
Modified: 2011-12-06 14:06 UTC (History)
7 users (show)

(edit)
Clone Of: 550724
(edit)
Last Closed: 2011-12-06 14:06:45 UTC


Attachments (Terms of Use)
backport/revert patch (2.94 KB, patch)
2011-08-31 12:13 UTC, Paolo Bonzini
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1530 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 6 kernel security, bug fix and enhancement update 2011-12-06 01:45:35 UTC

Description Paolo Bonzini 2011-08-26 13:12:25 UTC
Daniel Stodden reported that the issue still exists in RHEL6.1.  The problem was that in 6.1 after the fix we also backported upstream commit 3588fe2 (xen/events: change to using fasteoi, 2010-08-27) which reintroduces the bug even though with even less frequency.

Upstream later reverted to edge-triggering with 7e186bd (xen: do not clear and mask evtchns in __xen_evtchn_do_upcall, 2011-05-06), which is huge but---stripped of stuff we do not have in RHEL6---is almost exactly a revert of 3588fe2.

Email thread follows.  After the thread is a summary of the comment trail from cloned bug 550724.

The one I'm after revolves around v2.6.32.21-2-gc578392 (on
linux-2.6.32.y/master): "xen: handle events as edge-triggered".

---------------------------------------------------------

Daniel Stodden 06:53 PM 08/25/2011:

Symptom is the occasional I/O deadlock when irqbalanced is running, due
to event migration under I/O load, which can drop notifications.

What happens is that blockfront misses a backend kick, because it's got
this supersized hardirq context which makes this more likely to be
affected than other places.

The symptom would be the disk wedging.

---------------------------------------------------------

Paolo Bonzini 10:34 AM 08/26/2011:

Hi,

was this RHEL6.0 or RHEL6.1?  For "regular" events, after changing them 
to edge-triggered, upstream switched to fasteoi with commit 3588fe2 and 
then back to edge-triggered with 7e186bd.

We have (buggy) level-triggered in RHEL6.0.  In RHEL6.1 we changed it to 
fasteoi as part of resyncing events.c with upstream; we do not have 
7e186bd in either RHEL6.1 or RHEL6.2, and likely will not backport it 
since we do not support pcifront.

Shall we change it back to edge-triggered, i.e. revert 3588fe2?

---------------------------------------------------------

Stefano Stabellini 02:03 PM 08/26/2011:

As the author of 7e186bd, I think the best thing you could do is to
backport 7e186bd plus all the other related upstream fixes to events.c
(see recent email by Ian Campbell related to Debian
http://marc.info/?l=xen-devel&m=131422114313553).
Reverting 3588fe2 would give you an events.c very different from both
upstream Linux and Xenolinux kernels.

---------------------------------------------------------

Paolo Bonzini 03:03 PM 08/26/2011


Ok, I see.  I'm quite sure the full backport is not an option for 6.2.  The "new" IRQ infrastructure in upstream also makes the backport pretty hard.

That's why I was hoping to get by with a revert.  But actually, once I strip 7e186bd of all the pirq stuff, I get exactly a revert of 3588fe2 except for the addition of mask_ack_dynirq.  I attach the 2.6.32 backport.

+++ This bug was initially created as a clone of Bug #550724 +++

Created attachment 380486 [details]
A terminal window with a top running shows the progressively degrading state.

Description of problem:

Sometimes under a small load, processes start locking up in D state. Load climbs steadly. Reboot required.

Version-Release number of selected component (if applicable):

Happens with both:
kernel-2.6.31.6-166.fc12.x86_64
kernel-2.6.31.9-174.fc12.x86_64


How reproducible:

Frequency roughly about every 24 hours. Happens when several LTSP sessions started, or sometime during logwatch processing, or during rsync backups.


Additional info:

System is a large Xen DomU. Dom0 is able to destroy and re-create the system.


--- Additional comment from norm@turing.une.edu.au on 2010-02-19 00:31:23 EST ---

This kernel thread seems to mention something like this problem:

http://www.mail-archive.com/kvm@vger.kernel.org/msg23039.html

There didn't seem to be any resolution to the problem in that thread.

--- Additional comment from norm@turing.une.edu.au on 2010-02-22 15:25:51 EST ---

I've posted this bug in the upstream kernel bugzilla

http://bugzilla.kernel.org/show_bug.cgi?id=15370

"task jbd2/dm-1-8:973 blocked for more than 120 seconds."

--- Additional comment from norm@turing.une.edu.au on 2010-02-24 21:26:16 EST ---

Updated to kernel-2.6.32.9-64.fc12.x86_64 and it crashed after about 19 hours.

This problem is quite frustrating. I have not been reproduce the problem in a different VM on either the same hardware or different hardware. Other VMs seem to withstand heavy disk tests with no problem.

It's something about the LTSP workload. The network I'm sure is in this mix somewhere. In the not very conclusive thread mentioned in comment #26, it is said: "I kept hitting this issue for 2.6.31 guest kernel even with a simple network test."


The system can hum along with labs full of students and work fine for several hours, then suddenly one of the jbd2/dm- tasks won't come out of D state and the rest of the system follows on as more and more processes catch the D state. Other jbd2/dm- tasks catch the D as well after the first one goes.

Depending on which jbd2/dm- goes D first, the growth of the D's can be very fast or start slow and then spread at an increasing pace.

The problem is load related but it is not a big load. Its some combination of network and disk I think.

I'm at a lose as to what more information I can provide.

--- Additional comment from norm@turing.une.edu.au on 2010-03-01 02:48:33 EST ---

Created attachment 396977 [details]
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64  just before panic

I also ran the xenctx on each processor just before it panicked as the processes started to hang.

Unfortunately I didn't get the console output of the panic other than the line:

Kernel panic - not syncing: hung_task: blocked tasks

There was no output in xm dmesg other than the messages mentioned in comment #9

--- Additional comment from norm@turing.une.edu.au on 2010-03-08 22:59:08 EST ---

Created attachment 398674 [details]
A different domU VM hung

I guess the xenctx dumps were also inconclusive. I have some slightly new information.

I have another domU that has hung. This was a test system the I was about to use to get a 2.6.33 to run and it hung in an rpm -i kernel... command. The hang seems to be on the /boot filesystem so I can still run commands and get logs if I don't access /boot or run commands like ps.

Same dom0 hardware so that is still a possible problem, although I've not seen any signs of trouble in dom0. This dom0 only runs two domUs, this test system and the main LTSP system I have logged in all the above. Previous disk tests have not be able to make the test system fail.

I guess this tells us something. It's probably not an ext4 problem as /boot is ext3. It's probably nothing to do with the network. The only network activity is my ssh logins. It's not quotas as they are not turned on. There are not even any users on this system, it's just a small test system.

This one is running 2.6.32.3-10.fc12.x86_64 and it seems to have first hung in:

INFO: task kjournald:580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff88012f57a300     0   580      2 0x00000080
 ffff88012fc2fc00 0000000000000246 0000000000000000 ffffffff817f3a80
 0000000001865b1e ffffffffb47a46fe ffff88012fc2ffd8 ffff88012fc2ffd8
 ffff88012f7ae158 000000000000f980 0000000000015740 ffff88012f7ae158
Call Trace:
 [<ffffffff81140972>] ? sync_buffer+0x0/0x44
 [<ffffffff81140972>] ? sync_buffer+0x0/0x44
 [<ffffffff8145337d>] io_schedule+0x43/0x5d
 [<ffffffff811409b2>] sync_buffer+0x40/0x44
 [<ffffffff814538d0>] __wait_on_bit+0x48/0x7b
 [<ffffffff81012e5b>] ? xen_hypervisor_callback+0x1b/0x20
 [<ffffffff81453971>] out_of_line_wait_on_bit+0x6e/0x79
 [<ffffffff81140972>] ? sync_buffer+0x0/0x44
 [<ffffffff81074a64>] ? wake_bit_function+0x0/0x33
 [<ffffffff8113f765>] ? submit_bh+0x109/0x117
 [<ffffffff811408d5>] __wait_on_buffer+0x24/0x26
 [<ffffffff811bfc12>] wait_on_buffer+0x3d/0x41
 [<ffffffff811c019a>] journal_commit_transaction+0x4cf/0xefd
 [<ffffffff8104eb5b>] ? finish_task_switch+0x8a/0xac
 [<ffffffff81074a2b>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff8100e87f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81454b03>] ? _spin_unlock_irqrestore+0x19/0x1b
 [<ffffffff81065479>] ? try_to_del_timer_sync+0x73/0x81
 [<ffffffff8100e87f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff811c384c>] kjournald+0xe8/0x225
 [<ffffffff81074a2b>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811c3764>] ? kjournald+0x0/0x225
 [<ffffffff8107473e>] kthread+0x7f/0x87
 [<ffffffff81012d2a>] child_rip+0xa/0x20
 [<ffffffff81011f11>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101269d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012d20>] ? child_rip+0x0/0x20



It's still possible to log into this system and do a restricted set of things. Is there anything other information I can collect while it's in this state?

For example cat /proc/*/stat can show me the blocked processes. Anything else? The system is installed on one LV with that divided into a number of partitions:

/dev/mapper/SYSTEM-root /                       ext4    defaults        1 1
UUID=3118ec38-93e1-456c-995f-c67217901296 /boot                   ext3    defaults        1 2
/dev/mapper/SYSTEM-tmp  /tmp                    ext4    defaults        1 2
/dev/mapper/SYSTEM-var  /var                    ext4    defaults        1 2
/dev/mapper/SYSTEM-mail /var/spool/mail         ext4    defaults        1 2
/dev/mapper/SYSTEM-opt  /opt                    ext4    defaults        1 2
/dev/mapper/SYSTEM-swap swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0

--- Additional comment from drjones@redhat.com on 2010-03-09 04:35:07 EST ---

Sorry, I haven't had any time to look at this. The backtraces you've captured are useful though. The one where we panic'ed immediately after detecting D-state is this

Call Trace:
  [<ffffffff8122e328>] delay_tsc+0x48 <--
  [<ffffffff8122e26f>] __delay+0xf
  [<ffffffff8122e2b3>] __const_udelay+0x42
  [<ffffffff81453982>] panic+0x131
  [<ffffffff810acbfd>] watchdog+0x18e
  [<ffffffff8100e87f>] xen_restore_fl_direct_reloc+0x4
  [<ffffffff810aca6f>] hung_task_panic+0x17
  [<ffffffff810745b6>] kthread+0x7f
  [<ffffffff81012d6a>] child_rip+0xa
  [<ffffffff81011f51>] int_ret_from_sys_call+0x7
  [<ffffffff810126dd>] retint_restore_args+0x5
  [<ffffffff81012d60>] kernel_thread+0xe0

It shows that the watchdog tripped after we called xen_restore_fl_direct(). xen_restore_fl_direct() just checks if there are pending events and if so, then makes a simple xen-version hypercall in order to trap to Xen, which will force event channel processing.

All the other stacks you've captured also contain xen_restore_fl_direct(). This indicates to me that Xen is getting messed up at some point, and then each process eventually gets hung up after trapping down to it. It further looks like the bug is in Xen, and not the guest, since the D-state problem transferred to the other domain on your system as well, once again with xen_restore_fl_direct() on the stack.

All that said, I don't know how exactly, or what, is getting Xen messed up. It does seem likely to be related to your LTSP workload though, since you're currently the only reporter I'm aware of.

Maybe we can get 'xm dmesg' output before you start the LTSP workload, and then on some regular interval after you start it, and then of course right after all the processes lock up in D state. Please also make sure all debug messages are turned by adding the following to your xen.gz line in grub

loglvl=all guest_loglvl=all debug_stack_lines=80

--- Additional comment from vevroux@msn.com on 2010-04-30 07:27:02 EDT ---

Hi, I have the exact same issue with xen 3.1.4 running on FC8 DOM0 2.6.21.7-5.fc8xen.

INFO: task kjournald:851 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff88001ce9ca80     0   851      2 0x00000000
 ffff880002d1daf0 0000000000000246 ffff88001ce1d908 ffff88001e0cb008
 000000000073335c 0000000000000001 ffff880002d1db58 ffff88001d5bdb00
 ffff88001cde1b48 000000000000f8e0 ffff88001cde1b48 0000000000015600
Call Trace:
 [<ffffffff8106e897>] ? clocksource_read+0xf/0x11
 [<ffffffff8106aca7>] ? ktime_get_ts+0x4e/0x53
 [<ffffffff81419393>] io_schedule+0x31/0x42
 [<ffffffff811e9503>] get_request_wait+0xeb/0x179
 [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811e982c>] __make_request+0x29b/0x3c5
 [<ffffffff811e71cb>] generic_make_request+0x24e/0x2ac
 [<ffffffff8100e7f2>] ? check_events+0x12/0x20
 [<ffffffff811e730c>] submit_bio+0xe3/0x100
 [<ffffffff8100e7f2>] ? check_events+0x12/0x20
 [<ffffffff8111b2f3>] submit_bh+0xf4/0x117
 [<ffffffff8100e7df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81196b39>] journal_do_submit_data+0x3c/0x4a
 [<ffffffff81197055>] journal_commit_transaction+0x3be/0xe55
 [<ffffffff81048fcb>] ? finish_task_switch+0x6d/0xe6
 [<ffffffff8100e7df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b
 [<ffffffff8105c11f>] ? try_to_del_timer_sync+0x5a/0x66
 [<ffffffff8100e7df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8119a78b>] kjournald+0xeb/0x230
 [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b
 [<ffffffff8119a6a0>] ? kjournald+0x0/0x230
 [<ffffffff81067765>] kthread+0x91/0x99
 [<ffffffff81012daa>] child_rip+0xa/0x20
 [<ffffffff81011f97>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101271d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012da0>] ? child_rip+0x0/0x20


vif3.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1048 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2082 errors:0 dropped:32 overruns:0 carrier:0
          collisions:0 txqueuelen:32
          RX bytes:114497 (111.8 KiB)  TX bytes:248113 (242.2 KiB)

vif4.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:363 errors:0 dropped:0 overruns:0 frame:0
          TX packets:414 errors:0 dropped:35 overruns:0 carrier:0
          collisions:0 txqueuelen:32
          RX bytes:43364 (42.3 KiB)  TX bytes:55012 (53.7 KiB)

vif5.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3127 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2850 errors:0 dropped:16 overruns:0 carrier:0
          collisions:0 txqueuelen:32
          RX bytes:443692 (433.2 KiB)  TX bytes:404390 (394.9 KiB)

vif6.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:436 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1087 errors:0 dropped:226 overruns:0 carrier:0
          collisions:0 txqueuelen:32
          RX bytes:63671 (62.1 KiB)  TX bytes:127551 (124.5 KiB)

--- Additional comment from drjones@redhat.com on 2010-05-05 03:23:17 EDT ---

Is this really linked to network use? Have you tried running the guests with the network down?

--- Additional comment from norm@turing.une.edu.au on 2010-05-07 03:40:16 EDT ---

Andrew, I get the impression you would be surprised if this problem was related to network use? Is there a reason for this?

--- Additional comment from norm@turing.une.edu.au on 2010-05-14 12:57:50 EDT ---

Created attachment 414109 [details]
guest panic on disk test

Well I finally got brave and ran some disk tests in production. The result was a panic in the guest within seconds. Attached is the stack trace. syslogd also wall'ed this:

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:------------[ cut here ]------------

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:invalid opcode: 0000 [#1] SMP 

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:last sysfs file: /sys/devices/vbd-51712/block/xvda/stat

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:Stack:

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:Call Trace:

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:Code: 8b 38 48 8b 70 08 48 89 77 08 48 89 3e 48 89 00 48 89 40 08 66 ff 03 ff 14 25 c8 a7 6b 81 48 8b 42 f8 48 83 e0 fc 48 39 c3 74 04 <0f> 0b eb fe f0 80 62 f8 fe 48 89 cf ff 55 98 48 8b 45 80 8b 90 
Timeout, server not responding.


The disk test was an fio test, run with "sudo fio four-threads-randio.fio" where four-threads-randio.fio contains:
; Four threads, two query, two writers.

[global]
rw=randread
size=256m
directory=/tmp/fio-testing/data
ioengine=libaio
iodepth=4
invalidate=1
direct=1

[bgwriter]
rw=randwrite
iodepth=32

[queryA]
iodepth=1
ioengine=mmap
direct=0
thinktime=3

[queryB]
iodepth=1
ioengine=mmap
direct=0
thinktime=5

[bgupdater]
rw=randrw
iodepth=16
thinktime=40
size=32m


Note that Dom0 logged no disk errors, or any error, in dmesg, syslog or "xm dmesg" on this crash.

--- Additional comment from norm@turing.une.edu.au on 2010-05-31 23:15:26 EDT ---

The most consistent way to trigger this bug (>50%) is to run, on a remote system:

ssh turing dump 3uf - /dev/mapper/SYSTEM-root | /usr/bin/gzip -9 > backup.root.gz

turing is the DomU system that crashes. The remote system is located on totally different hardware.

Seems to me that the workload here is a lot of disk reads and a lot of network writes.

--- Additional comment from norm@turing.une.edu.au on 2010-06-20 17:41:26 EDT ---

I have finally been alble to reproduce this on completely different hardware while doing a similar command to comment #58. No stack traces as nothing logged to syslog and I didn't turn on hung_task_panic=1 for this kernel.

This is different hardware to all he other reports in this bug. The host software is Centos 5.5 running 2.6.18-194.3.1.el5xen.

I had a top running in a screen session of the single Fedora 12 domU:

top - 07:26:30 up 2 days, 21:34,  1 user,  load average: 24.99, 24.97, 24.53
Tasks: 149 total,   2 running, 146 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.1%us,  0.1%sy,  0.0%ni,  0.0%id, 99.8%wa,  0.0%hi,  0.0%si,  0.1%st
Mem:   4060828k total,  4040752k used,    20076k free,  3486120k buffers
Swap:  6291448k total,      320k used,  6291128k free,   134136k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  235 root      20   0     0    0    0 D  0.0  0.0   0:00.18 jbd2/dm-0-8
  887 root      20   0     0    0    0 D  0.0  0.0   0:00.24 flush-253:0
 1321 ntp       20   0 60716 2776 1904 D  0.0  0.1   0:00.27 ntpd
 1384 postgres  20   0  183m 4988 4192 D  0.0  0.1   0:07.96 postmaster
 1409 postgres  20   0  155m 1232  444 D  0.0  0.0   0:00.08 postmaster
 1411 root      20   0 62464 1436  844 D  0.0  0.0   0:00.32 amd
 1441 postgres  20   0  183m 1652  716 R  0.0  0.0   0:03.75 postmaster
 1442 postgres  20   0  155m 1392  516 D  0.0  0.0   0:04.80 postmaster
 1470 root      20   0  255m 5852 4212 D  0.0  0.1   0:00.37 abrtd
 9076 root      20   0 54076 2588 1856 R  0.0  0.1   8:09.33 top
22487 root      20   0 23256 1948  144 D  0.0  0.0   0:14.18 dump
22488 root      20   0 23256 1948  144 D  0.0  0.0   0:14.07 dump
22489 root      20   0 23256 1948  144 D  0.0  0.0   0:14.47 dump
22530 root      20   0  109m  788  236 D  0.0  0.0   0:00.00 crond
22607 root      20   0 71840 2016  420 D  0.0  0.0   0:00.00 sendmail
22608 smmsp     20   0 59156 1628  408 D  0.0  0.0   0:00.00 sendmail
22609 root      20   0  109m  788  236 D  0.0  0.0   0:00.00 crond
22610 root      20   0 71840 2016  420 D  0.0  0.0   0:00.00 sendmail
22611 smmsp     20   0 59156 1628  408 D  0.0  0.0   0:00.00 sendmail
22612 root      20   0  109m  788  236 D  0.0  0.0   0:00.00 crond
22613 root      20   0 71840 2016  420 D  0.0  0.0   0:00.00 sendmail
22614 smmsp     20   0 59156 1628  408 D  0.0  0.0   0:00.00 sendmail
22615 root      20   0  109m  788  236 D  0.0  0.0   0:00.00 crond
22616 root      20   0 71840 2016  420 D  0.0  0.0   0:00.00 sendmail
22617 smmsp     20   0 59156 1628  408 D  0.0  0.0   0:00.00 sendmail

--- Additional comment from norm@turing.une.edu.au on 2010-06-21 02:00:04 EDT ---

Created attachment 425531 [details]
A sysrq l stack of domU

Tried to do an "xm sysrq N w" to show stacks of blocked tasks on hung domU. Nothing was displayed. An "xm sysrq N l" showed the attached stack. Looks a bit different to the other stacks though I'm not sure if it's useful.

--- Additional comment from norm@turing.une.edu.au on 2010-07-01 23:40:32 EDT ---

Could this problem is related to timers or clock source in any way? In a bug that I think is similar to this, #526627, the clock source was the suspected cause for awhile. Also comment #54 above reports a very similar problem with KVM in bug #563672

My domU once logged:

Jun 11 01:12:51 turing kernel: hrtimer: interrupt took 1645273 ns

out of the blue. That's the only time I have ever seen that message.

DomU currently reports:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
xen tsc 
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
xen

--- Additional comment from norm@turing.une.edu.au on 2010-07-28 03:03:28 EDT ---

Just noticed that this bug is referenced here:

http://lists.xensource.com/archives/html/xen-devel/2010-06/msg00590.html

No solution there either. One comment from Rick Boone was:

> I booted into 2.6.32.12 yesterday, but I'm still getting a hung task crash. (Initially, I got a soft lockup crash on one of the cpu's, after switching to 2.6.32.12, but it seems I was able to resolve that by switching the kernel to tickless)

--- Additional comment from drjones@redhat.com on 2010-08-20 01:56:50 EDT ---

I think I've reproduced this bug on a very recent rhel6 kernel.  When I mount these file systems I get this log to the console 

JBD: barrier-based sync failed on xvdb - disabling barriers

and the first tasks to start hanging (with ext3) are

kjournald, flush-202 - not always that order (reproduced 3 times)

and now on the ext4 reproduced case it was

jbd2/xvda2-8

My reproducer is currently to simply build a kernel on these file systems and wait.

Andrew

--- Additional comment from norm@turing.une.edu.au on 2010-08-22 06:39:14 EDT ---

(In reply to comment #74)
> Ah, nevermind. Just got it to reproduce with ext4. Only notes I have right now
> is that when I mount these file systems I get this log to the console

All my crashes have been with ext4 (with or without quotas enabled) except for the crash mentioned in comment #34 which was ext3 on a /boot partition.

> JBD: barrier-based sync failed on xvdb - disabling barriers

Hmm, this messages is familiar for some reason but I can't find it in any of my current logs.

> and the first tasks to start hanging (with ext3) are
> 
> kjournald, flush-202 - not always that order (reproduced 3 times)
> 
> and now on the ext4 reproduced case it was
> 
> jbd2/xvda2-8

jbd2/dm* for my ext4 filesystems. Any one could go first but the most likely seems to be /var

All systems that have had this problem for me have been on one or more lv of the host system. I have not tried image files.

> My reproducer is currently to simply build a kernel on these file systems and
> wait.

My feelings on the workload that reproduces this is that it has to be stop/start disk activity. Just straight heavy I/O rarely hits this problem in my testing. The network also figures prominently in the mix. After the the jdb2 hang the first processes to hang are usually things like httpd, nscd, smdb and imap. Also backups, with dump across the network, seems to trigger the problem reasonably often.

I'll put my test system (with has seen the problem only twice) into a loop of kernel building and see what happens.

The production system that sees the problem often is a LTSP and nxclient server. So that's lots of desktop users using browsers, compilers, editors, etc. Lots of stop/start network and disk I/O.

--- Additional comment from norm@turing.une.edu.au on 2010-08-23 17:48:48 EDT ---

(In reply to comment #75)
> (In reply to comment #74)> 
> > JBD: barrier-based sync failed on xvdb - disabling barriers

> Hmm, this messages is familiar for some reason but I can't find it in any of my
> current logs.

I saw a discussion of this message on the ext4 mailing list. I don't get it in my logs as far as I can tell.

http://marc.info/?t=128017340300003&r=1&w=2


> > My reproducer is currently to simply build a kernel on these file systems and
> > wait.

> I'll put my test system (with has seen the problem only twice) into a loop of
> kernel building and see what happens.

My test system, building the kernel went into this state after 2 days. However it did it when doing a backup (dump) to a remote system again and/or sending an email.

top - 07:29:45 up 4 days, 14:24,  3 users,  load average: 25.99, 25.96, 25.35
Tasks: 153 total,   1 running, 152 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni,  0.0%id, 99.9%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4060836k total,  2080144k used,  1980692k free,   128332k buffers
Swap:  6291448k total,     5444k used,  6286004k free,   374432k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMMAND
20361 root      20   0 14908 1172  856 R  0.3  0.0   4:18.03 top
  237 root      20   0     0    0    0 D  0.0  0.0   6:29.73 jbd2/dm-0-8
  825 root      20   0     0    0    0 D  0.0  0.0   1:36.20 flush-253:0
  919 root      20   0  8456  908  580 D  0.0  0.0   0:00.10 patch
  920 root      20   0  170m 2644 1680 D  0.0  0.1   0:00.00 crond
  923 root      20   0  170m 2644 1680 D  0.0  0.1   0:00.00 crond
  926 root      20   0  111m  976  404 D  0.0  0.0   0:00.00 crond
  927 smmsp     20   0 59152 1656  420 D  0.0  0.0   0:00.00 sendmail
  928 root      20   0 71836 2040  432 D  0.0  0.1   0:00.00 sendmail
  929 root      20   0  115m 1112  524 D  0.0  0.0   0:00.00 crond
  930 smmsp     20   0 59152 1656  420 D  0.0  0.0   0:00.00 sendmail
  932 root      20   0 71836 2040  432 D  0.0  0.1   0:00.00 sendmail
  933 root      20   0  117m 1136  540 D  0.0  0.0   0:00.00 crond
  934 smmsp     20   0 59152 1656  420 D  0.0  0.0   0:00.00 sendmail
  935 root      20   0 71836 2040  432 D  0.0  0.1   0:00.00 sendmail
  936 root      20   0  128m 1244  612 D  0.0  0.0   0:00.00 crond
  937 smmsp     20   0 59152 1656  420 D  0.0  0.0   0:00.00 sendmail
  938 root      20   0 71836 2044  432 D  0.0  0.1   0:00.00 sendmail
 1218 ntp       20   0 60708 1736  868 D  0.0  0.0   0:00.33 ntpd
 1281 postgres  20   0  183m 3416 2776 D  0.0  0.1   0:01.25 postmaster
 1373 postgres  20   0  183m 1068  396 D  0.0  0.0   0:01.44 postmaster
 1374 postgres  20   0  155m  908  268 D  0.0  0.0   0:01.61 postmaster
32393 root      20   0 23256 1920  144 D  0.0  0.0   0:00.39 dump
32394 root      20   0 23256 1920  144 D  0.0  0.0   0:00.31 dump
32395 root      20   0 23256 1920  144 D  0.0  0.0   0:00.29 dump

--- Additional comment from norm@turing.une.edu.au on 2010-08-23 17:58:51 EDT ---

(In reply to comment #76)
> > > JBD: barrier-based sync failed on xvdb - disabling barriers

> I don't get it in my logs as far as I can tell.

Er, yes I do. It has appeared in some of the ancient logs of my test system.

--- Additional comment from norm@turing.une.edu.au on 2010-08-26 21:43:23 EDT ---

In 5 days of rebuilding the kernel, my test system crashed once as described in comment #76. So still pretty tough to re-produce this in test. My production system crashed 13 times during the same period.

One observation. The crash in comment #34 was a pivotal moment. It was the one that got this bug moved from a fedora kernel (guest) to the enterprise kernel (the host).

Do we really think this problem is in the host kernel or even Xen? KVM guests seem to also see something similar.

Comment #34 was also interesting in that there was little network activity on the guest that crashed. My main crashing guest was also running on the host at the same time but it didn't crash (then).

--- Additional comment from drjones@redhat.com on 2010-09-02 12:34:20 EDT ---

Right, good question (where does this bug belong). If the problem on KVM hosts really seems to be the same (I'll try to reproduce that soon), then we can focus more on the guest. At this time though, I still want to dig deeper on what I saw in comment 36. I'll try to get to this asap, but I still have some other issues to knock off the TODO list first, sorry.

Drew

--- Additional comment from eric@pkill.info on 2010-09-04 01:02:17 EDT ---

We have experienced the same problem: for high CPU tasks, the PV F12 DomU works well; for hight I/O tasks, the F12 DomU crashes and the Dom0 also crashes.

One oops message:

Sep  2 21:27:33 node04 kernel: INFO: task ext4-dio-unwrit:990 blocked for more than 120 seconds.
Sep  2 21:27:33 node04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  2 21:27:33 node04 kernel: ext4-dio-unwr D ffff8800157890c0     0   990      2 0x00000000
Sep  2 21:27:33 node04 kernel: ffff880012e39d40 0000000000000246 0000000003f44d9c 0000000000006ad1
Sep  2 21:27:33 node04 kernel: ffff8800124e44d8 ffff8800124e44a0 ffff880012e39fd8 ffff880012e39fd8
Sep  2 21:27:33 node04 kernel: ffff8800124e4870 0000000000007d30 000000000000b0c0 ffff8800124e4870
Sep  2 21:27:33 node04 kernel: Call Trace:
Sep  2 21:27:33 node04 kernel: [<ffffffff8102d986>] ? set_next_entity+0x28/0x78
Sep  2 21:27:33 node04 kernel: [<ffffffff8140ae01>] __mutex_lock_common+0x11e/0x18d
Sep  2 21:27:33 node04 kernel: [<ffffffff8140ae89>] __mutex_lock_slowpath+0x19/0x1b
Sep  2 21:27:33 node04 kernel: [<ffffffff8140af9a>] mutex_lock+0x31/0x4b
Sep  2 21:27:33 node04 kernel: [<ffffffff81157b29>] ext4_end_aio_dio_work+0x2c/0x6d
Sep  2 21:27:33 node04 kernel: [<ffffffff81053705>] worker_thread+0x1c3/0x24f
Sep  2 21:27:33 node04 kernel: [<ffffffff81157afd>] ? ext4_end_aio_dio_work+0x0/0x6d
Sep  2 21:27:33 node04 kernel: [<ffffffff81057b54>] ? autoremove_wake_function+0x0/0x39
Sep  2 21:27:33 node04 kernel: [<ffffffff81053542>] ? worker_thread+0x0/0x24f
Sep  2 21:27:33 node04 kernel: [<ffffffff81057711>] kthread+0x7f/0x87
Sep  2 21:27:33 node04 kernel: [<ffffffff8100ab2a>] child_rip+0xa/0x20
Sep  2 21:27:33 node04 kernel: [<ffffffff81057692>] ? kthread+0x0/0x87
Sep  2 21:27:33 node04 kernel: [<ffffffff8100ab20>] ? child_rip+0x0/0x20
Sep  2 21:27:33 node04 kernel: INFO: task tapdisk:18044 blocked for more than 120 seconds.
Sep  2 21:27:33 node04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  2 21:27:33 node04 kernel: tapdisk       D ffff8800157890c0     0 18044      1 0x00000080
Sep  2 21:27:33 node04 kernel: ffff88006cc29738 0000000000000286 ffff88006cc29708 ffff88007e3d4cb0
Sep  2 21:27:33 node04 kernel: ffff88007e3d4cb0 ffff880000000002 ffff88006cc29fd8 ffff88006cc29fd8
Sep  2 21:27:33 node04 kernel: ffff88004064c870 0000000000007d30 000000000000b0c0 ffff88004064c870
Sep  2 21:27:33 node04 kernel: Call Trace:
Sep  2 21:27:33 node04 kernel: [<ffffffff81189bab>] do_get_write_access+0x200/0x40e
Sep  2 21:27:33 node04 kernel: [<ffffffff81057b8d>] ? wake_bit_function+0x0/0x33
Sep  2 21:27:33 node04 kernel: [<ffffffff8118fc7e>] ? jbd2_journal_add_journal_head+0xbf/0x13d
Sep  2 21:27:33 node04 kernel: [<ffffffff81189eff>] jbd2_journal_get_write_access+0x2b/0x44
Sep  2 21:27:33 node04 kernel: [<ffffffff811538ab>] ? ext4_read_block_bitmap+0x54/0x29c
Sep  2 21:27:33 node04 kernel: [<ffffffff81176639>] __ext4_journal_get_write_access+0x56/0x60
Sep  2 21:27:33 node04 kernel: [<ffffffff811791e2>] ext4_mb_mark_diskspace_used+0x72/0x333
Sep  2 21:27:33 node04 kernel: [<ffffffff81178a48>] ? ext4_mb_initialize_context+0x7d/0x18d
Sep  2 21:27:33 node04 kernel: [<ffffffff8117bfc8>] ext4_mb_new_blocks+0x1e8/0x42c
Sep  2 21:27:33 node04 kernel: [<ffffffff811736c0>] ? ext4_ext_find_extent+0x130/0x27f
Sep  2 21:27:33 node04 kernel: [<ffffffff81175945>] ext4_ext_get_blocks+0x1289/0x14e0
Sep  2 21:27:33 node04 kernel: [<ffffffff81021dd6>] ? xen_multicall_flush+0x2c/0x2e
Sep  2 21:27:33 node04 kernel: [<ffffffff810be586>] ? unmap_vmas+0x663/0x7b8
Sep  2 21:27:33 node04 kernel: [<ffffffff81188f03>] ? start_this_handle+0x3dd/0x402
Sep  2 21:27:33 node04 kernel: [<ffffffff8115b011>] ext4_get_blocks+0x152/0x30f
Sep  2 21:27:33 node04 kernel: [<ffffffff8115b246>] ext4_get_block_dio_write+0x78/0xb6
Sep  2 21:27:33 node04 kernel: [<ffffffff8110bc19>] __blockdev_direct_IO+0x521/0xae0
Sep  2 21:27:33 node04 kernel: [<ffffffff81159f76>] ext4_direct_IO+0x1b6/0x247
Sep  2 21:27:33 node04 kernel: [<ffffffff8115b1ce>] ? ext4_get_block_dio_write+0x0/0xb6
Sep  2 21:27:33 node04 kernel: [<ffffffff81157b6a>] ? ext4_end_io_dio+0x0/0x85
Sep  2 21:27:33 node04 kernel: [<ffffffff810a7cf6>] generic_file_direct_write+0xed/0x16d
Sep  2 21:27:33 node04 kernel: [<ffffffff810a7f0c>] __generic_file_aio_write+0x196/0x286
Sep  2 21:27:33 node04 kernel: [<ffffffff810a805f>] generic_file_aio_write+0x63/0xad
Sep  2 21:27:33 node04 kernel: [<ffffffff811549af>] ? ext4_file_write+0x0/0x95
Sep  2 21:27:33 node04 kernel: [<ffffffff81154a3d>] ext4_file_write+0x8e/0x95
Sep  2 21:27:33 node04 kernel: [<ffffffff810d6d97>] ? kmem_cache_alloc+0x47/0x140
Sep  2 21:27:33 node04 kernel: [<ffffffff81114e52>] aio_rw_vect_retry+0x8e/0x19d
Sep  2 21:27:33 node04 kernel: [<ffffffff81114dc4>] ? aio_rw_vect_retry+0x0/0x19d
Sep  2 21:27:33 node04 kernel: [<ffffffff811168e7>] aio_run_iocb+0x79/0x11f
Sep  2 21:27:33 node04 kernel: [<ffffffff8111748d>] sys_io_submit+0x5db/0x6c6
Sep  2 21:27:33 node04 kernel: [<ffffffff8100a078>] system_call_fastpath+0x16/0x1b
Sep  2 21:27:33 node04 kernel: [<ffffffff8100a010>] ? system_call+0x0/0x52

Another one:

INFO: task tapdisk:12896 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tapdisk       D ffff8800157890c0     0 12896      1 0x00000080
 ffff88007ce871a8 0000000000000282 0000000000000000 ffff880001c62dc0
 0000001001080001 ffff8800586e27e0 ffff88007ce87fd8 ffff88007ce87fd8
 ffff880001c63190 0000000000007d30 000000000000b0c0 ffff880001c63190
Call Trace:
 [<ffffffff8140a7fd>] io_schedule+0x73/0xb5
 [<ffffffff811d624c>] get_request_wait+0xee/0x17c
 [<ffffffff81057b54>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811cddf4>] ? elv_merge+0x167/0x19f
 [<ffffffff811d65c2>] __make_request+0x2e8/0x3fb
 [<ffffffff811d4c1d>] generic_make_request+0x2c8/0x321
 [<ffffffff810a9192>] ? mempool_alloc+0x6c/0x11e
 [<ffffffff81026966>] ? __wake_up_common+0x4e/0x84
 [<ffffffff811d4d59>] submit_bio+0xe3/0x100
 [<ffffffff811048c3>] submit_bh+0xf4/0x117
 [<ffffffff811063b4>] __block_write_full_page+0x1d8/0x2b3
 [<ffffffff8115bf23>] ? noalloc_get_block_write+0x0/0x60
 [<ffffffff81105d70>] ? end_buffer_async_write+0x0/0x1ab
 [<ffffffff81105d70>] ? end_buffer_async_write+0x0/0x1ab
 [<ffffffff8115bf23>] ? noalloc_get_block_write+0x0/0x60
 [<ffffffff81106518>] block_write_full_page_endio+0x89/0x95
 [<ffffffff81106539>] block_write_full_page+0x15/0x17
 [<ffffffff81158866>] ext4_writepage+0x2ac/0x2f0
 [<ffffffff810b16f8>] pageout+0x137/0x1fe
 [<ffffffff810b1bba>] shrink_page_list+0x265/0x483
 [<ffffffff810dae9a>] ? mem_cgroup_del_lru+0x39/0x3b
 [<ffffffff810b0e8f>] ? isolate_pages_global+0x1ba/0x201
 [<ffffffff81008ad3>] ? __switch_to+0x452/0x56f
 [<ffffffff810b21a9>] shrink_inactive_list+0x3d1/0x6d2
 [<ffffffff81057b4f>] ? wake_up_bit+0x25/0x2a
 [<ffffffff81036c87>] ? try_to_wake_up+0x2fc/0x30e
 [<ffffffff810b2825>] shrink_zone+0x37b/0x44b
 [<ffffffff810b2b17>] do_try_to_free_pages+0x222/0x399
 [<ffffffff810b2d76>] try_to_free_pages+0x6e/0x70
 [<ffffffff810b0cd5>] ? isolate_pages_global+0x0/0x201
 [<ffffffff810ac912>] __alloc_pages_nodemask+0x3f3/0x63c
 [<ffffffff810d49dc>] alloc_slab_page+0x1e/0x20
 [<ffffffff810d526e>] __slab_alloc+0x16b/0x488
 [<ffffffff8110b812>] ? __blockdev_direct_IO+0x11a/0xae0
 [<ffffffff8110b812>] ? __blockdev_direct_IO+0x11a/0xae0
 [<ffffffff810d5b7f>] kmem_cache_alloc_notrace+0x7d/0xf2
 [<ffffffff8110b812>] __blockdev_direct_IO+0x11a/0xae0
 [<ffffffff810aeee4>] ? pagevec_lookup+0x22/0x2b
 [<ffffffff810afff5>] ? invalidate_inode_pages2_range+0x27e/0x2b7
 [<ffffffff81159f76>] ext4_direct_IO+0x1b6/0x247
 [<ffffffff8115b1ce>] ? ext4_get_block_dio_write+0x0/0xb6
 [<ffffffff81157b6a>] ? ext4_end_io_dio+0x0/0x85
 [<ffffffff810a7cf6>] generic_file_direct_write+0xed/0x16d
 [<ffffffff810a7f0c>] __generic_file_aio_write+0x196/0x286
 [<ffffffff810f0803>] ? set_fd_set+0x50/0x5b
 [<ffffffff810f1858>] ? core_sys_select+0x1d6/0x20a
 [<ffffffff810a805f>] generic_file_aio_write+0x63/0xad
 [<ffffffff811549af>] ? ext4_file_write+0x0/0x95
 [<ffffffff81154a3d>] ext4_file_write+0x8e/0x95
 [<ffffffff810d6d97>] ? kmem_cache_alloc+0x47/0x140
 [<ffffffff81114e52>] aio_rw_vect_retry+0x8e/0x19d
 [<ffffffff81114dc4>] ? aio_rw_vect_retry+0x0/0x19d
 [<ffffffff811168e7>] aio_run_iocb+0x79/0x11f
 [<ffffffff8111748d>] sys_io_submit+0x5db/0x6c6
 [<ffffffff8100a078>] system_call_fastpath+0x16/0x1b
 [<ffffffff8100a010>] ? system_call+0x0/0x52

A brief list of the hardware and configuratioin:

Dom0â€ēs hardware platform:

Motherboard: INTEL S5500BC S5500 Quad Core Xeon Server Board
CPU: 2 x Intel Quad Core Xeon E5520 2.26G (5.86GT/sec,8M,Socket 1366)
Memory: 8 x Kingston DDR-3 1333MHz 4GB ECC REG. CL9 DIMM w/Parity & Thermal Sensor
HD: 4 x WD WD10EARS 1 TB, SATA II 3Gb/s, 64 MB Cache

Dom0: F12, with kernel 2.6.32.13 + 2.6.32 Xen patches v2 from http://gentoo-xen-kernel.googlecode.com/files/xen-patches-2.6.32-2.tar.bz2

Dom0 configuration:
Processor type and features  --->
 [*] Symmetric multi-processing support
 [*] Support sparse irq numbering
 [*] Enable Xen compatible kernel
 Preemption Model (No Forced Preemption (Server))  --->

Device Drivers  --->
 XEN  --->
 [*] Privileged Guest (domain 0)
 <*> Backend driver support (NEW)
 <*>   Block-device backend driver (NEW)
 <*>   Block-device tap backend driver (NEW)
 < >   Block-device tap backend driver 2 (NEW)
 <*>   Network-device backend driver (NEW)
 (8)     Maximum simultaneous transmit requests (as a power of 2) (NEW)
 [ ]     Pipelined transmitter (DANGEROUS) (NEW)
 < >     Network-device loopback driver (NEW)
 < >   PCI-device backend driver (NEW)
 < >   TPM-device backend driver (NEW)
 <M>   SCSI backend driver (NEW)
 <M>   USB backend driver (NEW)
 <M> Block-device frontend driver
 <M> Network-device frontend driver
 <M>   Network-device frontend driver acceleration for Solarflare NICs (NEW)
 <M> SCSI frontend driver (NEW)
 <M> USB frontend driver (NEW)
 [*]   Taking the HCD statistics (for debug) (NEW)
 [ ]   HCD suspend/resume support (DO NOT USE) (NEW)
 <*> User-space granted page access driver (NEW)
 <*> Framebuffer-device frontend driver (NEW)
 <*>   Keyboard-device frontend driver (NEW)
 [*] Disable serial port drivers (NEW)
 <*> Export Xen attributes in sysfs (NEW)
 (256) Number of guest devices (NEW)
 Xen version compatibility (no compatibility code)  --->
 [*] Place shared vCPU info in per-CPU storage (NEW)

DomU: F12 with pv_ops kernel: 2.6.32.19-163.fc12.x86_64

The details of hardware and kernel configuration is here: http://pkill.info/b/2405/setting-up-stable-xen-dom0-with-fedora-xen-3-4-3-with-xenified-linux-kernel-2-6-32-13-in-fedora-12/

Our solution now:
Disable the hard disk cache by: hdparm -W0 /dev/sdX

After disabling the hard disk cache, the VMs works. When we do a middle size (processing 20GB data simultaneously on 8VMs on top of 2 physical machines) intensive I/O test, the VMs and Dom0 don't crash. We will run a full size (about 800GB data on these 8VMs) experiment and see whether is works.

--- Additional comment from norm@turing.une.edu.au on 2010-09-17 02:34:22 EDT ---

Some recent threads that look like this problem:

http://forums.citrix.com/thread.jspa?threadID=272708 (Xen block devices stop responding in guest)

That lead to this kernel mailing list thread:

http://lkml.org/lkml/2010/9/1/178 (I/O scheduler deadlocks on Xen virtual block devices)

The main points from the kernel thread:

  - Problem probably in the Xen I/O stack and not guest scheduler.

  - These patches might be the real fixes:

   xen: use percpu interrupts for IPIs and VIRQs
   xen: handle events as edge-triggered

A work-around might be to disable irqbalance in the guest. This seems to have been the case so far in the citrix thread mentioned above.

I'm now running my main crashing guest without irqbalance.

--- Additional comment from eric@pkill.info on 2010-09-17 03:07:25 EDT ---

We have done experiments on LVM backed DomUs. The DomUs have this problem as in comments 80 and the other configurations (hardware, the Dom0, DomU's kernel, etc.) are the same except the backed VBD (moved from file / tap:aio to LVM backed).

The 8 LVM backed DomUs work well so far with around 5GB writes and 5GB reads for each DomU.

Hope this piece info can help.

--- Additional comment from drjones@redhat.com on 2010-09-17 09:09:26 EDT ---

Hi Norman,

Great work tracking these mail threads and potentially finding a solution. I did a quick test on my machine. First I confirmed I could still reproduce it with a kernel compile loop with irqbalance on. It did reproduce in something less then 30 minutes. Then I turned off irqbalance and ran the kernel compile loop a few hours. But then I turned irqbalance back on to confirm I could again trigger it quickly, but still haven't seen it in almost an hour, so currently my testing isn't super conclusive. I'm anxious to hear your results, since I know you've been plagued with it frequently.

Drew

--- Additional comment from norm@turing.une.edu.au on 2010-09-19 22:25:00 EDT ---

Turning off irqbalance is looking very promising. I have not had a crash since doing so and the system has been through some significant workload.

Since the 23rd of March this year I've been recording (most) of the times of crashes. Rough stats are:

Shortest uptime: 6.683 minutes
Longest uptime: 8.885 days
Number of Crashes: 210
Crashes/day: 0.845

The 8 days was set during a period of very little activity.

As I write this, 4 days of uptime has passed with a normal to large workload.

One note, during a particularly high load during this uptime, the message mentioned in comment #67 made a return:

Sep 20 11:42:55 turing kernel: hrtimer: interrupt took 940625 ns

On the disappointing side, I've still not been able to create a test system that will reliably crash.

--- Additional comment from drjones@redhat.com on 2010-09-20 04:13:53 EDT ---

I think we can be confident that irqbalance off does the trick. I ran the kernel compile loop with irqbalance off all weekend and it was still running this morning. I turned irqbalance back on and was able to reproduce within an hour or so. Next up: backport the patches pointed to and try another round with irqbalance on.

--- Additional comment from drjones@redhat.com on 2010-09-20 08:53:36 EDT ---


    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A work-around for this issue is to turn off irqbalance, i.e.
chkconfig --level 2345 irqbalance off
service irqbalance stop

--- Additional comment from bburns@redhat.com on 2010-09-20 08:56:07 EDT ---

Set z-stream flag as this has several BZs that may be duplicates and could have fairy large impact in the field.

--- Additional comment from rlerch@redhat.com on 2010-09-21 01:42:12 EDT ---

Andrew, thanks for the Technical Note draft, would i be able to get a summary of the following 2 dot points of information so i can write this known issue up for  the 6.0 Technical Notes?

I tried to get this information from reading the bug, but got a bit lost.

* Cause
    What actions or circumstances cause the bug to present (if known).
    Note: if the cause is not known or is not fully diagnosed,
    current hypotheses or theories should be noted.
* Consequence
    What happens when the bug presents.


cheers,
ryanlerch

--- Additional comment from drjones@redhat.com on 2010-09-21 02:56:59 EDT ---

Hi Ryan,

(In reply to comment #88)
> * Cause
>     What actions or circumstances cause the bug to present (if known).
>     Note: if the cause is not known or is not fully diagnosed,
>     current hypotheses or theories should be noted.

While we're still doing some testing, I think it's 90+% likely that it's due to the way the event channel irq was set up. Xen events behave like edge triggered interrupts, but the kernel was setting them up as level. Since it's a fundamental issue, basically any action using interrupts could trigger the bug, but it seems you need high activity to increase the probability of hitting it. My reproducer was just to compile a kernel on the guest which triggered interrupts for the xen block device.


> * Consequence
>     What happens when the bug presents.
> 

All the tasks associated with the device that had interrupt issues (usually blkfront) will end up in D-state (i.e. hung).

--- Additional comment from drjones@redhat.com on 2010-09-22 04:53:31 EDT ---

I've been running the kernel compile loop on a RHEL6 guest patched with the two patches pointed to in comment 81 for over 30 hours (with irqbalance on, of course). I'm pretty confident that they fix the issue and will post them. The "handle events as edge-triggered" patch, which is probably the one that truly fixes the issue, is already in Linus' tree, but .36-rc3, which means it won't be in Fedora until F15.

--- Additional comment from norm@turing.une.edu.au on 2010-09-22 05:57:54 EDT ---

I just saw that the 2 patches also made it into 2.6.32.22 and 2.6.35.5

So they should make it into Fedora 12 if the stable updates are pushed there. Fedora 13 is currently 2.6.34.7 so I'm not sure about that.

See:

http://lwn.net/Articles/406311/

http://lwn.net/Articles/406312/

--- Additional comment from drjones@redhat.com on 2010-09-22 08:05:10 EDT ---

(In reply to comment #91)
> I just saw that the 2 patches also made it into 2.6.32.22 and 2.6.35.5

Ah, nice. Yeah, I forgot to check the stable trees to see what got pulled in by them.

--- Additional comment from norm@turing.une.edu.au on 2010-09-22 09:10:30 EDT ---

This bug started out in the fedora 12 kernel moved around some and ended up in
the RHE 6 kernel.

Should this bug be cloned and changed to the F12 kernel to encourage the stable
2.6.32.22 updates to be applied there?

--- Additional comment from drjones@redhat.com on 2010-09-22 09:46:33 EDT ---

Right, agreed. Instead of cloning (since this bug is huge). I've just created a new bug pointing to the patches, bug 636534.

--- Additional comment from cebbert@redhat.com on 2010-09-22 11:29:26 EDT ---

(In reply to comment #93)
> Should this bug be cloned and changed to the F12 kernel to encourage the stable
> 2.6.32.22 updates to be applied there?

They're already applied.

--- Additional comment from drjones@redhat.com on 2010-09-23 11:52:52 EDT ---

QA,

This patch needs some wide-spread testing on different machines. I saw no problems with it on my machine when testing, but there's been a report that it causes a boot crash. See Chuck's comment in bug 636534 comment 3.

Thanks,
Drew

--- Additional comment from norm@turing.une.edu.au on 2010-09-24 00:05:07 EDT ---

(In reply to comment #96)

> This patch needs some wide-spread testing on different machines. I saw no
> problems with it on my machine when testing, but there's been a report that it
> causes a boot crash. See Chuck's comment in bug 636534 comment 3.

Seems there is a typo in some instances of the patch. See bug 636534 comment 4.

--- Additional comment from drjones@redhat.com on 2010-09-24 02:38:04 EDT ---

(In reply to comment #97)
> Seems there is a typo in some instances of the patch. See bug 636534 comment 4.

Cool, I just saw that message that there was a typo on xen-devel too. The patch I've posted doesn't have it, so we're good-to-go.

Thanks,
Drew

--- Additional comment from pm-rhel@redhat.com on 2010-10-04 21:50:47 EDT ---

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

--- Additional comment from drjones@redhat.com on 2010-11-16 04:07:37 EST ---

*** Bug 587265 has been marked as a duplicate of this bug. ***

--- Additional comment from arozansk@redhat.com on 2010-11-17 14:44:38 EST ---

Patch(es) available on kernel-2.6.32-83.el6

--- Additional comment from arozansk@redhat.com on 2010-11-17 14:49:20 EST ---

List of patches present on kernel-2.6.32-83.el6:
Related patch: http://patchwork.usersys.redhat.com/patch/28339
Related patch: http://patchwork.usersys.redhat.com/patch/28340

Comment 1 RHEL Product and Program Management 2011-08-29 17:30:05 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 2 Paolo Bonzini 2011-08-31 12:13:21 UTC
Created attachment 520809 [details]
backport/revert patch

Comment 7 Aristeu Rozanski 2011-09-07 16:41:02 UTC
Patch(es) available on kernel-2.6.32-195.el6

Comment 13 errata-xmlrpc 2011-12-06 14:06:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html


Note You need to log in before you can comment on or make changes to this bug.