Bug 550724

Summary:

xen PV guest kernel 2.6.32 processes lock up in D state

Product:

Red Hat Enterprise Linux 6

Reporter:

Norman Gaywood <ngaywood>

Component:

kernel

Assignee:

Andrew Jones <drjones>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

high

Docs Contact:

Priority:

low

Version:

6.0

CC:

ahecox, anton, bhubbard, dhoward, dougsland, drjones, eric, gansalmon, itamar, jforbes, kernel-maint, mishu, mshao, qcai, qwan, redhat, vevroux, xen-maint, yuzhang

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-83.el6

Doc Type:

Bug Fix

Doc Text:

In some cases, under a small system load involve some I/O operation, processes started to lock up in the D state (that is, became unresponsive). The system load could in some cases climb steadily. This was due to the way the event channel IRQ (Interrupt Request) was set up. Xen events behave like edge-triggered IRQs, however, the kernel was setting them up as level-triggered IRQs. As a result, any action using Xen event channels could lock up a process in the D state. With this update, the handling has been changed from edge-triggered IRQs to level-triggered IRQs and process no longer lock up in the D state.

Story Points:

---

Clone Of:

Clones:

733672 (view as bug list)

Environment:

Xen DomU running under Centos 5.4, kernel 2.6.18-164.6.1.el5xen

Last Closed:

2011-05-23 20:18:58 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

523117, 661737

Attachments:

Description	Flags
A terminal window with a top running shows the progressively degrading state.	none
dmesg captured from Dom0 console	none
partial sysreq output	none
sysreq info for 2.6.30.10-105.fc11.x86_64	none
syslog boot of kernel-2.6.32.6-36.fc12.x86_64	none
top output of kernel-2.6.32.6-36.fc12.x86_64	none
dmesg of dom0 host	none
xm dmesg	none
dom0 dmesg	none
domU dmesg	none
/var/log/messages at start of crash	none
sysreq W and sysreq t info	none
"xm top" from dom0	none
2.6.33-1 won't boot on an F12 domU	none
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 after panic	none
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 just before panic	none
A different domU VM hung	none
hypervisor boot messages	none
guest panic on disk test	none
dmesg of dom0	none
A sysrq l stack of domU	none
/proc/cpuinfo of dom0 orginal system	none
/proc/cpuinfo of dom0 new system	none

Description Norman Gaywood 2009-12-27 03:20:46 UTC

Created attachment 380486 [details]
A terminal window with a top running shows the progressively degrading state.

Description of problem:

Sometimes under a small load, processes start locking up in D state. Load climbs steadly. Reboot required.

Version-Release number of selected component (if applicable):

Happens with both:
kernel-2.6.31.6-166.fc12.x86_64
kernel-2.6.31.9-174.fc12.x86_64


How reproducible:

Frequency roughly about every 24 hours. Happens when several LTSP sessions started, or sometime during logwatch processing, or during rsync backups.


Additional info:

System is a large Xen DomU. Dom0 is able to destroy and re-create the system.

Comment 1 Norman Gaywood 2009-12-27 03:29:07 UTC

Created attachment 380488 [details]
dmesg captured from Dom0 console

Comment 2 Norman Gaywood 2009-12-27 03:35:45 UTC

Created attachment 380490 [details]
partial sysreq output

For a kernel-2.6.31.6-166.fc12.x86_64 I was able to get a partial sysreq output.

Working on getting more detail. Sometimes the sysreq keys do not seem to work.

Comment 3 Norman Gaywood 2010-01-01 07:49:32 UTC

Created attachment 381157 [details]
sysreq info for 2.6.30.10-105.fc11.x86_64

I installed the F11 kernel on my F12 Xen domU PV system. Lasted for several days then we got the load climbing again.

This time I could look at /var/log/messages without hanging and send sysreq from the dom0.

As a side note. Our site is currently in the middle of a large network upgrade. So we often have the network go away under us and come back. I get the impression that these network problems can sometimes trigger the problem described here. httpd is amongst the first to go to D state and stay there.

Comment 4 Norman Gaywood 2010-01-02 23:01:52 UTC

2.6.30.10-105.fc11.x86_64 (the most stable for me) crashed again, probably while doing an automated backup.

After restarting, I tried this:

echo jiffies >  /sys/devices/system/clocksource/clocksource0/current_clocksource

and processes starting going into D state straight away. The clock also did not seem to be updating.

previous value current_clocksource was xen

Don't know what I'm doing here, so I don't know if this is related.

Comment 5 Norman Gaywood 2010-01-27 12:06:54 UTC

Created attachment 387042 [details]
syslog boot of kernel-2.6.32.6-36.fc12.x86_64

Just back from holiday and working on this again,

Installed the new kernel-2.6.32.6-36.fc12.x86_64 from koji and processes started going into D state as soon as a large rsync backup started.

Comment 6 Norman Gaywood 2010-01-27 12:08:06 UTC

Created attachment 387043 [details]
top output of kernel-2.6.32.6-36.fc12.x86_64

Comment 7 Andrew Jones 2010-01-27 14:13:06 UTC

Hi Norman,

Can you attach your guest config and write the hardware details of your host? Also, check if 'xm dmesg' has anything interesting in it. What release is your host? RHEL 5.4?

Thanks,
Andrew

Comment 8 Norman Gaywood 2010-01-29 23:24:21 UTC

Created attachment 387674 [details]
dmesg of dom0 host

host system is running CentOS 5.4 and kernel 2.6.18-164.11.1.el5xen

Comment 9 Norman Gaywood 2010-01-29 23:30:08 UTC

Created attachment 387675 [details]
xm dmesg

Comment 10 Norman Gaywood 2010-02-01 05:00:47 UTC

Another data point. I upgraded the host kernel to RH enterprise 2.6.18-186.el5xen which contains some Xen fixes.

Still have the same problem. These kernels still crash:

kernel-2.6.30.10-105.fc11.x86_64
kernel-2.6.31.12-174.2.3.fc12.x86_64

I don't know if this a fedora problem or a RH Enterprise Xen problem.

Comment 11 Andrew Jones 2010-02-01 08:13:17 UTC

Ok, here the prime suspects:

guest kernel
backing store
host kernel/drivers/HV
or a problem with some combination of the above

some of these suspects can be eliminated if we know some history. Has this machine ever had a working config? Did these problems appear after some upgrade? If so, what was kept constant and what was upgraded? We can revert one suspect at a time, or start over with a known-good and upgrade one at a time. Or is this a new system? If so, did the hardware work fine with bare-metal kernels? Or, have you seen anything strange behavior in dom0?

guest: you can try building and booting a latest stable kernel 2.6.32.7. Several xenpv patches have been pushed in to the 2.6.32 tree.

backing store. What is it? please attach the guest config file. I assume it's an LV, but if it's on NFS or other network store than that might explain some problems, since you said the network had been flaky. If it's a local drive, then what type? What driver in dom0 is being used?

host: you're already on the latest, so I don't have any suggestions for that right now, other than to keep it running the latest for now. Also, how much ram does dom0 have?

Thanks,
Andrew

Comment 12 Norman Gaywood 2010-02-10 05:21:49 UTC

Thanks Andrew, lots of good questions.

This hardware is relatively new, just over 6 months old. The main idea
of the system is to be a development environment for math/comp sci
students. It's setup to deal with up to 60 LTSP (Linux Terminal Server
Project) terminals and nxclient/ssh connections. It replaces a 4 year
old HP server with Fedora 10 that did the same thing. The old HP setup
ran fedora 10, at its end of life, on the bare metal. The new server
was supposed to make use of virtualization.

The hardware of the new dom0 server is an IBM x3850 M2 with 4 Xeon Quad
Core E7330 80W processors, 64GB of memory. Two IBM 73.4GB 2.5in 10K HS
SAS HDD makeup the system storage for dom0.

At the moment we are running Centos 5.4 with the latest kernel I could
find: kernel-xen-2.6.18-186.el5

SAS attached for main storage is an IBM DS3200 with 12 750GB SATA HDD
configured as one large raid 6 drive. We break up the large drive
using LV.

Various attachments of config and dmesg of dom0 to follow.

I see no strange error messages in the dom0 (including
/var/log/messages) except for the:

(XEN) traps.c:1878:d5 Domain attempted WRMSR 000000000000008b from 00000021:00000000 to 00000000:00000000.

reported by "xm dmesg"

We don't use NFS in dom0 and the network around here is pretty much
stable now.

One thing to note. Originally we had hoped to run a fedora kernel as
a dom0. However we struck bug #541615 (Calgary: DMA error on CalIOC2
PHB 0x3) and so were unable to get the attached storage to pass disk
tests. RH enterprise/Centos is rock solid as a dom0 and passes any disk
tests we can throw at it.

More detail on the crashing guest to follow.

Comment 13 Norman Gaywood 2010-02-10 05:23:46 UTC

Created attachment 389898 [details]
dom0 dmesg

dom0 dmesg

Comment 14 Norman Gaywood 2010-02-10 05:25:34 UTC

dom0 pvdisplay

 --- Physical volume ---
  PV Name               /dev/sdb2
  VG Name               SYSTEM
  PV Size               66.76 GB / not usable 10.54 MB
  Allocatable           yes 
  PE Size (KByte)       32768
  Total PE              2136
  Free PE               1415
  Allocated PE          721
  PV UUID               i4ZeR9-aQQe-VQfd-dwaw-mpOm-swCa-qnQl7e
   
  --- Physical volume ---
  PV Name               /dev/sda1
  VG Name               VM
  PV Size               2.72 TB / not usable 3.97 MB
  Allocatable           yes 
  PE Size (KByte)       4096
  Total PE              713974
  Free PE               403190
  Allocated PE          310784
  PV UUID               prb5ng-TESx-WsIX-JYQs-Fy6T-I0Lv-Yshq01

Comment 15 Norman Gaywood 2010-02-10 05:26:39 UTC

dom0 lvdiplay

  --- Logical volume ---
  LV Name                /dev/SYSTEM/ROOT
  VG Name                SYSTEM
  LV UUID                lPSZXD-wmj9-jVe3-gmDs-zPJf-oGG2-JIchhk
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                9.75 GB
  Current LE             312
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0
   
  --- Logical volume ---
  LV Name                /dev/SYSTEM/VAR
  VG Name                SYSTEM
  LV UUID                h346wl-8YvE-Wmb3-2OO6-q2CG-bZwg-SJRdPe
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                9.88 GB
  Current LE             316
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1
   
  --- Logical volume ---
  LV Name                /dev/SYSTEM/TMP
  VG Name                SYSTEM
  LV UUID                3Uuw0a-jDga-7hL6-dTNv-o2om-zm1r-btCYDR
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                992.00 MB
  Current LE             31
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
   
  --- Logical volume ---
  LV Name                /dev/SYSTEM/SWAP
  VG Name                SYSTEM
  LV UUID                PJ0dsz-tVqz-GRHV-Uvho-fcd7-wpUF-5CH3eC
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                1.94 GB
  Current LE             62
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3
   
  --- Logical volume ---
  LV Name                /dev/VM/system
  VG Name                VM
  LV UUID                HNvX19-hQBi-YnG9-9xVc-83Zh-Vs2o-2t0B6r
  LV Write Access        read/write
  LV Status              available
  # open                 2
  LV Size                150.00 GB
  Current LE             38400
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4
   
  --- Logical volume ---
  LV Name                /dev/VM/home
  VG Name                VM
  LV UUID                Pr2FlO-BNTb-PeeW-GN0j-O7j2-espt-Em9lYG
  LV Write Access        read/write
  LV Status              available
  # open                 2
  LV Size                1.00 TB
  Current LE             262144
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:5
   
  --- Logical volume ---
  LV Name                /dev/VM/projects
  VG Name                VM
  LV UUID                o1phpe-Dp8r-HUf1-J6VQ-y6LV-PpI1-xbIlVT
  LV Write Access        read/write
  LV Status              available
  # open                 2
  LV Size                20.00 GB
  Current LE             5120
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:6
   
  --- Logical volume ---
  LV Name                /dev/VM/kojack
  VG Name                VM
  LV UUID                PBPq4R-M8eJ-Y9gZ-H52S-V21v-3mas-01vDfY
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                20.00 GB
  Current LE             5120
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:7

Comment 16 Norman Gaywood 2010-02-10 05:27:41 UTC

domU config

name = "turing"
uuid = "8d3383a0-f0ae-f0b1-215d-c2e04115e9fe"
maxmem = 30000
memory = 30000
vcpus = 12
bootloader = "/usr/bin/pygrub"
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
vfb = [ "type=vnc,vncunused=1,keymap=en-us" ]
disk = [ "phy:/dev/VM/system,xvda,w", "phy:/dev/VM/home,xvdb,w", "phy:/dev/VM/projects,xvdc,w" ]
vif = [ "mac=00:16:3e:7d:d0:91,bridge=xenbr0,script=vif-bridge" ]

Comment 17 Norman Gaywood 2010-02-10 05:31:12 UTC

We get crashes running every 2.6.{30,31,32} kernel we have tried. We are currently using kernel-2.6.32.7-41.fc12.x86_64 from koji and the latest crash had a top looking like:

top - 10:58:42 up  4:55, 21 users,  load average: 62.42, 43.99, 22.27
Tasks: 777 total,   1 running, 776 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.4%us,  0.6%sy,  0.0%ni, 31.9%id, 67.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  30100024k total,  4208836k used, 25891188k free,   313452k buffers
Swap: 10239992k total,        0k used, 10239992k free,  1557692k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11652 root      20   0 20528 1704  872 R  1.6  0.0   1:46.83 top
  973 root      20   0     0    0    0 D  0.0  0.0   0:00.33 jbd2/dm-1-8
 1069 root      20   0     0    0    0 D  0.0  0.0   0:00.08 flush-253:1
 2645 apache    20   0  475m  21m 6020 D  0.0  0.1   0:05.25 httpd
 2647 apache    20   0  474m  19m 5996 D  0.0  0.1   0:04.74 httpd
 2648 apache    20   0  474m  19m 5992 D  0.0  0.1   0:04.53 httpd
 2650 apache    20   0  475m  20m 6036 D  0.0  0.1   0:06.52 httpd
 2651 apache    20   0  476m  22m 5980 D  0.0  0.1   0:04.70 httpd
 2652 apache    20   0  476m  21m 6028 D  0.0  0.1   0:05.44 httpd
 2896 apache    20   0  476m  21m 6016 D  0.0  0.1   0:05.54 httpd
 2897 apache    20   0  475m  20m 6016 D  0.0  0.1   0:05.27 httpd
 2898 apache    20   0  476m  21m 6004 D  0.0  0.1   0:05.18 httpd
 5028 apache    20   0  474m  20m 6020 D  0.0  0.1   0:05.01 httpd
 8075 ashley    20   0  137m 6484 2040 D  0.0  0.0   0:00.40 gconfd-2
 8085 ashley    20   0  332m  16m  11m D  0.0  0.1   0:01.47 gnome-panel
 8929 apache    20   0  475m  20m 5964 D  0.0  0.1   0:04.40 httpd
10458 stadmin   20   0  214m 5436 3784 D  0.0  0.0   0:00.40 smbd
10740 apache    20   0  475m  20m 5916 D  0.0  0.1   0:02.48 httpd
12005 stadmin   20   0  214m 5636 4028 D  0.0  0.0   0:00.48 smbd
14550 despi     20   0  110m 1608 1248 D  0.0  0.0   0:00.02 bash
14592 despi     20   0  820m 102m  28m D  0.0  0.3   0:42.92 thunderbird-bin
15914 despi     20   0  712m 116m  26m D  0.0  0.4   3:52.89 firefox
17831 stadmin   20   0  214m 6068 4104 D  0.0  0.0   0:01.61 smbd
19420 despi     20   0  311m  17m 7108 D  0.0  0.1   0:08.50 gvim
19429 despi     20   0  149m  13m 2688 D  0.0  0.0   0:00.76 xdvi-xaw3d
21706 kankanac  20   0  227m 7736 4000 D  0.0  0.0   0:01.26 pine
22351 despi     20   0  309m  15m 6880 D  0.0  0.1   0:00.46 gvim
23276 despi     20   0  171m  91m  26m D  0.0  0.3   5:38.29 acroread
23924 math101   20   0  160m 8096 3920 D  0.0  0.0   0:00.15 vim
23944 math101   20   0  7688  332  256 D  0.0  0.0   0:00.00 fortune
23991 kvaidya2  20   0 54968 2156 1804 D  0.0  0.0   0:00.00 imap
24001 hrich     20   0 99.5m 3832 1132 D  0.0  0.0   0:00.00 sendmail
24032 despi     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24053 despi     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24062 despi     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24067 bea       20   0 66268 2068 1452 D  0.0  0.0   0:00.03 sftp-server
24109 despi     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24118 despi     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24127 despi     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24136 despi     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24138 stadmin   20   0  214m 5412 3844 D  0.0  0.0   0:00.04 smbd
24158 markev    20   0 55112 2144 1764 D  0.0  0.0   0:00.03 imap
24166 stadmin   20   0  214m 5036 3500 D  0.0  0.0   0:00.03 smbd
24175 bea       20   0 66268 2052 1448 D  0.0  0.0   0:00.03 sftp-server
24218 stadmin   20   0  214m 5404 3836 D  0.0  0.0   0:00.03 smbd
24244 bea       20   0 66268 2052 1448 D  0.0  0.0   0:00.03 sftp-server
24295 comp131   20   0  107m  20m 3976 D  0.0  0.1   0:00.28 index.cgi
24311 ashley    20   0 20676  792  708 D  0.0  0.0   0:00.00 procmail
24318 markev    20   0 55144 2252 1836 D  0.0  0.0   0:00.00 imap
24336 apache    20   0  374m 9.8m 1908 D  0.0  0.0   0:00.00 httpd
24344 apache    20   0  374m 9.8m 1880 D  0.0  0.0   0:00.00 httpd
24345 apache    20   0  374m 9712 1628 D  0.0  0.0   0:00.00 httpd
24346 apache    20   0  374m 9.8m 1884 D  0.0  0.0   0:00.00 httpd
24347 apache    20   0  374m 9.8m 1900 D  0.0  0.0   0:00.00 httpd
24356 stadmin   20   0  214m 5264 3716 D  0.0  0.0   0:00.03 smbd
24360 apache    20   0  374m 9352 1292 D  0.0  0.0   0:00.00 httpd
24365 apache    20   0  374m 9668 1576 D  0.0  0.0   0:00.00 httpd
24367 markev    20   0  117m 1432  584 D  0.0  0.0   0:00.00 sshd
24370 apache    20   0  476m  19m 3996 D  0.0  0.1   0:00.11 httpd
24374 ashoka    20   0 55340 2472 1876 D  0.0  0.0   0:00.90 pop3

Comment 18 Norman Gaywood 2010-02-10 05:32:55 UTC

Created attachment 389900 [details]
domU dmesg

kernel-2.6.32.7-41.fc12.x86_64

Comment 19 Norman Gaywood 2010-02-10 05:35:53 UTC

Created attachment 389904 [details]
/var/log/messages at start of crash

These messages appear in /var/log/messages as the system dies.

Comment 20 Norman Gaywood 2010-02-10 05:38:10 UTC

Created attachment 389905 [details]
sysreq W and sysreq t info

I sent a sysreq C and M from dom0 and these messages were logged in the domU /var/log/messages.

Comment 21 Norman Gaywood 2010-02-10 05:47:36 UTC

That was a sysreq W and t (not C amd M).

I didn't send a M because I thought the other ones did not work. I didn't have a tail on /var/log/messages and depending on what hangs, messages may not be saved. This time, messages did make it into /var/log/messages.

Comment 22 Norman Gaywood 2010-02-10 06:22:12 UTC

Some observations and things I've tried.

- The crash can happen at any time. The best we've had is about 2 days of uptime, the worst is a few hours.

- There are no other domU systems running. This is the only one.

- domU is not very loaded. There are users but they are mostly doing desktop type tasks. system load never gets above two and is usually below 1.

- At first I suspected NFS and our (then) bad network. However I removed all use of NFS and the problem still occurred.

- I also suspected quota and ext4 problems. I disabled quotas and it stopped the "kernel: JBD: Spotted dirty metadata buffer" error messages but did not stop the crash.

- I also suspected our amd automounter and its complex maps. But I disabled that and the system still died.

Not sure what to try next.

Comment 23 Norman Gaywood 2010-02-10 19:34:09 UTC

The crash dump of https://bugzilla.redhat.com/show_bug.cgi?id=551552#c3 looks similar the output in comment 19

https://bugzilla.redhat.com/show_bug.cgi?id=526627 may another case of this bug?

Comment 24 Norman Gaywood 2010-02-15 03:13:54 UTC

Created attachment 394253 [details]
"xm top" from dom0

Output of "xm top" a few minutes after "turing" went into D state lockup.


If anyone has suggestions of any other information I could gather, I'd love to know.

Comment 25 Norman Gaywood 2010-02-17 04:32:00 UTC

Tried kernel parameter "elevator=noop" in both:

kernel-2.6.32.7-41.fc12.x86_64
kernel-2.6.31.12-174.2.19.fc12.x86_64

both crash and burn. So I've taken that parameter out.


The system had its first real login load of about 16 LTSP terminals. It lasted a few minutes. Did and xm destroy xm create and it lasted another 2 minutes.

Same xm destroy and xm create and its been running 20 minutes now.

This is not good. Can't see how we are going to be running student labs with this setup.

Comment 26 Norman Gaywood 2010-02-19 05:31:23 UTC

This kernel thread seems to mention something like this problem:

http://www.mail-archive.com/kvm@vger.kernel.org/msg23039.html

There didn't seem to be any resolution to the problem in that thread.

Comment 27 Norman Gaywood 2010-02-22 20:25:51 UTC

I've posted this bug in the upstream kernel bugzilla

http://bugzilla.kernel.org/show_bug.cgi?id=15370

"task jbd2/dm-1-8:973 blocked for more than 120 seconds."

Comment 28 Norman Gaywood 2010-02-24 23:27:31 UTC

Created attachment 396182 [details]
2.6.33-1 won't boot on an F12 domU

The hope of 2.6.33 solving this problem has been put on hold.

On a test F12 system I updated to

grubby-7.0.13-1.fc13.x86_64.rpm
linux-firmware-20100106-3.fc13.noarch.rpm

and installed

kernel-2.6.33-1.fc13.x86_64.rpm

and it would not boot. Boot messages attached.

Should I make another bugzilla about this?

Comment 29 Norman Gaywood 2010-02-25 02:26:16 UTC

Updated to kernel-2.6.32.9-64.fc12.x86_64 and it crashed after about 19 hours.

This problem is quite frustrating. I have not been reproduce the problem in a different VM on either the same hardware or different hardware. Other VMs seem to withstand heavy disk tests with no problem.

It's something about the LTSP workload. The network I'm sure is in this mix somewhere. In the not very conclusive thread mentioned in comment #26, it is said: "I kept hitting this issue for 2.6.31 guest kernel even with a simple network test."


The system can hum along with labs full of students and work fine for several hours, then suddenly one of the jbd2/dm- tasks won't come out of D state and the rest of the system follows on as more and more processes catch the D state. Other jbd2/dm- tasks catch the D as well after the first one goes.

Depending on which jbd2/dm- goes D first, the growth of the D's can be very fast or start slow and then spread at an increasing pace.

The problem is load related but it is not a big load. Its some combination of network and disk I think.

I'm at a lose as to what more information I can provide.

Comment 30 Andrew Jones 2010-02-25 18:00:11 UTC

(In reply to comment #28)
> kernel-2.6.33-1.fc13.x86_64.rpm
> 
> and it would not boot. Boot messages attached.
> 
> Should I make another bugzilla about this?    

Yes, that's new bug material, there's already a new bug written about .33 not booting as a PV guest though. Checkout bug 568415 to see if it has the same symptoms you saw.

Comment 31 Andrew Jones 2010-02-26 14:25:29 UTC

We might be able to get a more useful stack if you boot your F12 guest with the hung_task_panic=1 kernel command line parameter. That should cause your machine to panic as soon as the first hung task is detected. Then you can grab the stack from it either with xm dmesg (it might be there...), sysreq, and/or the xenctx command, "/usr/lib/xen/bin/xenctx -s <System.map-of-guest-kernel>.map <domid>"

Comment 32 Norman Gaywood 2010-03-01 07:37:10 UTC

Created attachment 396973 [details]
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64  after panic

Thanks Andrew for the suggestions.

This is the result of running:

for p in $(seq 0 11)
do
   /usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 27 $p
done

on the system after it panicked with the hung_task_panic=1 passed to the F12 guest.

Comment 33 Norman Gaywood 2010-03-01 07:48:33 UTC

Created attachment 396977 [details]
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64  just before panic

I also ran the xenctx on each processor just before it panicked as the processes started to hang.

Unfortunately I didn't get the console output of the panic other than the line:

Kernel panic - not syncing: hung_task: blocked tasks

There was no output in xm dmesg other than the messages mentioned in comment #9

Comment 34 Norman Gaywood 2010-03-09 03:59:08 UTC

Created attachment 398674 [details]
A different domU VM hung

I guess the xenctx dumps were also inconclusive. I have some slightly new information.

I have another domU that has hung. This was a test system the I was about to use to get a 2.6.33 to run and it hung in an rpm -i kernel... command. The hang seems to be on the /boot filesystem so I can still run commands and get logs if I don't access /boot or run commands like ps.

Same dom0 hardware so that is still a possible problem, although I've not seen any signs of trouble in dom0. This dom0 only runs two domUs, this test system and the main LTSP system I have logged in all the above. Previous disk tests have not be able to make the test system fail.

I guess this tells us something. It's probably not an ext4 problem as /boot is ext3. It's probably nothing to do with the network. The only network activity is my ssh logins. It's not quotas as they are not turned on. There are not even any users on this system, it's just a small test system.

This one is running 2.6.32.3-10.fc12.x86_64 and it seems to have first hung in:

INFO: task kjournald:580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff88012f57a300     0   580      2 0x00000080
 ffff88012fc2fc00 0000000000000246 0000000000000000 ffffffff817f3a80
 0000000001865b1e ffffffffb47a46fe ffff88012fc2ffd8 ffff88012fc2ffd8
 ffff88012f7ae158 000000000000f980 0000000000015740 ffff88012f7ae158
Call Trace:
 [<ffffffff81140972>] ? sync_buffer+0x0/0x44
 [<ffffffff81140972>] ? sync_buffer+0x0/0x44
 [<ffffffff8145337d>] io_schedule+0x43/0x5d
 [<ffffffff811409b2>] sync_buffer+0x40/0x44
 [<ffffffff814538d0>] __wait_on_bit+0x48/0x7b
 [<ffffffff81012e5b>] ? xen_hypervisor_callback+0x1b/0x20
 [<ffffffff81453971>] out_of_line_wait_on_bit+0x6e/0x79
 [<ffffffff81140972>] ? sync_buffer+0x0/0x44
 [<ffffffff81074a64>] ? wake_bit_function+0x0/0x33
 [<ffffffff8113f765>] ? submit_bh+0x109/0x117
 [<ffffffff811408d5>] __wait_on_buffer+0x24/0x26
 [<ffffffff811bfc12>] wait_on_buffer+0x3d/0x41
 [<ffffffff811c019a>] journal_commit_transaction+0x4cf/0xefd
 [<ffffffff8104eb5b>] ? finish_task_switch+0x8a/0xac
 [<ffffffff81074a2b>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff8100e87f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81454b03>] ? _spin_unlock_irqrestore+0x19/0x1b
 [<ffffffff81065479>] ? try_to_del_timer_sync+0x73/0x81
 [<ffffffff8100e87f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff811c384c>] kjournald+0xe8/0x225
 [<ffffffff81074a2b>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811c3764>] ? kjournald+0x0/0x225
 [<ffffffff8107473e>] kthread+0x7f/0x87
 [<ffffffff81012d2a>] child_rip+0xa/0x20
 [<ffffffff81011f11>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101269d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012d20>] ? child_rip+0x0/0x20



It's still possible to log into this system and do a restricted set of things. Is there anything other information I can collect while it's in this state?

For example cat /proc/*/stat can show me the blocked processes. Anything else? The system is installed on one LV with that divided into a number of partitions:

/dev/mapper/SYSTEM-root /                       ext4    defaults        1 1
UUID=3118ec38-93e1-456c-995f-c67217901296 /boot                   ext3    defaults        1 2
/dev/mapper/SYSTEM-tmp  /tmp                    ext4    defaults        1 2
/dev/mapper/SYSTEM-var  /var                    ext4    defaults        1 2
/dev/mapper/SYSTEM-mail /var/spool/mail         ext4    defaults        1 2
/dev/mapper/SYSTEM-opt  /opt                    ext4    defaults        1 2
/dev/mapper/SYSTEM-swap swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0

Comment 35 Norman Gaywood 2010-03-09 04:53:40 UTC

I also did an xm dump-core of the test system in the hung state.

Note sure what I can do with this 5GB file however.

Comment 36 Andrew Jones 2010-03-09 09:35:07 UTC

Sorry, I haven't had any time to look at this. The backtraces you've captured are useful though. The one where we panic'ed immediately after detecting D-state is this

Call Trace:
  [<ffffffff8122e328>] delay_tsc+0x48 <--
  [<ffffffff8122e26f>] __delay+0xf
  [<ffffffff8122e2b3>] __const_udelay+0x42
  [<ffffffff81453982>] panic+0x131
  [<ffffffff810acbfd>] watchdog+0x18e
  [<ffffffff8100e87f>] xen_restore_fl_direct_reloc+0x4
  [<ffffffff810aca6f>] hung_task_panic+0x17
  [<ffffffff810745b6>] kthread+0x7f
  [<ffffffff81012d6a>] child_rip+0xa
  [<ffffffff81011f51>] int_ret_from_sys_call+0x7
  [<ffffffff810126dd>] retint_restore_args+0x5
  [<ffffffff81012d60>] kernel_thread+0xe0

It shows that the watchdog tripped after we called xen_restore_fl_direct(). xen_restore_fl_direct() just checks if there are pending events and if so, then makes a simple xen-version hypercall in order to trap to Xen, which will force event channel processing.

All the other stacks you've captured also contain xen_restore_fl_direct(). This indicates to me that Xen is getting messed up at some point, and then each process eventually gets hung up after trapping down to it. It further looks like the bug is in Xen, and not the guest, since the D-state problem transferred to the other domain on your system as well, once again with xen_restore_fl_direct() on the stack.

All that said, I don't know how exactly, or what, is getting Xen messed up. It does seem likely to be related to your LTSP workload though, since you're currently the only reporter I'm aware of.

Maybe we can get 'xm dmesg' output before you start the LTSP workload, and then on some regular interval after you start it, and then of course right after all the processes lock up in D state. Please also make sure all debug messages are turned by adding the following to your xen.gz line in grub

loglvl=all guest_loglvl=all debug_stack_lines=80

Comment 37 Norman Gaywood 2010-03-09 15:24:50 UTC

Created attachment 398824 [details]
hypervisor boot messages


I've restarted dom0 with the requested logging.

Attached is the hypervisor.log found in /var/log/xen/console after startup.

This seems to contain the xm dmesg output. I have XENCONSOLED_LOG_HYPERVISOR=yes enabled in /etc/sysconfig/xend

The LTSP domU has also been started.

I guess I have to hope now that something is logged when it goes wrong.

Comment 38 Norman Gaywood 2010-03-10 00:03:22 UTC

Usage is building in domU and the dom0 dmesg logs have been quite. The only additions were this several hours after the restart:

(XEN) traps.c:2044: GPF (0060): ffff828c8013076c -> ffff828c80199f78

and 8 minutes later the same thing again:

(XEN) traps.c:2044: GPF (0060): ffff828c8013076c -> ffff828c80199f78

No crash yet in domU yet. On activity there is our standard ext4 quota problems with several:

kernel: JBD: Spotted dirty metadata buffer (dev = dm-1, blocknr = 98662). There's a risk of filesystem corruption in case of system crash.

Comment 39 Norman Gaywood 2010-03-10 00:36:59 UTC

Andrew, you say:

> since you're currently the only reporter I'm aware of.

The two bugs mentioned in comment #23 looks similar to me.

At least xen_restore_fl_direct_end() features in their stack traces. They haven't done any xenctx traces that might have xen_restore_fl_direct()

Comment 40 Norman Gaywood 2010-03-10 05:10:58 UTC

Had a few more

(XEN) traps.c:2044: GPF (0060): ffff828c8013076c -> ffff828c80199f78

in the hypervisor.log

followed by a crash. No stacks in the logs from the crash of either the guest or the hypervisor. (Sometimes I don't get the "hung task" message depending on which filesystem locks up.)

After the guest reboot we get a bunch more of the WRMSR messages immediately after the reboot.

After the reboot we sporadicly  get more of the traps.c:2044 messages and nothing else.

Hope the traps.c:2044 thing means something.

Comment 41 Norman Gaywood 2010-03-14 00:12:25 UTC

After about a day or so the traps.x:2044: GPF (0060) messages seem to have stopped.

Guest still till crashing several times a day.

Have not rebooted dom0 since turning on the extra logging, however it does not look like I am getting anything signification in the dmesg logs anyway.

Any suggestions on what tools to use to extract more information?

Comment 42 Norman Gaywood 2010-03-15 04:58:40 UTC

I've updated dom0 to 2.6.18-191.el5xen with still the same problem.

Here is a theory on the problem, or perhaps a grab for a straw.

As I said at the start we have a new network rolled out across all our systems. We have noticed that occasional packet loss. Not big outages but occasional dropped packets according to ping.

Could this be the trigger for this bug? Some error code not handling the situation properly perhaps? Could xen dom0 get "messed up" by this?

Here is the ifconfig output of dom0. We can see some dropped packets in vif3.0

eth0      Link encap:Ethernet  HWaddr 00:1A:64:DB:DC:4C  
          inet addr:129.180.124.36  Bcast:129.180.125.255  Mask:255.255.254.0
          inet6 addr: 2001:388:10:1606:21a:64ff:fedb:dc4c/64 Scope:Global
          inet6 addr: fe80::21a:64ff:fedb:dc4c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1155927 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6966 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1330910570 (1.2 GiB)  TX bytes:1373077 (1.3 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:6042 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6042 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:581960 (568.3 KiB)  TX bytes:581960 (568.3 KiB)

peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:26444158 errors:0 dropped:0 overruns:0 frame:0
          TX packets:39923089 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4611777466 (4.2 GiB)  TX bytes:39579860067 (36.8 GiB)
          Interrupt:22 Memory:f2000000-f2012800 

vif0.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:6977 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1155938 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1375739 (1.3 MiB)  TX bytes:1330911296 (1.2 GiB)

vif3.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:12139711 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18153943 errors:0 dropped:4363 overruns:0 carrier:0
          collisions:0 txqueuelen:32 
          RX bytes:26347167293 (24.5 GiB)  TX bytes:3148543642 (2.9 GiB)

virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:468 (468.0 b)

xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:1146793 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1313566897 (1.2 GiB)  TX bytes:0 (0.0 b)

Comment 43 Jeremy Sanders 2010-04-12 11:06:58 UTC

Is this related?
Apr  9 00:47:44 xback2 kernel: INFO: task kthreadd:2 blocked for more than 120 seconds.
Apr  9 00:47:44 xback2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  9 00:47:44 xback2 kernel: kthreadd      D 0000000000000002     0     2      0 0x00000000
Apr  9 00:47:44 xback2 kernel: ffff88007dbfd4c0 0000000000000046 0000000000000000 0000000a00000000
Apr  9 00:47:44 xback2 kernel: ffff880000000001 ffff880079f9b800 ffff88007dbfdfd8 ffff88007dbfdfd8
Apr  9 00:47:44 xback2 kernel: ffff88007dbf1b38 000000000000f980 0000000000015740 ffff88007dbf1b38
Apr  9 00:47:44 xback2 kernel: Call Trace:
Apr  9 00:47:44 xback2 kernel: [<ffffffff8107c30d>] ? ktime_get_ts+0x85/0x8e
Apr  9 00:47:44 xback2 kernel: [<ffffffff810d604d>] ? sync_page+0x0/0x4a
Apr  9 00:47:44 xback2 kernel: [<ffffffff810d604d>] ? sync_page+0x0/0x4a
Apr  9 00:47:44 xback2 kernel: [<ffffffff814546f5>] io_schedule+0x43/0x5d
Apr  9 00:47:44 xback2 kernel: [<ffffffff810d6093>] sync_page+0x46/0x4a
Apr  9 00:47:44 xback2 kernel: [<ffffffff81454c48>] __wait_on_bit+0x48/0x7b
Apr  9 00:47:44 xback2 kernel: [<ffffffff810d6265>] wait_on_page_bit+0x72/0x79
...
Apr  9 00:47:44 xback2 kernel: INFO: task async/mgr:14 blocked for more than 120 seconds.
Apr  9 00:47:44 xback2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  9 00:47:44 xback2 kernel: async/mgr     D ffff88007d045d80     0    14      2 0x00000000
Apr  9 00:47:44 xback2 kernel: ffff88007d055ca0 0000000000000046 ffff88007d055c68 ffff88007d055c64
Apr  9 00:47:44 xback2 kernel: 0000000000000000 0000000000008a00 ffff88007d055fd8 ffff88007d055fd8
Apr  9 00:47:44 xback2 kernel: ffff88007d046158 000000000000f980 0000000000015740 ffff88007d046158
Apr  9 00:47:44 xback2 kernel: Call Trace:
Apr  9 00:47:44 xback2 kernel: [<ffffffff814549eb>] schedule_timeout+0x36/0xe3
Apr  9 00:47:44 xback2 kernel: [<ffffffff81046355>] ? check_preempt_wakeup+0x1d4/0x270
Apr  9 00:47:44 xback2 kernel: [<ffffffff8104815e>] ? enqueue_task_fair+0x68/0x6d


We don't have any virtual machines on this system. It's 2.6.32.10-90.fc12.x86_64 running an rsync job on an MD file system (xfs). We have an ext3 system similarly affected.

Unfortunately the system crashed completely when I tried to use sysrq on it (C I think). Here is there memory information I grabbed:

Apr 12 11:55:04 xback2 kernel: SysRq : Show Memory
Apr 12 11:55:04 xback2 kernel: Mem-Info:
Apr 12 11:55:04 xback2 kernel: Node 0 DMA per-cpu:
Apr 12 11:55:04 xback2 kernel: CPU    0: hi:    0, btch:   1 usd:   0
Apr 12 11:55:04 xback2 kernel: CPU    1: hi:    0, btch:   1 usd:   0
Apr 12 11:55:04 xback2 kernel: Node 0 DMA32 per-cpu:
Apr 12 11:55:04 xback2 kernel: CPU    0: hi:  186, btch:  31 usd: 159
Apr 12 11:55:04 xback2 kernel: CPU    1: hi:  186, btch:  31 usd: 115
Apr 12 11:55:04 xback2 kernel: active_anon:1757 inactive_anon:10553 isolated_anon:0
Apr 12 11:55:04 xback2 kernel: active_file:155930 inactive_file:155529 isolated_file:99
Apr 12 11:55:04 xback2 kernel: unevictable:0 dirty:69285 writeback:2073 unstable:0
Apr 12 11:55:04 xback2 kernel: free:72852 slab_reclaimable:81782 slab_unreclaimable:13993
Apr 12 11:55:04 xback2 kernel: mapped:2703 shmem:66 pagetables:1325 bounce:0
Apr 12 11:55:04 xback2 kernel: Node 0 DMA free:8044kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:2280kB inactive_file:492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15364kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4944kB slab_unreclaimable:112kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 12 11:55:04 xback2 kernel: lowmem_reserve[]: 0 2003 2003 2003
Apr 12 11:55:04 xback2 kernel: Node 0 DMA32 free:283364kB min:5704kB low:7128kB high:8556kB active_anon:7028kB inactive_anon:42212kB active_file:621440kB inactive_file:621624kB unevictable:0kB isolated(anon):0kB isolated(file):396kB present:2051180kB mlocked:0kB dirty:277140kB writeback:8292kB mapped:10812kB shmem:264kB slab_reclaimable:322184kB slab_unreclaimable:55860kB kernel_stack:1776kB pagetables:5300kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 12 11:55:04 xback2 kernel: lowmem_reserve[]: 0 0 0 0
Apr 12 11:55:04 xback2 kernel: Node 0 DMA: 19*4kB 20*8kB 28*16kB 28*32kB 17*64kB 12*128kB 5*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 8044kB
Apr 12 11:55:04 xback2 kernel: Node 0 DMA32: 64907*4kB 2541*8kB 129*16kB 2*32kB 6*64kB 3*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 283364kB
Apr 12 11:55:04 xback2 kernel: 311650 total pagecache pages
Apr 12 11:55:04 xback2 kernel: 47 pages in swap cache
Apr 12 11:55:04 xback2 kernel: Swap cache stats: add 592, delete 545, find 67/87
Apr 12 11:55:04 xback2 kernel: Free swap  = 4199984kB
Apr 12 11:55:04 xback2 kernel: Total swap = 4200988kB
Apr 12 11:55:04 xback2 kernel: 524000 pages RAM
Apr 12 11:55:04 xback2 kernel: 9892 pages reserved
Apr 12 11:55:04 xback2 kernel: 176159 pages shared
Apr 12 11:55:04 xback2 kernel: 274020 pages non-shared

Comment 44 Norman Gaywood 2010-04-27 04:32:53 UTC

Still seeing this problem with 2.6.18-194.el5xen

I also tried updating the Xen to 3.4.2 from the repo provided at:

http://www.gitco.de/repo/xen3.4.2

and still experienced the same D state lockup in my guest.

I've now downgraded back to the standard Xen that comes with RH/Centos.

Another observation, the most common processes to lockup after jbd2/dm and flush are httpd and nscd. Both of these processes could possibly be doing LDAP lookups.

Not sure want any of this is telling me.



Jeremy, re comment #33. I does not look to be the same problem to me but I'm not really qualified to judge that.

Comment 45 Herve Roux 2010-04-30 11:27:02 UTC

Hi, I have the exact same issue with xen 3.1.4 running on FC8 DOM0 2.6.21.7-5.fc8xen.

INFO: task kjournald:851 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff88001ce9ca80     0   851      2 0x00000000
 ffff880002d1daf0 0000000000000246 ffff88001ce1d908 ffff88001e0cb008
 000000000073335c 0000000000000001 ffff880002d1db58 ffff88001d5bdb00
 ffff88001cde1b48 000000000000f8e0 ffff88001cde1b48 0000000000015600
Call Trace:
 [<ffffffff8106e897>] ? clocksource_read+0xf/0x11
 [<ffffffff8106aca7>] ? ktime_get_ts+0x4e/0x53
 [<ffffffff81419393>] io_schedule+0x31/0x42
 [<ffffffff811e9503>] get_request_wait+0xeb/0x179
 [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811e982c>] __make_request+0x29b/0x3c5
 [<ffffffff811e71cb>] generic_make_request+0x24e/0x2ac
 [<ffffffff8100e7f2>] ? check_events+0x12/0x20
 [<ffffffff811e730c>] submit_bio+0xe3/0x100
 [<ffffffff8100e7f2>] ? check_events+0x12/0x20
 [<ffffffff8111b2f3>] submit_bh+0xf4/0x117
 [<ffffffff8100e7df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81196b39>] journal_do_submit_data+0x3c/0x4a
 [<ffffffff81197055>] journal_commit_transaction+0x3be/0xe55
 [<ffffffff81048fcb>] ? finish_task_switch+0x6d/0xe6
 [<ffffffff8100e7df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b
 [<ffffffff8105c11f>] ? try_to_del_timer_sync+0x5a/0x66
 [<ffffffff8100e7df>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8119a78b>] kjournald+0xeb/0x230
 [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b
 [<ffffffff8119a6a0>] ? kjournald+0x0/0x230
 [<ffffffff81067765>] kthread+0x91/0x99
 [<ffffffff81012daa>] child_rip+0xa/0x20
 [<ffffffff81011f97>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101271d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012da0>] ? child_rip+0x0/0x20


vif3.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1048 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2082 errors:0 dropped:32 overruns:0 carrier:0
          collisions:0 txqueuelen:32
          RX bytes:114497 (111.8 KiB)  TX bytes:248113 (242.2 KiB)

vif4.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:363 errors:0 dropped:0 overruns:0 frame:0
          TX packets:414 errors:0 dropped:35 overruns:0 carrier:0
          collisions:0 txqueuelen:32
          RX bytes:43364 (42.3 KiB)  TX bytes:55012 (53.7 KiB)

vif5.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3127 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2850 errors:0 dropped:16 overruns:0 carrier:0
          collisions:0 txqueuelen:32
          RX bytes:443692 (433.2 KiB)  TX bytes:404390 (394.9 KiB)

vif6.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:436 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1087 errors:0 dropped:226 overruns:0 carrier:0
          collisions:0 txqueuelen:32
          RX bytes:63671 (62.1 KiB)  TX bytes:127551 (124.5 KiB)

Comment 46 Norman Gaywood 2010-05-01 03:21:33 UTC

Herve, is that stack trace from a guest or the Dom0 kernel.

The lockups for me were in the guests, not Dom0.

If it was a guest, what was it running?

Comment 47 Herve Roux 2010-05-03 15:27:34 UTC

The stack trace is from one of my fc12 guest (2.6.32). The only one I managed to get as usually all the guests get stuck (even the guest running old fc8). Sometimes I can get the system back by destroying the Guest were the blockage occur, but usually only a reboot of the dom0 helps. 2 days ago I had issue with the interfaces not being released properly. Don’t know if it’s of any help.

Found interface vif19.0 but domain 19 does not exist.
Found interface vif20.0 but domain 20 does not exist.
Found interface vbd-19-2050 but domain 19 does not exist.
Found interface vbd-19-2049 but domain 19 does not exist.
Found interface vif1.0 but domain 1 does not exist.
Found interface vif19.0 but domain 19 does not exist.
Found interface vif20.0 but domain 20 does not exist.
Found interface vbd-19-2050 but domain 19 does not exist.
Found interface vbd-19-2049 but domain 19 does not exist.
Found interface vbd-1-51718 but domain 1 does not exist.
Found interface vbd-1-51717 but domain 1 does not exist.
Found interface vbd-1-51716 but domain 1 does not exist.
Found interface vbd-1-51715 but domain 1 does not exist.
Found interface vbd-1-51714 but domain 1 does not exist.
Found interface vbd-1-51713 but domain 1 does not exist.
....

Comment 48 Norman Gaywood 2010-05-03 21:59:46 UTC

Herve, could you describe your hardware, some history (how long have you been seeing this, etc), and perhaps a description of your workload?

If it is the same problem I'm seeing, it seems to be a very rare problem. Yet I have not had a guest that has made 3 days of uptime. The other bugzillas that looked similar have gone silent.

I also can't reproduce the problem on different hardware but then I don't have enough hardware to re-create exactly the same conditions.

Comment 49 Norman Gaywood 2010-05-05 06:24:02 UTC

I've just become aware of several bugs to do with the Broadcom NICs that we use:

# lspci | grep -i Broadcom
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

# ethtool -i peth0
driver: bnx2
version: 2.0.2
firmware-version: 4.6.4 IPMI 0.1.18
bus-info: 0000:02:00.0


Thing is that most of these problems are supposed to fixed in the RedHat 5.5 kernels.

However I did notice some fixes in 2.6.18-198:

[net] bnx2: fix lost MSI-X problem on 5709 NICs (John Feeney) [511368]
[net] bnx2: avoid restarting cnic in some contexts (Andy Gospodarek) [554706]


So I installed that kernel but it did not fix my problem. The guest still locks up in D state processes. But at least I've found another candidate for the cause here. Any suggestions of things to look for would be well received.


About to try:

options bnx2 "disable_msi=1"

in /etc/modprobe.conf of dom0


After that, I think I'll try putting in some Intel NICs and see if that helps.

Comment 50 Andrew Jones 2010-05-05 07:23:17 UTC

Is this really linked to network use? Have you tried running the guests with the network down?

Comment 51 Norman Gaywood 2010-05-06 23:24:01 UTC

I have not tried recently to run any tests with the network down. This is a production system now and pretty hard to find a time to do this.

Before the system went live in late Dec 2009 we ran bonnie++ and other tests and did not notice this problem. We first noticed it 2 days after going live.

In comment #34 I mention another guest with this problem where there was very little network activity. The hang occurred during an rpm install of a new kernel. At the same time this happened, the other guest was doing a normal busy network workload.

So yes it could still be a storage problem, I don't now. There is no sign in dom0 of problems with the disks.


Just over a day of uptime with 'options bnx2 "disable_msi=1"'. 2 days to go to set a new uptime record.

Comment 52 Norman Gaywood 2010-05-07 00:58:58 UTC

Crash and burn. 'options bnx2 "disable_msi=1"' made no difference.

Comment 53 Norman Gaywood 2010-05-07 07:40:16 UTC

Andrew, I get the impression you would be surprised if this problem was related to network use? Is there a reason for this?

Comment 54 Jan ONDREJ 2010-05-12 10:35:36 UTC

Looks similar to my bug with KVM:
  https://bugzilla.redhat.com/show_bug.cgi?id=563672
I have also this problem on servers with high disk IO.

Comment 55 Norman Gaywood 2010-05-14 16:57:50 UTC

Created attachment 414109 [details]
guest panic on disk test

Well I finally got brave and ran some disk tests in production. The result was a panic in the guest within seconds. Attached is the stack trace. syslogd also wall'ed this:

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:------------[ cut here ]------------

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:invalid opcode: 0000 [#1] SMP 

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:last sysfs file: /sys/devices/vbd-51712/block/xvda/stat

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:Stack:

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:Call Trace:

Message from syslogd@turing at May 15 02:24:59 ...
 kernel:Code: 8b 38 48 8b 70 08 48 89 77 08 48 89 3e 48 89 00 48 89 40 08 66 ff 03 ff 14 25 c8 a7 6b 81 48 8b 42 f8 48 83 e0 fc 48 39 c3 74 04 <0f> 0b eb fe f0 80 62 f8 fe 48 89 cf ff 55 98 48 8b 45 80 8b 90 
Timeout, server not responding.


The disk test was an fio test, run with "sudo fio four-threads-randio.fio" where four-threads-randio.fio contains:
; Four threads, two query, two writers.

[global]
rw=randread
size=256m
directory=/tmp/fio-testing/data
ioengine=libaio
iodepth=4
invalidate=1
direct=1

[bgwriter]
rw=randwrite
iodepth=32

[queryA]
iodepth=1
ioengine=mmap
direct=0
thinktime=3

[queryB]
iodepth=1
ioengine=mmap
direct=0
thinktime=5

[bgupdater]
rw=randrw
iodepth=16
thinktime=40
size=32m


Note that Dom0 logged no disk errors, or any error, in dmesg, syslog or "xm dmesg" on this crash.

Comment 56 Norman Gaywood 2010-06-01 02:45:13 UTC

The crash is comment #55 has no new information as far as I can tell. Running this test again, the system survives and runs without problem.

Hope to get some testing time in the next few weeks were I can take the system offline and really hammer it.

Comment 57 Norman Gaywood 2010-06-01 02:50:49 UTC

One odd thing about this system is that it has two problems that I am the only one pushing the reporting of.

The other problem is bug #578674 (JBD: Spotted dirty metadata buffer) where most times setquota(8) is run, syslog reports messages like:

kernel: JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 118115505). There's a risk of filesystem corruption in case of system crash.

Occasionally this stack is written:

Jun  1 09:55:05 turing kernel: ------------[ cut here ]------------
Jun  1 09:55:05 turing kernel: WARNING: at fs/quota/dquot.c:975 dquot_claim_space+0xa1/0x11f()
Jun  1 09:55:05 turing kernel: Modules linked in: fuse inet_diag nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 dm_multipath uinput joydev xen_netfront xen_blkfront [last unloaded: scsi_wait_scan]
Jun  1 09:55:05 turing kernel: Pid: 783, comm: flush-253:0 Not tainted 2.6.32.12-115.fc12.x86_64 #1
Jun  1 09:55:05 turing kernel: Call Trace:
Jun  1 09:55:05 turing kernel: [<ffffffff81056370>] warn_slowpath_common+0x7c/0x94
Jun  1 09:55:05 turing kernel: [<ffffffff8105639c>] warn_slowpath_null+0x14/0x16
Jun  1 09:55:05 turing kernel: [<ffffffff81164ef3>] dquot_claim_space+0xa1/0x11f
Jun  1 09:55:05 turing kernel: [<ffffffff811b6f66>] ext4_mb_mark_diskspace_used+0x272/0x31a
Jun  1 09:55:05 turing kernel: [<ffffffff811ba0ed>] ext4_mb_new_blocks+0x1e8/0x423
Jun  1 09:55:05 turing kernel: [<ffffffff81216636>] ? blk_peek_request+0x191/0x1a7
Jun  1 09:55:05 turing kernel: [<ffffffff8129f691>] ? gnttab_free_grant_references+0x6a/0x73
Jun  1 09:55:05 turing kernel: [<ffffffff8118e4f8>] ext4_new_meta_blocks+0x52/0xaa
Jun  1 09:55:05 turing kernel: [<ffffffff811b0d4a>] ? ext4_ext_find_extent+0x53/0x279
Jun  1 09:55:05 turing kernel: [<ffffffff811b0b07>] ext4_ext_new_meta_block+0x3b/0x42
Jun  1 09:55:05 turing kernel: [<ffffffff811b1a4d>] ext4_ext_insert_extent+0x79d/0xb1a
Jun  1 09:55:05 turing kernel: [<ffffffff8110e3ad>] ? __kmalloc+0x14c/0x15e
Jun  1 09:55:05 turing kernel: [<ffffffff811b0d4a>] ? ext4_ext_find_extent+0x53/0x279
Jun  1 09:55:05 turing kernel: [<ffffffff8100e87f>] ? xen_restore_fl_direct_end+0x0/0x1
Jun  1 09:55:05 turing kernel: [<ffffffff811b29ba>] ext4_ext_get_blocks+0xbf0/0x14b7
Jun  1 09:55:05 turing kernel: [<ffffffff8122a83b>] ? __lookup_tag+0x6d/0x12d
Jun  1 09:55:05 turing kernel: [<ffffffff81195be8>] ext4_get_blocks+0x120/0x286
Jun  1 09:55:05 turing kernel: [<ffffffff81195e9e>] mpage_da_map_blocks+0x9e/0x31c
Jun  1 09:55:05 turing kernel: [<ffffffff8119644b>] ext4_da_writepages+0x32f/0x535
Jun  1 09:55:05 turing kernel: [<ffffffff810dd537>] do_writepages+0x21/0x2a
Jun  1 09:55:05 turing kernel: [<ffffffff8113a320>] writeback_single_inode+0xcc/0x1f1
Jun  1 09:55:05 turing kernel: [<ffffffff8113aac1>] writeback_inodes_wb+0x37a/0x411
Jun  1 09:55:05 turing kernel: [<ffffffff8100e1b5>] ? xen_force_evtchn_callback+0xd/0xf
Jun  1 09:55:05 turing kernel: [<ffffffff8100e892>] ? check_events+0x12/0x20
Jun  1 09:55:05 turing kernel: [<ffffffff8113ac80>] wb_writeback+0x128/0x1a7
Jun  1 09:55:05 turing kernel: [<ffffffff8113aef1>] wb_do_writeback+0x13c/0x152
Jun  1 09:55:05 turing kernel: [<ffffffff8113af46>] bdi_writeback_task+0x3f/0xaf
Jun  1 09:55:05 turing kernel: [<ffffffff810ec716>] ? bdi_start_fn+0x0/0xda
Jun  1 09:55:05 turing kernel: [<ffffffff810ec787>] bdi_start_fn+0x71/0xda
Jun  1 09:55:05 turing kernel: [<ffffffff810ec716>] ? bdi_start_fn+0x0/0xda
Jun  1 09:55:05 turing kernel: [<ffffffff8107460a>] kthread+0x7f/0x87
Jun  1 09:55:05 turing kernel: [<ffffffff81012d6a>] child_rip+0xa/0x20
Jun  1 09:55:05 turing kernel: [<ffffffff81011f51>] ? int_ret_from_sys_call+0x7/0x1b
Jun  1 09:55:05 turing kernel: [<ffffffff810126dd>] ? retint_restore_args+0x5/0x6
Jun  1 09:55:05 turing kernel: [<ffffffff81012d60>] ? child_rip+0x0/0x20
Jun  1 09:55:05 turing kernel: ---[ end trace 93666c5e61b00755 ]---


So two strange errors on the one system. I wonder if they are related?

Comment 58 Norman Gaywood 2010-06-01 03:15:26 UTC

The most consistent way to trigger this bug (>50%) is to run, on a remote system:

ssh turing dump 3uf - /dev/mapper/SYSTEM-root | /usr/bin/gzip -9 > backup.root.gz

turing is the DomU system that crashes. The remote system is located on totally different hardware.

Seems to me that the workload here is a lot of disk reads and a lot of network writes.

Comment 59 Norman Gaywood 2010-06-20 21:41:26 UTC

I have finally been alble to reproduce this on completely different hardware while doing a similar command to comment #58. No stack traces as nothing logged to syslog and I didn't turn on hung_task_panic=1 for this kernel.

This is different hardware to all he other reports in this bug. The host software is Centos 5.5 running 2.6.18-194.3.1.el5xen.

I had a top running in a screen session of the single Fedora 12 domU:

top - 07:26:30 up 2 days, 21:34,  1 user,  load average: 24.99, 24.97, 24.53
Tasks: 149 total,   2 running, 146 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.1%us,  0.1%sy,  0.0%ni,  0.0%id, 99.8%wa,  0.0%hi,  0.0%si,  0.1%st
Mem:   4060828k total,  4040752k used,    20076k free,  3486120k buffers
Swap:  6291448k total,      320k used,  6291128k free,   134136k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  235 root      20   0     0    0    0 D  0.0  0.0   0:00.18 jbd2/dm-0-8
  887 root      20   0     0    0    0 D  0.0  0.0   0:00.24 flush-253:0
 1321 ntp       20   0 60716 2776 1904 D  0.0  0.1   0:00.27 ntpd
 1384 postgres  20   0  183m 4988 4192 D  0.0  0.1   0:07.96 postmaster
 1409 postgres  20   0  155m 1232  444 D  0.0  0.0   0:00.08 postmaster
 1411 root      20   0 62464 1436  844 D  0.0  0.0   0:00.32 amd
 1441 postgres  20   0  183m 1652  716 R  0.0  0.0   0:03.75 postmaster
 1442 postgres  20   0  155m 1392  516 D  0.0  0.0   0:04.80 postmaster
 1470 root      20   0  255m 5852 4212 D  0.0  0.1   0:00.37 abrtd
 9076 root      20   0 54076 2588 1856 R  0.0  0.1   8:09.33 top
22487 root      20   0 23256 1948  144 D  0.0  0.0   0:14.18 dump
22488 root      20   0 23256 1948  144 D  0.0  0.0   0:14.07 dump
22489 root      20   0 23256 1948  144 D  0.0  0.0   0:14.47 dump
22530 root      20   0  109m  788  236 D  0.0  0.0   0:00.00 crond
22607 root      20   0 71840 2016  420 D  0.0  0.0   0:00.00 sendmail
22608 smmsp     20   0 59156 1628  408 D  0.0  0.0   0:00.00 sendmail
22609 root      20   0  109m  788  236 D  0.0  0.0   0:00.00 crond
22610 root      20   0 71840 2016  420 D  0.0  0.0   0:00.00 sendmail
22611 smmsp     20   0 59156 1628  408 D  0.0  0.0   0:00.00 sendmail
22612 root      20   0  109m  788  236 D  0.0  0.0   0:00.00 crond
22613 root      20   0 71840 2016  420 D  0.0  0.0   0:00.00 sendmail
22614 smmsp     20   0 59156 1628  408 D  0.0  0.0   0:00.00 sendmail
22615 root      20   0  109m  788  236 D  0.0  0.0   0:00.00 crond
22616 root      20   0 71840 2016  420 D  0.0  0.0   0:00.00 sendmail
22617 smmsp     20   0 59156 1628  408 D  0.0  0.0   0:00.00 sendmail

Comment 60 Norman Gaywood 2010-06-21 03:51:27 UTC

The bug in comment #57 is un-realated. An upstream patch has been found for it.

Comment 61 Norman Gaywood 2010-06-21 05:54:24 UTC

Created attachment 425530 [details]
dmesg of dom0

Attached is the dmesg output of the new dom0 that is seeing the domU crash. The output of lspci in dom0 is:

00:00.0 Host bridge: Intel Corporation E7230/3000/3010 Memory Controller Hub
00:01.0 PCI bridge: Intel Corporation E7230/3000/3010 PCI Express Root Port
00:1c.0 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port 1 (rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01)
00:1d.0 USB Controller: Intel Corporation N10/ICH7 Family USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #3 (rev 01)
00:1d.7 USB Controller: Intel Corporation N10/ICH 7 Family USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
00:1f.2 IDE interface: Intel Corporation N10/ICH7 Family SATA IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation N10/ICH 7 Family SMBus Controller (rev 01)
02:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
06:05.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)

Comment 62 Norman Gaywood 2010-06-21 06:00:04 UTC

Created attachment 425531 [details]
A sysrq l stack of domU

Tried to do an "xm sysrq N w" to show stacks of blocked tasks on hung domU. Nothing was displayed. An "xm sysrq N l" showed the attached stack. Looks a bit different to the other stacks though I'm not sure if it's useful.

Comment 63 Andrew Jones 2010-06-21 14:18:27 UTC

Anything in 'xm dmesg' on this new machine? Can you attach /proc/cpuinfo for the machines on which you're seeing the problem?

Comment 64 Norman Gaywood 2010-06-21 18:59:30 UTC

Created attachment 425719 [details]
/proc/cpuinfo of dom0 orginal system

Comment 65 Norman Gaywood 2010-06-21 19:00:37 UTC

Created attachment 425720 [details]
/proc/cpuinfo of dom0 new system

Comment 66 Norman Gaywood 2010-06-21 19:07:10 UTC

Nothing extraordinary in 'xm dmesg' of new machine. I've now restarted dom0 with:

kernel /xen.gz-2.6.18-194.3.1.el5 loglvl=all guest_loglvl=all debug_stack_lines=80

although that did not increase useful logging on the original system.

Comment 67 Norman Gaywood 2010-07-02 03:40:32 UTC

Could this problem is related to timers or clock source in any way? In a bug that I think is similar to this, #526627, the clock source was the suspected cause for awhile. Also comment #54 above reports a very similar problem with KVM in bug #563672

My domU once logged:

Jun 11 01:12:51 turing kernel: hrtimer: interrupt took 1645273 ns

out of the blue. That's the only time I have ever seen that message.

DomU currently reports:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
xen tsc 
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
xen

I'm trying clocksource=tsc now in my second system.

Early on in this bug I reported setting clocksource to jiffies in an earlier Fedora kernel and also in bug #526627 I reported  booting with this. I got the characteristic hanging processes very quickly.

Comment 68 Norman Gaywood 2010-07-03 22:22:09 UTC

clocksource=tsc does not fix this problem.

Tried it on my main unstable system and it hung once again during backups.

Also, the host was running kernel-xen-2.6.18-204.el5.x86_64

Comment 69 Norman Gaywood 2010-07-28 07:03:28 UTC

Just noticed that this bug is referenced here:

http://lists.xensource.com/archives/html/xen-devel/2010-06/msg00590.html

No solution there either. One comment from Rick Boone was:

> I booted into 2.6.32.12 yesterday, but I'm still getting a hung task crash. (Initially, I got a soft lockup crash on one of the cpu's, after switching to 2.6.32.12, but it seems I was able to resolve that by switching the kernel to tickless)

After my next crash, probably during backups tonight, I'll try booting with:

nohz=off highres=off

which I think takes me *AWAY* from a tickless kernel.

Comment 70 Chuck Ebbert 2010-08-01 23:44:36 UTC

(In reply to comment #68)
> clocksource=tsc does not fix this problem.
> 
> Tried it on my main unstable system and it hung once again during backups.
> 
> Also, the host was running kernel-xen-2.6.18-204.el5.x86_64    

When you set clocksource=tsc on the command line, did you check to see if it was accepted? There are some bugs that make the clocksource= option get ignored...

Comment 71 Norman Gaywood 2010-08-02 00:15:53 UTC

(In reply to comment #70)
> (In reply to comment #68)
> > clocksource=tsc does not fix this problem.
 
> When you set clocksource=tsc on the command line, did you check to see if it
> was accepted? There are some bugs that make the clocksource= option get
> ignored...    

Yes I'm pretty sure I checked /sys/devices/system/clocksource/clocksource0/current_clocksource

and also looking back in my console logs I see in the boot messages:

kernel: Switching to clocksource tsc

Comment 72 Norman Gaywood 2010-08-02 00:30:26 UTC

(In reply to comment #69)

> After my next crash, probably during backups tonight, I'll try booting with:
> 
> nohz=off highres=off
> 
> which I think takes me *AWAY* from a tickless kernel.  

I can now report that this did not solve the problem either. Same characteristic crash with nohz=off highres=off

Although, comment #70 made me wonder how do I check that those options "took"? I can see them in the command line boot messages but what on the system indicates that I am using those options?


I have no ideas of what to try next for this bug.

Comment 73 Andrew Jones 2010-08-20 05:56:50 UTC

I think I've reproduced this bug on a very recent rhel6 kernel. I still haven't got all the ingredients nailed down, but I'm suspicious it had something to do with an ext3 file system that I had on a file-backed blkback device. I still don't think I've ever seen this problem with ext4 though.

Which fs is being used on the system having the problems here? If it's ext3, could you try to find a way to try ext4? I know that's asking a lot if it's a production system.

Andrew

Comment 74 Andrew Jones 2010-08-20 08:07:49 UTC

Ah, nevermind. Just got it to reproduce with ext4. Only notes I have right now is that when I mount these file systems I get this log to the console 

JBD: barrier-based sync failed on xvdb - disabling barriers

and the first tasks to start hanging (with ext3) are

kjournald, flush-202 - not always that order (reproduced 3 times)

and now on the ext4 reproduced case it was

jbd2/xvda2-8

My reproducer is currently to simply build a kernel on these file systems and wait.

Andrew

Comment 75 Norman Gaywood 2010-08-22 10:39:14 UTC

(In reply to comment #74)
> Ah, nevermind. Just got it to reproduce with ext4. Only notes I have right now
> is that when I mount these file systems I get this log to the console

All my crashes have been with ext4 (with or without quotas enabled) except for the crash mentioned in comment #34 which was ext3 on a /boot partition.

> JBD: barrier-based sync failed on xvdb - disabling barriers

Hmm, this messages is familiar for some reason but I can't find it in any of my current logs.

> and the first tasks to start hanging (with ext3) are
> 
> kjournald, flush-202 - not always that order (reproduced 3 times)
> 
> and now on the ext4 reproduced case it was
> 
> jbd2/xvda2-8

jbd2/dm* for my ext4 filesystems. Any one could go first but the most likely seems to be /var

All systems that have had this problem for me have been on one or more lv of the host system. I have not tried image files.

> My reproducer is currently to simply build a kernel on these file systems and
> wait.

My feelings on the workload that reproduces this is that it has to be stop/start disk activity. Just straight heavy I/O rarely hits this problem in my testing. The network also figures prominently in the mix. After the the jdb2 hang the first processes to hang are usually things like httpd, nscd, smdb and imap. Also backups, with dump across the network, seems to trigger the problem reasonably often.

I'll put my test system (with has seen the problem only twice) into a loop of kernel building and see what happens.

The production system that sees the problem often is a LTSP and nxclient server. So that's lots of desktop users using browsers, compilers, editors, etc. Lots of stop/start network and disk I/O.

Comment 76 Norman Gaywood 2010-08-23 21:48:48 UTC

(In reply to comment #75)
> (In reply to comment #74)> 
> > JBD: barrier-based sync failed on xvdb - disabling barriers

> Hmm, this messages is familiar for some reason but I can't find it in any of my
> current logs.

I saw a discussion of this message on the ext4 mailing list. I don't get it in my logs as far as I can tell.

http://marc.info/?t=128017340300003&r=1&w=2


> > My reproducer is currently to simply build a kernel on these file systems and
> > wait.

> I'll put my test system (with has seen the problem only twice) into a loop of
> kernel building and see what happens.

My test system, building the kernel went into this state after 2 days. However it did it when doing a backup (dump) to a remote system again and/or sending an email.

top - 07:29:45 up 4 days, 14:24,  3 users,  load average: 25.99, 25.96, 25.35
Tasks: 153 total,   1 running, 152 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni,  0.0%id, 99.9%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4060836k total,  2080144k used,  1980692k free,   128332k buffers
Swap:  6291448k total,     5444k used,  6286004k free,   374432k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMMAND
20361 root      20   0 14908 1172  856 R  0.3  0.0   4:18.03 top
  237 root      20   0     0    0    0 D  0.0  0.0   6:29.73 jbd2/dm-0-8
  825 root      20   0     0    0    0 D  0.0  0.0   1:36.20 flush-253:0
  919 root      20   0  8456  908  580 D  0.0  0.0   0:00.10 patch
  920 root      20   0  170m 2644 1680 D  0.0  0.1   0:00.00 crond
  923 root      20   0  170m 2644 1680 D  0.0  0.1   0:00.00 crond
  926 root      20   0  111m  976  404 D  0.0  0.0   0:00.00 crond
  927 smmsp     20   0 59152 1656  420 D  0.0  0.0   0:00.00 sendmail
  928 root      20   0 71836 2040  432 D  0.0  0.1   0:00.00 sendmail
  929 root      20   0  115m 1112  524 D  0.0  0.0   0:00.00 crond
  930 smmsp     20   0 59152 1656  420 D  0.0  0.0   0:00.00 sendmail
  932 root      20   0 71836 2040  432 D  0.0  0.1   0:00.00 sendmail
  933 root      20   0  117m 1136  540 D  0.0  0.0   0:00.00 crond
  934 smmsp     20   0 59152 1656  420 D  0.0  0.0   0:00.00 sendmail
  935 root      20   0 71836 2040  432 D  0.0  0.1   0:00.00 sendmail
  936 root      20   0  128m 1244  612 D  0.0  0.0   0:00.00 crond
  937 smmsp     20   0 59152 1656  420 D  0.0  0.0   0:00.00 sendmail
  938 root      20   0 71836 2044  432 D  0.0  0.1   0:00.00 sendmail
 1218 ntp       20   0 60708 1736  868 D  0.0  0.0   0:00.33 ntpd
 1281 postgres  20   0  183m 3416 2776 D  0.0  0.1   0:01.25 postmaster
 1373 postgres  20   0  183m 1068  396 D  0.0  0.0   0:01.44 postmaster
 1374 postgres  20   0  155m  908  268 D  0.0  0.0   0:01.61 postmaster
32393 root      20   0 23256 1920  144 D  0.0  0.0   0:00.39 dump
32394 root      20   0 23256 1920  144 D  0.0  0.0   0:00.31 dump
32395 root      20   0 23256 1920  144 D  0.0  0.0   0:00.29 dump

Comment 77 Norman Gaywood 2010-08-23 21:58:51 UTC

(In reply to comment #76)
> > > JBD: barrier-based sync failed on xvdb - disabling barriers

> I don't get it in my logs as far as I can tell.

Er, yes I do. It has appeared in some of the ancient logs of my test system.

Comment 78 Norman Gaywood 2010-08-27 01:43:23 UTC

In 5 days of rebuilding the kernel, my test system crashed once as described in comment #76. So still pretty tough to re-produce this in test. My production system crashed 13 times during the same period.

One observation. The crash in comment #34 was a pivotal moment. It was the one that got this bug moved from a fedora kernel (guest) to the enterprise kernel (the host).

Do we really think this problem is in the host kernel or even Xen? KVM guests seem to also see something similar.

Comment #34 was also interesting in that there was little network activity on the guest that crashed. My main crashing guest was also running on the host at the same time but it didn't crash (then).

Comment 79 Andrew Jones 2010-09-02 16:34:20 UTC

Right, good question (where does this bug belong). If the problem on KVM hosts really seems to be the same (I'll try to reproduce that soon), then we can focus more on the guest. At this time though, I still want to dig deeper on what I saw in comment 36. I'll try to get to this asap, but I still have some other issues to knock off the TODO list first, sorry.

Drew

Comment 80 Zhiqiang Ma 2010-09-04 05:02:17 UTC

We have experienced the same problem: for high CPU tasks, the PV F12 DomU works well; for hight I/O tasks, the F12 DomU crashes and the Dom0 also crashes.

One oops message:

Sep  2 21:27:33 node04 kernel: INFO: task ext4-dio-unwrit:990 blocked for more than 120 seconds.
Sep  2 21:27:33 node04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  2 21:27:33 node04 kernel: ext4-dio-unwr D ffff8800157890c0     0   990      2 0x00000000
Sep  2 21:27:33 node04 kernel: ffff880012e39d40 0000000000000246 0000000003f44d9c 0000000000006ad1
Sep  2 21:27:33 node04 kernel: ffff8800124e44d8 ffff8800124e44a0 ffff880012e39fd8 ffff880012e39fd8
Sep  2 21:27:33 node04 kernel: ffff8800124e4870 0000000000007d30 000000000000b0c0 ffff8800124e4870
Sep  2 21:27:33 node04 kernel: Call Trace:
Sep  2 21:27:33 node04 kernel: [<ffffffff8102d986>] ? set_next_entity+0x28/0x78
Sep  2 21:27:33 node04 kernel: [<ffffffff8140ae01>] __mutex_lock_common+0x11e/0x18d
Sep  2 21:27:33 node04 kernel: [<ffffffff8140ae89>] __mutex_lock_slowpath+0x19/0x1b
Sep  2 21:27:33 node04 kernel: [<ffffffff8140af9a>] mutex_lock+0x31/0x4b
Sep  2 21:27:33 node04 kernel: [<ffffffff81157b29>] ext4_end_aio_dio_work+0x2c/0x6d
Sep  2 21:27:33 node04 kernel: [<ffffffff81053705>] worker_thread+0x1c3/0x24f
Sep  2 21:27:33 node04 kernel: [<ffffffff81157afd>] ? ext4_end_aio_dio_work+0x0/0x6d
Sep  2 21:27:33 node04 kernel: [<ffffffff81057b54>] ? autoremove_wake_function+0x0/0x39
Sep  2 21:27:33 node04 kernel: [<ffffffff81053542>] ? worker_thread+0x0/0x24f
Sep  2 21:27:33 node04 kernel: [<ffffffff81057711>] kthread+0x7f/0x87
Sep  2 21:27:33 node04 kernel: [<ffffffff8100ab2a>] child_rip+0xa/0x20
Sep  2 21:27:33 node04 kernel: [<ffffffff81057692>] ? kthread+0x0/0x87
Sep  2 21:27:33 node04 kernel: [<ffffffff8100ab20>] ? child_rip+0x0/0x20
Sep  2 21:27:33 node04 kernel: INFO: task tapdisk:18044 blocked for more than 120 seconds.
Sep  2 21:27:33 node04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  2 21:27:33 node04 kernel: tapdisk       D ffff8800157890c0     0 18044      1 0x00000080
Sep  2 21:27:33 node04 kernel: ffff88006cc29738 0000000000000286 ffff88006cc29708 ffff88007e3d4cb0
Sep  2 21:27:33 node04 kernel: ffff88007e3d4cb0 ffff880000000002 ffff88006cc29fd8 ffff88006cc29fd8
Sep  2 21:27:33 node04 kernel: ffff88004064c870 0000000000007d30 000000000000b0c0 ffff88004064c870
Sep  2 21:27:33 node04 kernel: Call Trace:
Sep  2 21:27:33 node04 kernel: [<ffffffff81189bab>] do_get_write_access+0x200/0x40e
Sep  2 21:27:33 node04 kernel: [<ffffffff81057b8d>] ? wake_bit_function+0x0/0x33
Sep  2 21:27:33 node04 kernel: [<ffffffff8118fc7e>] ? jbd2_journal_add_journal_head+0xbf/0x13d
Sep  2 21:27:33 node04 kernel: [<ffffffff81189eff>] jbd2_journal_get_write_access+0x2b/0x44
Sep  2 21:27:33 node04 kernel: [<ffffffff811538ab>] ? ext4_read_block_bitmap+0x54/0x29c
Sep  2 21:27:33 node04 kernel: [<ffffffff81176639>] __ext4_journal_get_write_access+0x56/0x60
Sep  2 21:27:33 node04 kernel: [<ffffffff811791e2>] ext4_mb_mark_diskspace_used+0x72/0x333
Sep  2 21:27:33 node04 kernel: [<ffffffff81178a48>] ? ext4_mb_initialize_context+0x7d/0x18d
Sep  2 21:27:33 node04 kernel: [<ffffffff8117bfc8>] ext4_mb_new_blocks+0x1e8/0x42c
Sep  2 21:27:33 node04 kernel: [<ffffffff811736c0>] ? ext4_ext_find_extent+0x130/0x27f
Sep  2 21:27:33 node04 kernel: [<ffffffff81175945>] ext4_ext_get_blocks+0x1289/0x14e0
Sep  2 21:27:33 node04 kernel: [<ffffffff81021dd6>] ? xen_multicall_flush+0x2c/0x2e
Sep  2 21:27:33 node04 kernel: [<ffffffff810be586>] ? unmap_vmas+0x663/0x7b8
Sep  2 21:27:33 node04 kernel: [<ffffffff81188f03>] ? start_this_handle+0x3dd/0x402
Sep  2 21:27:33 node04 kernel: [<ffffffff8115b011>] ext4_get_blocks+0x152/0x30f
Sep  2 21:27:33 node04 kernel: [<ffffffff8115b246>] ext4_get_block_dio_write+0x78/0xb6
Sep  2 21:27:33 node04 kernel: [<ffffffff8110bc19>] __blockdev_direct_IO+0x521/0xae0
Sep  2 21:27:33 node04 kernel: [<ffffffff81159f76>] ext4_direct_IO+0x1b6/0x247
Sep  2 21:27:33 node04 kernel: [<ffffffff8115b1ce>] ? ext4_get_block_dio_write+0x0/0xb6
Sep  2 21:27:33 node04 kernel: [<ffffffff81157b6a>] ? ext4_end_io_dio+0x0/0x85
Sep  2 21:27:33 node04 kernel: [<ffffffff810a7cf6>] generic_file_direct_write+0xed/0x16d
Sep  2 21:27:33 node04 kernel: [<ffffffff810a7f0c>] __generic_file_aio_write+0x196/0x286
Sep  2 21:27:33 node04 kernel: [<ffffffff810a805f>] generic_file_aio_write+0x63/0xad
Sep  2 21:27:33 node04 kernel: [<ffffffff811549af>] ? ext4_file_write+0x0/0x95
Sep  2 21:27:33 node04 kernel: [<ffffffff81154a3d>] ext4_file_write+0x8e/0x95
Sep  2 21:27:33 node04 kernel: [<ffffffff810d6d97>] ? kmem_cache_alloc+0x47/0x140
Sep  2 21:27:33 node04 kernel: [<ffffffff81114e52>] aio_rw_vect_retry+0x8e/0x19d
Sep  2 21:27:33 node04 kernel: [<ffffffff81114dc4>] ? aio_rw_vect_retry+0x0/0x19d
Sep  2 21:27:33 node04 kernel: [<ffffffff811168e7>] aio_run_iocb+0x79/0x11f
Sep  2 21:27:33 node04 kernel: [<ffffffff8111748d>] sys_io_submit+0x5db/0x6c6
Sep  2 21:27:33 node04 kernel: [<ffffffff8100a078>] system_call_fastpath+0x16/0x1b
Sep  2 21:27:33 node04 kernel: [<ffffffff8100a010>] ? system_call+0x0/0x52

Another one:

INFO: task tapdisk:12896 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tapdisk       D ffff8800157890c0     0 12896      1 0x00000080
 ffff88007ce871a8 0000000000000282 0000000000000000 ffff880001c62dc0
 0000001001080001 ffff8800586e27e0 ffff88007ce87fd8 ffff88007ce87fd8
 ffff880001c63190 0000000000007d30 000000000000b0c0 ffff880001c63190
Call Trace:
 [<ffffffff8140a7fd>] io_schedule+0x73/0xb5
 [<ffffffff811d624c>] get_request_wait+0xee/0x17c
 [<ffffffff81057b54>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811cddf4>] ? elv_merge+0x167/0x19f
 [<ffffffff811d65c2>] __make_request+0x2e8/0x3fb
 [<ffffffff811d4c1d>] generic_make_request+0x2c8/0x321
 [<ffffffff810a9192>] ? mempool_alloc+0x6c/0x11e
 [<ffffffff81026966>] ? __wake_up_common+0x4e/0x84
 [<ffffffff811d4d59>] submit_bio+0xe3/0x100
 [<ffffffff811048c3>] submit_bh+0xf4/0x117
 [<ffffffff811063b4>] __block_write_full_page+0x1d8/0x2b3
 [<ffffffff8115bf23>] ? noalloc_get_block_write+0x0/0x60
 [<ffffffff81105d70>] ? end_buffer_async_write+0x0/0x1ab
 [<ffffffff81105d70>] ? end_buffer_async_write+0x0/0x1ab
 [<ffffffff8115bf23>] ? noalloc_get_block_write+0x0/0x60
 [<ffffffff81106518>] block_write_full_page_endio+0x89/0x95
 [<ffffffff81106539>] block_write_full_page+0x15/0x17
 [<ffffffff81158866>] ext4_writepage+0x2ac/0x2f0
 [<ffffffff810b16f8>] pageout+0x137/0x1fe
 [<ffffffff810b1bba>] shrink_page_list+0x265/0x483
 [<ffffffff810dae9a>] ? mem_cgroup_del_lru+0x39/0x3b
 [<ffffffff810b0e8f>] ? isolate_pages_global+0x1ba/0x201
 [<ffffffff81008ad3>] ? __switch_to+0x452/0x56f
 [<ffffffff810b21a9>] shrink_inactive_list+0x3d1/0x6d2
 [<ffffffff81057b4f>] ? wake_up_bit+0x25/0x2a
 [<ffffffff81036c87>] ? try_to_wake_up+0x2fc/0x30e
 [<ffffffff810b2825>] shrink_zone+0x37b/0x44b
 [<ffffffff810b2b17>] do_try_to_free_pages+0x222/0x399
 [<ffffffff810b2d76>] try_to_free_pages+0x6e/0x70
 [<ffffffff810b0cd5>] ? isolate_pages_global+0x0/0x201
 [<ffffffff810ac912>] __alloc_pages_nodemask+0x3f3/0x63c
 [<ffffffff810d49dc>] alloc_slab_page+0x1e/0x20
 [<ffffffff810d526e>] __slab_alloc+0x16b/0x488
 [<ffffffff8110b812>] ? __blockdev_direct_IO+0x11a/0xae0
 [<ffffffff8110b812>] ? __blockdev_direct_IO+0x11a/0xae0
 [<ffffffff810d5b7f>] kmem_cache_alloc_notrace+0x7d/0xf2
 [<ffffffff8110b812>] __blockdev_direct_IO+0x11a/0xae0
 [<ffffffff810aeee4>] ? pagevec_lookup+0x22/0x2b
 [<ffffffff810afff5>] ? invalidate_inode_pages2_range+0x27e/0x2b7
 [<ffffffff81159f76>] ext4_direct_IO+0x1b6/0x247
 [<ffffffff8115b1ce>] ? ext4_get_block_dio_write+0x0/0xb6
 [<ffffffff81157b6a>] ? ext4_end_io_dio+0x0/0x85
 [<ffffffff810a7cf6>] generic_file_direct_write+0xed/0x16d
 [<ffffffff810a7f0c>] __generic_file_aio_write+0x196/0x286
 [<ffffffff810f0803>] ? set_fd_set+0x50/0x5b
 [<ffffffff810f1858>] ? core_sys_select+0x1d6/0x20a
 [<ffffffff810a805f>] generic_file_aio_write+0x63/0xad
 [<ffffffff811549af>] ? ext4_file_write+0x0/0x95
 [<ffffffff81154a3d>] ext4_file_write+0x8e/0x95
 [<ffffffff810d6d97>] ? kmem_cache_alloc+0x47/0x140
 [<ffffffff81114e52>] aio_rw_vect_retry+0x8e/0x19d
 [<ffffffff81114dc4>] ? aio_rw_vect_retry+0x0/0x19d
 [<ffffffff811168e7>] aio_run_iocb+0x79/0x11f
 [<ffffffff8111748d>] sys_io_submit+0x5db/0x6c6
 [<ffffffff8100a078>] system_call_fastpath+0x16/0x1b
 [<ffffffff8100a010>] ? system_call+0x0/0x52

A brief list of the hardware and configuratioin:

Dom0′s hardware platform:

Motherboard: INTEL S5500BC S5500 Quad Core Xeon Server Board
CPU: 2 x Intel Quad Core Xeon E5520 2.26G (5.86GT/sec,8M,Socket 1366)
Memory: 8 x Kingston DDR-3 1333MHz 4GB ECC REG. CL9 DIMM w/Parity & Thermal Sensor
HD: 4 x WD WD10EARS 1 TB, SATA II 3Gb/s, 64 MB Cache

Dom0: F12, with kernel 2.6.32.13 + 2.6.32 Xen patches v2 from http://gentoo-xen-kernel.googlecode.com/files/xen-patches-2.6.32-2.tar.bz2

Dom0 configuration:
Processor type and features  --->
 [*] Symmetric multi-processing support
 [*] Support sparse irq numbering
 [*] Enable Xen compatible kernel
 Preemption Model (No Forced Preemption (Server))  --->

Device Drivers  --->
 XEN  --->
 [*] Privileged Guest (domain 0)
 <*> Backend driver support (NEW)
 <*>   Block-device backend driver (NEW)
 <*>   Block-device tap backend driver (NEW)
 < >   Block-device tap backend driver 2 (NEW)
 <*>   Network-device backend driver (NEW)
 (8)     Maximum simultaneous transmit requests (as a power of 2) (NEW)
 [ ]     Pipelined transmitter (DANGEROUS) (NEW)
 < >     Network-device loopback driver (NEW)
 < >   PCI-device backend driver (NEW)
 < >   TPM-device backend driver (NEW)
 <M>   SCSI backend driver (NEW)
 <M>   USB backend driver (NEW)
 <M> Block-device frontend driver
 <M> Network-device frontend driver
 <M>   Network-device frontend driver acceleration for Solarflare NICs (NEW)
 <M> SCSI frontend driver (NEW)
 <M> USB frontend driver (NEW)
 [*]   Taking the HCD statistics (for debug) (NEW)
 [ ]   HCD suspend/resume support (DO NOT USE) (NEW)
 <*> User-space granted page access driver (NEW)
 <*> Framebuffer-device frontend driver (NEW)
 <*>   Keyboard-device frontend driver (NEW)
 [*] Disable serial port drivers (NEW)
 <*> Export Xen attributes in sysfs (NEW)
 (256) Number of guest devices (NEW)
 Xen version compatibility (no compatibility code)  --->
 [*] Place shared vCPU info in per-CPU storage (NEW)

DomU: F12 with pv_ops kernel: 2.6.32.19-163.fc12.x86_64

The details of hardware and kernel configuration is here: http://pkill.info/b/2405/setting-up-stable-xen-dom0-with-fedora-xen-3-4-3-with-xenified-linux-kernel-2-6-32-13-in-fedora-12/

Our solution now:
Disable the hard disk cache by: hdparm -W0 /dev/sdX

After disabling the hard disk cache, the VMs works. When we do a middle size (processing 20GB data simultaneously on 8VMs on top of 2 physical machines) intensive I/O test, the VMs and Dom0 don't crash. We will run a full size (about 800GB data on these 8VMs) experiment and see whether is works.

Comment 81 Norman Gaywood 2010-09-17 06:34:22 UTC

Some recent threads that look like this problem:

http://forums.citrix.com/thread.jspa?threadID=272708 (Xen block devices stop responding in guest)

That lead to this kernel mailing list thread:

http://lkml.org/lkml/2010/9/1/178 (I/O scheduler deadlocks on Xen virtual block devices)

The main points from the kernel thread:

  - Problem probably in the Xen I/O stack and not guest scheduler.

  - These patches might be the real fixes:

   xen: use percpu interrupts for IPIs and VIRQs
   xen: handle events as edge-triggered

A work-around might be to disable irqbalance in the guest. This seems to have been the case so far in the citrix thread mentioned above.

I'm now running my main crashing guest without irqbalance.

Comment 82 Zhiqiang Ma 2010-09-17 07:07:25 UTC

We have done experiments on LVM backed DomUs. The DomUs have this problem as in comments 80 and the other configurations (hardware, the Dom0, DomU's kernel, etc.) are the same except the backed VBD (moved from file / tap:aio to LVM backed).

The 8 LVM backed DomUs work well so far with around 5GB writes and 5GB reads for each DomU.

Hope this piece info can help.

Comment 83 Andrew Jones 2010-09-17 13:09:26 UTC

Hi Norman,

Great work tracking these mail threads and potentially finding a solution. I did a quick test on my machine. First I confirmed I could still reproduce it with a kernel compile loop with irqbalance on. It did reproduce in something less then 30 minutes. Then I turned off irqbalance and ran the kernel compile loop a few hours. But then I turned irqbalance back on to confirm I could again trigger it quickly, but still haven't seen it in almost an hour, so currently my testing isn't super conclusive. I'm anxious to hear your results, since I know you've been plagued with it frequently.

Drew

Comment 84 Norman Gaywood 2010-09-20 02:25:00 UTC

Turning off irqbalance is looking very promising. I have not had a crash since doing so and the system has been through some significant workload.

Since the 23rd of March this year I've been recording (most) of the times of crashes. Rough stats are:

Shortest uptime: 6.683 minutes
Longest uptime: 8.885 days
Number of Crashes: 210
Crashes/day: 0.845

The 8 days was set during a period of very little activity.

As I write this, 4 days of uptime has passed with a normal to large workload.

One note, during a particularly high load during this uptime, the message mentioned in comment #67 made a return:

Sep 20 11:42:55 turing kernel: hrtimer: interrupt took 940625 ns

On the disappointing side, I've still not been able to create a test system that will reliably crash.

Comment 85 Andrew Jones 2010-09-20 08:13:53 UTC

I think we can be confident that irqbalance off does the trick. I ran the kernel compile loop with irqbalance off all weekend and it was still running this morning. I turned irqbalance back on and was able to reproduce within an hour or so. Next up: backport the patches pointed to and try another round with irqbalance on.

Comment 86 Andrew Jones 2010-09-20 12:53:36 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A work-around for this issue is to turn off irqbalance, i.e.
chkconfig --level 2345 irqbalance off
service irqbalance stop

Comment 90 Andrew Jones 2010-09-22 08:53:31 UTC

I've been running the kernel compile loop on a RHEL6 guest patched with the two patches pointed to in comment 81 for over 30 hours (with irqbalance on, of course). I'm pretty confident that they fix the issue and will post them. The "handle events as edge-triggered" patch, which is probably the one that truly fixes the issue, is already in Linus' tree, but .36-rc3, which means it won't be in Fedora until F15.

Comment 91 Norman Gaywood 2010-09-22 09:57:54 UTC

I just saw that the 2 patches also made it into 2.6.32.22 and 2.6.35.5

So they should make it into Fedora 12 if the stable updates are pushed there. Fedora 13 is currently 2.6.34.7 so I'm not sure about that.

See:

http://lwn.net/Articles/406311/

http://lwn.net/Articles/406312/

Comment 92 Andrew Jones 2010-09-22 12:05:10 UTC

(In reply to comment #91)
> I just saw that the 2 patches also made it into 2.6.32.22 and 2.6.35.5

Ah, nice. Yeah, I forgot to check the stable trees to see what got pulled in by them.

Comment 93 Norman Gaywood 2010-09-22 13:10:30 UTC

This bug started out in the fedora 12 kernel moved around some and ended up in
the RHE 6 kernel.

Should this bug be cloned and changed to the F12 kernel to encourage the stable
2.6.32.22 updates to be applied there?

Comment 94 Andrew Jones 2010-09-22 13:46:33 UTC

Right, agreed. Instead of cloning (since this bug is huge). I've just created a new bug pointing to the patches, bug 636534.

Comment 95 Chuck Ebbert 2010-09-22 15:29:26 UTC

(In reply to comment #93)
> Should this bug be cloned and changed to the F12 kernel to encourage the stable
> 2.6.32.22 updates to be applied there?

They're already applied.

Comment 96 Andrew Jones 2010-09-23 15:52:52 UTC

QA,

This patch needs some wide-spread testing on different machines. I saw no problems with it on my machine when testing, but there's been a report that it causes a boot crash. See Chuck's comment in bug 636534 comment 3.

Thanks,
Drew

Comment 97 Norman Gaywood 2010-09-24 04:05:07 UTC

(In reply to comment #96)

> This patch needs some wide-spread testing on different machines. I saw no
> problems with it on my machine when testing, but there's been a report that it
> causes a boot crash. See Chuck's comment in bug 636534 comment 3.

Seems there is a typo in some instances of the patch. See bug 636534 comment 4.

Comment 98 Andrew Jones 2010-09-24 06:38:04 UTC

(In reply to comment #97)
> Seems there is a typo in some instances of the patch. See bug 636534 comment 4.

Cool, I just saw that message that there was a typo on xen-devel too. The patch I've posted doesn't have it, so we're good-to-go.

Thanks,
Drew

Comment 99 RHEL Program Management 2010-10-05 01:50:47 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 100 Andrew Jones 2010-11-16 09:07:37 UTC

*** Bug 587265 has been marked as a duplicate of this bug. ***

Comment 101 Aristeu Rozanski 2010-11-17 19:44:38 UTC

Patch(es) available on kernel-2.6.32-83.el6

Comment 103 Yufang Zhang 2010-11-26 06:00:09 UTC

Just to confirm before QA ack this bug: Could comment #90 be used as a case to verify this bug?

Comment 104 Andrew Jones 2010-11-26 08:17:10 UTC

(In reply to comment #103)
> Just to confirm before QA ack this bug: Could comment #90 be used as a case to
> verify this bug?

It worked for me. Just using the system under load seemed to accelerate the chances of hitting it (with irqbalance on). However, it seems to be a bit different for each report. Try booting an unpatched kernel on a test machine, start a kernel compile loop with irqbalance on, and then just leave it. If it doesn't hit the bug within a day or two, then I'm not sure what to say, other than trying another machine.

Comment 108 Martin Prpič 2011-02-23 15:07:16 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1 @@
-A work-around for this issue is to turn off irqbalance, i.e.
+In some cases, under a small system load involve some I/O operation, processes started to lock up in the D state (that is, became unresponsive). The system load could in some cases climb steadily. This was due to the way the event channel IRQ (Interrupt Request) was set up. Xen events behave like edge-triggered IRQs, however, the kernel was setting them up as level-triggered IRQs. As a result, any action using Xen event channels could lock up a process in the D state. With this update, the handling has been changed from edge-triggered IRQs to level-triggered IRQs and process no longer lock up in the D state.-chkconfig --level 2345 irqbalance off
-service irqbalance stop

Comment 114 errata-xmlrpc 2011-05-23 20:18:58 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html