Bug 550724
Summary: | xen PV guest kernel 2.6.32 processes lock up in D state | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Norman Gaywood <ngaywood> | |
Component: | kernel | Assignee: | Andrew Jones <drjones> | |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | high | Docs Contact: | ||
Priority: | low | |||
Version: | 6.0 | CC: | ahecox, anton, bhubbard, dhoward, dougsland, drjones, eric, gansalmon, itamar, jforbes, kernel-maint, mishu, mshao, qcai, qwan, redhat, vevroux, xen-maint, yuzhang | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | kernel-2.6.32-83.el6 | Doc Type: | Bug Fix | |
Doc Text: |
In some cases, under a small system load involve some I/O operation, processes started to lock up in the D state (that is, became unresponsive). The system load could in some cases climb steadily. This was due to the way the event channel IRQ (Interrupt Request) was set up. Xen events behave like edge-triggered IRQs, however, the kernel was setting them up as level-triggered IRQs. As a result, any action using Xen event channels could lock up a process in the D state. With this update, the handling has been changed from edge-triggered IRQs to level-triggered IRQs and process no longer lock up in the D state.
|
Story Points: | --- | |
Clone Of: | ||||
: | 733672 (view as bug list) | Environment: |
Xen DomU running under Centos 5.4, kernel 2.6.18-164.6.1.el5xen
|
|
Last Closed: | 2011-05-23 20:18:58 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 523117, 661737 | |||
Attachments: |
Created attachment 380488 [details]
dmesg captured from Dom0 console
Created attachment 380490 [details]
partial sysreq output
For a kernel-2.6.31.6-166.fc12.x86_64 I was able to get a partial sysreq output.
Working on getting more detail. Sometimes the sysreq keys do not seem to work.
Created attachment 381157 [details]
sysreq info for 2.6.30.10-105.fc11.x86_64
I installed the F11 kernel on my F12 Xen domU PV system. Lasted for several days then we got the load climbing again.
This time I could look at /var/log/messages without hanging and send sysreq from the dom0.
As a side note. Our site is currently in the middle of a large network upgrade. So we often have the network go away under us and come back. I get the impression that these network problems can sometimes trigger the problem described here. httpd is amongst the first to go to D state and stay there.
2.6.30.10-105.fc11.x86_64 (the most stable for me) crashed again, probably while doing an automated backup. After restarting, I tried this: echo jiffies > /sys/devices/system/clocksource/clocksource0/current_clocksource and processes starting going into D state straight away. The clock also did not seem to be updating. previous value current_clocksource was xen Don't know what I'm doing here, so I don't know if this is related. Created attachment 387042 [details]
syslog boot of kernel-2.6.32.6-36.fc12.x86_64
Just back from holiday and working on this again,
Installed the new kernel-2.6.32.6-36.fc12.x86_64 from koji and processes started going into D state as soon as a large rsync backup started.
Created attachment 387043 [details]
top output of kernel-2.6.32.6-36.fc12.x86_64
Hi Norman, Can you attach your guest config and write the hardware details of your host? Also, check if 'xm dmesg' has anything interesting in it. What release is your host? RHEL 5.4? Thanks, Andrew Created attachment 387674 [details]
dmesg of dom0 host
host system is running CentOS 5.4 and kernel 2.6.18-164.11.1.el5xen
Created attachment 387675 [details]
xm dmesg
Another data point. I upgraded the host kernel to RH enterprise 2.6.18-186.el5xen which contains some Xen fixes. Still have the same problem. These kernels still crash: kernel-2.6.30.10-105.fc11.x86_64 kernel-2.6.31.12-174.2.3.fc12.x86_64 I don't know if this a fedora problem or a RH Enterprise Xen problem. Ok, here the prime suspects: guest kernel backing store host kernel/drivers/HV or a problem with some combination of the above some of these suspects can be eliminated if we know some history. Has this machine ever had a working config? Did these problems appear after some upgrade? If so, what was kept constant and what was upgraded? We can revert one suspect at a time, or start over with a known-good and upgrade one at a time. Or is this a new system? If so, did the hardware work fine with bare-metal kernels? Or, have you seen anything strange behavior in dom0? guest: you can try building and booting a latest stable kernel 2.6.32.7. Several xenpv patches have been pushed in to the 2.6.32 tree. backing store. What is it? please attach the guest config file. I assume it's an LV, but if it's on NFS or other network store than that might explain some problems, since you said the network had been flaky. If it's a local drive, then what type? What driver in dom0 is being used? host: you're already on the latest, so I don't have any suggestions for that right now, other than to keep it running the latest for now. Also, how much ram does dom0 have? Thanks, Andrew Thanks Andrew, lots of good questions. This hardware is relatively new, just over 6 months old. The main idea of the system is to be a development environment for math/comp sci students. It's setup to deal with up to 60 LTSP (Linux Terminal Server Project) terminals and nxclient/ssh connections. It replaces a 4 year old HP server with Fedora 10 that did the same thing. The old HP setup ran fedora 10, at its end of life, on the bare metal. The new server was supposed to make use of virtualization. The hardware of the new dom0 server is an IBM x3850 M2 with 4 Xeon Quad Core E7330 80W processors, 64GB of memory. Two IBM 73.4GB 2.5in 10K HS SAS HDD makeup the system storage for dom0. At the moment we are running Centos 5.4 with the latest kernel I could find: kernel-xen-2.6.18-186.el5 SAS attached for main storage is an IBM DS3200 with 12 750GB SATA HDD configured as one large raid 6 drive. We break up the large drive using LV. Various attachments of config and dmesg of dom0 to follow. I see no strange error messages in the dom0 (including /var/log/messages) except for the: (XEN) traps.c:1878:d5 Domain attempted WRMSR 000000000000008b from 00000021:00000000 to 00000000:00000000. reported by "xm dmesg" We don't use NFS in dom0 and the network around here is pretty much stable now. One thing to note. Originally we had hoped to run a fedora kernel as a dom0. However we struck bug #541615 (Calgary: DMA error on CalIOC2 PHB 0x3) and so were unable to get the attached storage to pass disk tests. RH enterprise/Centos is rock solid as a dom0 and passes any disk tests we can throw at it. More detail on the crashing guest to follow. Created attachment 389898 [details]
dom0 dmesg
dom0 dmesg
dom0 pvdisplay --- Physical volume --- PV Name /dev/sdb2 VG Name SYSTEM PV Size 66.76 GB / not usable 10.54 MB Allocatable yes PE Size (KByte) 32768 Total PE 2136 Free PE 1415 Allocated PE 721 PV UUID i4ZeR9-aQQe-VQfd-dwaw-mpOm-swCa-qnQl7e --- Physical volume --- PV Name /dev/sda1 VG Name VM PV Size 2.72 TB / not usable 3.97 MB Allocatable yes PE Size (KByte) 4096 Total PE 713974 Free PE 403190 Allocated PE 310784 PV UUID prb5ng-TESx-WsIX-JYQs-Fy6T-I0Lv-Yshq01 dom0 lvdiplay --- Logical volume --- LV Name /dev/SYSTEM/ROOT VG Name SYSTEM LV UUID lPSZXD-wmj9-jVe3-gmDs-zPJf-oGG2-JIchhk LV Write Access read/write LV Status available # open 1 LV Size 9.75 GB Current LE 312 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 --- Logical volume --- LV Name /dev/SYSTEM/VAR VG Name SYSTEM LV UUID h346wl-8YvE-Wmb3-2OO6-q2CG-bZwg-SJRdPe LV Write Access read/write LV Status available # open 1 LV Size 9.88 GB Current LE 316 Segments 2 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:1 --- Logical volume --- LV Name /dev/SYSTEM/TMP VG Name SYSTEM LV UUID 3Uuw0a-jDga-7hL6-dTNv-o2om-zm1r-btCYDR LV Write Access read/write LV Status available # open 1 LV Size 992.00 MB Current LE 31 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 --- Logical volume --- LV Name /dev/SYSTEM/SWAP VG Name SYSTEM LV UUID PJ0dsz-tVqz-GRHV-Uvho-fcd7-wpUF-5CH3eC LV Write Access read/write LV Status available # open 1 LV Size 1.94 GB Current LE 62 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:3 --- Logical volume --- LV Name /dev/VM/system VG Name VM LV UUID HNvX19-hQBi-YnG9-9xVc-83Zh-Vs2o-2t0B6r LV Write Access read/write LV Status available # open 2 LV Size 150.00 GB Current LE 38400 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:4 --- Logical volume --- LV Name /dev/VM/home VG Name VM LV UUID Pr2FlO-BNTb-PeeW-GN0j-O7j2-espt-Em9lYG LV Write Access read/write LV Status available # open 2 LV Size 1.00 TB Current LE 262144 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:5 --- Logical volume --- LV Name /dev/VM/projects VG Name VM LV UUID o1phpe-Dp8r-HUf1-J6VQ-y6LV-PpI1-xbIlVT LV Write Access read/write LV Status available # open 2 LV Size 20.00 GB Current LE 5120 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:6 --- Logical volume --- LV Name /dev/VM/kojack VG Name VM LV UUID PBPq4R-M8eJ-Y9gZ-H52S-V21v-3mas-01vDfY LV Write Access read/write LV Status available # open 1 LV Size 20.00 GB Current LE 5120 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:7 domU config name = "turing" uuid = "8d3383a0-f0ae-f0b1-215d-c2e04115e9fe" maxmem = 30000 memory = 30000 vcpus = 12 bootloader = "/usr/bin/pygrub" on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart" vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] disk = [ "phy:/dev/VM/system,xvda,w", "phy:/dev/VM/home,xvdb,w", "phy:/dev/VM/projects,xvdc,w" ] vif = [ "mac=00:16:3e:7d:d0:91,bridge=xenbr0,script=vif-bridge" ] We get crashes running every 2.6.{30,31,32} kernel we have tried. We are currently using kernel-2.6.32.7-41.fc12.x86_64 from koji and the latest crash had a top looking like: top - 10:58:42 up 4:55, 21 users, load average: 62.42, 43.99, 22.27 Tasks: 777 total, 1 running, 776 sleeping, 0 stopped, 0 zombie Cpu(s): 0.4%us, 0.6%sy, 0.0%ni, 31.9%id, 67.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 30100024k total, 4208836k used, 25891188k free, 313452k buffers Swap: 10239992k total, 0k used, 10239992k free, 1557692k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11652 root 20 0 20528 1704 872 R 1.6 0.0 1:46.83 top 973 root 20 0 0 0 0 D 0.0 0.0 0:00.33 jbd2/dm-1-8 1069 root 20 0 0 0 0 D 0.0 0.0 0:00.08 flush-253:1 2645 apache 20 0 475m 21m 6020 D 0.0 0.1 0:05.25 httpd 2647 apache 20 0 474m 19m 5996 D 0.0 0.1 0:04.74 httpd 2648 apache 20 0 474m 19m 5992 D 0.0 0.1 0:04.53 httpd 2650 apache 20 0 475m 20m 6036 D 0.0 0.1 0:06.52 httpd 2651 apache 20 0 476m 22m 5980 D 0.0 0.1 0:04.70 httpd 2652 apache 20 0 476m 21m 6028 D 0.0 0.1 0:05.44 httpd 2896 apache 20 0 476m 21m 6016 D 0.0 0.1 0:05.54 httpd 2897 apache 20 0 475m 20m 6016 D 0.0 0.1 0:05.27 httpd 2898 apache 20 0 476m 21m 6004 D 0.0 0.1 0:05.18 httpd 5028 apache 20 0 474m 20m 6020 D 0.0 0.1 0:05.01 httpd 8075 ashley 20 0 137m 6484 2040 D 0.0 0.0 0:00.40 gconfd-2 8085 ashley 20 0 332m 16m 11m D 0.0 0.1 0:01.47 gnome-panel 8929 apache 20 0 475m 20m 5964 D 0.0 0.1 0:04.40 httpd 10458 stadmin 20 0 214m 5436 3784 D 0.0 0.0 0:00.40 smbd 10740 apache 20 0 475m 20m 5916 D 0.0 0.1 0:02.48 httpd 12005 stadmin 20 0 214m 5636 4028 D 0.0 0.0 0:00.48 smbd 14550 despi 20 0 110m 1608 1248 D 0.0 0.0 0:00.02 bash 14592 despi 20 0 820m 102m 28m D 0.0 0.3 0:42.92 thunderbird-bin 15914 despi 20 0 712m 116m 26m D 0.0 0.4 3:52.89 firefox 17831 stadmin 20 0 214m 6068 4104 D 0.0 0.0 0:01.61 smbd 19420 despi 20 0 311m 17m 7108 D 0.0 0.1 0:08.50 gvim 19429 despi 20 0 149m 13m 2688 D 0.0 0.0 0:00.76 xdvi-xaw3d 21706 kankanac 20 0 227m 7736 4000 D 0.0 0.0 0:01.26 pine 22351 despi 20 0 309m 15m 6880 D 0.0 0.1 0:00.46 gvim 23276 despi 20 0 171m 91m 26m D 0.0 0.3 5:38.29 acroread 23924 math101 20 0 160m 8096 3920 D 0.0 0.0 0:00.15 vim 23944 math101 20 0 7688 332 256 D 0.0 0.0 0:00.00 fortune 23991 kvaidya2 20 0 54968 2156 1804 D 0.0 0.0 0:00.00 imap 24001 hrich 20 0 99.5m 3832 1132 D 0.0 0.0 0:00.00 sendmail 24032 despi 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24053 despi 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24062 despi 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24067 bea 20 0 66268 2068 1452 D 0.0 0.0 0:00.03 sftp-server 24109 despi 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24118 despi 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24127 despi 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24136 despi 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24138 stadmin 20 0 214m 5412 3844 D 0.0 0.0 0:00.04 smbd 24158 markev 20 0 55112 2144 1764 D 0.0 0.0 0:00.03 imap 24166 stadmin 20 0 214m 5036 3500 D 0.0 0.0 0:00.03 smbd 24175 bea 20 0 66268 2052 1448 D 0.0 0.0 0:00.03 sftp-server 24218 stadmin 20 0 214m 5404 3836 D 0.0 0.0 0:00.03 smbd 24244 bea 20 0 66268 2052 1448 D 0.0 0.0 0:00.03 sftp-server 24295 comp131 20 0 107m 20m 3976 D 0.0 0.1 0:00.28 index.cgi 24311 ashley 20 0 20676 792 708 D 0.0 0.0 0:00.00 procmail 24318 markev 20 0 55144 2252 1836 D 0.0 0.0 0:00.00 imap 24336 apache 20 0 374m 9.8m 1908 D 0.0 0.0 0:00.00 httpd 24344 apache 20 0 374m 9.8m 1880 D 0.0 0.0 0:00.00 httpd 24345 apache 20 0 374m 9712 1628 D 0.0 0.0 0:00.00 httpd 24346 apache 20 0 374m 9.8m 1884 D 0.0 0.0 0:00.00 httpd 24347 apache 20 0 374m 9.8m 1900 D 0.0 0.0 0:00.00 httpd 24356 stadmin 20 0 214m 5264 3716 D 0.0 0.0 0:00.03 smbd 24360 apache 20 0 374m 9352 1292 D 0.0 0.0 0:00.00 httpd 24365 apache 20 0 374m 9668 1576 D 0.0 0.0 0:00.00 httpd 24367 markev 20 0 117m 1432 584 D 0.0 0.0 0:00.00 sshd 24370 apache 20 0 476m 19m 3996 D 0.0 0.1 0:00.11 httpd 24374 ashoka 20 0 55340 2472 1876 D 0.0 0.0 0:00.90 pop3 Created attachment 389900 [details]
domU dmesg
kernel-2.6.32.7-41.fc12.x86_64
Created attachment 389904 [details]
/var/log/messages at start of crash
These messages appear in /var/log/messages as the system dies.
Created attachment 389905 [details]
sysreq W and sysreq t info
I sent a sysreq C and M from dom0 and these messages were logged in the domU /var/log/messages.
That was a sysreq W and t (not C amd M). I didn't send a M because I thought the other ones did not work. I didn't have a tail on /var/log/messages and depending on what hangs, messages may not be saved. This time, messages did make it into /var/log/messages. Some observations and things I've tried. - The crash can happen at any time. The best we've had is about 2 days of uptime, the worst is a few hours. - There are no other domU systems running. This is the only one. - domU is not very loaded. There are users but they are mostly doing desktop type tasks. system load never gets above two and is usually below 1. - At first I suspected NFS and our (then) bad network. However I removed all use of NFS and the problem still occurred. - I also suspected quota and ext4 problems. I disabled quotas and it stopped the "kernel: JBD: Spotted dirty metadata buffer" error messages but did not stop the crash. - I also suspected our amd automounter and its complex maps. But I disabled that and the system still died. Not sure what to try next. The crash dump of https://bugzilla.redhat.com/show_bug.cgi?id=551552#c3 looks similar the output in comment 19 https://bugzilla.redhat.com/show_bug.cgi?id=526627 may another case of this bug? Created attachment 394253 [details]
"xm top" from dom0
Output of "xm top" a few minutes after "turing" went into D state lockup.
If anyone has suggestions of any other information I could gather, I'd love to know.
Tried kernel parameter "elevator=noop" in both: kernel-2.6.32.7-41.fc12.x86_64 kernel-2.6.31.12-174.2.19.fc12.x86_64 both crash and burn. So I've taken that parameter out. The system had its first real login load of about 16 LTSP terminals. It lasted a few minutes. Did and xm destroy xm create and it lasted another 2 minutes. Same xm destroy and xm create and its been running 20 minutes now. This is not good. Can't see how we are going to be running student labs with this setup. This kernel thread seems to mention something like this problem: http://www.mail-archive.com/kvm@vger.kernel.org/msg23039.html There didn't seem to be any resolution to the problem in that thread. I've posted this bug in the upstream kernel bugzilla http://bugzilla.kernel.org/show_bug.cgi?id=15370 "task jbd2/dm-1-8:973 blocked for more than 120 seconds." Created attachment 396182 [details]
2.6.33-1 won't boot on an F12 domU
The hope of 2.6.33 solving this problem has been put on hold.
On a test F12 system I updated to
grubby-7.0.13-1.fc13.x86_64.rpm
linux-firmware-20100106-3.fc13.noarch.rpm
and installed
kernel-2.6.33-1.fc13.x86_64.rpm
and it would not boot. Boot messages attached.
Should I make another bugzilla about this?
Updated to kernel-2.6.32.9-64.fc12.x86_64 and it crashed after about 19 hours. This problem is quite frustrating. I have not been reproduce the problem in a different VM on either the same hardware or different hardware. Other VMs seem to withstand heavy disk tests with no problem. It's something about the LTSP workload. The network I'm sure is in this mix somewhere. In the not very conclusive thread mentioned in comment #26, it is said: "I kept hitting this issue for 2.6.31 guest kernel even with a simple network test." The system can hum along with labs full of students and work fine for several hours, then suddenly one of the jbd2/dm- tasks won't come out of D state and the rest of the system follows on as more and more processes catch the D state. Other jbd2/dm- tasks catch the D as well after the first one goes. Depending on which jbd2/dm- goes D first, the growth of the D's can be very fast or start slow and then spread at an increasing pace. The problem is load related but it is not a big load. Its some combination of network and disk I think. I'm at a lose as to what more information I can provide. (In reply to comment #28) > kernel-2.6.33-1.fc13.x86_64.rpm > > and it would not boot. Boot messages attached. > > Should I make another bugzilla about this? Yes, that's new bug material, there's already a new bug written about .33 not booting as a PV guest though. Checkout bug 568415 to see if it has the same symptoms you saw. We might be able to get a more useful stack if you boot your F12 guest with the hung_task_panic=1 kernel command line parameter. That should cause your machine to panic as soon as the first hung task is detected. Then you can grab the stack from it either with xm dmesg (it might be there...), sysreq, and/or the xenctx command, "/usr/lib/xen/bin/xenctx -s <System.map-of-guest-kernel>.map <domid>" Created attachment 396973 [details]
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 after panic
Thanks Andrew for the suggestions.
This is the result of running:
for p in $(seq 0 11)
do
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 27 $p
done
on the system after it panicked with the hung_task_panic=1 passed to the F12 guest.
Created attachment 396977 [details] /usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 just before panic I also ran the xenctx on each processor just before it panicked as the processes started to hang. Unfortunately I didn't get the console output of the panic other than the line: Kernel panic - not syncing: hung_task: blocked tasks There was no output in xm dmesg other than the messages mentioned in comment #9 Created attachment 398674 [details]
A different domU VM hung
I guess the xenctx dumps were also inconclusive. I have some slightly new information.
I have another domU that has hung. This was a test system the I was about to use to get a 2.6.33 to run and it hung in an rpm -i kernel... command. The hang seems to be on the /boot filesystem so I can still run commands and get logs if I don't access /boot or run commands like ps.
Same dom0 hardware so that is still a possible problem, although I've not seen any signs of trouble in dom0. This dom0 only runs two domUs, this test system and the main LTSP system I have logged in all the above. Previous disk tests have not be able to make the test system fail.
I guess this tells us something. It's probably not an ext4 problem as /boot is ext3. It's probably nothing to do with the network. The only network activity is my ssh logins. It's not quotas as they are not turned on. There are not even any users on this system, it's just a small test system.
This one is running 2.6.32.3-10.fc12.x86_64 and it seems to have first hung in:
INFO: task kjournald:580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald D ffff88012f57a300 0 580 2 0x00000080
ffff88012fc2fc00 0000000000000246 0000000000000000 ffffffff817f3a80
0000000001865b1e ffffffffb47a46fe ffff88012fc2ffd8 ffff88012fc2ffd8
ffff88012f7ae158 000000000000f980 0000000000015740 ffff88012f7ae158
Call Trace:
[<ffffffff81140972>] ? sync_buffer+0x0/0x44
[<ffffffff81140972>] ? sync_buffer+0x0/0x44
[<ffffffff8145337d>] io_schedule+0x43/0x5d
[<ffffffff811409b2>] sync_buffer+0x40/0x44
[<ffffffff814538d0>] __wait_on_bit+0x48/0x7b
[<ffffffff81012e5b>] ? xen_hypervisor_callback+0x1b/0x20
[<ffffffff81453971>] out_of_line_wait_on_bit+0x6e/0x79
[<ffffffff81140972>] ? sync_buffer+0x0/0x44
[<ffffffff81074a64>] ? wake_bit_function+0x0/0x33
[<ffffffff8113f765>] ? submit_bh+0x109/0x117
[<ffffffff811408d5>] __wait_on_buffer+0x24/0x26
[<ffffffff811bfc12>] wait_on_buffer+0x3d/0x41
[<ffffffff811c019a>] journal_commit_transaction+0x4cf/0xefd
[<ffffffff8104eb5b>] ? finish_task_switch+0x8a/0xac
[<ffffffff81074a2b>] ? autoremove_wake_function+0x0/0x39
[<ffffffff8100e87f>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff81454b03>] ? _spin_unlock_irqrestore+0x19/0x1b
[<ffffffff81065479>] ? try_to_del_timer_sync+0x73/0x81
[<ffffffff8100e87f>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff811c384c>] kjournald+0xe8/0x225
[<ffffffff81074a2b>] ? autoremove_wake_function+0x0/0x39
[<ffffffff811c3764>] ? kjournald+0x0/0x225
[<ffffffff8107473e>] kthread+0x7f/0x87
[<ffffffff81012d2a>] child_rip+0xa/0x20
[<ffffffff81011f11>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff8101269d>] ? retint_restore_args+0x5/0x6
[<ffffffff81012d20>] ? child_rip+0x0/0x20
It's still possible to log into this system and do a restricted set of things. Is there anything other information I can collect while it's in this state?
For example cat /proc/*/stat can show me the blocked processes. Anything else? The system is installed on one LV with that divided into a number of partitions:
/dev/mapper/SYSTEM-root / ext4 defaults 1 1
UUID=3118ec38-93e1-456c-995f-c67217901296 /boot ext3 defaults 1 2
/dev/mapper/SYSTEM-tmp /tmp ext4 defaults 1 2
/dev/mapper/SYSTEM-var /var ext4 defaults 1 2
/dev/mapper/SYSTEM-mail /var/spool/mail ext4 defaults 1 2
/dev/mapper/SYSTEM-opt /opt ext4 defaults 1 2
/dev/mapper/SYSTEM-swap swap swap defaults 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
I also did an xm dump-core of the test system in the hung state. Note sure what I can do with this 5GB file however. Sorry, I haven't had any time to look at this. The backtraces you've captured are useful though. The one where we panic'ed immediately after detecting D-state is this Call Trace: [<ffffffff8122e328>] delay_tsc+0x48 <-- [<ffffffff8122e26f>] __delay+0xf [<ffffffff8122e2b3>] __const_udelay+0x42 [<ffffffff81453982>] panic+0x131 [<ffffffff810acbfd>] watchdog+0x18e [<ffffffff8100e87f>] xen_restore_fl_direct_reloc+0x4 [<ffffffff810aca6f>] hung_task_panic+0x17 [<ffffffff810745b6>] kthread+0x7f [<ffffffff81012d6a>] child_rip+0xa [<ffffffff81011f51>] int_ret_from_sys_call+0x7 [<ffffffff810126dd>] retint_restore_args+0x5 [<ffffffff81012d60>] kernel_thread+0xe0 It shows that the watchdog tripped after we called xen_restore_fl_direct(). xen_restore_fl_direct() just checks if there are pending events and if so, then makes a simple xen-version hypercall in order to trap to Xen, which will force event channel processing. All the other stacks you've captured also contain xen_restore_fl_direct(). This indicates to me that Xen is getting messed up at some point, and then each process eventually gets hung up after trapping down to it. It further looks like the bug is in Xen, and not the guest, since the D-state problem transferred to the other domain on your system as well, once again with xen_restore_fl_direct() on the stack. All that said, I don't know how exactly, or what, is getting Xen messed up. It does seem likely to be related to your LTSP workload though, since you're currently the only reporter I'm aware of. Maybe we can get 'xm dmesg' output before you start the LTSP workload, and then on some regular interval after you start it, and then of course right after all the processes lock up in D state. Please also make sure all debug messages are turned by adding the following to your xen.gz line in grub loglvl=all guest_loglvl=all debug_stack_lines=80 Created attachment 398824 [details]
hypervisor boot messages
I've restarted dom0 with the requested logging.
Attached is the hypervisor.log found in /var/log/xen/console after startup.
This seems to contain the xm dmesg output. I have XENCONSOLED_LOG_HYPERVISOR=yes enabled in /etc/sysconfig/xend
The LTSP domU has also been started.
I guess I have to hope now that something is logged when it goes wrong.
Usage is building in domU and the dom0 dmesg logs have been quite. The only additions were this several hours after the restart: (XEN) traps.c:2044: GPF (0060): ffff828c8013076c -> ffff828c80199f78 and 8 minutes later the same thing again: (XEN) traps.c:2044: GPF (0060): ffff828c8013076c -> ffff828c80199f78 No crash yet in domU yet. On activity there is our standard ext4 quota problems with several: kernel: JBD: Spotted dirty metadata buffer (dev = dm-1, blocknr = 98662). There's a risk of filesystem corruption in case of system crash. Andrew, you say: > since you're currently the only reporter I'm aware of. The two bugs mentioned in comment #23 looks similar to me. At least xen_restore_fl_direct_end() features in their stack traces. They haven't done any xenctx traces that might have xen_restore_fl_direct() Had a few more (XEN) traps.c:2044: GPF (0060): ffff828c8013076c -> ffff828c80199f78 in the hypervisor.log followed by a crash. No stacks in the logs from the crash of either the guest or the hypervisor. (Sometimes I don't get the "hung task" message depending on which filesystem locks up.) After the guest reboot we get a bunch more of the WRMSR messages immediately after the reboot. After the reboot we sporadicly get more of the traps.c:2044 messages and nothing else. Hope the traps.c:2044 thing means something. After about a day or so the traps.x:2044: GPF (0060) messages seem to have stopped. Guest still till crashing several times a day. Have not rebooted dom0 since turning on the extra logging, however it does not look like I am getting anything signification in the dmesg logs anyway. Any suggestions on what tools to use to extract more information? I've updated dom0 to 2.6.18-191.el5xen with still the same problem. Here is a theory on the problem, or perhaps a grab for a straw. As I said at the start we have a new network rolled out across all our systems. We have noticed that occasional packet loss. Not big outages but occasional dropped packets according to ping. Could this be the trigger for this bug? Some error code not handling the situation properly perhaps? Could xen dom0 get "messed up" by this? Here is the ifconfig output of dom0. We can see some dropped packets in vif3.0 eth0 Link encap:Ethernet HWaddr 00:1A:64:DB:DC:4C inet addr:129.180.124.36 Bcast:129.180.125.255 Mask:255.255.254.0 inet6 addr: 2001:388:10:1606:21a:64ff:fedb:dc4c/64 Scope:Global inet6 addr: fe80::21a:64ff:fedb:dc4c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1155927 errors:0 dropped:0 overruns:0 frame:0 TX packets:6966 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1330910570 (1.2 GiB) TX bytes:1373077 (1.3 MiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:6042 errors:0 dropped:0 overruns:0 frame:0 TX packets:6042 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:581960 (568.3 KiB) TX bytes:581960 (568.3 KiB) peth0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link UP BROADCAST RUNNING NOARP MTU:1500 Metric:1 RX packets:26444158 errors:0 dropped:0 overruns:0 frame:0 TX packets:39923089 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:4611777466 (4.2 GiB) TX bytes:39579860067 (36.8 GiB) Interrupt:22 Memory:f2000000-f2012800 vif0.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link UP BROADCAST RUNNING NOARP MTU:1500 Metric:1 RX packets:6977 errors:0 dropped:0 overruns:0 frame:0 TX packets:1155938 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1375739 (1.3 MiB) TX bytes:1330911296 (1.2 GiB) vif3.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link UP BROADCAST RUNNING NOARP MTU:1500 Metric:1 RX packets:12139711 errors:0 dropped:0 overruns:0 frame:0 TX packets:18153943 errors:0 dropped:4363 overruns:0 carrier:0 collisions:0 txqueuelen:32 RX bytes:26347167293 (24.5 GiB) TX bytes:3148543642 (2.9 GiB) virbr0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:6 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:468 (468.0 b) xenbr0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF UP BROADCAST RUNNING NOARP MTU:1500 Metric:1 RX packets:1146793 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1313566897 (1.2 GiB) TX bytes:0 (0.0 b) Is this related? Apr 9 00:47:44 xback2 kernel: INFO: task kthreadd:2 blocked for more than 120 seconds. Apr 9 00:47:44 xback2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 9 00:47:44 xback2 kernel: kthreadd D 0000000000000002 0 2 0 0x00000000 Apr 9 00:47:44 xback2 kernel: ffff88007dbfd4c0 0000000000000046 0000000000000000 0000000a00000000 Apr 9 00:47:44 xback2 kernel: ffff880000000001 ffff880079f9b800 ffff88007dbfdfd8 ffff88007dbfdfd8 Apr 9 00:47:44 xback2 kernel: ffff88007dbf1b38 000000000000f980 0000000000015740 ffff88007dbf1b38 Apr 9 00:47:44 xback2 kernel: Call Trace: Apr 9 00:47:44 xback2 kernel: [<ffffffff8107c30d>] ? ktime_get_ts+0x85/0x8e Apr 9 00:47:44 xback2 kernel: [<ffffffff810d604d>] ? sync_page+0x0/0x4a Apr 9 00:47:44 xback2 kernel: [<ffffffff810d604d>] ? sync_page+0x0/0x4a Apr 9 00:47:44 xback2 kernel: [<ffffffff814546f5>] io_schedule+0x43/0x5d Apr 9 00:47:44 xback2 kernel: [<ffffffff810d6093>] sync_page+0x46/0x4a Apr 9 00:47:44 xback2 kernel: [<ffffffff81454c48>] __wait_on_bit+0x48/0x7b Apr 9 00:47:44 xback2 kernel: [<ffffffff810d6265>] wait_on_page_bit+0x72/0x79 ... Apr 9 00:47:44 xback2 kernel: INFO: task async/mgr:14 blocked for more than 120 seconds. Apr 9 00:47:44 xback2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 9 00:47:44 xback2 kernel: async/mgr D ffff88007d045d80 0 14 2 0x00000000 Apr 9 00:47:44 xback2 kernel: ffff88007d055ca0 0000000000000046 ffff88007d055c68 ffff88007d055c64 Apr 9 00:47:44 xback2 kernel: 0000000000000000 0000000000008a00 ffff88007d055fd8 ffff88007d055fd8 Apr 9 00:47:44 xback2 kernel: ffff88007d046158 000000000000f980 0000000000015740 ffff88007d046158 Apr 9 00:47:44 xback2 kernel: Call Trace: Apr 9 00:47:44 xback2 kernel: [<ffffffff814549eb>] schedule_timeout+0x36/0xe3 Apr 9 00:47:44 xback2 kernel: [<ffffffff81046355>] ? check_preempt_wakeup+0x1d4/0x270 Apr 9 00:47:44 xback2 kernel: [<ffffffff8104815e>] ? enqueue_task_fair+0x68/0x6d We don't have any virtual machines on this system. It's 2.6.32.10-90.fc12.x86_64 running an rsync job on an MD file system (xfs). We have an ext3 system similarly affected. Unfortunately the system crashed completely when I tried to use sysrq on it (C I think). Here is there memory information I grabbed: Apr 12 11:55:04 xback2 kernel: SysRq : Show Memory Apr 12 11:55:04 xback2 kernel: Mem-Info: Apr 12 11:55:04 xback2 kernel: Node 0 DMA per-cpu: Apr 12 11:55:04 xback2 kernel: CPU 0: hi: 0, btch: 1 usd: 0 Apr 12 11:55:04 xback2 kernel: CPU 1: hi: 0, btch: 1 usd: 0 Apr 12 11:55:04 xback2 kernel: Node 0 DMA32 per-cpu: Apr 12 11:55:04 xback2 kernel: CPU 0: hi: 186, btch: 31 usd: 159 Apr 12 11:55:04 xback2 kernel: CPU 1: hi: 186, btch: 31 usd: 115 Apr 12 11:55:04 xback2 kernel: active_anon:1757 inactive_anon:10553 isolated_anon:0 Apr 12 11:55:04 xback2 kernel: active_file:155930 inactive_file:155529 isolated_file:99 Apr 12 11:55:04 xback2 kernel: unevictable:0 dirty:69285 writeback:2073 unstable:0 Apr 12 11:55:04 xback2 kernel: free:72852 slab_reclaimable:81782 slab_unreclaimable:13993 Apr 12 11:55:04 xback2 kernel: mapped:2703 shmem:66 pagetables:1325 bounce:0 Apr 12 11:55:04 xback2 kernel: Node 0 DMA free:8044kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:2280kB inactive_file:492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15364kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4944kB slab_unreclaimable:112kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Apr 12 11:55:04 xback2 kernel: lowmem_reserve[]: 0 2003 2003 2003 Apr 12 11:55:04 xback2 kernel: Node 0 DMA32 free:283364kB min:5704kB low:7128kB high:8556kB active_anon:7028kB inactive_anon:42212kB active_file:621440kB inactive_file:621624kB unevictable:0kB isolated(anon):0kB isolated(file):396kB present:2051180kB mlocked:0kB dirty:277140kB writeback:8292kB mapped:10812kB shmem:264kB slab_reclaimable:322184kB slab_unreclaimable:55860kB kernel_stack:1776kB pagetables:5300kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Apr 12 11:55:04 xback2 kernel: lowmem_reserve[]: 0 0 0 0 Apr 12 11:55:04 xback2 kernel: Node 0 DMA: 19*4kB 20*8kB 28*16kB 28*32kB 17*64kB 12*128kB 5*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 8044kB Apr 12 11:55:04 xback2 kernel: Node 0 DMA32: 64907*4kB 2541*8kB 129*16kB 2*32kB 6*64kB 3*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 283364kB Apr 12 11:55:04 xback2 kernel: 311650 total pagecache pages Apr 12 11:55:04 xback2 kernel: 47 pages in swap cache Apr 12 11:55:04 xback2 kernel: Swap cache stats: add 592, delete 545, find 67/87 Apr 12 11:55:04 xback2 kernel: Free swap = 4199984kB Apr 12 11:55:04 xback2 kernel: Total swap = 4200988kB Apr 12 11:55:04 xback2 kernel: 524000 pages RAM Apr 12 11:55:04 xback2 kernel: 9892 pages reserved Apr 12 11:55:04 xback2 kernel: 176159 pages shared Apr 12 11:55:04 xback2 kernel: 274020 pages non-shared Still seeing this problem with 2.6.18-194.el5xen I also tried updating the Xen to 3.4.2 from the repo provided at: http://www.gitco.de/repo/xen3.4.2 and still experienced the same D state lockup in my guest. I've now downgraded back to the standard Xen that comes with RH/Centos. Another observation, the most common processes to lockup after jbd2/dm and flush are httpd and nscd. Both of these processes could possibly be doing LDAP lookups. Not sure want any of this is telling me. Jeremy, re comment #33. I does not look to be the same problem to me but I'm not really qualified to judge that. Hi, I have the exact same issue with xen 3.1.4 running on FC8 DOM0 2.6.21.7-5.fc8xen. INFO: task kjournald:851 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kjournald D ffff88001ce9ca80 0 851 2 0x00000000 ffff880002d1daf0 0000000000000246 ffff88001ce1d908 ffff88001e0cb008 000000000073335c 0000000000000001 ffff880002d1db58 ffff88001d5bdb00 ffff88001cde1b48 000000000000f8e0 ffff88001cde1b48 0000000000015600 Call Trace: [<ffffffff8106e897>] ? clocksource_read+0xf/0x11 [<ffffffff8106aca7>] ? ktime_get_ts+0x4e/0x53 [<ffffffff81419393>] io_schedule+0x31/0x42 [<ffffffff811e9503>] get_request_wait+0xeb/0x179 [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39 [<ffffffff811e982c>] __make_request+0x29b/0x3c5 [<ffffffff811e71cb>] generic_make_request+0x24e/0x2ac [<ffffffff8100e7f2>] ? check_events+0x12/0x20 [<ffffffff811e730c>] submit_bio+0xe3/0x100 [<ffffffff8100e7f2>] ? check_events+0x12/0x20 [<ffffffff8111b2f3>] submit_bh+0xf4/0x117 [<ffffffff8100e7df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff81196b39>] journal_do_submit_data+0x3c/0x4a [<ffffffff81197055>] journal_commit_transaction+0x3be/0xe55 [<ffffffff81048fcb>] ? finish_task_switch+0x6d/0xe6 [<ffffffff8100e7df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b [<ffffffff8105c11f>] ? try_to_del_timer_sync+0x5a/0x66 [<ffffffff8100e7df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8119a78b>] kjournald+0xeb/0x230 [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39 [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b [<ffffffff8119a6a0>] ? kjournald+0x0/0x230 [<ffffffff81067765>] kthread+0x91/0x99 [<ffffffff81012daa>] child_rip+0xa/0x20 [<ffffffff81011f97>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8101271d>] ? retint_restore_args+0x5/0x6 [<ffffffff81012da0>] ? child_rip+0x0/0x20 vif3.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1048 errors:0 dropped:0 overruns:0 frame:0 TX packets:2082 errors:0 dropped:32 overruns:0 carrier:0 collisions:0 txqueuelen:32 RX bytes:114497 (111.8 KiB) TX bytes:248113 (242.2 KiB) vif4.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:363 errors:0 dropped:0 overruns:0 frame:0 TX packets:414 errors:0 dropped:35 overruns:0 carrier:0 collisions:0 txqueuelen:32 RX bytes:43364 (42.3 KiB) TX bytes:55012 (53.7 KiB) vif5.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3127 errors:0 dropped:0 overruns:0 frame:0 TX packets:2850 errors:0 dropped:16 overruns:0 carrier:0 collisions:0 txqueuelen:32 RX bytes:443692 (433.2 KiB) TX bytes:404390 (394.9 KiB) vif6.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:436 errors:0 dropped:0 overruns:0 frame:0 TX packets:1087 errors:0 dropped:226 overruns:0 carrier:0 collisions:0 txqueuelen:32 RX bytes:63671 (62.1 KiB) TX bytes:127551 (124.5 KiB) Herve, is that stack trace from a guest or the Dom0 kernel. The lockups for me were in the guests, not Dom0. If it was a guest, what was it running? The stack trace is from one of my fc12 guest (2.6.32). The only one I managed to get as usually all the guests get stuck (even the guest running old fc8). Sometimes I can get the system back by destroying the Guest were the blockage occur, but usually only a reboot of the dom0 helps. 2 days ago I had issue with the interfaces not being released properly. Don’t know if it’s of any help. Found interface vif19.0 but domain 19 does not exist. Found interface vif20.0 but domain 20 does not exist. Found interface vbd-19-2050 but domain 19 does not exist. Found interface vbd-19-2049 but domain 19 does not exist. Found interface vif1.0 but domain 1 does not exist. Found interface vif19.0 but domain 19 does not exist. Found interface vif20.0 but domain 20 does not exist. Found interface vbd-19-2050 but domain 19 does not exist. Found interface vbd-19-2049 but domain 19 does not exist. Found interface vbd-1-51718 but domain 1 does not exist. Found interface vbd-1-51717 but domain 1 does not exist. Found interface vbd-1-51716 but domain 1 does not exist. Found interface vbd-1-51715 but domain 1 does not exist. Found interface vbd-1-51714 but domain 1 does not exist. Found interface vbd-1-51713 but domain 1 does not exist. .... Herve, could you describe your hardware, some history (how long have you been seeing this, etc), and perhaps a description of your workload? If it is the same problem I'm seeing, it seems to be a very rare problem. Yet I have not had a guest that has made 3 days of uptime. The other bugzillas that looked similar have gone silent. I also can't reproduce the problem on different hardware but then I don't have enough hardware to re-create exactly the same conditions. I've just become aware of several bugs to do with the Broadcom NICs that we use: # lspci | grep -i Broadcom 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) # ethtool -i peth0 driver: bnx2 version: 2.0.2 firmware-version: 4.6.4 IPMI 0.1.18 bus-info: 0000:02:00.0 Thing is that most of these problems are supposed to fixed in the RedHat 5.5 kernels. However I did notice some fixes in 2.6.18-198: [net] bnx2: fix lost MSI-X problem on 5709 NICs (John Feeney) [511368] [net] bnx2: avoid restarting cnic in some contexts (Andy Gospodarek) [554706] So I installed that kernel but it did not fix my problem. The guest still locks up in D state processes. But at least I've found another candidate for the cause here. Any suggestions of things to look for would be well received. About to try: options bnx2 "disable_msi=1" in /etc/modprobe.conf of dom0 After that, I think I'll try putting in some Intel NICs and see if that helps. Is this really linked to network use? Have you tried running the guests with the network down? I have not tried recently to run any tests with the network down. This is a production system now and pretty hard to find a time to do this. Before the system went live in late Dec 2009 we ran bonnie++ and other tests and did not notice this problem. We first noticed it 2 days after going live. In comment #34 I mention another guest with this problem where there was very little network activity. The hang occurred during an rpm install of a new kernel. At the same time this happened, the other guest was doing a normal busy network workload. So yes it could still be a storage problem, I don't now. There is no sign in dom0 of problems with the disks. Just over a day of uptime with 'options bnx2 "disable_msi=1"'. 2 days to go to set a new uptime record. Crash and burn. 'options bnx2 "disable_msi=1"' made no difference. Andrew, I get the impression you would be surprised if this problem was related to network use? Is there a reason for this? Looks similar to my bug with KVM: https://bugzilla.redhat.com/show_bug.cgi?id=563672 I have also this problem on servers with high disk IO. Created attachment 414109 [details]
guest panic on disk test
Well I finally got brave and ran some disk tests in production. The result was a panic in the guest within seconds. Attached is the stack trace. syslogd also wall'ed this:
Message from syslogd@turing at May 15 02:24:59 ...
kernel:------------[ cut here ]------------
Message from syslogd@turing at May 15 02:24:59 ...
kernel:invalid opcode: 0000 [#1] SMP
Message from syslogd@turing at May 15 02:24:59 ...
kernel:last sysfs file: /sys/devices/vbd-51712/block/xvda/stat
Message from syslogd@turing at May 15 02:24:59 ...
kernel:Stack:
Message from syslogd@turing at May 15 02:24:59 ...
kernel:Call Trace:
Message from syslogd@turing at May 15 02:24:59 ...
kernel:Code: 8b 38 48 8b 70 08 48 89 77 08 48 89 3e 48 89 00 48 89 40 08 66 ff 03 ff 14 25 c8 a7 6b 81 48 8b 42 f8 48 83 e0 fc 48 39 c3 74 04 <0f> 0b eb fe f0 80 62 f8 fe 48 89 cf ff 55 98 48 8b 45 80 8b 90
Timeout, server not responding.
The disk test was an fio test, run with "sudo fio four-threads-randio.fio" where four-threads-randio.fio contains:
; Four threads, two query, two writers.
[global]
rw=randread
size=256m
directory=/tmp/fio-testing/data
ioengine=libaio
iodepth=4
invalidate=1
direct=1
[bgwriter]
rw=randwrite
iodepth=32
[queryA]
iodepth=1
ioengine=mmap
direct=0
thinktime=3
[queryB]
iodepth=1
ioengine=mmap
direct=0
thinktime=5
[bgupdater]
rw=randrw
iodepth=16
thinktime=40
size=32m
Note that Dom0 logged no disk errors, or any error, in dmesg, syslog or "xm dmesg" on this crash.
The crash is comment #55 has no new information as far as I can tell. Running this test again, the system survives and runs without problem. Hope to get some testing time in the next few weeks were I can take the system offline and really hammer it. One odd thing about this system is that it has two problems that I am the only one pushing the reporting of. The other problem is bug #578674 (JBD: Spotted dirty metadata buffer) where most times setquota(8) is run, syslog reports messages like: kernel: JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 118115505). There's a risk of filesystem corruption in case of system crash. Occasionally this stack is written: Jun 1 09:55:05 turing kernel: ------------[ cut here ]------------ Jun 1 09:55:05 turing kernel: WARNING: at fs/quota/dquot.c:975 dquot_claim_space+0xa1/0x11f() Jun 1 09:55:05 turing kernel: Modules linked in: fuse inet_diag nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 dm_multipath uinput joydev xen_netfront xen_blkfront [last unloaded: scsi_wait_scan] Jun 1 09:55:05 turing kernel: Pid: 783, comm: flush-253:0 Not tainted 2.6.32.12-115.fc12.x86_64 #1 Jun 1 09:55:05 turing kernel: Call Trace: Jun 1 09:55:05 turing kernel: [<ffffffff81056370>] warn_slowpath_common+0x7c/0x94 Jun 1 09:55:05 turing kernel: [<ffffffff8105639c>] warn_slowpath_null+0x14/0x16 Jun 1 09:55:05 turing kernel: [<ffffffff81164ef3>] dquot_claim_space+0xa1/0x11f Jun 1 09:55:05 turing kernel: [<ffffffff811b6f66>] ext4_mb_mark_diskspace_used+0x272/0x31a Jun 1 09:55:05 turing kernel: [<ffffffff811ba0ed>] ext4_mb_new_blocks+0x1e8/0x423 Jun 1 09:55:05 turing kernel: [<ffffffff81216636>] ? blk_peek_request+0x191/0x1a7 Jun 1 09:55:05 turing kernel: [<ffffffff8129f691>] ? gnttab_free_grant_references+0x6a/0x73 Jun 1 09:55:05 turing kernel: [<ffffffff8118e4f8>] ext4_new_meta_blocks+0x52/0xaa Jun 1 09:55:05 turing kernel: [<ffffffff811b0d4a>] ? ext4_ext_find_extent+0x53/0x279 Jun 1 09:55:05 turing kernel: [<ffffffff811b0b07>] ext4_ext_new_meta_block+0x3b/0x42 Jun 1 09:55:05 turing kernel: [<ffffffff811b1a4d>] ext4_ext_insert_extent+0x79d/0xb1a Jun 1 09:55:05 turing kernel: [<ffffffff8110e3ad>] ? __kmalloc+0x14c/0x15e Jun 1 09:55:05 turing kernel: [<ffffffff811b0d4a>] ? ext4_ext_find_extent+0x53/0x279 Jun 1 09:55:05 turing kernel: [<ffffffff8100e87f>] ? xen_restore_fl_direct_end+0x0/0x1 Jun 1 09:55:05 turing kernel: [<ffffffff811b29ba>] ext4_ext_get_blocks+0xbf0/0x14b7 Jun 1 09:55:05 turing kernel: [<ffffffff8122a83b>] ? __lookup_tag+0x6d/0x12d Jun 1 09:55:05 turing kernel: [<ffffffff81195be8>] ext4_get_blocks+0x120/0x286 Jun 1 09:55:05 turing kernel: [<ffffffff81195e9e>] mpage_da_map_blocks+0x9e/0x31c Jun 1 09:55:05 turing kernel: [<ffffffff8119644b>] ext4_da_writepages+0x32f/0x535 Jun 1 09:55:05 turing kernel: [<ffffffff810dd537>] do_writepages+0x21/0x2a Jun 1 09:55:05 turing kernel: [<ffffffff8113a320>] writeback_single_inode+0xcc/0x1f1 Jun 1 09:55:05 turing kernel: [<ffffffff8113aac1>] writeback_inodes_wb+0x37a/0x411 Jun 1 09:55:05 turing kernel: [<ffffffff8100e1b5>] ? xen_force_evtchn_callback+0xd/0xf Jun 1 09:55:05 turing kernel: [<ffffffff8100e892>] ? check_events+0x12/0x20 Jun 1 09:55:05 turing kernel: [<ffffffff8113ac80>] wb_writeback+0x128/0x1a7 Jun 1 09:55:05 turing kernel: [<ffffffff8113aef1>] wb_do_writeback+0x13c/0x152 Jun 1 09:55:05 turing kernel: [<ffffffff8113af46>] bdi_writeback_task+0x3f/0xaf Jun 1 09:55:05 turing kernel: [<ffffffff810ec716>] ? bdi_start_fn+0x0/0xda Jun 1 09:55:05 turing kernel: [<ffffffff810ec787>] bdi_start_fn+0x71/0xda Jun 1 09:55:05 turing kernel: [<ffffffff810ec716>] ? bdi_start_fn+0x0/0xda Jun 1 09:55:05 turing kernel: [<ffffffff8107460a>] kthread+0x7f/0x87 Jun 1 09:55:05 turing kernel: [<ffffffff81012d6a>] child_rip+0xa/0x20 Jun 1 09:55:05 turing kernel: [<ffffffff81011f51>] ? int_ret_from_sys_call+0x7/0x1b Jun 1 09:55:05 turing kernel: [<ffffffff810126dd>] ? retint_restore_args+0x5/0x6 Jun 1 09:55:05 turing kernel: [<ffffffff81012d60>] ? child_rip+0x0/0x20 Jun 1 09:55:05 turing kernel: ---[ end trace 93666c5e61b00755 ]--- So two strange errors on the one system. I wonder if they are related? The most consistent way to trigger this bug (>50%) is to run, on a remote system: ssh turing dump 3uf - /dev/mapper/SYSTEM-root | /usr/bin/gzip -9 > backup.root.gz turing is the DomU system that crashes. The remote system is located on totally different hardware. Seems to me that the workload here is a lot of disk reads and a lot of network writes. I have finally been alble to reproduce this on completely different hardware while doing a similar command to comment #58. No stack traces as nothing logged to syslog and I didn't turn on hung_task_panic=1 for this kernel. This is different hardware to all he other reports in this bug. The host software is Centos 5.5 running 2.6.18-194.3.1.el5xen. I had a top running in a screen session of the single Fedora 12 domU: top - 07:26:30 up 2 days, 21:34, 1 user, load average: 24.99, 24.97, 24.53 Tasks: 149 total, 2 running, 146 sleeping, 0 stopped, 1 zombie Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 0.0%id, 99.8%wa, 0.0%hi, 0.0%si, 0.1%st Mem: 4060828k total, 4040752k used, 20076k free, 3486120k buffers Swap: 6291448k total, 320k used, 6291128k free, 134136k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 235 root 20 0 0 0 0 D 0.0 0.0 0:00.18 jbd2/dm-0-8 887 root 20 0 0 0 0 D 0.0 0.0 0:00.24 flush-253:0 1321 ntp 20 0 60716 2776 1904 D 0.0 0.1 0:00.27 ntpd 1384 postgres 20 0 183m 4988 4192 D 0.0 0.1 0:07.96 postmaster 1409 postgres 20 0 155m 1232 444 D 0.0 0.0 0:00.08 postmaster 1411 root 20 0 62464 1436 844 D 0.0 0.0 0:00.32 amd 1441 postgres 20 0 183m 1652 716 R 0.0 0.0 0:03.75 postmaster 1442 postgres 20 0 155m 1392 516 D 0.0 0.0 0:04.80 postmaster 1470 root 20 0 255m 5852 4212 D 0.0 0.1 0:00.37 abrtd 9076 root 20 0 54076 2588 1856 R 0.0 0.1 8:09.33 top 22487 root 20 0 23256 1948 144 D 0.0 0.0 0:14.18 dump 22488 root 20 0 23256 1948 144 D 0.0 0.0 0:14.07 dump 22489 root 20 0 23256 1948 144 D 0.0 0.0 0:14.47 dump 22530 root 20 0 109m 788 236 D 0.0 0.0 0:00.00 crond 22607 root 20 0 71840 2016 420 D 0.0 0.0 0:00.00 sendmail 22608 smmsp 20 0 59156 1628 408 D 0.0 0.0 0:00.00 sendmail 22609 root 20 0 109m 788 236 D 0.0 0.0 0:00.00 crond 22610 root 20 0 71840 2016 420 D 0.0 0.0 0:00.00 sendmail 22611 smmsp 20 0 59156 1628 408 D 0.0 0.0 0:00.00 sendmail 22612 root 20 0 109m 788 236 D 0.0 0.0 0:00.00 crond 22613 root 20 0 71840 2016 420 D 0.0 0.0 0:00.00 sendmail 22614 smmsp 20 0 59156 1628 408 D 0.0 0.0 0:00.00 sendmail 22615 root 20 0 109m 788 236 D 0.0 0.0 0:00.00 crond 22616 root 20 0 71840 2016 420 D 0.0 0.0 0:00.00 sendmail 22617 smmsp 20 0 59156 1628 408 D 0.0 0.0 0:00.00 sendmail The bug in comment #57 is un-realated. An upstream patch has been found for it. Created attachment 425530 [details]
dmesg of dom0
Attached is the dmesg output of the new dom0 that is seeing the domU crash. The output of lspci in dom0 is:
00:00.0 Host bridge: Intel Corporation E7230/3000/3010 Memory Controller Hub
00:01.0 PCI bridge: Intel Corporation E7230/3000/3010 PCI Express Root Port
00:1c.0 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port 1 (rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01)
00:1d.0 USB Controller: Intel Corporation N10/ICH7 Family USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #3 (rev 01)
00:1d.7 USB Controller: Intel Corporation N10/ICH 7 Family USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
00:1f.2 IDE interface: Intel Corporation N10/ICH7 Family SATA IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation N10/ICH 7 Family SMBus Controller (rev 01)
02:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
06:05.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
Created attachment 425531 [details]
A sysrq l stack of domU
Tried to do an "xm sysrq N w" to show stacks of blocked tasks on hung domU. Nothing was displayed. An "xm sysrq N l" showed the attached stack. Looks a bit different to the other stacks though I'm not sure if it's useful.
Anything in 'xm dmesg' on this new machine? Can you attach /proc/cpuinfo for the machines on which you're seeing the problem? Created attachment 425719 [details]
/proc/cpuinfo of dom0 orginal system
Created attachment 425720 [details]
/proc/cpuinfo of dom0 new system
Nothing extraordinary in 'xm dmesg' of new machine. I've now restarted dom0 with: kernel /xen.gz-2.6.18-194.3.1.el5 loglvl=all guest_loglvl=all debug_stack_lines=80 although that did not increase useful logging on the original system. Could this problem is related to timers or clock source in any way? In a bug that I think is similar to this, #526627, the clock source was the suspected cause for awhile. Also comment #54 above reports a very similar problem with KVM in bug #563672 My domU once logged: Jun 11 01:12:51 turing kernel: hrtimer: interrupt took 1645273 ns out of the blue. That's the only time I have ever seen that message. DomU currently reports: $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource xen tsc $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource xen I'm trying clocksource=tsc now in my second system. Early on in this bug I reported setting clocksource to jiffies in an earlier Fedora kernel and also in bug #526627 I reported booting with this. I got the characteristic hanging processes very quickly. clocksource=tsc does not fix this problem. Tried it on my main unstable system and it hung once again during backups. Also, the host was running kernel-xen-2.6.18-204.el5.x86_64 Just noticed that this bug is referenced here: http://lists.xensource.com/archives/html/xen-devel/2010-06/msg00590.html No solution there either. One comment from Rick Boone was: > I booted into 2.6.32.12 yesterday, but I'm still getting a hung task crash. (Initially, I got a soft lockup crash on one of the cpu's, after switching to 2.6.32.12, but it seems I was able to resolve that by switching the kernel to tickless) After my next crash, probably during backups tonight, I'll try booting with: nohz=off highres=off which I think takes me *AWAY* from a tickless kernel. (In reply to comment #68) > clocksource=tsc does not fix this problem. > > Tried it on my main unstable system and it hung once again during backups. > > Also, the host was running kernel-xen-2.6.18-204.el5.x86_64 When you set clocksource=tsc on the command line, did you check to see if it was accepted? There are some bugs that make the clocksource= option get ignored... (In reply to comment #70) > (In reply to comment #68) > > clocksource=tsc does not fix this problem. > When you set clocksource=tsc on the command line, did you check to see if it > was accepted? There are some bugs that make the clocksource= option get > ignored... Yes I'm pretty sure I checked /sys/devices/system/clocksource/clocksource0/current_clocksource and also looking back in my console logs I see in the boot messages: kernel: Switching to clocksource tsc (In reply to comment #69) > After my next crash, probably during backups tonight, I'll try booting with: > > nohz=off highres=off > > which I think takes me *AWAY* from a tickless kernel. I can now report that this did not solve the problem either. Same characteristic crash with nohz=off highres=off Although, comment #70 made me wonder how do I check that those options "took"? I can see them in the command line boot messages but what on the system indicates that I am using those options? I have no ideas of what to try next for this bug. I think I've reproduced this bug on a very recent rhel6 kernel. I still haven't got all the ingredients nailed down, but I'm suspicious it had something to do with an ext3 file system that I had on a file-backed blkback device. I still don't think I've ever seen this problem with ext4 though. Which fs is being used on the system having the problems here? If it's ext3, could you try to find a way to try ext4? I know that's asking a lot if it's a production system. Andrew Ah, nevermind. Just got it to reproduce with ext4. Only notes I have right now is that when I mount these file systems I get this log to the console JBD: barrier-based sync failed on xvdb - disabling barriers and the first tasks to start hanging (with ext3) are kjournald, flush-202 - not always that order (reproduced 3 times) and now on the ext4 reproduced case it was jbd2/xvda2-8 My reproducer is currently to simply build a kernel on these file systems and wait. Andrew (In reply to comment #74) > Ah, nevermind. Just got it to reproduce with ext4. Only notes I have right now > is that when I mount these file systems I get this log to the console All my crashes have been with ext4 (with or without quotas enabled) except for the crash mentioned in comment #34 which was ext3 on a /boot partition. > JBD: barrier-based sync failed on xvdb - disabling barriers Hmm, this messages is familiar for some reason but I can't find it in any of my current logs. > and the first tasks to start hanging (with ext3) are > > kjournald, flush-202 - not always that order (reproduced 3 times) > > and now on the ext4 reproduced case it was > > jbd2/xvda2-8 jbd2/dm* for my ext4 filesystems. Any one could go first but the most likely seems to be /var All systems that have had this problem for me have been on one or more lv of the host system. I have not tried image files. > My reproducer is currently to simply build a kernel on these file systems and > wait. My feelings on the workload that reproduces this is that it has to be stop/start disk activity. Just straight heavy I/O rarely hits this problem in my testing. The network also figures prominently in the mix. After the the jdb2 hang the first processes to hang are usually things like httpd, nscd, smdb and imap. Also backups, with dump across the network, seems to trigger the problem reasonably often. I'll put my test system (with has seen the problem only twice) into a loop of kernel building and see what happens. The production system that sees the problem often is a LTSP and nxclient server. So that's lots of desktop users using browsers, compilers, editors, etc. Lots of stop/start network and disk I/O. (In reply to comment #75) > (In reply to comment #74)> > > JBD: barrier-based sync failed on xvdb - disabling barriers > Hmm, this messages is familiar for some reason but I can't find it in any of my > current logs. I saw a discussion of this message on the ext4 mailing list. I don't get it in my logs as far as I can tell. http://marc.info/?t=128017340300003&r=1&w=2 > > My reproducer is currently to simply build a kernel on these file systems and > > wait. > I'll put my test system (with has seen the problem only twice) into a loop of > kernel building and see what happens. My test system, building the kernel went into this state after 2 days. However it did it when doing a backup (dump) to a remote system again and/or sending an email. top - 07:29:45 up 4 days, 14:24, 3 users, load average: 25.99, 25.96, 25.35 Tasks: 153 total, 1 running, 152 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 0.0%id, 99.9%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 4060836k total, 2080144k used, 1980692k free, 128332k buffers Swap: 6291448k total, 5444k used, 6286004k free, 374432k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMMAND 20361 root 20 0 14908 1172 856 R 0.3 0.0 4:18.03 top 237 root 20 0 0 0 0 D 0.0 0.0 6:29.73 jbd2/dm-0-8 825 root 20 0 0 0 0 D 0.0 0.0 1:36.20 flush-253:0 919 root 20 0 8456 908 580 D 0.0 0.0 0:00.10 patch 920 root 20 0 170m 2644 1680 D 0.0 0.1 0:00.00 crond 923 root 20 0 170m 2644 1680 D 0.0 0.1 0:00.00 crond 926 root 20 0 111m 976 404 D 0.0 0.0 0:00.00 crond 927 smmsp 20 0 59152 1656 420 D 0.0 0.0 0:00.00 sendmail 928 root 20 0 71836 2040 432 D 0.0 0.1 0:00.00 sendmail 929 root 20 0 115m 1112 524 D 0.0 0.0 0:00.00 crond 930 smmsp 20 0 59152 1656 420 D 0.0 0.0 0:00.00 sendmail 932 root 20 0 71836 2040 432 D 0.0 0.1 0:00.00 sendmail 933 root 20 0 117m 1136 540 D 0.0 0.0 0:00.00 crond 934 smmsp 20 0 59152 1656 420 D 0.0 0.0 0:00.00 sendmail 935 root 20 0 71836 2040 432 D 0.0 0.1 0:00.00 sendmail 936 root 20 0 128m 1244 612 D 0.0 0.0 0:00.00 crond 937 smmsp 20 0 59152 1656 420 D 0.0 0.0 0:00.00 sendmail 938 root 20 0 71836 2044 432 D 0.0 0.1 0:00.00 sendmail 1218 ntp 20 0 60708 1736 868 D 0.0 0.0 0:00.33 ntpd 1281 postgres 20 0 183m 3416 2776 D 0.0 0.1 0:01.25 postmaster 1373 postgres 20 0 183m 1068 396 D 0.0 0.0 0:01.44 postmaster 1374 postgres 20 0 155m 908 268 D 0.0 0.0 0:01.61 postmaster 32393 root 20 0 23256 1920 144 D 0.0 0.0 0:00.39 dump 32394 root 20 0 23256 1920 144 D 0.0 0.0 0:00.31 dump 32395 root 20 0 23256 1920 144 D 0.0 0.0 0:00.29 dump (In reply to comment #76) > > > JBD: barrier-based sync failed on xvdb - disabling barriers > I don't get it in my logs as far as I can tell. Er, yes I do. It has appeared in some of the ancient logs of my test system. In 5 days of rebuilding the kernel, my test system crashed once as described in comment #76. So still pretty tough to re-produce this in test. My production system crashed 13 times during the same period. One observation. The crash in comment #34 was a pivotal moment. It was the one that got this bug moved from a fedora kernel (guest) to the enterprise kernel (the host). Do we really think this problem is in the host kernel or even Xen? KVM guests seem to also see something similar. Comment #34 was also interesting in that there was little network activity on the guest that crashed. My main crashing guest was also running on the host at the same time but it didn't crash (then). Right, good question (where does this bug belong). If the problem on KVM hosts really seems to be the same (I'll try to reproduce that soon), then we can focus more on the guest. At this time though, I still want to dig deeper on what I saw in comment 36. I'll try to get to this asap, but I still have some other issues to knock off the TODO list first, sorry. Drew We have experienced the same problem: for high CPU tasks, the PV F12 DomU works well; for hight I/O tasks, the F12 DomU crashes and the Dom0 also crashes. One oops message: Sep 2 21:27:33 node04 kernel: INFO: task ext4-dio-unwrit:990 blocked for more than 120 seconds. Sep 2 21:27:33 node04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 2 21:27:33 node04 kernel: ext4-dio-unwr D ffff8800157890c0 0 990 2 0x00000000 Sep 2 21:27:33 node04 kernel: ffff880012e39d40 0000000000000246 0000000003f44d9c 0000000000006ad1 Sep 2 21:27:33 node04 kernel: ffff8800124e44d8 ffff8800124e44a0 ffff880012e39fd8 ffff880012e39fd8 Sep 2 21:27:33 node04 kernel: ffff8800124e4870 0000000000007d30 000000000000b0c0 ffff8800124e4870 Sep 2 21:27:33 node04 kernel: Call Trace: Sep 2 21:27:33 node04 kernel: [<ffffffff8102d986>] ? set_next_entity+0x28/0x78 Sep 2 21:27:33 node04 kernel: [<ffffffff8140ae01>] __mutex_lock_common+0x11e/0x18d Sep 2 21:27:33 node04 kernel: [<ffffffff8140ae89>] __mutex_lock_slowpath+0x19/0x1b Sep 2 21:27:33 node04 kernel: [<ffffffff8140af9a>] mutex_lock+0x31/0x4b Sep 2 21:27:33 node04 kernel: [<ffffffff81157b29>] ext4_end_aio_dio_work+0x2c/0x6d Sep 2 21:27:33 node04 kernel: [<ffffffff81053705>] worker_thread+0x1c3/0x24f Sep 2 21:27:33 node04 kernel: [<ffffffff81157afd>] ? ext4_end_aio_dio_work+0x0/0x6d Sep 2 21:27:33 node04 kernel: [<ffffffff81057b54>] ? autoremove_wake_function+0x0/0x39 Sep 2 21:27:33 node04 kernel: [<ffffffff81053542>] ? worker_thread+0x0/0x24f Sep 2 21:27:33 node04 kernel: [<ffffffff81057711>] kthread+0x7f/0x87 Sep 2 21:27:33 node04 kernel: [<ffffffff8100ab2a>] child_rip+0xa/0x20 Sep 2 21:27:33 node04 kernel: [<ffffffff81057692>] ? kthread+0x0/0x87 Sep 2 21:27:33 node04 kernel: [<ffffffff8100ab20>] ? child_rip+0x0/0x20 Sep 2 21:27:33 node04 kernel: INFO: task tapdisk:18044 blocked for more than 120 seconds. Sep 2 21:27:33 node04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 2 21:27:33 node04 kernel: tapdisk D ffff8800157890c0 0 18044 1 0x00000080 Sep 2 21:27:33 node04 kernel: ffff88006cc29738 0000000000000286 ffff88006cc29708 ffff88007e3d4cb0 Sep 2 21:27:33 node04 kernel: ffff88007e3d4cb0 ffff880000000002 ffff88006cc29fd8 ffff88006cc29fd8 Sep 2 21:27:33 node04 kernel: ffff88004064c870 0000000000007d30 000000000000b0c0 ffff88004064c870 Sep 2 21:27:33 node04 kernel: Call Trace: Sep 2 21:27:33 node04 kernel: [<ffffffff81189bab>] do_get_write_access+0x200/0x40e Sep 2 21:27:33 node04 kernel: [<ffffffff81057b8d>] ? wake_bit_function+0x0/0x33 Sep 2 21:27:33 node04 kernel: [<ffffffff8118fc7e>] ? jbd2_journal_add_journal_head+0xbf/0x13d Sep 2 21:27:33 node04 kernel: [<ffffffff81189eff>] jbd2_journal_get_write_access+0x2b/0x44 Sep 2 21:27:33 node04 kernel: [<ffffffff811538ab>] ? ext4_read_block_bitmap+0x54/0x29c Sep 2 21:27:33 node04 kernel: [<ffffffff81176639>] __ext4_journal_get_write_access+0x56/0x60 Sep 2 21:27:33 node04 kernel: [<ffffffff811791e2>] ext4_mb_mark_diskspace_used+0x72/0x333 Sep 2 21:27:33 node04 kernel: [<ffffffff81178a48>] ? ext4_mb_initialize_context+0x7d/0x18d Sep 2 21:27:33 node04 kernel: [<ffffffff8117bfc8>] ext4_mb_new_blocks+0x1e8/0x42c Sep 2 21:27:33 node04 kernel: [<ffffffff811736c0>] ? ext4_ext_find_extent+0x130/0x27f Sep 2 21:27:33 node04 kernel: [<ffffffff81175945>] ext4_ext_get_blocks+0x1289/0x14e0 Sep 2 21:27:33 node04 kernel: [<ffffffff81021dd6>] ? xen_multicall_flush+0x2c/0x2e Sep 2 21:27:33 node04 kernel: [<ffffffff810be586>] ? unmap_vmas+0x663/0x7b8 Sep 2 21:27:33 node04 kernel: [<ffffffff81188f03>] ? start_this_handle+0x3dd/0x402 Sep 2 21:27:33 node04 kernel: [<ffffffff8115b011>] ext4_get_blocks+0x152/0x30f Sep 2 21:27:33 node04 kernel: [<ffffffff8115b246>] ext4_get_block_dio_write+0x78/0xb6 Sep 2 21:27:33 node04 kernel: [<ffffffff8110bc19>] __blockdev_direct_IO+0x521/0xae0 Sep 2 21:27:33 node04 kernel: [<ffffffff81159f76>] ext4_direct_IO+0x1b6/0x247 Sep 2 21:27:33 node04 kernel: [<ffffffff8115b1ce>] ? ext4_get_block_dio_write+0x0/0xb6 Sep 2 21:27:33 node04 kernel: [<ffffffff81157b6a>] ? ext4_end_io_dio+0x0/0x85 Sep 2 21:27:33 node04 kernel: [<ffffffff810a7cf6>] generic_file_direct_write+0xed/0x16d Sep 2 21:27:33 node04 kernel: [<ffffffff810a7f0c>] __generic_file_aio_write+0x196/0x286 Sep 2 21:27:33 node04 kernel: [<ffffffff810a805f>] generic_file_aio_write+0x63/0xad Sep 2 21:27:33 node04 kernel: [<ffffffff811549af>] ? ext4_file_write+0x0/0x95 Sep 2 21:27:33 node04 kernel: [<ffffffff81154a3d>] ext4_file_write+0x8e/0x95 Sep 2 21:27:33 node04 kernel: [<ffffffff810d6d97>] ? kmem_cache_alloc+0x47/0x140 Sep 2 21:27:33 node04 kernel: [<ffffffff81114e52>] aio_rw_vect_retry+0x8e/0x19d Sep 2 21:27:33 node04 kernel: [<ffffffff81114dc4>] ? aio_rw_vect_retry+0x0/0x19d Sep 2 21:27:33 node04 kernel: [<ffffffff811168e7>] aio_run_iocb+0x79/0x11f Sep 2 21:27:33 node04 kernel: [<ffffffff8111748d>] sys_io_submit+0x5db/0x6c6 Sep 2 21:27:33 node04 kernel: [<ffffffff8100a078>] system_call_fastpath+0x16/0x1b Sep 2 21:27:33 node04 kernel: [<ffffffff8100a010>] ? system_call+0x0/0x52 Another one: INFO: task tapdisk:12896 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tapdisk D ffff8800157890c0 0 12896 1 0x00000080 ffff88007ce871a8 0000000000000282 0000000000000000 ffff880001c62dc0 0000001001080001 ffff8800586e27e0 ffff88007ce87fd8 ffff88007ce87fd8 ffff880001c63190 0000000000007d30 000000000000b0c0 ffff880001c63190 Call Trace: [<ffffffff8140a7fd>] io_schedule+0x73/0xb5 [<ffffffff811d624c>] get_request_wait+0xee/0x17c [<ffffffff81057b54>] ? autoremove_wake_function+0x0/0x39 [<ffffffff811cddf4>] ? elv_merge+0x167/0x19f [<ffffffff811d65c2>] __make_request+0x2e8/0x3fb [<ffffffff811d4c1d>] generic_make_request+0x2c8/0x321 [<ffffffff810a9192>] ? mempool_alloc+0x6c/0x11e [<ffffffff81026966>] ? __wake_up_common+0x4e/0x84 [<ffffffff811d4d59>] submit_bio+0xe3/0x100 [<ffffffff811048c3>] submit_bh+0xf4/0x117 [<ffffffff811063b4>] __block_write_full_page+0x1d8/0x2b3 [<ffffffff8115bf23>] ? noalloc_get_block_write+0x0/0x60 [<ffffffff81105d70>] ? end_buffer_async_write+0x0/0x1ab [<ffffffff81105d70>] ? end_buffer_async_write+0x0/0x1ab [<ffffffff8115bf23>] ? noalloc_get_block_write+0x0/0x60 [<ffffffff81106518>] block_write_full_page_endio+0x89/0x95 [<ffffffff81106539>] block_write_full_page+0x15/0x17 [<ffffffff81158866>] ext4_writepage+0x2ac/0x2f0 [<ffffffff810b16f8>] pageout+0x137/0x1fe [<ffffffff810b1bba>] shrink_page_list+0x265/0x483 [<ffffffff810dae9a>] ? mem_cgroup_del_lru+0x39/0x3b [<ffffffff810b0e8f>] ? isolate_pages_global+0x1ba/0x201 [<ffffffff81008ad3>] ? __switch_to+0x452/0x56f [<ffffffff810b21a9>] shrink_inactive_list+0x3d1/0x6d2 [<ffffffff81057b4f>] ? wake_up_bit+0x25/0x2a [<ffffffff81036c87>] ? try_to_wake_up+0x2fc/0x30e [<ffffffff810b2825>] shrink_zone+0x37b/0x44b [<ffffffff810b2b17>] do_try_to_free_pages+0x222/0x399 [<ffffffff810b2d76>] try_to_free_pages+0x6e/0x70 [<ffffffff810b0cd5>] ? isolate_pages_global+0x0/0x201 [<ffffffff810ac912>] __alloc_pages_nodemask+0x3f3/0x63c [<ffffffff810d49dc>] alloc_slab_page+0x1e/0x20 [<ffffffff810d526e>] __slab_alloc+0x16b/0x488 [<ffffffff8110b812>] ? __blockdev_direct_IO+0x11a/0xae0 [<ffffffff8110b812>] ? __blockdev_direct_IO+0x11a/0xae0 [<ffffffff810d5b7f>] kmem_cache_alloc_notrace+0x7d/0xf2 [<ffffffff8110b812>] __blockdev_direct_IO+0x11a/0xae0 [<ffffffff810aeee4>] ? pagevec_lookup+0x22/0x2b [<ffffffff810afff5>] ? invalidate_inode_pages2_range+0x27e/0x2b7 [<ffffffff81159f76>] ext4_direct_IO+0x1b6/0x247 [<ffffffff8115b1ce>] ? ext4_get_block_dio_write+0x0/0xb6 [<ffffffff81157b6a>] ? ext4_end_io_dio+0x0/0x85 [<ffffffff810a7cf6>] generic_file_direct_write+0xed/0x16d [<ffffffff810a7f0c>] __generic_file_aio_write+0x196/0x286 [<ffffffff810f0803>] ? set_fd_set+0x50/0x5b [<ffffffff810f1858>] ? core_sys_select+0x1d6/0x20a [<ffffffff810a805f>] generic_file_aio_write+0x63/0xad [<ffffffff811549af>] ? ext4_file_write+0x0/0x95 [<ffffffff81154a3d>] ext4_file_write+0x8e/0x95 [<ffffffff810d6d97>] ? kmem_cache_alloc+0x47/0x140 [<ffffffff81114e52>] aio_rw_vect_retry+0x8e/0x19d [<ffffffff81114dc4>] ? aio_rw_vect_retry+0x0/0x19d [<ffffffff811168e7>] aio_run_iocb+0x79/0x11f [<ffffffff8111748d>] sys_io_submit+0x5db/0x6c6 [<ffffffff8100a078>] system_call_fastpath+0x16/0x1b [<ffffffff8100a010>] ? system_call+0x0/0x52 A brief list of the hardware and configuratioin: Dom0′s hardware platform: Motherboard: INTEL S5500BC S5500 Quad Core Xeon Server Board CPU: 2 x Intel Quad Core Xeon E5520 2.26G (5.86GT/sec,8M,Socket 1366) Memory: 8 x Kingston DDR-3 1333MHz 4GB ECC REG. CL9 DIMM w/Parity & Thermal Sensor HD: 4 x WD WD10EARS 1 TB, SATA II 3Gb/s, 64 MB Cache Dom0: F12, with kernel 2.6.32.13 + 2.6.32 Xen patches v2 from http://gentoo-xen-kernel.googlecode.com/files/xen-patches-2.6.32-2.tar.bz2 Dom0 configuration: Processor type and features ---> [*] Symmetric multi-processing support [*] Support sparse irq numbering [*] Enable Xen compatible kernel Preemption Model (No Forced Preemption (Server)) ---> Device Drivers ---> XEN ---> [*] Privileged Guest (domain 0) <*> Backend driver support (NEW) <*> Block-device backend driver (NEW) <*> Block-device tap backend driver (NEW) < > Block-device tap backend driver 2 (NEW) <*> Network-device backend driver (NEW) (8) Maximum simultaneous transmit requests (as a power of 2) (NEW) [ ] Pipelined transmitter (DANGEROUS) (NEW) < > Network-device loopback driver (NEW) < > PCI-device backend driver (NEW) < > TPM-device backend driver (NEW) <M> SCSI backend driver (NEW) <M> USB backend driver (NEW) <M> Block-device frontend driver <M> Network-device frontend driver <M> Network-device frontend driver acceleration for Solarflare NICs (NEW) <M> SCSI frontend driver (NEW) <M> USB frontend driver (NEW) [*] Taking the HCD statistics (for debug) (NEW) [ ] HCD suspend/resume support (DO NOT USE) (NEW) <*> User-space granted page access driver (NEW) <*> Framebuffer-device frontend driver (NEW) <*> Keyboard-device frontend driver (NEW) [*] Disable serial port drivers (NEW) <*> Export Xen attributes in sysfs (NEW) (256) Number of guest devices (NEW) Xen version compatibility (no compatibility code) ---> [*] Place shared vCPU info in per-CPU storage (NEW) DomU: F12 with pv_ops kernel: 2.6.32.19-163.fc12.x86_64 The details of hardware and kernel configuration is here: http://pkill.info/b/2405/setting-up-stable-xen-dom0-with-fedora-xen-3-4-3-with-xenified-linux-kernel-2-6-32-13-in-fedora-12/ Our solution now: Disable the hard disk cache by: hdparm -W0 /dev/sdX After disabling the hard disk cache, the VMs works. When we do a middle size (processing 20GB data simultaneously on 8VMs on top of 2 physical machines) intensive I/O test, the VMs and Dom0 don't crash. We will run a full size (about 800GB data on these 8VMs) experiment and see whether is works. Some recent threads that look like this problem: http://forums.citrix.com/thread.jspa?threadID=272708 (Xen block devices stop responding in guest) That lead to this kernel mailing list thread: http://lkml.org/lkml/2010/9/1/178 (I/O scheduler deadlocks on Xen virtual block devices) The main points from the kernel thread: - Problem probably in the Xen I/O stack and not guest scheduler. - These patches might be the real fixes: xen: use percpu interrupts for IPIs and VIRQs xen: handle events as edge-triggered A work-around might be to disable irqbalance in the guest. This seems to have been the case so far in the citrix thread mentioned above. I'm now running my main crashing guest without irqbalance. We have done experiments on LVM backed DomUs. The DomUs have this problem as in comments 80 and the other configurations (hardware, the Dom0, DomU's kernel, etc.) are the same except the backed VBD (moved from file / tap:aio to LVM backed). The 8 LVM backed DomUs work well so far with around 5GB writes and 5GB reads for each DomU. Hope this piece info can help. Hi Norman, Great work tracking these mail threads and potentially finding a solution. I did a quick test on my machine. First I confirmed I could still reproduce it with a kernel compile loop with irqbalance on. It did reproduce in something less then 30 minutes. Then I turned off irqbalance and ran the kernel compile loop a few hours. But then I turned irqbalance back on to confirm I could again trigger it quickly, but still haven't seen it in almost an hour, so currently my testing isn't super conclusive. I'm anxious to hear your results, since I know you've been plagued with it frequently. Drew Turning off irqbalance is looking very promising. I have not had a crash since doing so and the system has been through some significant workload. Since the 23rd of March this year I've been recording (most) of the times of crashes. Rough stats are: Shortest uptime: 6.683 minutes Longest uptime: 8.885 days Number of Crashes: 210 Crashes/day: 0.845 The 8 days was set during a period of very little activity. As I write this, 4 days of uptime has passed with a normal to large workload. One note, during a particularly high load during this uptime, the message mentioned in comment #67 made a return: Sep 20 11:42:55 turing kernel: hrtimer: interrupt took 940625 ns On the disappointing side, I've still not been able to create a test system that will reliably crash. I think we can be confident that irqbalance off does the trick. I ran the kernel compile loop with irqbalance off all weekend and it was still running this morning. I turned irqbalance back on and was able to reproduce within an hour or so. Next up: backport the patches pointed to and try another round with irqbalance on. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A work-around for this issue is to turn off irqbalance, i.e. chkconfig --level 2345 irqbalance off service irqbalance stop I've been running the kernel compile loop on a RHEL6 guest patched with the two patches pointed to in comment 81 for over 30 hours (with irqbalance on, of course). I'm pretty confident that they fix the issue and will post them. The "handle events as edge-triggered" patch, which is probably the one that truly fixes the issue, is already in Linus' tree, but .36-rc3, which means it won't be in Fedora until F15. I just saw that the 2 patches also made it into 2.6.32.22 and 2.6.35.5 So they should make it into Fedora 12 if the stable updates are pushed there. Fedora 13 is currently 2.6.34.7 so I'm not sure about that. See: http://lwn.net/Articles/406311/ http://lwn.net/Articles/406312/ (In reply to comment #91) > I just saw that the 2 patches also made it into 2.6.32.22 and 2.6.35.5 Ah, nice. Yeah, I forgot to check the stable trees to see what got pulled in by them. This bug started out in the fedora 12 kernel moved around some and ended up in the RHE 6 kernel. Should this bug be cloned and changed to the F12 kernel to encourage the stable 2.6.32.22 updates to be applied there? Right, agreed. Instead of cloning (since this bug is huge). I've just created a new bug pointing to the patches, bug 636534. (In reply to comment #93) > Should this bug be cloned and changed to the F12 kernel to encourage the stable > 2.6.32.22 updates to be applied there? They're already applied. QA, This patch needs some wide-spread testing on different machines. I saw no problems with it on my machine when testing, but there's been a report that it causes a boot crash. See Chuck's comment in bug 636534 comment 3. Thanks, Drew (In reply to comment #96) > This patch needs some wide-spread testing on different machines. I saw no > problems with it on my machine when testing, but there's been a report that it > causes a boot crash. See Chuck's comment in bug 636534 comment 3. Seems there is a typo in some instances of the patch. See bug 636534 comment 4. (In reply to comment #97) > Seems there is a typo in some instances of the patch. See bug 636534 comment 4. Cool, I just saw that message that there was a typo on xen-devel too. The patch I've posted doesn't have it, so we're good-to-go. Thanks, Drew This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. *** Bug 587265 has been marked as a duplicate of this bug. *** Patch(es) available on kernel-2.6.32-83.el6 Just to confirm before QA ack this bug: Could comment #90 be used as a case to verify this bug? (In reply to comment #103) > Just to confirm before QA ack this bug: Could comment #90 be used as a case to > verify this bug? It worked for me. Just using the system under load seemed to accelerate the chances of hitting it (with irqbalance on). However, it seems to be a bit different for each report. Try booting an unpatched kernel on a test machine, start a kernel compile loop with irqbalance on, and then just leave it. If it doesn't hit the bug within a day or two, then I'm not sure what to say, other than trying another machine. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,3 +1 @@ -A work-around for this issue is to turn off irqbalance, i.e. +In some cases, under a small system load involve some I/O operation, processes started to lock up in the D state (that is, became unresponsive). The system load could in some cases climb steadily. This was due to the way the event channel IRQ (Interrupt Request) was set up. Xen events behave like edge-triggered IRQs, however, the kernel was setting them up as level-triggered IRQs. As a result, any action using Xen event channels could lock up a process in the D state. With this update, the handling has been changed from edge-triggered IRQs to level-triggered IRQs and process no longer lock up in the D state.-chkconfig --level 2345 irqbalance off -service irqbalance stop An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html |
Created attachment 380486 [details] A terminal window with a top running shows the progressively degrading state. Description of problem: Sometimes under a small load, processes start locking up in D state. Load climbs steadly. Reboot required. Version-Release number of selected component (if applicable): Happens with both: kernel-2.6.31.6-166.fc12.x86_64 kernel-2.6.31.9-174.fc12.x86_64 How reproducible: Frequency roughly about every 24 hours. Happens when several LTSP sessions started, or sometime during logwatch processing, or during rsync backups. Additional info: System is a large Xen DomU. Dom0 is able to destroy and re-create the system.