---Problem Description--- Test suite "bonnie" generates call trace with RHEL 5.6 guest ---uname output--- Linux mx3650b.in.ibm.com 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64 GNU/Linux Machine Type = HS22 ---Steps to Reproduce--- 1) Install rhel5.6 (64bit) on rhel5.6 host 2) run "bonnie" testsuite in the guest. ---KVM Component Data--- RHEV Build date: Red Hat Enterprise Linux Server release 5.6 (Tikanga) Kernel \r on an \m RHEL Kernel version: Linux mx3650b.in.ibm.com 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64 GNU/Linux Qemu version: QEMU PC emulator version 0.9.1 (kvm-83-maint-snapshot-20090205), Copyright (c) 2003-2008 Fabrice Bellard KVM version: filename: /lib/modules/2.6.18-238.el5/weak-updates/kmod-kvm/kvm.ko license: GPL author: Qumranet version: kvm-83-224.el5 srcversion: DFFDF9C90F86F76B4F0935A depends: vermagic: 2.6.18-237.el5 SMP mod_unload gcc-4.1 parm: oos_shadow:bool parm: force_kvmclock:bool Guest OS: RHEL 5.6 32-bit Guest OS Image storage type: localfile Host Machine Type: HS22 Test Type: automated Qemu Command Line: /usr/libexec/qemu-kvm -name 'vm1' -monitor unix:'/tmp/monitor-humanmonitor1-20110125-191701-mD75',server,nowait -serial unix:'/tmp/serial-20110125-191701-mD75',server,nowait -drive file='/tmp/kvm_autotest_root/images/rhel5-64.qcow2',index=0,if=virtio,cache=none,boot=on -net nic,vlan=0,model=virtio,macaddr='9a:f8:23:cf:93:08' -net tap,vlan=0,ifname='t0-191701-mD75',script='/root/autotest/client/tests/kvm/scripts/qemu-ifup',downscript='no' -m 4096 -smp 4 -redir tcp:5000::22 -vnc :2 Call trace ------------- Jan 24 23:10:31 localhost kernel: psmouse.c: Explorer Mouse at isa0060/serio1/input0 lost synchronization, throwing 1 bytes away. Jan 24 23:10:31 localhost kernel: psmouse.c: failed to re-enable mouse on isa0060/serio1 Jan 24 23:10:31 localhost kernel: psmouse.c: resync failed, issuing reconnect request Jan 24 23:10:31 localhost kernel: INFO: task syslogd:1995 blocked for more than 120 seconds. Jan 24 23:10:31 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 24 23:10:31 localhost kernel: syslogd D ffff81013b97f7b0 0 1995 1 1998 1987 (NOTLB) Jan 24 23:10:31 localhost kernel: ffff8101346add88 0000000000000082 0000000000000296 0000000000000003 Jan 24 23:10:31 localhost kernel: ffff8101346add18 0000000000000009 ffff81013e8bb040 ffff81013e8207e0 Jan 24 23:10:31 localhost kernel: 00000024f11a0b5d 0000000000014838 ffff81013e8bb228 000000033b97f7b0 Jan 24 23:10:31 localhost kernel: Call Trace: Jan 24 23:10:31 localhost kernel: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5 Jan 24 23:10:31 localhost kernel: [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e Jan 24 23:10:31 localhost kernel: [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff Jan 24 23:10:31 localhost kernel: [<ffffffff8002fcd8>] __writeback_single_inode+0x1d9/0x318 Jan 24 23:10:31 localhost kernel: [<ffffffff800e2b79>] do_readv_writev+0x26e/0x291 Jan 24 23:10:31 localhost kernel: [<ffffffff800f5992>] sync_inode+0x24/0x33 Jan 24 23:10:31 localhost kernel: [<ffffffff8804c370>] :ext3:ext3_sync_file+0xcc/0xf8 Jan 24 23:10:31 localhost kernel: [<ffffffff80050310>] do_fsync+0x52/0xa4 Jan 24 23:10:31 localhost kernel: [<ffffffff800e33fe>] __do_fsync+0x23/0x36 Jan 24 23:10:31 localhost kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Jan 24 23:10:31 localhost kernel: Jan 24 23:10:59 localhost shutdown[4088]: shutting down for system halt bonnie is part of autotest testsuite, can be downloaded from the link http://autotest.kernel.org/wiki/Download We get the folloing error message on starting RHEL 5.6 guest on RHEL 6 host. The folloing error message occurs instead of the call trace hdc: drive_cmd: status=0x41 { DriveReady Error } hdc: drive_cmd: error=0x04 { AbortedCommand } ide: failed opcode was: 0xec Bonnie failed to run with RHEL5.6 guest on RHEL6 host -Server architecture(s) (x86/POWER6/Z/etc.): x86 -Server type (9117-MMA/HS20/s390/etc.): hs22 -General component (desktop/kernel/base OS/dev tools/etc.): -Kernel: Linux mx3650b.in.ibm.com 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64 GNU/Linux -Other components involved (ixgbe/java/emulex/etc.): qemu, kvm -Does the server have the latest GA firmware? yes -Has the problem been shown to occur on more than one system? All x86-64 machines -Is a tested patch available? no -What is the latest official Red Hat build on which this bug has been? RHEL6 GA, RHEL5.6 GA host -Attach sosreport
Created attachment 486954 [details] sos report for rhel5.6 guest running on rhel6 host
Did you have the chance to test on RHEL6 host as well ?
I'm a bit confused, as this bug seems to be two bugs in one report: - on a rhel5.6 host we get a softlockup in ext3's fsync. Given how slow ext3 fsync is that doesn't surprise me in general. - on a rhel6 host you get a warning from the ATA driver. That's the same guest migrated to a rhel6 host, or a different installation of the same guest image? Note that the 0ex ATA command is WIN_IDENTIFY, which is something we normally don't do under load.
Also note that the config seems to use cache=writethrough, which could explain all kinds of timeouts if backed by ext3's incredibly slow O_SYNC code. Does the problem go away if you add cache=none to the qemu command line, or use a different filesystem like xfs or ext4?
can reproduce with cache=none 1. cmd /usr/libexec/qemu-kvm -monitor stdio -serial unix:'/tmp/serial-20110718-171628-6Bso',server,nowait -drive file='/home/images/RHEL-Server-5.7-64-virtio.qcow2',index=0,if=ide,media=disk,cache=none,format=qcow2 -net nic,vlan=0,model=rtl8139,macaddr='9a:ba:99:1b:74:e0' -net tap,vlan=0,ifname='t0-171628-6Bso',script='/home/qemu-ifup-switch',downscript='no' -m 2048 -smp 4,cores=1,threads=1,sockets=4 -cpu qemu64,+sse2 -soundhw ac97 -vnc :0 -rtc-td-hack -M rhel5.6.0 -boot c -usbdevice tablet -no-kvm-pit-reinjection 2. host 2.6.18-274.el5 kvm-83-239.el5 3. guest rhel5.7.64 4. INFO: task syslogd:2377 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syslogd D ffff810002536420 0 2377 1 2380 2364 (NOTLB) ffff810074ad1d98 0000000000000082 ffff810074ad1d18 0000000000000246 0000000000000000 0000000000000009 ffff81007d248080 ffffffff80314b60 0000037ef15f64ad 00000000000058a9 ffff81007d248268 00000000880317be Call Trace: [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff8803179a>] :jbd:journal_stop+0x1d3/0x203 [<ffffffff8002fa5f>] __writeback_single_inode+0x1dd/0x31c [<ffffffff800e3463>] do_readv_writev+0x26e/0x291 [<ffffffff800f64f4>] sync_inode+0x24/0x33 [<ffffffff8804c37e>] :ext3:ext3_sync_file+0xce/0xf8 [<ffffffff8004fe95>] do_fsync+0x52/0xa4 [<ffffffff800e3cf0>] __do_fsync+0x23/0x36 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task syslogd:2377 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syslogd D ffff810002536420 0 2377 1 2380 2364 (NOTLB) ffff810074ad1d98 0000000000000082 ffff810074ad1d18 0000000000000246 0000000000000000 0000000000000009 ffff81007d248080 ffffffff80314b60 0000037ef15f64ad 00000000000058a9 ffff81007d248268 00000000880317be Call Trace: [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff8803179a>] :jbd:journal_stop+0x1d3/0x203 [<ffffffff8002fa5f>] __writeback_single_inode+0x1dd/0x31c [<ffffffff800e3463>] do_readv_writev+0x26e/0x291 [<ffffffff800f64f4>] sync_inode+0x24/0x33 [<ffffffff8804c37e>] :ext3:ext3_sync_file+0xce/0xf8 [<ffffffff8004fe95>] do_fsync+0x52/0xa4 [<ffffffff800e3cf0>] __do_fsync+0x23/0x36 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Just to clarify things: these warnings are in the guest or in the host? Either way it just seems like ext3 fsync beeing the pain it is and causing soft lockup warnings. I can't see how this is related to kvm, except for an added layer of indirection not helping to improve an already painful experience. I would recommend not using ext3 for virt setups, but if you really strongly care about it open a bug against the ext3 driver in the kernel
Moved to the kernel component per chellwig recommendation
------- Comment From prem.karat.ibm.com 2011-08-30 01:31 EDT------- (In reply to comment #25) > Moved to the kernel component per chellwig recommendation Hi Redhat, Did you get a chance to look into this? Cheers, Prem
------- Comment From prem.karat.ibm.com 2011-09-13 09:02 EDT------- Hi Redhat, We are awaiting your response on this. Cheers, Prem
(In reply to comment #10) > ------- Comment From prem.karat.ibm.com 2011-09-13 09:02 EDT------- > Hi Redhat, > > We are awaiting your response on this. > > Cheers, > Prem Christoph was asking you to try our ext4 instead. It seems it is not a hypervisor issue but a file system one.
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.8 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
------- Comment From prem.karat.ibm.com 2012-01-09 23:38 EDT------- Hi RH, You can close this request. -prem