Bug 690039

Summary: Test suite "bonnie" generates call trace with RHEL 5.6 guest
Product: Red Hat Enterprise Linux 5 Reporter: IBM Bug Proxy <bugproxy>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.6CC: akong, gcosta, jasowang, kwolf, leiwang, mkenneth, qwan, rhod, shuang, tburke, virt-maint, yuzhang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-12 15:58:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580948    
Attachments:
Description Flags
sos report for rhel5.6 guest running on rhel6 host none

Description IBM Bug Proxy 2011-03-23 05:40:53 UTC
---Problem Description---
Test suite "bonnie" generates call trace with  RHEL 5.6 guest

---uname output---
Linux mx3650b.in.ibm.com 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64
GNU/Linux
Machine Type = HS22 

---Steps to Reproduce---
1) Install rhel5.6 (64bit) on rhel5.6 host
2) run "bonnie" testsuite in the guest.

---KVM Component Data---
RHEV Build date:
Red Hat Enterprise Linux Server release 5.6 (Tikanga)
Kernel \r on an \m

RHEL Kernel version:
 Linux mx3650b.in.ibm.com 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010
x86_64 x86_64 x86_64 GNU/Linux

Qemu version: 
QEMU PC emulator version 0.9.1 (kvm-83-maint-snapshot-20090205), Copyright (c)
2003-2008 Fabrice Bellard

KVM version: filename:   
/lib/modules/2.6.18-238.el5/weak-updates/kmod-kvm/kvm.ko
license:        GPL
author:         Qumranet
version:        kvm-83-224.el5
srcversion:     DFFDF9C90F86F76B4F0935A
depends:        
vermagic:       2.6.18-237.el5 SMP mod_unload gcc-4.1
parm:           oos_shadow:bool
parm:           force_kvmclock:bool

Guest OS:  RHEL 5.6 32-bit
Guest OS Image storage type: localfile
Host Machine Type: HS22
Test Type: automated
Qemu Command Line:
 /usr/libexec/qemu-kvm -name 'vm1' -monitor
unix:'/tmp/monitor-humanmonitor1-20110125-191701-mD75',server,nowait -serial
unix:'/tmp/serial-20110125-191701-mD75',server,nowait -drive
file='/tmp/kvm_autotest_root/images/rhel5-64.qcow2',index=0,if=virtio,cache=none,boot=on
-net nic,vlan=0,model=virtio,macaddr='9a:f8:23:cf:93:08' -net
tap,vlan=0,ifname='t0-191701-mD75',script='/root/autotest/client/tests/kvm/scripts/qemu-ifup',downscript='no'
-m 4096 -smp 4 -redir tcp:5000::22 -vnc :2

Call trace
-------------
Jan 24 23:10:31 localhost kernel: psmouse.c: Explorer Mouse at
isa0060/serio1/input0 lost synchronization, throwing 1 bytes away.
Jan 24 23:10:31 localhost kernel: psmouse.c: failed to re-enable mouse on
isa0060/serio1
Jan 24 23:10:31 localhost kernel: psmouse.c: resync failed, issuing reconnect
request
Jan 24 23:10:31 localhost kernel: INFO: task syslogd:1995 blocked for more than
120 seconds.
Jan 24 23:10:31 localhost kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 23:10:31 localhost kernel: syslogd       D ffff81013b97f7b0     0  1995    1          1998 
1987 (NOTLB)
Jan 24 23:10:31 localhost kernel:  ffff8101346add88 0000000000000082    0000000000000296
0000000000000003
Jan 24 23:10:31 localhost kernel:  ffff8101346add18 0000000000000009    ffff81013e8bb040
ffff81013e8207e0
Jan 24 23:10:31 localhost kernel:  00000024f11a0b5d 0000000000014838 ffff81013e8bb228 000000033b97f7b0
Jan 24 23:10:31 localhost kernel: Call Trace:
Jan 24 23:10:31 localhost kernel:  [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
Jan 24 23:10:31 localhost kernel:  [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
Jan 24 23:10:31 localhost kernel:  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
Jan 24 23:10:31 localhost kernel:  [<ffffffff8002fcd8>] __writeback_single_inode+0x1d9/0x318
Jan 24 23:10:31 localhost kernel:  [<ffffffff800e2b79>] do_readv_writev+0x26e/0x291
Jan 24 23:10:31 localhost kernel:  [<ffffffff800f5992>] sync_inode+0x24/0x33
Jan 24 23:10:31 localhost kernel:  [<ffffffff8804c370>] :ext3:ext3_sync_file+0xcc/0xf8
Jan 24 23:10:31 localhost kernel:  [<ffffffff80050310>] do_fsync+0x52/0xa4
Jan 24 23:10:31 localhost kernel:  [<ffffffff800e33fe>] __do_fsync+0x23/0x36
Jan 24 23:10:31 localhost kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jan 24 23:10:31 localhost kernel:
Jan 24 23:10:59 localhost shutdown[4088]: shutting down for system halt


bonnie is part of autotest testsuite, can be downloaded from the link
http://autotest.kernel.org/wiki/Download

We get the folloing error message on starting RHEL 5.6 guest on RHEL 6 host.

The folloing error message occurs instead of the call trace
hdc: drive_cmd: status=0x41 { DriveReady Error }
hdc: drive_cmd: error=0x04 { AbortedCommand }
ide: failed opcode was: 0xec

Bonnie failed to run with RHEL5.6 guest on RHEL6 host 


-Server architecture(s) (x86/POWER6/Z/etc.):  x86
-Server type (9117-MMA/HS20/s390/etc.):  hs22
-General component (desktop/kernel/base OS/dev tools/etc.):  
-Kernel: Linux mx3650b.in.ibm.com 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64
x86_64 GNU/Linux
-Other components involved (ixgbe/java/emulex/etc.): qemu, kvm
-Does the server have the latest GA firmware? yes
-Has the problem been shown to occur on more than one system? All x86-64 machines
-Is a tested patch available? no
-What is the latest official Red Hat build on which this bug has been? RHEL6 GA, RHEL5.6 GA host
-Attach sosreport

Comment 1 IBM Bug Proxy 2011-03-23 05:41:07 UTC
Created attachment 486954 [details]
sos report for rhel5.6 guest running on rhel6 host

Comment 2 Glauber Costa 2011-03-23 13:07:38 UTC
Did you have the chance to test on RHEL6 host as well ?

Comment 4 chellwig@redhat.com 2011-03-23 19:26:08 UTC
I'm a bit confused, as this bug seems to be two bugs in one report:

 - on a rhel5.6 host we get a softlockup in ext3's fsync.  Given how slow ext3 fsync is that doesn't surprise me in general.

 - on a rhel6 host you get a warning from the ATA driver.  That's the same guest migrated to a rhel6 host, or a different installation of the same guest image?

Note that the 0ex ATA command is WIN_IDENTIFY, which is something we normally don't do under load.

Comment 5 chellwig@redhat.com 2011-03-23 19:28:19 UTC
Also note that the config seems to use cache=writethrough, which could explain all kinds of timeouts if backed by ext3's incredibly slow O_SYNC code.

Does the problem go away if you add cache=none to the qemu command line, or use a different filesystem like xfs or ext4?

Comment 6 Suqin Huang 2011-07-19 08:40:46 UTC
can reproduce with cache=none

1. cmd
/usr/libexec/qemu-kvm -monitor stdio -serial unix:'/tmp/serial-20110718-171628-6Bso',server,nowait -drive file='/home/images/RHEL-Server-5.7-64-virtio.qcow2',index=0,if=ide,media=disk,cache=none,format=qcow2 -net nic,vlan=0,model=rtl8139,macaddr='9a:ba:99:1b:74:e0' -net tap,vlan=0,ifname='t0-171628-6Bso',script='/home/qemu-ifup-switch',downscript='no' -m 2048 -smp 4,cores=1,threads=1,sockets=4 -cpu qemu64,+sse2 -soundhw ac97 -vnc :0 -rtc-td-hack -M rhel5.6.0 -boot c  -usbdevice tablet -no-kvm-pit-reinjection

2. host
2.6.18-274.el5
kvm-83-239.el5

3. guest
rhel5.7.64

4. 

INFO: task syslogd:2377 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd       D ffff810002536420     0  2377      1          2380  2364 (NOTLB)
 ffff810074ad1d98 0000000000000082 ffff810074ad1d18 0000000000000246
 0000000000000000 0000000000000009 ffff81007d248080 ffffffff80314b60
 0000037ef15f64ad 00000000000058a9 ffff81007d248268 00000000880317be
Call Trace:
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8803179a>] :jbd:journal_stop+0x1d3/0x203
 [<ffffffff8002fa5f>] __writeback_single_inode+0x1dd/0x31c
 [<ffffffff800e3463>] do_readv_writev+0x26e/0x291
 [<ffffffff800f64f4>] sync_inode+0x24/0x33
 [<ffffffff8804c37e>] :ext3:ext3_sync_file+0xce/0xf8
 [<ffffffff8004fe95>] do_fsync+0x52/0xa4
 [<ffffffff800e3cf0>] __do_fsync+0x23/0x36
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task syslogd:2377 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd       D ffff810002536420     0  2377      1          2380  2364 (NOTLB)
 ffff810074ad1d98 0000000000000082 ffff810074ad1d18 0000000000000246
 0000000000000000 0000000000000009 ffff81007d248080 ffffffff80314b60
 0000037ef15f64ad 00000000000058a9 ffff81007d248268 00000000880317be
Call Trace:
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8803179a>] :jbd:journal_stop+0x1d3/0x203
 [<ffffffff8002fa5f>] __writeback_single_inode+0x1dd/0x31c
 [<ffffffff800e3463>] do_readv_writev+0x26e/0x291
 [<ffffffff800f64f4>] sync_inode+0x24/0x33
 [<ffffffff8804c37e>] :ext3:ext3_sync_file+0xce/0xf8
 [<ffffffff8004fe95>] do_fsync+0x52/0xa4
 [<ffffffff800e3cf0>] __do_fsync+0x23/0x36
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 7 chellwig@redhat.com 2011-07-28 11:18:30 UTC
Just to clarify things: these warnings are in the guest or in the host?

Either way it just seems like ext3 fsync beeing the pain it is and causing soft lockup warnings.  I can't see how this is related to kvm, except for an added layer of indirection not helping to improve an already painful experience.

I would recommend not using ext3 for virt setups, but if you really strongly care about it open a bug against the ext3 driver in the kernel

Comment 8 Dor Laor 2011-08-17 00:01:11 UTC
Moved to the kernel component per chellwig recommendation

Comment 9 IBM Bug Proxy 2011-08-30 05:40:58 UTC
------- Comment From prem.karat.ibm.com 2011-08-30 01:31 EDT-------
(In reply to comment #25)
> Moved to the kernel component per chellwig recommendation

Hi Redhat,

Did you get a chance to look into this?

Cheers,
Prem

Comment 10 IBM Bug Proxy 2011-09-13 13:13:26 UTC
------- Comment From prem.karat.ibm.com 2011-09-13 09:02 EDT-------
Hi Redhat,

We are awaiting your response on this.

Cheers,
Prem

Comment 11 Dor Laor 2011-09-19 12:15:48 UTC
(In reply to comment #10)
> ------- Comment From prem.karat.ibm.com 2011-09-13 09:02 EDT-------
> Hi Redhat,
> 
> We are awaiting your response on this.
> 
> Cheers,
> Prem

Christoph was asking you to try our ext4 instead. It seems it is not a hypervisor issue but a file system one.

Comment 12 RHEL Program Management 2012-01-09 14:22:09 UTC
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.8 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 13 IBM Bug Proxy 2012-01-10 04:41:15 UTC
------- Comment From prem.karat.ibm.com 2012-01-09 23:38 EDT-------
Hi RH,

You can close this request.

-prem