460693 – Xen domU, RAID1, LVM, iscsi target export with blockio bug

Bug 460693 - Xen domU, RAID1, LVM, iscsi target export with blockio bug

Summary: Xen domU, RAID1, LVM, iscsi target export with blockio bug

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.2
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Lalancette
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	490148 492568 512913
TreeView+	depends on / blocked

Reported:	2008-08-29 18:42 UTC by Nenad Opsenica
Modified:	2018-10-20 03:35 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	490148 (view as bug list)
Environment:
Last Closed:	2009-09-02 08:40:45 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Backport of upstream Linux 9e973e64ac6dc504e6447d52193d4fff1a670156 (2.86 KB, patch) 2009-03-01 18:05 UTC, Chris Lalancette	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Nenad Opsenica 2008-08-29 18:42:45 UTC

Description of problem:

My goal was to make iSCSI export of parts (logical volumes) of software RAID1 device created inside domU. 
RAID components are basically two dom0 logical volumes, pushed as block devices to domU. RAID1 device, /dev/md0, is created inside domU; then, PV, VG and LVs are created inside /dev/md0. Different logical volumes from /dev/md0 are then exported through iSCSI target software, with "blockio" mode. 

After starting iscsi target software, connecting to targets from other computer was successful, but creating filesystem brings up bug in domU blkfront.c, same as writing larger amount of data (~128MB) to the target with dd. In case that there is file system already on the target, mounting FS is successful, but trying to write large amount of data to it with 

	dd if=/dev/zero of=dummy-file-1 bs=1024 count=$[1024*512]
	
brings up the same bug. 

In case that iscsi target software is using "fileio" mode, everything is going just fine.
Exporting whole /dev/md0 as iscsi target also works great.

Same thing happens with both iSCSI enterprise target (IET) and scst-iscsi.




Version-Release number of selected component (if applicable):
(CentOS 5.2)
kernel-xen-2.6.18-92.1.6.el5
xen-3.0.3-64.el5_2.1

iscsitarget-0.4.16

or

scst-1.0.0-2.6.18.92.1.6.el5xen, iscsi-scst-1.0.0



How reproducible:
Easy.



Steps to Reproduce:
1. make 2 LVs in dom0 and push them to domU
2. inside domU, make RAID1 /dev/md0 consisting of these two devices
3. create logical volumes in /dev/md0
4. export logical volumes as separate iSCSI targets, with "blockio" mode
5. connect to iscsi target(s) from other computer
6. try to write large amount of data to iscsi target(s) - either mkfs, dd


  
Actual results:
Bug shows up in domU that is running iSCSI target software, and domU reboots:


------------[ cut here ]------------
kernel BUG at drivers/xen/blkfront/blkfront.c:567!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /block/ram0/dev
Modules linked in: iscsi_scst(FU) scst_disk(U) scst_vdisk(U) scst(U) iscsi_tcp(U) libiscsi(U) scsi_transport_iscsi(U) scsi_mod lock_dlm gfs2(U) dlm configfs ipv6 xfrm_nalgo crypto_api dm_multipath raid1 parport_pc lp parport pcspkr xenblk xennet dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0061:[<da0b8704>]    Tainted: GF     VLI
EFLAGS: 00010046   (2.6.18-92.1.6.el5xen #1)
EIP is at do_blkif_request+0x182/0x37b [xenblk]
eax: 0000000c   ebx: c0dd17e0   ecx: 00000008   edx: 0000000b
esi: 00000000   edi: 0000bc26   ebp: d8f97628   esp: c0c52dec
ds: 007b   es: 007b   ss: 0069
Process md0_raid1 (pid: 2344, ti=c0c52000 task=d4fbc000 task.ti=c0c52000)
Stack: d8f6e468 c08ec000 d8f6abe4 00000003 c08ec000 00000001 00000177 00000000
       d261fec4 c0dd17e0 00000008 00000000 0000000b ffffffff d8f6e468 d8f6e468
       00000000 00000060 c04d5418 d8f6abe4 c04d7530 00000000 00001000 c0660000
Call Trace:
 [<c04d5418>] __generic_unplug_device+0x1d/0x1f
 [<c04d7530>] __make_request+0x31d/0x36a
 [<c04d4824>] generic_make_request+0x248/0x258
 [<c059f59e>] bitmap_unplug+0x135/0x14c
 [<c0429828>] del_timer+0x41/0x47
 [<da0bfa0b>] raid1d+0xec/0xc44 [raid1]
 [<c0607bac>] schedule+0x718/0x7cd
 [<c0607be0>] schedule+0x74c/0x7cd
 [<c0609350>] _spin_lock_irqsave+0x8/0x28
 [<c042914a>] lock_timer_base+0x15/0x2f
 [<c04291a8>] try_to_del_timer_sync+0x44/0x4a
 [<c04291b8>] del_timer_sync+0xa/0x14
 [<c060832e>] schedule_timeout+0x78/0x8c
 [<c0609350>] _spin_lock_irqsave+0x8/0x28
 [<c059c2c1>] md_thread+0xdf/0xf5
 [<c043190f>] autoremove_wake_function+0x0/0x2d
 [<c059c1e2>] md_thread+0x0/0xf5
 [<c043184d>] kthread+0xc0/0xeb
 [<c043178d>] kthread+0x0/0xeb
 [<c0403005>] kernel_thread_helper+0x5/0xb
 =======================
Code: 0f b7 5b 1a 6b c3 0c 89 5c 24 2c 89 44 24 20 8b 52 30 c7 44 24 30 00 00 00 00 01 d0




Expected results:
No bug :)




Additional info:
domU disk configuration is:
	disk = [ 'phy:/dev/system/root.container1,sda1,w',
	         'phy:/dev/system/swap.container1,sda2,w',
	         'phy:/dev/containers/test1,sdc,w',
	         'phy:/dev/containers/test2,sdd,w' ]



/etc/scst.conf file:

[HANDLER vdisk]
DEVICE SOMETARGET,/dev/somevg/somelv,BLOCKIO,512

#[ASSIGNMENT Default]
DEVICE SOMETARGET,0





/etc/iscsi-scst.conf file:

Target iqn.2008-06.net.panline:sometarget
        # Users, who can access this target. The same rules as for discovery
        # users apply here.
        # Leave them alone if you don't want to use authentication.
        #IncomingUser joe secret
        #OutgoingUser jim 12charpasswd
        # Alias name for this target
        # Alias Test
        # various iSCSI parameters
        # (not all are used right now, see also iSCSI spec for details)
        #MaxConnections         1
        InitialR2T              No
        ImmediateData           Yes
        MaxRecvDataSegmentLength 1048576
        MaxXmitDataSegmentLength 1048576
        MaxBurstLength          1048576
        FirstBurstLength        1048576
        #DefaultTime2Wait       2
        #DefaultTime2Retain     20
        #MaxOutstandingR2T      20
        #DataPDUInOrder         Yes
        #DataSequenceInOrder    Yes
        #ErrorRecoveryLevel     0
        #HeaderDigest           CRC32C,None
        #DataDigest             CRC32C,None
        # various target parameters
        #QueuedCommands         32

Comment 1 Issue Tracker 2009-01-06 23:31:39 UTC

In IT234267, the customer is experiencing occasional crashes while
installing a DomU.  All of the crashes go through the following code
path:

 [<c04d51b8>] __generic_unplug_device+0x1d/0x1f
 [<c04d5f0d>] generic_unplug_device+0x15/0x25
 [<ed1ef2b4>] unplug_slaves+0x4f/0x83 [raid1]
 [<ed1ef300>] raid1_unplug+0xe/0x1a [raid1]
 [<ed247840>] dm_table_unplug_all+0x22/0x2e [dm_mod]
 [<ed245c79>] dm_unplug_all+0x17/0x21 [dm_mod]
 [<c04d7373>] blk_backing_dev_unplug+0x56/0x5d
 [<c044e5c4>] sync_page+0x0/0x3b
 [<c046e748>] block_sync_page+0x31/0x32
 [<c044e5f7>] sync_page+0x33/0x3b
 [<c060811e>] __wait_on_bit_lock+0x2a/0x52
 [<c044e537>] __lock_page+0x52/0x59
 [<c043192c>] wake_bit_function+0x0/0x3c
 [<c0450f4b>] filemap_nopage+0x22e/0x313

I feel the underlying cause in IT234267 is the same as experienced in this
BZ.


Bill


Issue escalated to RHEL 5 Kernel by: bbraswel.
Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by bbraswel 
 issue 234267

Comment 2 Chris Lalancette 2009-02-04 21:49:59 UTC

FYI; this problem *may* be solved by the upstream patch posted here:

http://lists.xensource.com/archives/html/xen-devel/2009-02/msg00117.html

Chris Lalancette

Comment 3 Chris Lalancette 2009-02-05 13:35:19 UTC

I've done a quick port of that upstream change to the RHEL-5 kernel, and did a quick test here.  Could someone who can reproduce the error (I wasn't able to) download the kernel at:

http://new-people.redhat.com/clalance/bz460693

And see if it fixes the issue for them?

Chris Lalancette

Comment 4 Nenad Opsenica 2009-02-05 13:56:42 UTC

I will test kernel later today or tomorrow in the morning.

Comment 5 Nenad Opsenica 2009-02-06 14:20:25 UTC

Ups, I was not able to reproduce the error, too. It looks like that something in my test configuration has been changed in last 6 months. I will try several other tests, but I'm not sure that this would lead to anything particulary useful. :(

Comment 6 Chris Lalancette 2009-02-06 14:29:57 UTC

Ah, OK.  Thanks for trying; I appreciate the effort.  If you *do* get some result, please be sure to report it here.

In the meantime, there were a couple of other people who had reported problems in this area, so I'm hoping one of them can reproduce the error and try this test patch out.

Thanks again!
Chris Lalancette

Comment 7 Chris Lalancette 2009-02-12 13:59:33 UTC

For anyone else (hint, hint) who was having problems with this bug, I've folded this patch into the main virttest kernels, since the patch referenced in Comment #2 is headed upstream.  You can get that kernel at:

http://new-people.redhat.com/clalance/virttest

Please give it a test to ensure we get it into the next RHEL release!

Chris Lalancette

Comment 8 Chris Chen 2009-02-20 22:43:45 UTC

I've tripped up this same bug in 5.2--same like in blkfront.c when the kernel panics. This happens mostly when doing a kickstart. I'll be giving 5.3 a test pretty soon.

Unfortunately, because I'm seeing this in the kickstart, I need the right kickstart initrd's to get it going--I've tried rolling the new modules into the existing initrd.img I have for 5.2 but there's a problem.

Comment 9 Chris Lalancette 2009-02-21 10:13:18 UTC

OK.  Well, my guess is that 5.3 won't change the issue for you; we didn't do anything in 5.3 to address this.  There is a patch in the virttest kernels that may address this problem, although I haven't been able to confirm it since I can't reproduce the issue at all.  Do you happen to have a reproduction scenario so I can try to reproduce?

Chris Lalancette

Comment 10 Chris Lalancette 2009-02-21 10:15:10 UTC

Oh, I should also mention that the kernels have now moved to:

http://people.redhat.com/clalance/virttest

Chris Lalancette

Comment 11 Chris Lalancette 2009-03-01 18:05:18 UTC

Created attachment 333658 [details]
Backport of upstream Linux 9e973e64ac6dc504e6447d52193d4fff1a670156

The current patch that we are carrying in the virttest kernels.  It still needs verification that it fixes the problem.

Comment 12 Chris Chen 2009-03-11 17:22:00 UTC

My problem is I haven't had the time to make a working initrd for kickstart from these test kernels. I have them running in my Xen DomU guests (already installed a running md raid1), and they're just fine.

Comment 13 Chris Chen 2009-03-17 00:56:50 UTC

Hrm, more work, something new today:

Created a md raid 1 from two xvd devices, lvm and running bonnie++:

Then a panic!

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at drivers/xen/blkfront/blkfront.c:567
invalid opcode: 0000 [1] SMP
last sysfs file: /block/ram0/dev
CPU 0
Modules linked in: nls_utf8 hfsplus i2c_dev i2c_core nfs lockd fscache nfs_acl sunrpc xennet ipv6 xfrm_nalgo crypto_api dm_multipath parport_pc lp parport pcspkr dm_snapshot dm_zero dm_mirror dm_mod xenblk raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-92.1.22.el5xen #1
RIP: e030:[<ffffffff8808472b>]  [<ffffffff8808472b>] :xenblk:do_blkif_request+0x181/0x384
RSP: e02b:ffffffff8062fdd8  EFLAGS: 00010046
RAX: 000000000000000b RBX: 0000000000000008 RCX: 0000000000000000
RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000f48
RBP: ffff88007fd42430 R08: ffff8800471b95f8 R09: 0000070000000335
R10: 0000070000000476 R11: 0000070000000410 R12: ffff88007fe9c000
R13: ffff8800497c2570 R14: ffff8800471b95f8 R15: ffff8800502b9540
FS:  00002aac547fce00(0000) GS:ffffffff805b0000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process swapper (pid: 0, threadinfo ffffffff805f0000, task ffffffff804d8b00)
Stack:  ffff88007ff01928  ffff88007fe9c000  0000000250290ec0  00000000000001e9
 000000000000000e  0000000800001000  0000000b00001000  ffffffff8022d658
 ffffffff0000010a  ffff88007ff01928
Call Trace:
 <IRQ>  [<ffffffff8022d658>] __end_that_request_first+0x1b2/0x4ff
 [<ffffffff8032d66f>] blk_start_queue+0x5a/0x7a
 [<ffffffff88084946>] :xenblk:kick_pending_request_queues+0x18/0x24
 [<ffffffff88084d55>] :xenblk:blkif_int+0x179/0x19e
 [<ffffffff802112c1>] handle_IRQ_event+0x2d/0x60
 [<ffffffff802b1af5>] __do_IRQ+0xa4/0x103
 [<ffffffff8028fd9d>] _local_bh_enable+0x61/0xc5
 [<ffffffff8026db48>] do_IRQ+0xe7/0xf5
 [<ffffffff803a0c69>] evtchn_do_upcall+0x86/0xe0
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff8026f139>] raw_safe_halt+0x84/0xa8

Comment 14 Chris Lalancette 2009-03-17 07:37:57 UTC

> Hrm, more work, something new today:
> 
> Created a md raid 1 from two xvd devices, lvm and running bonnie++:
> 
> Then a panic!
> 
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at drivers/xen/blkfront/blkfront.c:567
> invalid opcode: 0000 [1] SMP
> last sysfs file: /block/ram0/dev
> CPU 0
> Modules linked in: nls_utf8 hfsplus i2c_dev i2c_core nfs lockd fscache nfs_acl
> sunrpc xennet ipv6 xfrm_nalgo crypto_api dm_multipath parport_pc lp parport
> pcspkr dm_snapshot dm_zero dm_mirror dm_mod xenblk raid1 ext3 jbd uhci_hcd
> ohci_hcd ehci_hcd
> Pid: 0, comm: swapper Not tainted 2.6.18-92.1.22.el5xen #1
> RIP: e030:[<ffffffff8808472b>]  [<ffffffff8808472b>]
> :xenblk:do_blkif_request+0x181/0x384

Cool, this is the same panic.  I tried setting up something similar to your test, and ran it overnight, but I didn't get a crash.  Is this reproducible for you?  If so, can you give the virttest kernels a whirl, and see if the issue goes away then?

Thanks,
Chris Lalancette

Comment 15 Chris Chen 2009-03-17 16:25:00 UTC

(In reply to comment #14)
> > Hrm, more work, something new today:
> > 
> > Created a md raid 1 from two xvd devices, lvm and running bonnie++:
> > 
> > Then a panic!
> > 
> > ----------- [cut here ] --------- [please bite here ] ---------
> > Kernel BUG at drivers/xen/blkfront/blkfront.c:567
> > invalid opcode: 0000 [1] SMP
> > last sysfs file: /block/ram0/dev
> > CPU 0
> > Modules linked in: nls_utf8 hfsplus i2c_dev i2c_core nfs lockd fscache nfs_acl
> > sunrpc xennet ipv6 xfrm_nalgo crypto_api dm_multipath parport_pc lp parport
> > pcspkr dm_snapshot dm_zero dm_mirror dm_mod xenblk raid1 ext3 jbd uhci_hcd
> > ohci_hcd ehci_hcd
> > Pid: 0, comm: swapper Not tainted 2.6.18-92.1.22.el5xen #1
> > RIP: e030:[<ffffffff8808472b>]  [<ffffffff8808472b>]
> > :xenblk:do_blkif_request+0x181/0x384
> 
> Cool, this is the same panic.  I tried setting up something similar to your
> test, and ran it overnight, but I didn't get a crash.  Is this reproducible for
> you?  If so, can you give the virttest kernels a whirl, and see if the issue
> goes away then?
> 
> Thanks,
> Chris Lalancette  

I've rebooted with the virttest kernel 2.6.18-131.el5virttest9xen #1 SMP Fri
Feb 20 06:20:21 EST 2009 x86_64 x86_64 x86_64 GNU/Linux and the problem hasn't
reoccured.

Thanks!

cc

Comment 16 Don Zickus 2009-03-23 15:52:39 UTC

in kernel-2.6.18-136.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 18 Chris Ward 2009-07-03 18:07:28 UTC

~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 21 errata-xmlrpc 2009-09-02 08:40:45 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.