Bug 509220

Summary: i386 rhel4.8 kvm guests crashes in virtio during installation
Product: Red Hat Enterprise Linux 4 Reporter: Gurhan Ozen <gozen>
Component: kernelAssignee: Chris Lalancette <clalance>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: clalance, dhoward, dyasny, jburke, jwm, mjenner, plyons, qzhang, rdassen, tao, tburke, virt-maint, ykaul
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-16 15:49:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 582911    
Attachments:
Description Flags
Backport of linux-2.6 patch for dynamic scatterlist none

Description Gurhan Ozen 2009-07-01 19:35:48 UTC
Description of problem:
------------[ cut here ]------------
kernel BUG at drivers/block/virtio_blk.c:157!
invalid operand: 0000 [#1]
Modules linked in: dm_snapshot dm_mirror dm_zero dm_mod ext3 jbd msdos raid6 raid5 xor raid1 raid0 virtio_blk virtio_net virtio_pci virtio virtio_ring uhci_hcd sr_mod sd_mod scsi_mod edd floppy loop nfs nfs_acl lockd sunrpc vfat fat cramfs
CPU:    0
EIP:    0060:[<f88352a0>]    Not tainted VLI
EFLAGS: 00010082   (2.6.9-89.EL) 
EIP is at do_virtblk_request+0x1a/0x85 [virtio_blk]
eax: c3137400   ebx: dfe54f4c   ecx: 00000000   edx: dfe57530
esi: f74a6028   edi: 00000000   ebp: f754c000   esp: f7471cf0
ds: 007b   es: 007b   ss: 0068
Process anaconda (pid: 542, threadinfo=f7471000 task=f74860f0)
Stack: f74a6028 00000018 c3137400 dfe54f4c c025e1ab f74a6028 c025e228 00000000 
       c025ef6e 00000001 00000000 f74a6028 f74860f0 c0121d93 f7471d28 f7471d28 
       f7500880 0000021e f754f2cc 00000010 dfe54f4c f7500880 f754f2cc c02670b6 
Call Trace:
 [<c025e1ab>] __generic_unplug_device+0x2b/0x2d
 [<c025e228>] generic_unplug_device+0x7b/0xe0
 [<c025ef6e>] blk_execute_rq+0xbb/0xe3
 [<c0121d93>] autoremove_wake_function+0x0/0x2d
 [<c02670b6>] cfq_set_request+0x33/0x6b
 [<c0267083>] cfq_set_request+0x0/0x6b
 [<c025cef6>] elv_set_request+0xa/0x17
 [<c025eb13>] get_request+0x395/0x39f
 [<c0262ec2>] sg_scsi_ioctl+0x2c2/0x3c4
 [<c0263397>] scsi_cmd_ioctl+0x3d3/0x478
 [<c01ed3a9>] kobject_get+0xf/0x13
 [<c0262218>] get_disk+0x29/0x62
 [<c0261bb3>] exact_lock+0x7/0xd
 [<c025b10e>] kobj_lookup+0x123/0x185
 [<c0261ba5>] exact_match+0x0/0x7
 [<c017840b>] do_open+0x1b2/0x4ec
 [<c018f64a>] wake_up_inode+0x6/0x23
 [<c01787c7>] blkdev_open+0x1a/0x42
 [<c016dad2>] __dentry_open+0xcf/0x188
 [<c016d9a2>] filp_open+0x51/0x65
 [<f88353f7>] virtblk_ioctl+0x76/0x80 [virtio_blk]
 [<c026164e>] blkdev_ioctl+0x32b/0x337
 [<c0178aca>] block_ioctl+0x11/0x13
 [<c0183485>] sys_ioctl+0x297/0x336
 [<c0322f1b>] syscall_call+0x7/0xb
Code: 78 04 b8 01 00 00 00 89 57 04 89 3a 5b 5e 5f 5d c3 55 31 ed 57 31 ff 56 89 c6 53 eb 5a 66 81 7b 54 83 00 8b 43 48 8b 68 38 76 08 <0f> 0b 9d 00 bd 57 83 f8 89 d9 89 ea 89 f0 e8 8f fe ff ff 84 c0 
 <0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
Guest:
RHEL4-U8 i386 released distro.
Host:
kvm-qemu-img-83-80.el5
kmod-kvm-83-80.el5
kernel-2.6.18-155.el5
kvm-83-80.el5


How reproducible:
Very. Only happens in i386 guests.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Dor Laor 2009-07-06 15:24:12 UTC
Seems like it is a guest driver issue. Changing the component.

Comment 3 Chris Lalancette 2009-07-15 19:33:31 UTC
I'll take a look at this tomorrow.

Chris Lalancette

Comment 4 Chris Lalancette 2009-08-13 10:09:04 UTC
OK, I took look at this.  First, I wasn't able to reproduce with an F-11 based host, so I'll try a RHEL-5 based one next.  Gurhan, could you give me details about which piece(s) of hardware you were able to reproduce this on?

Looking at the stack trace, we hit this:

0x2a0 is in do_virtblk_request (drivers/block/virtio_blk.c:157).
152		struct request *req;
153		unsigned int issued = 0;
154	
155		while ((req = elv_next_request(q)) != NULL) {
156			vblk = req->rq_disk->private_data;
157			BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg));
158	
159			/* If this request fails, stop queue and wait for something to
160			   finish to restart it. */
161			if (!do_req(q, vblk, req)) {

Which means (I think) that we were handed more segments than the ring can handle.  I'll have to look further into how we got into that situation.

Chris Lalancette

Comment 5 Chris Lalancette 2009-08-13 12:06:24 UTC
OK, I have a thought as to how this might happen.  I think the problem might be in the size of our scatterlist vs. what the host told us.  In the current RHEL-4 code, there is a hardcoded scatterlist size of (3+MAX_PHYS_SEGMENTS).  However, I think it is possible for the host to tell us to use more than that, and if it does so, then we set up the block layer to give us whatever the host tells us, irrespective of our internally hardcoded size.  If that happens, then we'll run into this BUG()

Luckily, upstream has moved to a dynamically allocated scatterlist.  Assuming my above analysis is right, than this should fix the problem.  I've done a backport of that patch to RHEL-4, and now I just need a place to test it.  Hopefully Gurhan can provide me with the machines I need to do that.

Chris Lalancette

Comment 6 Chris Lalancette 2009-08-13 12:57:07 UTC
Created attachment 357314 [details]
Backport of linux-2.6 patch for dynamic scatterlist

This patch is what I have in mind.  I've lightly tested it, and it seems to work (at least basically), but I'll still need to test it on the problem machine.

Chris Lalancette

Comment 12 Chris Lalancette 2009-08-20 10:13:20 UTC
Hm, yeah, I'm wondering if Dor's last comment is why things are different.  In any case, I've now done a test of the installer with the patch in place, and things look pretty good; the install with that particular ks.cfg no longer fails.  I'll get this patch queued up for 4.9.

Chris Lalancette

Comment 13 RHEL Program Management 2009-08-20 10:39:53 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 14 Vivek Goyal 2009-08-25 19:23:32 UTC
Committed in 89.11.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 15 Dan Yasny 2010-04-09 12:52:33 UTC
The test kernel has been tested by a customer, and verified the issue is no longer reproducible: see https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=737433&gid=23498&view_type=lifoall#eid_6681283

Comment 25 errata-xmlrpc 2011-02-16 15:49:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html