463897 – [RHEL4 PV-on-HVM]: Crash in xen-vbd when trying to attach disks

Bug 463897 - [RHEL4 PV-on-HVM]: Crash in xen-vbd when trying to attach disks

Summary: [RHEL4 PV-on-HVM]: Crash in xen-vbd when trying to attach disks

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.7
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Don Dutile (Red Hat)
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	484698 (view as bug list)
Depends On:
Blocks:	RHEL4u8_relnotes
TreeView+	depends on / blocked

Reported:	2008-09-25 07:45 UTC by Chris Lalancette
Modified:	2018-10-20 01:29 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, attempting to mount disks or partitions in a 32-bit Red Hat Enterprise Linux 4.6 fully virtualized guest using the paravituralized block driver(xen-vbd.ko) on a 64-bit host would fail. With this update, the block front driver (block.c) has been updated to inform the block back driver that the guest is using the 32-bit protocol, which resolves this issue.
Clone Of:
Environment:
Last Closed:	2009-05-18 19:28:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Posted patch for fix to rhel4.8. (1.52 KB, patch) 2008-10-09 22:14 UTC, Don Dutile (Red Hat)	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1024	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update	2009-05-18 14:57:26 UTC

Description Chris Lalancette 2008-09-25 07:45:35 UTC

Description of problem:
I'm running a RHEL-5 dom0 on kernel 2.6.18-116.el5.  I've installed a RHEL-4 i386 fully virtualized guest.  I boot up the kernel-smp-2.6.9-78.EL kernel inside the guest.  I then "modprobe xen-vbd" inside the guest to load the PV-on-HVM VBD driver.  Now, I run:

# xm block-attach rhel4fv_i386 tap:aio:/storage/clalance/test.dsk /dev/xvda w

on the dom0.  This results in some error messages being spit out on the dom0:

tap tap-12-51712: 2 getting info
blktap: ring-ref 8, event-channel 7, protocol 1 (unspecified, assuming native)
Registering block device major 202
xen-vbd: registered block device major 202
(XEN) grant_table.c:229:d0 Bad ref (-630782208).
(XEN) grant_table.c:229:d0 Bad ref (-630782208).
blk_tap: invalid kernel buffer -- could not remap it
blk_tap: invalid user buffer -- could not remap it
blk_tap: Reached Fail_flush
Buffer I/O error on device xvda, logical block 0
blk_tap: Bad number of segments in request (0)

And then it leads to a crash inside the guest:

Unable to handle kernel paging request at virtual address f28d5cf4
 printing eip:
e00d8a74
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: xen_vbd md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc cpufreq_powersave loop button battery ac 8139cp mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
CPU:    3
EIP:    0060:[<e00d8a74>]    Not tainted VLI
EFLAGS: 00010086   (2.6.9-78.ELsmp) 
EIP is at blkif_int+0x5b/0x13a [xen_vbd]
eax: da67c06c   ebx: f28d5c00   ecx: c03e6fb8   edx: da67c0ac
esi: ca000100   edi: da8cc000   ebp: 00000001   esp: c0406f88
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c0406000 task=df2105b0)
Stack: 00000286 00000002 da67c0ac 00000000 00000030 00000007 00000002 c047c7b0 
       c02332b4 00000007 00000000 00000001 e0014000 e00d8a19 00000007 00000000 
       c03e6fb8 df2e5140 00000001 00000000 c03e6fb8 c01074d2 c03e6f9c c0406000 
Call Trace:
 [<c02332b4>] evtchn_interrupt+0x13d/0x1ca
 [<e00d8a19>] blkif_int+0x0/0x13a [xen_vbd]
 [<c01074d2>] handle_IRQ_event+0x25/0x4f
 [<c0107a32>] do_IRQ+0x11c/0x1ae
 =======================
 [<c02e13b4>] common_interrupt+0x18/0x20
 [<c012007b>] arch_init_sched_domains+0x4d7/0x63b
 [<c0126d7b>] __do_softirq+0x43/0xb1
 [<c01081a3>] do_softirq+0x4f/0x56
 =======================
 [<c01174c8>] smp_apic_timer_interrupt+0x9a/0x9c
 [<c02e1436>] apic_timer_interrupt+0x1a/0x20
 [<c0104018>] default_idle+0x0/0x2f
 [<c03b007b>] powernow_cpu_init+0x12a/0x1c0
 [<c0104041>] default_idle+0x29/0x2f
 [<c01040a0>] cpu_idle+0x26/0x3b
Code: 76 00 8b 6f 20 39 c5 0f 84 a6 00 00 00 8b 47 24 48 21 e8 6b c0 6c 03 47 28 8d 50 40 89 54 24 08 8b 70 40 69 de 9c 00 00 00 01 fb <8b> 83 f4 00 00 00 89 44 24 0c 8d 83 88 00 00 00 e8 2f 01 00 00 
 <0>Kernel panic - not syncing: Fatal exception in interrupt


Now, this might be a dom0 side bug; however, even if that is the case, this should probably not crash the guest kernel like this.  I can reproduce this on demand, so if you need more information, let me know.

Comment 1 Don Dutile (Red Hat) 2008-09-25 17:45:20 UTC

Can you provide a dump of the crash ?
--it should take a crash (in the guest) since panic-notifier
is built into the xenpvhvm subsys, and it's fully setup
before the block-attach is exec'd.

Additionally, can you try w/5.2 dom0 to ensure it isn't 
a new 'feature' in 5.3's dom0 ?

Finally, is this the bug-fixed/patched -116 or the
buggy/crashing -116 ???

Comment 2 Chris Lalancette 2008-09-26 06:21:39 UTC

I'll have to set up my machine again for the test to get the crash; not a big deal, but I'm not sure if I'll get to it today.

I should have mentioned; I tested this with both -116 (the funky kernel) and with -92 (the 5.2 kernel), and I got the same results with both.

Chris Lalancette

Comment 3 Don Dutile (Red Hat) 2008-10-08 22:06:24 UTC

I duped the problem on my rhel5.2 dom0 & rhel4-fv guest doing an xm block-attach as well.
After numerous dumps with debug printk's, I found that the data structure passed in from the dom0 back-end to the FV frontend (blkfront) was mis-aligned.
Sure enough (only) blkfront has two types of blkif_*{} structures, one for 32-bit
and one for 64-bit systems.... i.e., the structures weren't made 32/64-bit neutral.  
So, the blkfront driver must tell the blkback driver what type of guest it is.
(Note: netfront doesn't have this problem.)

This code exists in upstream Xen, and in RHEL5's blkfront, but RHEL4's (old) snapshot was never updated.

So, adding the following to blkfront's talk_to_backend() fcn:
        err = xenbus_printf(xbt, dev->nodename, "protocol", "%s",
                            XEN_IO_PROTO_ABI_NATIVE);
        if (err) {
                message = "writing protocol";
                goto abort_transaction;
        }

and adding include/xen/interface/io/protocols.h to RHEL4 (copy from RHEL5 or
upstream xen), and voila.... all is well.

Will create a patch and post it tomorrow for RHEL4.8;
Q: should this be patched for RHEL4.7.z ?

Note/Question: I would guess that rhel4-xenU-i386 on rhel5.2-xen-x86_64 
               should fail for similar reasons.

We may also want to investigate whether the following cset from xen-unstable
should be applied to the xen-tools for 32-on-64 migration support:
  cset 17635
  xend: fix block protocol mismatch on save/restore

  The protocol field of the blkif interface is correct at startup for a
  guest of a different mode from dom0 (eg. 32-bit dom0, 64-bit guest).
  However, this property is not persisted on save, so a later restore
  (or migrate) will setup the block interface with the wrong mode.

Comment 4 Daniel Berrangé 2008-10-08 22:22:35 UTC

The 'protocol' field is initialized automatically by XenD, based on information returned from the hypervisor, since the hypervisor knows whether the guest domain is 32 or 64 bit.  This should all 'just work' correctly.....

....except, if you are running a 32-bit fullyvirtualized operating system, in a 64-bit guest domain. Which is almost certainly the case you tripped up on. So I think it should be sufficient to add your patch to just the PV-on-HVM drivers, but might as well add it to PV rhel4 guest too just for sake of completeness.

Comment 5 Don Dutile (Red Hat) 2008-10-09 01:37:20 UTC

the patch is to blkfront.c, and it is pv/pv-on-hvm agnostic,
so the fix will go into both (fv+pv-on-hvm and pv kernels).

note, that the patch I'm working on is for blkfront/blkback
to converse with the same struct's.
the cset I pointed to was for xend for save/restore of 32-bit guests
on 64-bit dom0's.

Comment 6 Chris Lalancette 2008-10-09 07:28:58 UTC

Yes, I actually think you are both right.  I've never run into problems running 32-bit PV RHEL-4 on 64-bit RHEL-5 dom0, which is probably because of what danpb pointed out.  But I did observe the problem on 32 FV on 64-bit dom0, which is what the patch should fix.

RE: whether to backport this to the z-stream; I would hold off for now.  If customers complain about it, then we can port it into the z-stream.

RE: other 32-on-64 patches; yes, we probably eventually want them.  I'm hoping to try to get 32-on-64 really solid in 5.4, but we'll see if I have time.

In any case, good work on this one Don, and let's get this into 4.8.

Chris Lalancette

Comment 8 RHEL Program Management 2008-10-09 13:36:58 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Don Dutile (Red Hat) 2008-10-09 22:14:09 UTC

Created attachment 319955 [details]
Posted patch for fix to rhel4.8.

Comment 10 Vivek Goyal 2008-10-21 19:08:31 UTC

Committed in 78.15.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 14 Chris Lalancette 2009-02-23 21:33:50 UTC

*** Bug 484698 has been marked as a duplicate of this bug. ***

Comment 17 Don Dutile (Red Hat) 2009-03-19 15:47:10 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
* Cause:  Running a 32-bit RHEL4.6 as a fully-virtualized (FV) guest that is
using the xen pv block driver (xen-vbd.ko) in the guest will not work on a
64-bit host, e.g. 2.6.9-78.el4 i686 guest on 2.6.18-92.el5 x86_64 dom0/host. 
The cause is due to the block back end not knowing that the guest block front
end is using a 32-bit protocol instead of a 64-bit protocol.

* Consequence:  Disks/partitions being attached to an i686 RHEL4 FV guest using
the xen pv block driver on an x86_64 RHEL5 host will fail to be
attached/accessible from the guest.

* Fix:   A fix to the xen block front driver (block.c) for RHEL4 was applied
that
         has the block front driver inform the block back driver that it is
using a 32-bit protocol, instead of the default/expected 64-bit protocol.

* Result:  Disks/partitions exported to i686 RHEL4 FV guests on x86_64 RHEL5
hosts connect and mount correctly.

Comment 19 Ryan Lerch 2009-03-29 23:47:03 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,17 +1 @@
-* Cause:  Running a 32-bit RHEL4.6 as a fully-virtualized (FV) guest that is
+Previously, attempting to mount disks or partitions in a 32-bit Red Hat Enterprise Linux 4.6 fully virtualized guest using the paravituralized block driver(xen-vbd.ko) on a 64-bit host would fail. With this update, the block front driver (block.c) has been updated to inform the block back driver that the guest is using the 32-bit protocol, which resolves this issue.-using the xen pv block driver (xen-vbd.ko) in the guest will not work on a
-64-bit host, e.g. 2.6.9-78.el4 i686 guest on 2.6.18-92.el5 x86_64 dom0/host. 
-The cause is due to the block back end not knowing that the guest block front
-end is using a 32-bit protocol instead of a 64-bit protocol.
-
-* Consequence:  Disks/partitions being attached to an i686 RHEL4 FV guest using
-the xen pv block driver on an x86_64 RHEL5 host will fail to be
-attached/accessible from the guest.
-
-* Fix:   A fix to the xen block front driver (block.c) for RHEL4 was applied
-that
-         has the block front driver inform the block back driver that it is
-using a 32-bit protocol, instead of the default/expected 64-bit protocol.
-
-* Result:  Disks/partitions exported to i686 RHEL4 FV guests on x86_64 RHEL5
-hosts connect and mount correctly.

Comment 20 Tom Lanyon 2009-04-06 03:41:51 UTC

For what it's worth, running 2.6.9-87 from http://people.redhat.com/vgoyal/rhel4/ resolves our issue where i686 RHEL4 HVM guests crash on top of RHEL5 x86_64 xen dom0 when trying to add xenpv block devices.

Comment 21 Chris Lalancette 2009-04-06 06:58:56 UTC

Great.  Thanks for the testing; while I tested it to work myself, it's always good to get additional confirmation.

Thanks again,
Chris Lalancette

Comment 22 Chris Ward 2009-05-05 13:58:15 UTC

Any updates here? Has this issue been resolved in the RHEL 4.8 Beta? later kernel?

Comment 26 errata-xmlrpc 2009-05-18 19:28:13 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Note You need to log in before you can comment on or make changes to this bug.