Description of problem: I'm running a RHEL-5 dom0 on kernel 2.6.18-116.el5. I've installed a RHEL-4 i386 fully virtualized guest. I boot up the kernel-smp-2.6.9-78.EL kernel inside the guest. I then "modprobe xen-vbd" inside the guest to load the PV-on-HVM VBD driver. Now, I run: # xm block-attach rhel4fv_i386 tap:aio:/storage/clalance/test.dsk /dev/xvda w on the dom0. This results in some error messages being spit out on the dom0: tap tap-12-51712: 2 getting info blktap: ring-ref 8, event-channel 7, protocol 1 (unspecified, assuming native) Registering block device major 202 xen-vbd: registered block device major 202 (XEN) grant_table.c:229:d0 Bad ref (-630782208). (XEN) grant_table.c:229:d0 Bad ref (-630782208). blk_tap: invalid kernel buffer -- could not remap it blk_tap: invalid user buffer -- could not remap it blk_tap: Reached Fail_flush Buffer I/O error on device xvda, logical block 0 blk_tap: Bad number of segments in request (0) And then it leads to a crash inside the guest: Unable to handle kernel paging request at virtual address f28d5cf4 printing eip: e00d8a74 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: xen_vbd md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc cpufreq_powersave loop button battery ac 8139cp mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod CPU: 3 EIP: 0060:[<e00d8a74>] Not tainted VLI EFLAGS: 00010086 (2.6.9-78.ELsmp) EIP is at blkif_int+0x5b/0x13a [xen_vbd] eax: da67c06c ebx: f28d5c00 ecx: c03e6fb8 edx: da67c0ac esi: ca000100 edi: da8cc000 ebp: 00000001 esp: c0406f88 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c0406000 task=df2105b0) Stack: 00000286 00000002 da67c0ac 00000000 00000030 00000007 00000002 c047c7b0 c02332b4 00000007 00000000 00000001 e0014000 e00d8a19 00000007 00000000 c03e6fb8 df2e5140 00000001 00000000 c03e6fb8 c01074d2 c03e6f9c c0406000 Call Trace: [<c02332b4>] evtchn_interrupt+0x13d/0x1ca [<e00d8a19>] blkif_int+0x0/0x13a [xen_vbd] [<c01074d2>] handle_IRQ_event+0x25/0x4f [<c0107a32>] do_IRQ+0x11c/0x1ae ======================= [<c02e13b4>] common_interrupt+0x18/0x20 [<c012007b>] arch_init_sched_domains+0x4d7/0x63b [<c0126d7b>] __do_softirq+0x43/0xb1 [<c01081a3>] do_softirq+0x4f/0x56 ======================= [<c01174c8>] smp_apic_timer_interrupt+0x9a/0x9c [<c02e1436>] apic_timer_interrupt+0x1a/0x20 [<c0104018>] default_idle+0x0/0x2f [<c03b007b>] powernow_cpu_init+0x12a/0x1c0 [<c0104041>] default_idle+0x29/0x2f [<c01040a0>] cpu_idle+0x26/0x3b Code: 76 00 8b 6f 20 39 c5 0f 84 a6 00 00 00 8b 47 24 48 21 e8 6b c0 6c 03 47 28 8d 50 40 89 54 24 08 8b 70 40 69 de 9c 00 00 00 01 fb <8b> 83 f4 00 00 00 89 44 24 0c 8d 83 88 00 00 00 e8 2f 01 00 00 <0>Kernel panic - not syncing: Fatal exception in interrupt Now, this might be a dom0 side bug; however, even if that is the case, this should probably not crash the guest kernel like this. I can reproduce this on demand, so if you need more information, let me know.
Can you provide a dump of the crash ? --it should take a crash (in the guest) since panic-notifier is built into the xenpvhvm subsys, and it's fully setup before the block-attach is exec'd. Additionally, can you try w/5.2 dom0 to ensure it isn't a new 'feature' in 5.3's dom0 ? Finally, is this the bug-fixed/patched -116 or the buggy/crashing -116 ???
I'll have to set up my machine again for the test to get the crash; not a big deal, but I'm not sure if I'll get to it today. I should have mentioned; I tested this with both -116 (the funky kernel) and with -92 (the 5.2 kernel), and I got the same results with both. Chris Lalancette
I duped the problem on my rhel5.2 dom0 & rhel4-fv guest doing an xm block-attach as well. After numerous dumps with debug printk's, I found that the data structure passed in from the dom0 back-end to the FV frontend (blkfront) was mis-aligned. Sure enough (only) blkfront has two types of blkif_*{} structures, one for 32-bit and one for 64-bit systems.... i.e., the structures weren't made 32/64-bit neutral. So, the blkfront driver must tell the blkback driver what type of guest it is. (Note: netfront doesn't have this problem.) This code exists in upstream Xen, and in RHEL5's blkfront, but RHEL4's (old) snapshot was never updated. So, adding the following to blkfront's talk_to_backend() fcn: err = xenbus_printf(xbt, dev->nodename, "protocol", "%s", XEN_IO_PROTO_ABI_NATIVE); if (err) { message = "writing protocol"; goto abort_transaction; } and adding include/xen/interface/io/protocols.h to RHEL4 (copy from RHEL5 or upstream xen), and voila.... all is well. Will create a patch and post it tomorrow for RHEL4.8; Q: should this be patched for RHEL4.7.z ? Note/Question: I would guess that rhel4-xenU-i386 on rhel5.2-xen-x86_64 should fail for similar reasons. We may also want to investigate whether the following cset from xen-unstable should be applied to the xen-tools for 32-on-64 migration support: cset 17635 xend: fix block protocol mismatch on save/restore The protocol field of the blkif interface is correct at startup for a guest of a different mode from dom0 (eg. 32-bit dom0, 64-bit guest). However, this property is not persisted on save, so a later restore (or migrate) will setup the block interface with the wrong mode.
The 'protocol' field is initialized automatically by XenD, based on information returned from the hypervisor, since the hypervisor knows whether the guest domain is 32 or 64 bit. This should all 'just work' correctly..... ....except, if you are running a 32-bit fullyvirtualized operating system, in a 64-bit guest domain. Which is almost certainly the case you tripped up on. So I think it should be sufficient to add your patch to just the PV-on-HVM drivers, but might as well add it to PV rhel4 guest too just for sake of completeness.
the patch is to blkfront.c, and it is pv/pv-on-hvm agnostic, so the fix will go into both (fv+pv-on-hvm and pv kernels). note, that the patch I'm working on is for blkfront/blkback to converse with the same struct's. the cset I pointed to was for xend for save/restore of 32-bit guests on 64-bit dom0's.
Yes, I actually think you are both right. I've never run into problems running 32-bit PV RHEL-4 on 64-bit RHEL-5 dom0, which is probably because of what danpb pointed out. But I did observe the problem on 32 FV on 64-bit dom0, which is what the patch should fix. RE: whether to backport this to the z-stream; I would hold off for now. If customers complain about it, then we can port it into the z-stream. RE: other 32-on-64 patches; yes, we probably eventually want them. I'm hoping to try to get 32-on-64 really solid in 5.4, but we'll see if I have time. In any case, good work on this one Don, and let's get this into 4.8. Chris Lalancette
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 319955 [details] Posted patch for fix to rhel4.8.
Committed in 78.15.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
*** Bug 484698 has been marked as a duplicate of this bug. ***
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: * Cause: Running a 32-bit RHEL4.6 as a fully-virtualized (FV) guest that is using the xen pv block driver (xen-vbd.ko) in the guest will not work on a 64-bit host, e.g. 2.6.9-78.el4 i686 guest on 2.6.18-92.el5 x86_64 dom0/host. The cause is due to the block back end not knowing that the guest block front end is using a 32-bit protocol instead of a 64-bit protocol. * Consequence: Disks/partitions being attached to an i686 RHEL4 FV guest using the xen pv block driver on an x86_64 RHEL5 host will fail to be attached/accessible from the guest. * Fix: A fix to the xen block front driver (block.c) for RHEL4 was applied that has the block front driver inform the block back driver that it is using a 32-bit protocol, instead of the default/expected 64-bit protocol. * Result: Disks/partitions exported to i686 RHEL4 FV guests on x86_64 RHEL5 hosts connect and mount correctly.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,17 +1 @@ -* Cause: Running a 32-bit RHEL4.6 as a fully-virtualized (FV) guest that is +Previously, attempting to mount disks or partitions in a 32-bit Red Hat Enterprise Linux 4.6 fully virtualized guest using the paravituralized block driver(xen-vbd.ko) on a 64-bit host would fail. With this update, the block front driver (block.c) has been updated to inform the block back driver that the guest is using the 32-bit protocol, which resolves this issue.-using the xen pv block driver (xen-vbd.ko) in the guest will not work on a -64-bit host, e.g. 2.6.9-78.el4 i686 guest on 2.6.18-92.el5 x86_64 dom0/host. -The cause is due to the block back end not knowing that the guest block front -end is using a 32-bit protocol instead of a 64-bit protocol. - -* Consequence: Disks/partitions being attached to an i686 RHEL4 FV guest using -the xen pv block driver on an x86_64 RHEL5 host will fail to be -attached/accessible from the guest. - -* Fix: A fix to the xen block front driver (block.c) for RHEL4 was applied -that - has the block front driver inform the block back driver that it is -using a 32-bit protocol, instead of the default/expected 64-bit protocol. - -* Result: Disks/partitions exported to i686 RHEL4 FV guests on x86_64 RHEL5 -hosts connect and mount correctly.
For what it's worth, running 2.6.9-87 from http://people.redhat.com/vgoyal/rhel4/ resolves our issue where i686 RHEL4 HVM guests crash on top of RHEL5 x86_64 xen dom0 when trying to add xenpv block devices.
Great. Thanks for the testing; while I tested it to work myself, it's always good to get additional confirmation. Thanks again, Chris Lalancette
Any updates here? Has this issue been resolved in the RHEL 4.8 Beta? later kernel?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html