Bug 497356

Summary: HVM-PV RHEL5.3 panics post-install with Solaris dom0
Product: Red Hat Enterprise Linux 5 Reporter: John Levon <levon>
Component: kernel-xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: low    
Version: 5.3CC: clalance, drjones, fajar, riek, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Other   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-24 08:10:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514491    
Attachments:
Description Flags
xenstore trace none

Description John Levon 2009-04-23 14:56:07 UTC
After a virt-install of RHEL5.3 from DVD, it panics during boot, with traces similar to:

Starting udev: Unable to handle kernel paging request at 0000000100000008 RIP: 
 [<ffffffff8017c2ab>] acpi_ns_map_handle_to_node+0x14/0x1d
PGD 3d2b8067 PUD 0 
Oops: 0000 [1] SMP 
last sysfs file: /class/net/lo/type
CPU 0 
Modules linked in: xen_vbd 8139too 8139cp i2c_piix4 pcspkr i2c_core ide_cd serio_raw parport_pc mii cdrom xen_platform_pci parport dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 912, comm: modprobe Not tainted 2.6.18-128.el5 #1
RIP: 0010:[<ffffffff8017c2ab>]  [<ffffffff8017c2ab>] acpi_ns_map_handle_to_node+0x14/0x1d
RSP: 0000:ffff81003d9c7e80  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000001001 RCX: 0000000000000000
RDX: 000000000000162a RSI: 0000000000000246 RDI: 0000000100000000
RBP: 0000000100000000 R08: ffff81003d9c7f80 R09: 0000000000000000
R10: ffffffff803d9520 R11: 000000000000007e R12: ffff81003d9c7ec8
R13: ffffffff80186080 R14: 000000001a4a0138 R15: 000000001a4a98a0
FS:  00002b1e8ae126e0(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000100000008 CR3: 000000003d2ae000 CR4: 00000000000006e0
Process modprobe (pid: 912, threadinfo ffff81003d9c6000, task ffff81003d654100)
Stack:  ffffffff8017b746 ffff81003d250d00 ffff81003d996448 ffff81003d996448
 ffff81003d70efa0 0000000000000000 ffffffff801861cf 0000000000000296
 ffffffff801b96a2 ffff81003d996448 ffffffff801861ff ffff81003d996448
Call Trace:
 [<ffffffff8017b746>] acpi_get_data+0x3e/0x6e
 [<ffffffff801861cf>] acpi_get_physical_device+0x15/0x30
 [<ffffffff801b96a2>] devres_release_all+0x77/0x7e
 [<ffffffff801861ff>] acpi_platform_notify_remove+0x15/0x63
 [<ffffffff801b5266>] device_del+0x142/0x1a9
 [<ffffffff8014942d>] kobject_release+0x0/0x9
 [<ffffffff801b52ea>] device_unregister+0x9/0x12
 [<ffffffff88196051>] :xen_platform_pci:xvd_dev_shutdown+0x55/0x84
 [<ffffffff88251014>] :xen_vbd:xlblk_init+0x14/0x18
 [<ffffffff800a3e5d>] sys_init_module+0xaf/0x1e8
 [<ffffffff8005d116>] system_call+0x7e/0x83

The traces are often different but always involve device shutdown.

If the PV drivers are removed, the domain can boot OK.

This was observed with a Solaris dom0 running both 3.1.4 and 3.3.2-pre.

Another example:

Setting clock  (utc): Thu Apr 23 12:04:49 EDT 2009 [  OK  ]
Starting udev: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff80064b3e>] _spin_lock+0x1/0xa
PGD 3d85e067 PUD 3eec4067 PMD 0 
Oops: 0002 [1] SMP 
last sysfs file: /class/input/input1/event1/dev
CPU 0 
Modules linked in: xen_vnif xen_vbd xen_balloon 8139too 8139cp mii ide_cd xen_platform_pci pcspkr i2c_piix4 i2c_core parport_pc serio_raw parport cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1016, comm: modprobe Not tainted 2.6.18-128.el5 #1
RIP: 0010:[<ffffffff80064b3e>]  [<ffffffff80064b3e>] _spin_lock+0x1/0xa
RSP: 0018:ffff81003dd13ec0  EFLAGS: 00010292
RAX: 0000000000000001 RBX: ffff81003d897070 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8014942d RDI: 0000000000000000
RBP: ffff81003d897048 R08: 000000000418d138 R09: ffffffff8014942d
R10: ffffffff803d9520 R11: ffffffff881e9cf4 R12: 0000000000000000
R13: ffffffff881f8848 R14: 000000000418d138 R15: 00000000041968a0
FS:  00002acdbf27d6e0(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000003dca5000 CR4: 00000000000006e0
Process modprobe (pid: 1016, threadinfo ffff81003dd12000, task ffff81003ee71080)
Stack:  ffffffff8026f4ac ffffffff80320730 ffff81003d897048 ffff81003e836f20
 ffffffff801b5146 ffffffff8014942d ffff81003d897048 ffff81003e8367c0
 ffff81003e836f20 00000000041968a0 ffffffff801b52ea ffff81003d897048
Call Trace:
 [<ffffffff8026f4ac>] klist_del+0x15/0x2a
 [<ffffffff801b5146>] device_del+0x22/0x1a9
 [<ffffffff8014942d>] kobject_release+0x0/0x9
 [<ffffffff801b52ea>] device_unregister+0x9/0x12
 [<ffffffff881e9051>] :xen_platform_pci:xvd_dev_shutdown+0x55/0x84
 [<ffffffff88260014>] :xen_vbd:xlblk_init+0x14/0x18
 [<ffffffff800a3e5d>] sys_init_module+0xaf/0x1e8
 [<ffffffff8005d116>] system_call+0x7e/0x83

Comment 1 Chris Lalancette 2009-04-23 15:03:27 UTC
There's a patch going into the 5.4 kernel that *may* address this issue; see https://bugzilla.redhat.com/show_bug.cgi?id=477005.  Can you try booting the kernel at:

http://people.redhat.com/dzickus/el5

And see if it makes a difference?

Chris Lalancette

Comment 2 John Levon 2009-04-23 15:08:07 UTC
# virsh dumpxml domu-220
<domain type='xen' id='14'>
  <name>domu-220</name>
  <uuid>5cc2cb8c-53f0-c687-3bad-078feeef00c9</uuid>
  <memory>1048576</memory>
  <currentMemory>1048576</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type>hvm</type>
    <loader>/usr/lib/xen/boot/hvmloader</loader>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>preserve</on_crash>
  <distro>
    <type>linux</type>
    <variant>rhel5</variant>
  </distro>
  <devices>
    <emulator>/usr/lib/xen/bin/qemu-dm</emulator>
    <interface type='bridge'>
      <mac address='00:16:3e:3e:d8:58'/>
      <source bridge='e1000g0'/>
      <script path='/usr/lib/xen/scripts/vif-vnic'/>
      <target dev='vif14.0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/2'/>
      <target port='0'/>
    </serial>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target port='0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='5900' autoport='yes' keymap='en-us'/>
  </devices>
</domain>

(I'd removed the empty CD device, but that didn't help). virt-install line:

virt-install --hvm --os-type=linux --os-variant=rhel5 -r 1024 -m `~johnlev/bin/maca domu-220` -n domu-220 -f /dev/zvol/dsk/export/dom/domu-220-root --vnc -c /net/heaped/export/netimage/linux/CentOS-5.3-x86_64-bin-DVD.iso

Comment 3 John Levon 2009-04-23 15:17:05 UTC
Chris, this is a royal pain for me to do, it seems very unlikely that it's the
same issue...

I found in xenstore:

control = ""
error = ""
 device = ""
  vbd = ""
   768 = ""
    error = "19 xlvbd_add at /local/domain/0/backend/vbd/14/768"

I'm attaching a xenstore trace. Of interest:

DOM  PID      TX     OP
14   0        0      XS_READ: /local/domain/0/backend/vbd/14/768/state -> 4
14   0        0      XS_READ: /local/domain/0/backend/vbd/14/768/sectors -> 20971520
14   0        0      XS_READ: /local/domain/0/backend/vbd/14/768/info -> 0
14   0        0      XS_READ: /local/domain/0/backend/vbd/14/768/sector-size -> 512
14   0        0      XS_WRITE: error/device/vbd/768/error 19 xlvbd_add at /local/domain/0/backend/vbd/14/768 -> OK
14   0        0      XS_READ: device/vbd/768/state -> 3

Comment 4 John Levon 2009-04-23 15:20:42 UTC
Created attachment 340953 [details]
xenstore trace

Comment 5 John Levon 2009-04-23 15:21:06 UTC
I'm told that full-PV install of RHEL5.3 works perfectly.

Comment 6 Chris Lalancette 2009-04-23 15:28:28 UTC
(In reply to comment #3)
> Chris, this is a royal pain for me to do, it seems very unlikely that it's the
> same issue...

I'm not sure why, on either count.  If you can just boot without loading the PV-on-HVM drivers, install the updated kernel, and then boot again with the PV-on-HVM drivers, that should show it.  Also, the call trace here looks very similar to the one in 477005; while that one was about our -debug variant, the fact remains that the debug code is pointing out a potential problem that could hit the non-debug variant.

It still may be a different issue, but it should be a relatively easy test to see if it works with the updated kernels.

Chris Lalancette

Comment 7 John Levon 2009-04-23 15:33:27 UTC
Is there a boot option to disable the PV drivers then?

Otherwise I have to learn how to access the disk from another Linux domU. I don't know much about LVM...

Comment 8 John Levon 2009-04-23 16:15:25 UTC
Oops, that dumpxml is wrong. It should have:

    <disk type='block' device='disk'>
      <driver name='phy'/>
      <source dev='/dev/zvol/dsk/export/dom/domu-220-root'/>
      <target dev='hda' bus='ide'/>
    </disk>

There. I changed this to:

    <disk type='block' device='disk'>
      <driver name='phy'/>
      <source dev='/dev/zvol/dsk/export/dom/domu-220-root'/>
      <target dev='xvda' bus='xen'/>
    </disk>

and it booted:

Setting up Logical Volume Management:   Found duplicate PV Yj7apcCwwN5g7myctf7jTqggVFJt9f3x: using /dev/xvda2 not /dev/hda2
  2 logical volume(s) in volume group "VolGroup00" now active
[  OK  ]

Something must be going wrong in the device calculations?

Comment 11 Fajar A. Nugraha 2009-05-08 20:22:01 UTC
This is a post I made earlier to xen-discuss, reposted here as per John Levon's request

================================

I'm using RHEL5.3 x86_64 as dom0 and Centos 5.3 x86_64 HVM domU, with
Xen 3.3.1. By default PV drivers are LOADED, but not USED. hda is
still handled by ata_piix, and eth0 handled by 8139cp. xen-vbd is
loaded, but it does not have any devices (because hda is already
handled by ata_piix). xen-net is loaded and got eth1 with the same MAC
address as eth0.

During boot, you can see that xen-vbd tries to grab hda but cannot
(I'll get to this later). This works fine in Linux, domU can continue
to boot. To get domU to actually USE the PV drivers for hda and eth0,
some steps are required : http://pastebin.com/fb6fe631

The point here is that on Linux dom0, whatever driver handles hda,
domU can continue to work.

Now here comes the funny part. In opensolaris dom0, the process of
xen-vbd trying to grab hda actually cause KERNEL PANIC.

So here's what I did to get it working on opensolaris dom0:
- do a zvol-backed fresh-install of Centos 5.3 on opensolaris dom0. I
got kernel-2.6.18-128.el5. It panics on the first reboot.
- export the zvol with iscsi, mount on linux, move
/lib/modules/2.6.18-128.el5/kernel/drivers/xenpv_hvm/blkfront/xen-vbd.ko
out of the way (I simply rename it to xen-vbd.ko.disabled)
- startup domU again on opensolaris
- edit /etc/modprobe.conf, add "alias scsi_hostadapter2 xen-vbd". This
will ensure xen-vbd is included on initrd later.
- yum  -y install kernel. I'll get kernel-2.6.18-128.1.6.el5, and
during the installation process it also generates
initrd-2.6.18-128.1.6.el5.img which contains xen-vbd.ko.
- edit /boot/grub/menu.lst to look like this

title CentOS (2.6.18-128.1.6.el5)
       root (hd0,0)
       kernel /boot/vmlinuz-2.6.18-128.1.6.el5 ro root=LABEL=/ ide0=noprobe
       initrd /boot/initrd-2.6.18-128.1.6.el5.img

the part "ide0=noprobe" tells ata_piix (or is it libata?) to NOT probe
hda and hdb.
- reboot

On the reboot process you can see that xen-vbd now handles hda. Note
that hdc (the cdrom) is still handled by ide-cd. As a bonus, xen-net
now handles eth0.

[root@localhost ~]# ls -lad /sys/block/hd*/device
/sys/block/hd*/device/driver/module
/sys/class/net/eth*/device/driver/module
lrwxrwxrwx 1 root root 0 May  8 03:57 /sys/block/hda/device ->
../../devices/xen/vbd-768
lrwxrwxrwx 1 root root 0 May  8 04:19
/sys/block/hda/device/driver/module -> ../../../../module/xen_vbd
lrwxrwxrwx 1 root root 0 May  8 03:57 /sys/block/hdc/device ->
../../devices/pci0000:00/0000:00:01.1/ide1/1.0
lrwxrwxrwx 1 root root 0 May  8 04:19
/sys/block/hdc/device/driver/module -> ../../../../module/ide_cd
lrwxrwxrwx 1 root root 0 May  8 04:17
/sys/class/net/eth0/device/driver/module ->
../../../../module/xen_vnif
lrwxrwxrwx 1 root root 0 May  8 04:19
/sys/class/net/eth1/device/driver/module -> ../../../../module/8139cp

Comment 13 Andrew Jones 2010-06-23 15:41:48 UTC
John,

Have you tried (and had better success) booting RHEL 5.4 or 5.5 on this platform?

thanks,
Andrew

Comment 14 John Levon 2010-06-23 17:52:33 UTC
Sorry, I no longer have time or inclination to work on this issue.

Comment 15 Andrew Jones 2010-06-24 08:10:15 UTC
Thanks John.

We'll close this out for now. It can always be reopened if necessary.

Andrew

Comment 16 Chris Lalancette 2010-07-19 13:50:17 UTC
Clearing out old flags for reporting purposes.

Chris Lalancette