Description of problem: Removing module causes kernel panic saying "unable to handle kernel NULL pointer dereference at virtual address 00000180". The address is always 00000180. This panic is caused by modprobe xen-balloon or modprobe -r xen-platform-pci. Version-Release number of selected component (if applicable): 283.el5 kernel How reproducible: Steps to Reproduce: 1. modprobe -a xen-balloon xen-platform-pci xen-vnif 2. modprobe -r xen-platform-pci 3. modprobe -r xen-vnif Actual results: Kernel panic with following output: WARNING: Error removing xen_platform_pci (/lib/modules/2.6.18-283.el5BUG: unable to handle kernel NULL pointer dereference/kernel/drivers/ at virtual address 00000180 xenpv_hvm/platfo printing eip: 00000180 rm-pci/xen-platf*pde = 00000000 orm-pci.ko): DevOops: 0000 [#1] SMP last sysfs file: /class/misc/aer_inject/dev Modules linked in: xen_balloon xen_platform_pci ipoib_helper i5k_amb hwmon capifs bas_gigaset usb_gigaset gigaset isdn slhc crc_ccitt i2c_amd756 ovcamchip reed_solomon chipreg ide_cd nsc_gpio i8xx_tco ipmi_si ipmi_msghandler nls_cp932 td CPU: 1 EIP: 0060:[<00000180>] Tainted: G ---- VLI EFLAGS: 00010202 (2.6.18-283.el5 #1) EIP is at 0x180 eax: 00000180 ebx: 00000001 ecx: ec095e84 edx: ffff0001 esi: eb8953c0 edi: ec095e98 ebp: eb8953f8 esp: ec095e7c ds: 007b es: 007b ss: 0068 Process modprobe (pid: 22483, ti=ec095000 task=ee371550 task.ti=ec095000) Stack: f885a0a0 f896464a 00000000 00000000 00000000 00000000 00007ff0 f8965c00 eb895000 c043fa06 ec095ed8 ec095efc f8965c00 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Call Trace: [<f885a0a0>] balloon_init+0xa0/0xc5 [xen_balloon] [<c043fa06>] sys_init_module+0x1af3/0x1cb8 [<c0473ed1>] __kmalloc+0x0/0x72 [<c0404f4b>] syscall_call+0x7/0xb ======================= Code: Bad EIP value. EIP: [<00000180>] 0x180 SS:ESP 0068:ec095e7c ice or resource <0>Kernel panic - not syncing: Fatal exception busy Expected results: No kernel panic Additional info:
So the problem is that after loading xen modules on a bare-metal kernel and then attempting to unload those modules we panic. This could probably just be thrown out with a "just don't do that" statement, but we can consider adding a simple check in the module's inits to bail if they're not on xen.
I also got panic just using "modprobe xen-balloon", no removing needed: [root@dell-pe1650-02 ~]# modprobe xen-balloon BUG: unable to handle kernel NULL pointer dereference at virtual address 00000180 printing eip: 00000180 *pde = 2596a067 Oops: 0000 [#1] SMP last sysfs file: /block/ram0/dev Modules linked in: xen_balloon xen_platform_pci autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi ac parport_pc lp parport floppy sg pcspkr scb2_flash mtdcore chipreg serio_raw tpm_tis i2c_piix4 i2c_core e1000 ide_cd cdrom tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod aacraid sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 0 EIP: 0060:[<00000180>] Not tainted VLI EFLAGS: 00010202 (2.6.18-274.el5 #1) EIP is at 0x180 eax: 00000180 ebx: 00000001 ecx: e4563e84 edx: ffff0001 esi: f5c273c0 edi: e4563e98 ebp: f5c273f8 esp: e4563e7c ds: 007b es: 007b ss: 0068 Process modprobe (pid: 3258, ti=e4563000 task=f5d67550 task.ti=e4563000) Stack: f88f90a0 f8d1664a 00000000 00000000 00000000 00000000 00007ff0 f8d17c00 f5c27000 c043fa0a e4563ed8 e4563efc f8d17c00 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Call Trace: [<f88f90a0>] balloon_init+0xa0/0xc5 [xen_balloon] [<c043fa0a>] sys_init_module+0x1af3/0x1cb8 [<c0473ecd>] __kmalloc+0x0/0x72 [<c0404f4b>] syscall_call+0x7/0xb ======================= Code: Bad EIP value. EIP: [<00000180>] 0x180 SS:ESP 0068:e4563e7c <0>Kernel panic - not syncing: Fatal exception
It's still a shooting yourself in the foot type of situation, since you don't need a xen balloon driver if you don't have xen. However, I believe in gun control so I'll look into taking the gun away from the sysadmin with a simple if-statement.
balloon_init() already starts with a call to is_running_on_xen(). Unfortunately, is_running_on_xen is a function-like macro always returning 1, it is defined in <include/asm-i386/mach-xen/asm/hypervisor.h>. The idea is probably that the module is only ever built with CONFIG_XEN. This compile-time-static definition is wrong when we're inserting the module in an HVM guest or a bare metal kernel. http://xenbits.xensource.com/linux-2.6.18-xen.hg/rev/407 After this change, modproble shouldn't even be able to load (attempt to initialize) whatever depends on is_running_on_xen(), unless "xen-platform-pci.ko" is loaded. If "xen-platform-pci.ko" is loaded in the bare metal kernel, then is_running_on_xen() will return false.
(In reply to comment #4) > If "xen-platform-pci.ko" is loaded in the bare > metal kernel, then is_running_on_xen() will return false. See cpuid(0x40000000) in get_hypercall_stubs().
Created attachment 522731 [details] don't hardcode is_running_on_xen() for pv-on-hvm drivers Allowing graceful failure of these modules when inadvertently loaded on native kernels. (Backport from linux-2.6.18-xen.hg changeset 407:5c61cd349b20.)
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
(Re)tested PV-on-HVM drivers (disk and network) in HVM guest, they work. Bare metal: [root@lacos-workstation 2.6.18-284.el5.pv_modprobe_bz734708]# insmod \ ./kernel/drivers/xenpv_hvm/balloon/xen-balloon.ko insmod: error inserting './kernel/drivers/xenpv_hvm/balloon/xen-balloon.ko': -1 Unknown symbol in module This is because the patch makes the balloon driver dependent on "hypercall_stubs", which is defined by xen-platform-pci. dmesg: xen_balloon: Unknown symbol xenbus_scanf xen_balloon: Unknown symbol xen_features xen_balloon: Unknown symbol register_xenstore_notifier xen_balloon: Unknown symbol register_xenbus_watch xen_balloon: Unknown symbol hypercall_stubs [root@lacos-workstation ~]# modprobe xen-balloon FATAL: Error inserting xen_balloon (/lib/modules/2.6.18-284.el5.pv_modprobe_bz734708/kernel/drivers/xenpv_hvm/balloon/xen-balloon.ko): No such device (This added xen_platform_pci first, and then the patched is_running_on_xen() macro worked.) Similarly for xen-vbd. xen-vnif depends on xen-balloon. xen_platform_pci is permanent.
Patch(es) available in kernel-2.6.18-287.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Verify this fix with 2.6.18-300.el5. Also can reproduce it with 5.7 released kernel(2.6.18-274.el5). Verify steps: 1. Install RHEL5.8 host with kernel 2.6.18-300.el5 2. boot up with normal linux kernel (without xen) 3. Install kernel-xen with the same version as kernel 4. Add the xen modules by modprobe, will get FATAL messages for xen-balloon, xen-vnif and xen-vbd: # modprobe xen-platform-pci # modprobe xen-balloon FATAL: Error inserting xen_balloon (/lib/modules/2.6.18-298.el5/kernel/drivers/xenpv_hvm/balloon/xen-balloon.ko): No such device # modprobe xen-vnif FATAL: Error inserting xen_vnif (/lib/modules/2.6.18-298.el5/kernel/drivers/xenpv_hvm/netfront/xen-vnif.ko): No such device # modprobe xen-vbd FATAL: Error inserting xen_vbd (/lib/modules/2.6.18-298.el5/kernel/drivers/xenpv_hvm/blkfront/xen-vbd.ko): No such device 5. check that only the modules xen_platform_pci is loaded: # lsmod | grep xen xen_platform_pci 118125 0 [permanent] 6. Remove the xen modules, the modules can not be removed with below ERROR/SARNING message: # modprobe -r xen-platform-pci FATAL: Error removing xen_platform_pci (/lib/modules/2.6.18-298.el5/kernel/drivers/xenpv_hvm/platform-pci/xen-platform-pci.ko): Device or resource busy # modprobe -r xen-vnif WARNING: Error removing xen_platform_pci (/lib/modules/2.6.18-298.el5/kernel/drivers/xenpv_hvm/platform-pci/xen-platform-pci.ko): Device or resource busy # modprobe -r xen-vbd WARNING: Error removing xen_platform_pci (/lib/modules/2.6.18-298.el5/kernel/drivers/xenpv_hvm/platform-pci/xen-platform-pci.ko): Device or resource busy # modprobe -r xen-balloon WARNING: Error removing xen_platform_pci (/lib/modules/2.6.18-298.el5/kernel/drivers/xenpv_hvm/platform-pci/xen-platform-pci.ko): Device or resource busy 7. Repeat tesps 4 and 5 for 10 times, no Call Trace or crash happens.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html