Hide Forgot
Description of problem: I started seeing panics with Veritas testing on 5.7 Snapshot 5. Version-Release number of selected component (if applicable): 2.6.18-269.el5 How reproducible: Multiple times Steps to Reproduce: 1. Run the ISV-Veritas 4 or 5 tests in beaker against snapshot 5. 2. 3. Actual results: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<0000000000000000>] PGD 41c8c4067 PUD 43cf33067 PMD 0 Oops: 0010 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 15 Modules linked in: gab(PU) llt(PU) fdd(PFU) vxportal(PFU) vxfs(PU) vxio(PFU) vxdmp(PU) autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod joydev ata_piix ide_cd cdrom pata_sil680 tpm_tis e100 serio_raw tg3 libata tpm mii tpm_bios floppy pcspkr qla2xxx scsi_transport_fc sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod shpchp megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 26431, comm: gabconfig Tainted: PF ---- 2.6.18-269.el5 #1 RIP: 0010:[<0000000000000000>] [<0000000000000000>] RSP: 0018:ffff81041897fdf0 EFLAGS: 00010246 RAX: ffffffff80295980 RBX: ffff810439345f90 RCX: ffff810439345f90 RDX: 0000000000000000 RSI: ffff8101fff71b80 RDI: 0000000000000000 RBP: ffff810439345f80 R08: 0000000000000000 R09: 0000000000000000 R10: ffff810425312140 R11: 0000000000000058 R12: ffff810000000000 R13: ffff8104324d86c0 R14: ffff8104324d86d0 R15: 00000000ffc106e8 FS: 0000000000000000(0000) GS:ffff8103ffe5bc40(0063) knlGS:00000000f7dd16c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000418ec2000 CR4: 00000000000006e0 Process gabconfig (pid: 26431, threadinfo ffff81041897e000, task ffff81043675b7e0) Stack: ffffffff88bb0133 0000000000000000 00000000c0206701 0000000000000020 ffff81043c252a18 00000000ffc106e8 ffffffff88ba1d9d 0000000000000292 ffffffff8002239a ffff81042fb2db04 ffffffff88b93b23 0000000000000000 Call Trace: [<ffffffff88bb0133>] :gab:gab_imc_import+0xee/0x153 [<ffffffff88ba1d9d>] :gab:gab_initllt+0x33/0x241 [<ffffffff8002239a>] __up_read+0x19/0x7f [<ffffffff88b93b23>] :gab:gab_drv_init+0x9b/0x269 [<ffffffff88b9f23c>] :gab:gab_untimeout+0x10/0x5b [<ffffffff88b93d83>] :gab:gab_config_drv+0x92/0x189 [<ffffffff88b940f1>] :gab:gab_drv_config+0x277/0x353 [<ffffffff88ba4851>] :gab:gab_linux_ioctl+0x10e/0x1a3 [<ffffffff88ba4902>] :gab:gab_linux_compat_ioctl+0x1c/0x20 [<ffffffff800fe162>] compat_sys_ioctl+0xc5/0x2b1 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 Code: Bad RIP value. RIP [<0000000000000000>] RSP <ffff81041897fdf0> CR2: 0000000000000000 <0>Kernel panic - not syncing: Fatal exception Expected results: Additional info: I'm still gathering information. It looks like some of the modules are failing to load. I got the following from my manual testing with Veritas SF 5.0 today. [root@veritas3 ~]# tail /var/log/messages Jun 27 14:46:29 veritas3 modprobe: WARNING: Could not open '/lib/modules/2.6.18-269.el5/veritas/vxvm/vxdmp.ko': No such file or directory Jun 27 14:46:29 veritas3 modprobe: FATAL: Error inserting vxspec (/lib/modules/2.6.18-269.el5/veritas/vxvm/vxspec.ko): Unknown symbol in module, or unknown parameter (see dmesg) Jun 27 14:46:29 veritas3 modprobe: WARNING: Could not open '/lib/modules/2.6.18-269.el5/veritas/vxvm/vxdmp.ko': No such file or directory Jun 27 14:46:29 veritas3 modprobe: FATAL: Error inserting vxspec (/lib/modules/2.6.18-269.el5/veritas/vxvm/vxspec.ko): Unknown symbol in module, or unknown parameter (see dmesg) Jun 27 14:48:19 veritas3 init: Id "x" respawning too fast: disabled for 5 minutes Jun 27 14:53:20 veritas3 init: Id "x" respawning too fast: disabled for 5 minutes Jun 27 14:56:27 veritas3 kernel: VxVM vxdmp V-5-0-897 dmplinux:vxdmp: Cannot find device number for root<4> Jun 27 14:56:27 veritas3 kernel: VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load Jun 27 14:56:27 veritas3 kernel: Jun 27 14:58:21 veritas3 init: Id "x" respawning too fast: disabled for 5 minutes [root@veritas3 ~]# dmesg security: 3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats security: 61 classes, 80452 rules security: 3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats security: 61 classes, 80452 rules security: 3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats security: 61 classes, 80442 rules security: 3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats security: 61 classes, 80452 rules security: 3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats security: 61 classes, 80452 rules security: 3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats security: 61 classes, 80452 rules VxVM vxdmp V-5-0-897 dmplinux:vxdmp: Cannot find device number for root<4> VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load
I did some more manual testing and found the following. [root@veritas3 ~]# dmesg | grep -i vx vxdmp: module license 'Proprietary. Send bug reports to support' taints kernel. VxVM vxdmp V-5-0-141 dmplinux:vxdmp: Cannot find device number for rootvxio: no version for "vxvm_imc_cleanup" found: kernel tainted. VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load [root@veritas3 ~]# lsmod | grep -i vx vxportal 41488 0 vxfs 1625752 2 fdd,vxportal vxio 1641832 0 vxdmp 249784 1 vxio [root@veritas3 ~]# uname -a Linux veritas3.rhts.eng.bos.redhat.com 2.6.18-269.el5 #1 SMP Tue Jun 21 16:22:46 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux There are complaints in dmesg about vxio not being loaded, but it shows up in lsmod.
Maybe this is something for Veritas people to look into?
(In reply to comment #3) > Maybe this is something for Veritas people to look into? I agree. I've contacted Veritas and am waiting for a response.
I just manually ran the test and verified that it works on kernel 2.6.18-268.el5. I grabbed the same information as above and don't see the vxio not loaded messages. [root@veritas3 ~]# dmesg | grep -i vx vxdmp: module license 'Proprietary. Send bug reports to support' taints kernel. VxVM vxdmp V-5-0-141 dmplinux:vxdmp: Cannot find device number for rootvxio: no version for "vxvm_imc_cleanup" found: kernel tainted. [root@veritas3 ~]# lsmod | grep -i vx vxportal 41488 0 vxfs 1625752 2 fdd,vxportal vxspec 42096 0 vxio 1641832 1 vxdmp 249784 5 vxspec,vxio [root@veritas3 ~]# uname -a Linux veritas3.rhts.eng.bos.redhat.com 2.6.18-268.el5 #1 SMP Tue Jun 14 18:24:50 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
When it panics we see this just before the panic: vxdmp: module license 'Proprietary. Send bug reports to support' taints kernel. VxVM vxdmp V-5-0-141 dmplinux:vxdmp: Cannot find device number for rootvxio: no version for "vxvm_imc_cleanup" found: kernel tainted. VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load vxfs: disagrees about version of symbol struct_module VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load LLT INFO V-14-1-10009 LLT Protocol available LLT INFO V-14-1-10483 16-bit cluster ID (999) set. Updating protocol version from 3.7 to 4.0 GAB INFO V-15-1-20021 GAB available When it passes we see this just before it would have panic'd vxdmp: module license 'Proprietary. Send bug reports to support' taints kernel. VxVM vxdmp V-5-0-141 dmplinux:vxdmp: Cannot find device number for rootvxio: no version for "vxvm_imc_cleanup" found: kernel tainted. vxfs: disagrees about version of symbol struct_module VxVM vxdmp V-5-0-34 added disk array OTHER_DISKS, datype = OTHER_DISKS VxVM vxdmp V-5-0-34 added disk array DISKS, datype = Disk LLT INFO V-14-1-10009 LLT Protocol available LLT INFO V-14-1-10483 16-bit cluster ID (999) set. Updating protocol version from 3.7 to 4.0 GAB INFO V-15-1-20021 GAB available GAB INFO V-15-1-20026 Port a registration waiting for seed port membership I spoke with David Howells about this issue. If his patch is part of the problem, we still are not sure as of yet. It was thought that it should crash in something proc-related. The GAB module jumped to a NULL pointer and since it is a proprietary module, and we have no idea what it does we can't be sure at this point. A theory is that it's possible that his patch might show up as a corrupter - if a module is allocating its own PDE objects. He was careful to bury the wrapper inside fs/proc/ where other code can't get at it and really would expect bugs to crop up in fs/proc/ when it tries to access the wrapper and it's not there anything else shouldn't be aware the wrapper exists. Currently Dan Y is going to send a nm -u from the GAB module, we can at least see if it accesses the proc routines. David H is building a 2.6.18-269.el5 minus his patch. Once that is finished we will need Dan to rerun his tests but he will have to install the test kernel first. Regards, Jeff
There is no vxio module installed or it has been removed or an upgrade of the module failed. I suggest looking there for clues. Also, the other modules are obviously implementing a compatible syscall wrapper that can't handle a missing piece, leading to NULL deference and panic on oops. I doubt this is a Red Hat issue.
Jon, While I don't disagree with your assessment. The issue still remains that it worked with the 2.6.18-268.el5 kernel and it is failing with the 2.6.18-269.el5. It seems to reason that something has changed on our end that possibly instigating this behaviour since the test has not changed.
Patch(es) available in kernel-2.6.18-273.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. ... Note: this kernel contains patches that are under embargo until 2011.07.07, so it will not actually be available until the 7th or 8th.
moving to verified; issue is fixed in RHEL5.7-Server-20110707.3.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html