Bug 717068

Summary: Kernel panics during Veritas SF testing.
Product: Red Hat Enterprise Linux 5 Reporter: Daniel Yeisley <dyeisley>
Component: kernelAssignee: David Howells <dhowells>
Status: CLOSED ERRATA QA Contact: Daniel Yeisley <dyeisley>
Severity: high Docs Contact:
Priority: high    
Version: 5.7CC: arozansk, benl, eguan, jburke, jcm, jstancek, moshiro, myamazak, pbunyan, qcai, rwheeler, syeghiay, tmuneda, vincent
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-273.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-21 10:05:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 675781    
Bug Blocks:    

Description Daniel Yeisley 2011-06-27 20:49:28 UTC
Description of problem:
I started seeing panics with Veritas testing on 5.7 Snapshot 5.

Version-Release number of selected component (if applicable):
2.6.18-269.el5

How reproducible:
Multiple times

Steps to Reproduce:
1. Run the ISV-Veritas 4 or 5 tests in beaker against snapshot 5.  
2.
3.
  
Actual results:
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:  
 [<0000000000000000>] 
PGD 41c8c4067 PUD 43cf33067 PMD 0  
Oops: 0010 [1] SMP  
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq 
CPU 15  
Modules linked in: gab(PU) llt(PU) fdd(PFU) vxportal(PFU) vxfs(PU) vxio(PFU) vxdmp(PU) autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod joydev ata_piix ide_cd cdrom pata_sil680 tpm_tis e100 serio_raw tg3 libata tpm mii tpm_bios floppy pcspkr qla2xxx scsi_transport_fc sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod shpchp megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd 
Pid: 26431, comm: gabconfig Tainted: PF    ---- 2.6.18-269.el5 #1 
RIP: 0010:[<0000000000000000>]  [<0000000000000000>] 
RSP: 0018:ffff81041897fdf0  EFLAGS: 00010246 
RAX: ffffffff80295980 RBX: ffff810439345f90 RCX: ffff810439345f90 
RDX: 0000000000000000 RSI: ffff8101fff71b80 RDI: 0000000000000000 
RBP: ffff810439345f80 R08: 0000000000000000 R09: 0000000000000000 
R10: ffff810425312140 R11: 0000000000000058 R12: ffff810000000000 
R13: ffff8104324d86c0 R14: ffff8104324d86d0 R15: 00000000ffc106e8 
FS:  0000000000000000(0000) GS:ffff8103ffe5bc40(0063) knlGS:00000000f7dd16c0 
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b 
CR2: 0000000000000000 CR3: 0000000418ec2000 CR4: 00000000000006e0 
Process gabconfig (pid: 26431, threadinfo ffff81041897e000, task ffff81043675b7e0) 
Stack:  ffffffff88bb0133 0000000000000000 00000000c0206701 0000000000000020 
 ffff81043c252a18 00000000ffc106e8 ffffffff88ba1d9d 0000000000000292 
 ffffffff8002239a ffff81042fb2db04 ffffffff88b93b23 0000000000000000 
Call Trace: 
 [<ffffffff88bb0133>] :gab:gab_imc_import+0xee/0x153 
 [<ffffffff88ba1d9d>] :gab:gab_initllt+0x33/0x241 
 [<ffffffff8002239a>] __up_read+0x19/0x7f 
 [<ffffffff88b93b23>] :gab:gab_drv_init+0x9b/0x269 
 [<ffffffff88b9f23c>] :gab:gab_untimeout+0x10/0x5b 
 [<ffffffff88b93d83>] :gab:gab_config_drv+0x92/0x189 
 [<ffffffff88b940f1>] :gab:gab_drv_config+0x277/0x353 
 [<ffffffff88ba4851>] :gab:gab_linux_ioctl+0x10e/0x1a3 
 [<ffffffff88ba4902>] :gab:gab_linux_compat_ioctl+0x1c/0x20 
 [<ffffffff800fe162>] compat_sys_ioctl+0xc5/0x2b1 
 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 
 
 
Code:  Bad RIP value. 
RIP  [<0000000000000000>] 
 RSP <ffff81041897fdf0> 
CR2: 0000000000000000 
 <0>Kernel panic - not syncing: Fatal exception 

Expected results:


Additional info:
I'm still gathering information.  It looks like some of the modules are failing to load.

I got the following from my manual testing with Veritas SF 5.0 today.

[root@veritas3 ~]# tail /var/log/messages
Jun 27 14:46:29 veritas3 modprobe: WARNING: Could not open '/lib/modules/2.6.18-269.el5/veritas/vxvm/vxdmp.ko': No such file or directory
Jun 27 14:46:29 veritas3 modprobe: FATAL: Error inserting vxspec (/lib/modules/2.6.18-269.el5/veritas/vxvm/vxspec.ko): Unknown symbol in module, or unknown parameter (see dmesg)
Jun 27 14:46:29 veritas3 modprobe: WARNING: Could not open '/lib/modules/2.6.18-269.el5/veritas/vxvm/vxdmp.ko': No such file or directory
Jun 27 14:46:29 veritas3 modprobe: FATAL: Error inserting vxspec (/lib/modules/2.6.18-269.el5/veritas/vxvm/vxspec.ko): Unknown symbol in module, or unknown parameter (see dmesg)
Jun 27 14:48:19 veritas3 init: Id "x" respawning too fast: disabled for 5 minutes
Jun 27 14:53:20 veritas3 init: Id "x" respawning too fast: disabled for 5 minutes
Jun 27 14:56:27 veritas3 kernel: VxVM vxdmp V-5-0-897 dmplinux:vxdmp: Cannot find device number for root<4>
Jun 27 14:56:27 veritas3 kernel: VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load
Jun 27 14:56:27 veritas3 kernel:
Jun 27 14:58:21 veritas3 init: Id "x" respawning too fast: disabled for 5 minutes

[root@veritas3 ~]# dmesg
security:  3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats
security:  61 classes, 80452 rules
security:  3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats
security:  61 classes, 80452 rules
security:  3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats
security:  61 classes, 80442 rules
security:  3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats
security:  61 classes, 80452 rules
security:  3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats
security:  61 classes, 80452 rules
security:  3 users, 6 roles, 2008 types, 267 bools, 1 sens, 1024 cats
security:  61 classes, 80452 rules
VxVM vxdmp V-5-0-897 dmplinux:vxdmp: Cannot find device number for root<4>
VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load

Comment 2 Daniel Yeisley 2011-06-28 14:51:44 UTC
I did some more manual testing and found the following.

[root@veritas3 ~]# dmesg | grep -i vx
vxdmp: module license 'Proprietary.  Send bug reports to support' taints kernel.
VxVM vxdmp V-5-0-141 dmplinux:vxdmp: Cannot find device number for rootvxio: no version for "vxvm_imc_cleanup" found: kernel tainted.
VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load
VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load
VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load
VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load

[root@veritas3 ~]# lsmod | grep -i vx
vxportal               41488  0
vxfs                 1625752  2 fdd,vxportal
vxio                 1641832  0
vxdmp                 249784  1 vxio

[root@veritas3 ~]# uname -a
Linux veritas3.rhts.eng.bos.redhat.com 2.6.18-269.el5 #1 SMP Tue Jun 21 16:22:46 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

There are complaints in dmesg about vxio not being loaded, but it shows up in lsmod.

Comment 3 Ric Wheeler 2011-06-28 15:04:14 UTC
Maybe this is something for Veritas people to look into?

Comment 4 Daniel Yeisley 2011-06-28 15:24:32 UTC
(In reply to comment #3)
> Maybe this is something for Veritas people to look into?

I agree.  I've contacted Veritas and am waiting for a response.

Comment 5 Daniel Yeisley 2011-06-28 15:33:08 UTC
I just manually ran the test and verified that it works on kernel 2.6.18-268.el5.

I grabbed the same information as above and don't see the vxio not loaded messages.

[root@veritas3 ~]# dmesg | grep -i vx
vxdmp: module license 'Proprietary.  Send bug reports to support' taints kernel.
VxVM vxdmp V-5-0-141 dmplinux:vxdmp: Cannot find device number for rootvxio: no version for "vxvm_imc_cleanup" found: kernel tainted.

[root@veritas3 ~]# lsmod | grep -i vx
vxportal               41488  0 
vxfs                 1625752  2 fdd,vxportal
vxspec                 42096  0 
vxio                 1641832  1 
vxdmp                 249784  5 vxspec,vxio

[root@veritas3 ~]# uname -a
Linux veritas3.rhts.eng.bos.redhat.com 2.6.18-268.el5 #1 SMP Tue Jun 14 18:24:50 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

Comment 6 Jeff Burke 2011-06-28 15:44:48 UTC
When it panics we see this just before the panic: 

 vxdmp: module license 'Proprietary.  Send bug reports to support' taints kernel. 
 VxVM vxdmp V-5-0-141 dmplinux:vxdmp: Cannot find device number for rootvxio: no version for "vxvm_imc_cleanup" found: kernel tainted.
 VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load 
 vxfs: disagrees about version of symbol struct_module 
 VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load 
 VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load 
 VxVM vxio V-5-0-472 vxspec: vxio not loaded. Aborting vxspec load 
 LLT INFO V-14-1-10009 LLT Protocol available 
 LLT INFO V-14-1-10483 16-bit cluster ID (999) set. Updating protocol version from 3.7 to 4.0 
 GAB INFO V-15-1-20021 GAB available

When it passes we see this just before it would have panic'd

vxdmp: module license 'Proprietary.  Send bug reports to support' taints kernel. 
VxVM vxdmp V-5-0-141 dmplinux:vxdmp: Cannot find device number for rootvxio: no version for "vxvm_imc_cleanup" found: kernel tainted. 
vxfs: disagrees about version of symbol struct_module 
VxVM vxdmp V-5-0-34 added disk array OTHER_DISKS, datype = OTHER_DISKS 
VxVM vxdmp V-5-0-34 added disk array DISKS, datype = Disk 
LLT INFO V-14-1-10009 LLT Protocol available 
LLT INFO V-14-1-10483 16-bit cluster ID (999) set. Updating protocol version from 3.7 to 4.0 
GAB INFO V-15-1-20021 GAB available 
GAB INFO V-15-1-20026 Port a registration waiting for seed port membership 

I spoke with David Howells about this issue. If his patch is part of the problem, we still are not sure as of yet. It was thought that it should crash in something proc-related. The GAB module jumped to a NULL pointer and since it is a proprietary module, and we have no idea what it does we can't be sure at this point.

A theory is that it's possible that his patch might show up as a corrupter - if a module is allocating its own PDE objects. He was careful to bury the wrapper inside fs/proc/ where other code can't get at it and really would expect bugs to crop up in fs/proc/ when it tries to access the wrapper and it's not there anything else shouldn't be aware the wrapper exists.

Currently Dan Y is going to send a nm -u from the GAB module, we can at least see if it accesses the proc routines. David H is building a 2.6.18-269.el5 minus his patch. Once that is finished we will need Dan to rerun his tests but he will have to install the test kernel first.

Regards,
Jeff

Comment 8 Jon Masters 2011-06-28 23:52:44 UTC
There is no vxio module installed or it has been removed or an upgrade of the module failed. I suggest looking there for clues. Also, the other modules are obviously implementing a compatible syscall wrapper that can't handle a missing piece, leading to NULL deference and panic on oops. I doubt this is a Red Hat issue.

Comment 9 Jeff Burke 2011-06-29 02:13:19 UTC
Jon,
 While I don't disagree with your assessment. The issue still remains that it worked with the 2.6.18-268.el5 kernel and it is failing with the 2.6.18-269.el5.
It seems to reason that something has changed on our end that possibly instigating this behaviour since the test has not changed.

Comment 29 Jarod Wilson 2011-07-06 15:24:42 UTC
Patch(es) available in kernel-2.6.18-273.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.
...
Note: this kernel contains patches that are under embargo until 2011.07.07, so
it will not actually be available until the 7th or 8th.

Comment 34 Daniel Yeisley 2011-07-08 13:11:27 UTC
moving to verified; issue is fixed in RHEL5.7-Server-20110707.3.

Comment 35 errata-xmlrpc 2011-07-21 10:05:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html