Escalated to Bugzilla from IssueTracker
A system experienced a crash do to a NULL pointer dereference in lpfc_nlp_put(). This occurred because the lpfc_nodelist structure had a NULL phba pointer. The system crashed when lpfc_nlp_put() tried to dereference the NULL phba pointer to reference a spin_lock. #0 [ffff8102579eda80] crash_kexec at ffffffff800ad2aa #1 [ffff8102579edb40] __die at ffffffff80065157 #2 [ffff8102579edb80] do_page_fault at ffffffff80066dd7 #3 [ffff8102579edc70] error_exit at ffffffff8005dde9 [exception RIP: _spin_lock_irqsave+3] RIP: ffffffff80064b0b RSP: ffff8102579edd20 RFLAGS: 00010006 RAX: 0000000000000206 RBX: 0000000000001578 RCX: 0000000000004d6b RDX: ffffffff8816104d RSI: 0000000000000100 RDI: 0000000000001578 RBP: ffff810fa17a7e00 R8: 0000000052000045 R9: 0000000000000000 R10: ffff81010fee4158 R11: 0000000000000282 R12: ffff8102579eded8 R13: ffff81047f6da000 R14: 0000000000000000 R15: ffff810111434000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff8102579edd20] lpfc_nlp_put at ffffffff88131dd0 #5 [ffff8102579edd60] lpfc_ioctl_send_mgmt_cmd at ffffffff8815623a #6 [ffff8102579eddf0] lpfcdfc_do_ioctl at ffffffff88156d9f #7 [ffff8102579ede90] lpfcdfc_compat_ioctl at ffffffff88157999 #8 [ffff8102579edf20] compat_sys_ioctl at ffffffff800f927c #9 [ffff8102579edf80] cstar_do_call at ffffffff8006160c From the vmcore, it looks like the lpfc_nodelist structure at 0xffff810fa17a7e00 was already freed, even though it was in use by the ioctl call. The memory is listed as allocated, but most fields of the structure contain invalid values. The ioctl request had a node id of 0xfffffc, and there was another lpfc_nodelist with this same id at ffff81107fe1fa00. Unlike the bad lpfc_nodelist, this other matching nodelist is on the fc_nodes list for this host while the bad one is not. So it appears the bad lpfc_nodelist was freed and later replaced with a new lpfc_nodelist for the same id.
One concern is the use of lpfc_nlp_get() in lpfc_ioctl_send_mgmt_cmd(). The return value of lpfc_nlp_get() isn't checked. So it may be possible the lpfc_nlp_get() call would not up the nlp reference count, allowing the lpfc_nodelist structure to be deleted while the ioctl path is using it.
Created attachment 441037 [details] test patch to handle lpfc_nlp_get() not increasing the refcount I've attached a test patch to handle the cases where lpfc_nlp_get() could fail and return NULL in lpfc_ioctl_send_mgmt_cmd(), signifying that the reference count was not increased. The patch also has extra WARN_ON() checks to confirm and report if the condition of lpfc_nlp_get() failing within lpfc_ioctl_send_mgmt_cmd() happens.
Hi Emulex folks, this is a customer problem, so a review of the patch and reply here would be appreciated.
David, I do not see any mention of what driver this patch is for, is it against the inbox RHEL5.4 (8.2.0.48.2p), or against a later version of the driver? Thanks, Joe
The patch was against the RHEL 5.4 driver.
(In reply to comment #6) > The patch was against the RHEL 5.4 driver. Joseph @ Emulex - anything else you need from us on this? How does the patch look in Comment #3?
This does appear to be a bug, thanks David for the analysis. However, the attached patch as it is might cause some other unexpected funtional failures. We started working on a driver fix, we hope to provide the fix/patch here for testing/verification by COB tomorrow 9/1/10.
Created attachment 442494 [details] Ndlp refcount fix Please see the patch as a fix against the RHEL5.4 8.2.0.48.2p driver. The problem was due to the management path, putting reference count on an ndlp, failed to check the return value from lpfc_nlp_get() routine. There is a possible race condition where the base driver state machine might release the ndlp, and the management path then encounters a null pointer, and without checking for that condition its deference can cause a system crash. This patch also includes a related fix to the lpfc_ioctl_send_els() routine.
(In reply to comment #9) Hi Vaios, The flow for z-stream is that a patch needs to be posted for the current minor release under development. That would be rhel5.6 in this case. I checked and it appears that this fix applies to rhel5.6 as well. Can you provide the equivalent rhel5.6 patch in another bz? I can post it and that will enable the z-stream update for this patch. This doesn't apply upstream or in rhel6 due to ioctl functionality being replaced. Rob
Hi Vaios, First off, thanks for putting this together! We seem unable to trigger the issue even with multiple lip's issued. Would it be possible for you guys to create a reproducer so that we can verify the patch? Thanks! -Guil
(In reply to comment #9) > Created attachment 442494 [details] > Ndlp refcount fix > > Please see the patch as a fix against the RHEL5.4 8.2.0.48.2p driver. > > The problem was due to the management path, putting reference count on an ndlp, > failed to check the return value from lpfc_nlp_get() routine. There is a > possible race condition where the base driver state machine might release the > ndlp, and the management path then encounters a null pointer, and without > checking for that condition its deference can cause a system crash. > > This patch also includes a related fix to the lpfc_ioctl_send_els() routine. The customer tested the patch and the box isn't even coming up, it's always restarting... bt: PID: 8218 TASK: ffff810c265a5040 CPU: 3 COMMAND: "vxesd" #0 [ffff810c0b9fba80] crash_kexec at ffffffff800ad301 #1 [ffff810c0b9fbb40] __die at ffffffff80065157 #2 [ffff810c0b9fbb80] do_page_fault at ffffffff80066dd7 #3 [ffff810c0b9fbc70] error_exit at ffffffff8005dde9 [exception RIP: _spin_lock_irqsave+3] RIP: ffffffff80064b0b RSP: ffff810c0b9fbd20 RFLAGS: 00010006 RAX: 0000000000000206 RBX: 0000000000001578 RCX: 0000000000000000 RDX: ffffffff881a502d RSI: 0000000000000100 RDI: 0000000000001578 RBP: ffff810c16700800 R8: 0000000000000040 R9: 0000000000000000 R10: ffff81000901ea90 R11: 0000000000000000 R12: ffff810c0b9fbed8 R13: ffff810c25880000 R14: 0000000000000000 R15: ffff810c09f39000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff810c0b9fbd20] lpfc_nlp_put at ffffffff88175dd0 #5 [ffff810c0b9fbd60] lpfc_ioctl_send_mgmt_cmd at ffffffff8819a217 #6 [ffff810c0b9fbdf0] lpfcdfc_do_ioctl at ffffffff8819ad7a #7 [ffff810c0b9fbe90] lpfcdfc_compat_ioctl at ffffffff8819b974 #8 [ffff810c0b9fbf20] compat_sys_ioctl at ffffffff800f92d3 #9 [ffff810c0b9fbf80] sysenter_do_call at ffffffff8006149d RIP: 00000000ffffe410 RSP: 00000000fffb32b8 RFLAGS: 00000296 RAX: ffffffffffffffda RBX: ffffffff8006149d RCX: 00000000c028fc01 RDX: 00000000fffb3300 RSI: 00000000f7ca3448 RDI: 00000000fffb3300 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00000000fffb32e8 ORIG_RAX: 0000000000000036 CS: 0023 SS: 002b The bug comes up at [2]: 4371 phba = ndlp->phba; <---- ! [1] 4372 spin_lock_irqsave(&phba->ndlp_lock, flags); 4373 /* Check the ndlp memory free acknowledge flag to avoid the 4374 * possible race condition that kref_put got invoked again 4375 * after previous one has done ndlp memory free. 4376 */ 4377 if (NLP_CHK_FREE_ACK(ndlp)) { [2] <--- HERE!!! 4378 spin_unlock_irqrestore(&phba->ndlp_lock, flags); but we see that at [1] we dereference it successfully at [1]: 0xffffffff88175d7a <lpfc_nlp_put>: push %r15 0xffffffff88175d7c <lpfc_nlp_put+2>: push %r14 0xffffffff88175d7e <lpfc_nlp_put+4>: push %r13 0xffffffff88175d80 <lpfc_nlp_put+6>: push %r12 0xffffffff88175d82 <lpfc_nlp_put+8>: push %rbp 0xffffffff88175d83 <lpfc_nlp_put+9>: mov %rdi,%rbp 0xffffffff88175d86 <lpfc_nlp_put+12>: push %rbx 0xffffffff88175d87 <lpfc_nlp_put+13>: sub $0x8,%rsp 0xffffffff88175d8b <lpfc_nlp_put+17>: test %rdi,%rdi 0xffffffff88175d8e <lpfc_nlp_put+20>: je 0xffffffff88175efc 0xffffffff88175d94 <lpfc_nlp_put+26>: mov 0x170(%rdi),%r9d 0xffffffff88175d9b <lpfc_nlp_put+33>: mov 0x24(%rdi),%ecx 0xffffffff88175d9e <lpfc_nlp_put+36>: mov $0xffffffff881a502d,%rdx 0xffffffff88175da5 <lpfc_nlp_put+43>: mov 0xe0(%rdi),%rdi 0xffffffff88175dac <lpfc_nlp_put+50>: mov 0x20(%rbp),%r8d 0xffffffff88175db0 <lpfc_nlp_put+54>: mov $0x100,%esi 0xffffffff88175db5 <lpfc_nlp_put+59>: callq 0xffffffff8819635c <lpfc_debugfs_disc_trc> 0xffffffff88175dba <lpfc_nlp_put+64>: mov 0xd0(%rbp),%rbx <-- [1] HERE 0xffffffff88175dc1 <lpfc_nlp_put+71>: add $0x1578,%rbx 0xffffffff88175dc8 <lpfc_nlp_put+78>: mov %rbx,%rdi 0xffffffff88175dcb <lpfc_nlp_put+81>: callq 0xffffffff80064b08 <_spin_lock_irqsave> 0xffffffff88175dd0 <lpfc_nlp_put+86>: mov 0x3e(%rbp),%dx <-- [2] BUG and at [2] we have the BUG. So, the only thing that comes to my mind is that while we are spinning on that spinlock, somebody who has it (the spinlock) taken is free()ing that pointer, so that we end up with unallocated mem after the lock. I've walked the code a bit but didn't find anything that could be doing so, but anyway I don't know the code at all... Do you know what could it be? Also, we have some veritas modules loaded, in short time we'll try to reproduce it without them... Thank you!
Created attachment 445761 [details] Additional refcount fixes The crash can still occur with the ndlp refcount fix patch. From more review, there are places in the code outside of the ioctl path which also reduce the reference count even though an attempt acquire a reference failed. The second patch is in addition to the patch to fix lpfc_ioctl_send_mgmt_cmd() .
Event posted on 09-07-2010 03:46pm EDT by fhirtz Note, we have an occurrence of this on a different system which was logged in 1287663. I'll include a pointer to where the data is staged shortly. This event sent from IssueTracker by fhirtz issue 1266703
Event posted on 09-07-2010 03:55pm EDT by fhirtz <snip> Your corefile is ready for you You may view it at megatron.gsslab.rdu.redhat.com Login with kerberos name/password $ cd /cores/20100907150202/work /cores/20100907150202/work$ ./crash Backtrace KERNEL: /cores/20100907150202/work/vmlinux DUMPFILE: /cores/20100907150202/work/vmcore_1287663_eqperforce6_090510 [PARTIAL DUMP] CPUS: 8 DATE: Sun Sep 5 18:24:08 2010 UPTIME: 05:49:25 LOAD AVERAGE: 1.91, 1.36, 1.17 TASKS: 569 NODENAME: eqperforce6 RELEASE: 2.6.18-164.17.1.el5 VERSION: #1 SMP Thu Apr 1 04:46:15 EDT 2010 MACHINE: x86_64 (2197 Mhz) MEMORY: 31.4 GB PANIC: "Oops: 0002 [1] SMP " (check log for details) <snip> mptctl: /dev/mptctl @ (major,minor=10,220) ioctl32(bpbkar:30573): Unknown cmd fd(3) cmd(5658462c){16} arg(ff87fb1c) on / ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /misc ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /var ioctl32(bpbkar:30573): Unknown cmd fd(8) cmd(5658462c){16} arg(ff87915c) on /var/lib/nfs/v4recovery ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /tmp ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /sbin ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /etc ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /root ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /bin ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /boot ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /mnt ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /lost+found ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac) on /net FS-Cache: Loaded ioctl32(bpbkar:24212): Unknown cmd fd(4) cmd(5658462c){16} arg(ffc73f7c) on /boot ioctl32(bpbkar:24212): Unknown cmd fd(4) cmd(5658462c){16} arg(ffc73f7c) on / ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /misc ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /var ioctl32(bpbkar:24212): Unknown cmd fd(8) cmd(5658462c){16} arg(ffc6d5bc) on /var/lib/nfs/v4recovery ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /tmp ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /sbin ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /etc ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /root ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /bin ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /boot ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /mnt ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /lost+found ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c) on /net Unable to handle kernel paging request at 000000000c901578 RIP: [<ffffffff80064b0b>] _spin_lock_irqsave+0x3/0xd PGD 6c1357067 PUD 6a943b067 PMD 0 Oops: 0002 [1] SMP last sysfs file: /class/fc_host/host1/active_fc4s CPU 7 Modules linked in: nfs fscache nfs_acl mptctl mptbase autofs4 dmpjbod(PU) dmpap(PU) dmpaa(PU) vxspec(PFU) vxio(PFU) vxdmp(PU) lockd sunrpc bonding ipv6 xfrm_nalgo crypto_api vxportal(PFU) fdd(PFU) vxfs(PU) dm_mirror dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg i2c_amd756 i2c_core lpfc ide_cd amd_rng hpilo k8_edac e1000 tg3 cdrom edac_mc scsi_transport_fc shpchp k8temp hwmon floppy serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache usb_storage cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 21862, comm: vxesd Tainted: PF 2.6.18-164.17.1.el5 #1 RIP: 0010:[<ffffffff80064b0b>] [<ffffffff80064b0b>] _spin_lock_irqsave+0x3/0xd RSP: 0018:ffff8106b1121d20 EFLAGS: 00010006 RAX: 0000000000000206 RBX: 000000000c901578 RCX: 000000000040b370 RDX: ffffffff882b704d RSI: 0000000000000100 RDI: 000000000c901578 RBP: ffff8107ec574a00 R08: 0000000028000045 R09: 0000000000000000 R10: ffff8107ffef4068 R11: ffff8107e8d830f0 R12: ffff8106b1121ed8 R13: ffff8107fe2dc000 R14: 0000000000000000 R15: ffff81020ff42000 FS: 00002b2541078240(0000) GS:ffff8107ffe6a3c0(0063) knlGS:0000000055571590 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 000000000c901578 CR3: 00000006a4b7a000 CR4: 00000000000006e0 Process vxesd (pid: 21862, threadinfo ffff8106b1120000, task ffff81033ea4f860) Stack: ffffffff88287dd0 ffff81033ea4f860 0000000000000013 ffff8107fb18a4f8 ffff8106b1121ed8 ffff8107fe2dc000 0000000000000000 ffff81020ff42000 ffffffff882ac23a ffff81020ff42000 ffff8106b1121e98 0000000000000001 Call Trace: [<ffffffff88287dd0>] :lpfc:lpfc_nlp_put+0x56/0x194 [<ffffffff882ac23a>] :lpfc:lpfc_ioctl_send_mgmt_cmd+0x614/0x633 [<ffffffff80128562>] avc_has_perm+0x46/0x58 [<ffffffff882acd9f>] :lpfc:lpfcdfc_do_ioctl+0x301/0xe61 [<ffffffff882ad999>] :lpfc:lpfcdfc_compat_ioctl+0x9a/0x131 [<ffffffff800f92d3>] compat_sys_ioctl+0xc5/0x2b2 [<ffffffff8006160c>] cstar_do_call+0x1b/0x6e Code: f0 ff 0f 0f 88 f2 00 00 00 c3 fa f0 ff 0f 0f 88 f3 00 00 00 RIP [<ffffffff80064b0b>] _spin_lock_irqsave+0x3/0xd RSP <ffff8106b1121d20> </snip> This is obviously happening with the veritas stack installed, and it does appear from the log that netbackup is probing devices with some sort of unrecognized 32bit(?) ioctl on boot, though the system here ran for about 6hrs after that. Given the nature of this, that might be interesting. This event sent from IssueTracker by fhirtz issue 1266703
Event posted on 09-08-2010 11:56am EDT by fhirtz To be clear, from the client facing GSS perspective in NA, I own the interaction. I do have a question related to the current strategy. Though Veritas might be calling some sort of broken/non-translatable 32bit ioctl here, am I correct in presuming that it shouldn't actually be able to fail the entire system? Also, do we have an idea why it does so quickly with the patch(es), but not stock? This event sent from IssueTracker by fhirtz issue 1266703
Hi Vaios, Are you still looking into this? Thanks, Rob
There was a crash with the refcount patches and LOG_NODE set. We see the following messages from the lpfc driver before the crash: lpfc 0000:0d:00.0: 0:(0):0932 FIND node did xfffffa NOT FOUND. lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffa, UNUSED -> PLOGI lpfc 0000:0d:00.0: 0:(0):0929 FIND node DID Data: xffff810c25146c00 xfffffa x0 x1000000 lpfc 0000:0d:00.0: 0:(0):0929 FIND node DID Data: xffff810c25146c00 xfffffa x0 x1000000 lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffa, UNMAPPED -> PLOGI lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffc, PLOGI -> UNUSED lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffa, PLOGI -> REGLOGIN lpfc 0000:0d:00.0: 0:(0):0279 lpfc_nlp_release: ndlp:xffff810c25146c00 usgmap:xf refcnt:0 lpfc 0000:0d:00.0: 0:(0):0900 Cleanup node for NPort xfffffc Data: x0 x0 x0 lpfc 0000:0d:00.0: 0:(0):0280 lpfc_cleanup_node: ndlp:xffff810c25146c00 usgmap:xf refcnt:0 lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffa, REGLOGIN -> UNMAPPED A node for xfffffa was created at address xffff810c25146c00, started through the state machine and then was freed. (The 0280 line from lpfc_cleanup_node() means the FREE_REQ bit was set and the memory was freed at the completion of lpfc_nlp_release() .) The crash then occurred when lpfc_nlp_put() was called on the already freed xffff810c25146c00 node. The nlp address is in register RBP, which lpfc_nlp_put() used to generate the invalid spin_lock address in RDI. dis lpfc_nlp_put() ... 0xffffffff88175dce <lpfc_nlp_put+64>: mov 0xd0(%rbp),%rbx 0xffffffff88175dd5 <lpfc_nlp_put+71>: add $0x1578,%rbx 0xffffffff88175ddc <lpfc_nlp_put+78>: mov %rbx,%rdi 0xffffffff88175ddf <lpfc_nlp_put+81>: callq 0xffffffff80064b08 <_spin_lock_irqsave> 0xffffffff88175de4 <lpfc_nlp_put+86>: mov 0x3e(%rbp),%dx ... bt PID: 6419 TASK: ffff810c271330c0 CPU: 11 COMMAND: "vxesd" #0 [ffff810c0f2ada80] crash_kexec at ffffffff800ad301 #1 [ffff810c0f2adb40] __die at ffffffff80065157 #2 [ffff810c0f2adb80] do_page_fault at ffffffff80066dd7 #3 [ffff810c0f2adc70] error_exit at ffffffff8005dde9 [exception RIP: _spin_lock_irqsave+3] RIP: ffffffff80064b0b RSP: ffff810c0f2add20 RFLAGS: 00010006 RAX: 0000000000000206 RBX: 0000000000001aac RCX: 0000000000000000 RDX: ffffffff881a502d RSI: 0000000000000100 RDI: 0000000000001aac RBP: ffff810c25146c00 R8: 000000000001bb40 R9: 0000000000000000 R10: ffff810009061e90 R11: 0000000000000000 R12: ffff810c0f2aded8 R13: ffff810c25976000 R14: 0000000000000000 R15: ffff810c0e821000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff810c0f2add20] lpfc_nlp_put at ffffffff88175de4 #5 [ffff810c0f2add60] lpfc_ioctl_send_mgmt_cmd at ffffffff8819a233 #6 [ffff810c0f2addf0] lpfcdfc_do_ioctl at ffffffff8819ad96 #7 [ffff810c0f2ade90] lpfcdfc_compat_ioctl at ffffffff8819b990 #8 [ffff810c0f2adf20] compat_sys_ioctl at ffffffff800f92d3 #9 [ffff810c0f2adf80] sysenter_do_call at ffffffff8006149d RIP: 00000000ffffe410 RSP: 00000000ffca16a8 RFLAGS: 00000296 RAX: ffffffffffffffda RBX: ffffffff8006149d RCX: 00000000c028fc01 RDX: 00000000ffca16f0 RSI: 00000000f7bd3448 RDI: 00000000ffca16f0 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00000000ffca16d8 ORIG_RAX: 0000000000000036 CS: 0023 SS: 002b From the messages, it appears lpfc_ioctl_send_mgmt_cmd() took the !pndl path and the reference count for the newly allocated nlp hit zero while the code in lpfc_ioctl_send_mgmt_cmd() still believed it held a reference.
Created attachment 447799 [details] Updated patch for addressing NDLP refcount issue This is the updated patch that provides an additional fix based on the latest data from Comment #19. This latest patch (*-2.patch) includes (and extends) the patch entered in Comment #9, and applies to the RHEL5.4 8.2.0.48.2p LPFC driver.
Comment on attachment 447799 [details] Updated patch for addressing NDLP refcount issue Please disregard this latest -2 patch. We are in the process of building a new one.
Created attachment 447823 [details] Latest patch for ndlp refcount issue Correct updated patch for addressing NDLP refcount issue. This is the updated patch that provides an additional fix based on the latest data from Comment #19. This latest patch (*-3.patch) includes (and extends) the patch entered in Comment #9, and applies to the RHEL5.4 8.2.0.48.2p LPFC driver.
(In reply to comment #17) > Event posted on 09-08-2010 11:56am EDT by fhirtz > > To be clear, from the client facing GSS perspective in NA, I own the > interaction. > > I do have a question related to the current strategy. Though Veritas might > be calling some sort of broken/non-translatable 32bit ioctl here, am I > correct in presuming that it shouldn't actually be able to fail the > entire system? Also, do we have an idea why it does so quickly with the > patch(es), but not stock? > > > This event sent from IssueTracker by fhirtz > issue 1266703 Is it possible to get this patch into a kernel at the customer site and tested?
Customer has been running a test kernel with the last patch on 3 systems for 3 days so far. Previously, the systems would crash at startup do to ioctl calls from vxesd. With the patched test kernel, the systems did not crash at startup and have continued to run without issue.
(In reply to comment #22) > Created attachment 447823 [details] > Latest patch for ndlp refcount issue > > Correct updated patch for addressing NDLP refcount issue. > > This is the updated patch that provides an additional fix based on the latest > data from Comment #19. > > This latest patch (*-3.patch) includes (and extends) the patch entered in > Comment #9, and applies to the RHEL5.4 8.2.0.48.2p LPFC driver. Hi Vaios, Thanks for getting this patch together. We need to have a rhel5.6 version of this patch before we can get the rhel5.4 version of this patch into z-stream. Can you open another bz, copy me and Andrius on it, and attach a rhel5.6 version of this patch to it? Thanks, Rob
(In reply to comment #26) > (In reply to comment #22) > > Created attachment 447823 [details] [details] > > Latest patch for ndlp refcount issue > > > > Correct updated patch for addressing NDLP refcount issue. > > > > This is the updated patch that provides an additional fix based on the latest > > data from Comment #19. > > > > This latest patch (*-3.patch) includes (and extends) the patch entered in > > Comment #9, and applies to the RHEL5.4 8.2.0.48.2p LPFC driver. > > Hi Vaios, > > Thanks for getting this patch together. We need to have a rhel5.6 version of > this patch before we can get the rhel5.4 version of this patch into z-stream. > Can you open another bz, copy me and Andrius on it, and attach a rhel5.6 > version of this patch to it? > > Thanks, Rob Also, the rhel5.6 patch is urgent for making a 9/28 z-stream build. Rob
On Comment #25. David, can we safely declare then this latest patch has fixed the customer's issue? On Comments # 26 and 27. When we hear back from the customer the final disposition of this fix, we'll add this fix in the patchset we are planning to submit soon for the RHEL5.6 inclusion.
The lpfc_ioctl_ndlp_refcount_elx-3.patch certainly appears to have fixed the issue. Without the patch, the customer's test systems were crashing during start up, but have been running fine for days with the patch.
Thank you David. See new BZ 636215 for the corresponding RHEL5.6 version of the individual patch. Should we consider then this BZ as closed, now that the fix will appear into the z-stream?
Created attachment 448958 [details] rhel5.6 version of Ndlp refcount patch
Comment on attachment 448958 [details] rhel5.6 version of Ndlp refcount patch patch provided by Emulex for rhel5.6.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-225.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Under certain circumstances, an attempt to dereference a NULL pointer in the lpfc_nlp_put() function may have caused the system to crash. With this update, several changes have been made to ensure the correct reference count, resolving this issue.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html