Bug 625841 - lpfc ioctl crash in lpfc_nlp_put()
Summary: lpfc ioctl crash in lpfc_nlp_put()
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Rob Evers
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 636215 637726 637727
TreeView+ depends on / blocked
 
Reported: 2010-08-20 15:18 UTC by Issue Tracker
Modified: 2018-11-27 23:54 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Under certain circumstances, an attempt to dereference a NULL pointer in the lpfc_nlp_put() function may have caused the system to crash. With this update, several changes have been made to ensure the correct reference count, resolving this issue.
Clone Of:
: 636215 (view as bug list)
Environment:
Last Closed: 2011-01-13 21:11:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
test patch to handle lpfc_nlp_get() not increasing the refcount (1.87 KB, patch)
2010-08-25 20:02 UTC, David Jeffery
no flags Details | Diff
Ndlp refcount fix (2.07 KB, application/octet-stream)
2010-09-01 20:36 UTC, Vaios Papadimitriou
no flags Details
Additional refcount fixes (1.75 KB, patch)
2010-09-07 19:32 UTC, David Jeffery
no flags Details | Diff
Updated patch for addressing NDLP refcount issue (2.52 KB, application/octet-stream)
2010-09-16 17:01 UTC, Vaios Papadimitriou
no flags Details
Latest patch for ndlp refcount issue (3.02 KB, application/octet-stream)
2010-09-16 18:51 UTC, Vaios Papadimitriou
no flags Details
rhel5.6 version of Ndlp refcount patch (3.03 KB, patch)
2010-09-22 15:26 UTC, Rob Evers
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Issue Tracker 2010-08-20 15:18:34 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 David Jeffery 2010-08-20 15:53:10 UTC
A system experienced a crash do to a NULL pointer dereference in lpfc_nlp_put(). 
This occurred because the lpfc_nodelist structure had a NULL phba pointer.  The system crashed when lpfc_nlp_put() tried to dereference the NULL phba pointer to reference a spin_lock.

 #0 [ffff8102579eda80] crash_kexec at ffffffff800ad2aa
 #1 [ffff8102579edb40] __die at ffffffff80065157
 #2 [ffff8102579edb80] do_page_fault at ffffffff80066dd7
 #3 [ffff8102579edc70] error_exit at ffffffff8005dde9
    [exception RIP: _spin_lock_irqsave+3]
    RIP: ffffffff80064b0b  RSP: ffff8102579edd20  RFLAGS: 00010006
    RAX: 0000000000000206  RBX: 0000000000001578  RCX: 0000000000004d6b
    RDX: ffffffff8816104d  RSI: 0000000000000100  RDI: 0000000000001578
    RBP: ffff810fa17a7e00   R8: 0000000052000045   R9: 0000000000000000
    R10: ffff81010fee4158  R11: 0000000000000282  R12: ffff8102579eded8
    R13: ffff81047f6da000  R14: 0000000000000000  R15: ffff810111434000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff8102579edd20] lpfc_nlp_put at ffffffff88131dd0
 #5 [ffff8102579edd60] lpfc_ioctl_send_mgmt_cmd at ffffffff8815623a
 #6 [ffff8102579eddf0] lpfcdfc_do_ioctl at ffffffff88156d9f
 #7 [ffff8102579ede90] lpfcdfc_compat_ioctl at ffffffff88157999
 #8 [ffff8102579edf20] compat_sys_ioctl at ffffffff800f927c
 #9 [ffff8102579edf80] cstar_do_call at ffffffff8006160c



From the vmcore, it looks like the lpfc_nodelist structure at 0xffff810fa17a7e00 was already freed, even though it was in use by the ioctl call. The memory is listed as allocated, but most fields of the structure contain invalid values.

The ioctl request had a node id of 0xfffffc, and there was another lpfc_nodelist with this same id at ffff81107fe1fa00.  Unlike the bad lpfc_nodelist, this other matching nodelist is on the fc_nodes list for this host while the bad one is not.

So it appears the bad lpfc_nodelist was freed and later replaced with a new lpfc_nodelist for the same id.

Comment 2 David Jeffery 2010-08-20 15:59:32 UTC
One concern is the use of lpfc_nlp_get() in lpfc_ioctl_send_mgmt_cmd().  The return value of lpfc_nlp_get() isn't checked.  So it may be possible the lpfc_nlp_get() call would not up the nlp reference count, allowing the lpfc_nodelist structure to be deleted while the ioctl path is using it.

Comment 3 David Jeffery 2010-08-25 20:02:40 UTC
Created attachment 441037 [details]
test patch to handle lpfc_nlp_get() not increasing the refcount

I've attached a test patch to handle the cases where lpfc_nlp_get() could fail and return NULL in lpfc_ioctl_send_mgmt_cmd(), signifying that the reference count was not increased.  The patch also has extra WARN_ON() checks to confirm and report if the condition of lpfc_nlp_get() failing within lpfc_ioctl_send_mgmt_cmd() happens.

Comment 4 Tom Coughlan 2010-08-27 15:49:40 UTC
Hi Emulex folks, this is a customer problem, so a review of the patch and reply here would be appreciated.

Comment 5 Joseph Mann 2010-08-27 20:28:54 UTC
David,

I do not see any mention of what driver this patch is for, is it against the inbox RHEL5.4 (8.2.0.48.2p), or against a later version of the driver?

Thanks,
Joe

Comment 6 David Jeffery 2010-08-31 14:19:46 UTC
The patch was against the RHEL 5.4 driver.

Comment 7 Andrius Benokraitis 2010-08-31 15:49:41 UTC
(In reply to comment #6)
> The patch was against the RHEL 5.4 driver.

Joseph @ Emulex - anything else you need from us on this? How does the patch look in Comment #3?

Comment 8 Vaios Papadimitriou 2010-08-31 18:10:12 UTC
This does appear to be a bug, thanks David for the analysis.

However, the attached patch as it is might cause some other unexpected funtional failures.

We started working on a driver fix, we hope to provide the fix/patch here for testing/verification by COB tomorrow 9/1/10.

Comment 9 Vaios Papadimitriou 2010-09-01 20:36:14 UTC
Created attachment 442494 [details]
Ndlp refcount fix

Please see the patch as a fix against the RHEL5.4 8.2.0.48.2p driver. 

The problem was due to the management path, putting reference count on an ndlp, failed to check the return value from lpfc_nlp_get() routine. There is a possible race condition where the base driver state machine might release the ndlp, and the management path then encounters a null pointer, and without checking for that condition its deference can cause a system crash.

This patch also includes a related fix to the lpfc_ioctl_send_els() routine.

Comment 10 Rob Evers 2010-09-02 18:21:20 UTC
(In reply to comment #9)

Hi Vaios,

The flow for z-stream is that a patch needs to be posted for the current minor release under development.  That would be rhel5.6 in this case.

I checked and it appears that this fix applies to rhel5.6 as well.  Can you provide the equivalent rhel5.6 patch in another bz?  I can post it and that will enable the z-stream update for this patch.

This doesn't apply upstream or in rhel6 due to ioctl functionality being replaced.

Rob

Comment 11 Guil Barros 2010-09-04 16:15:38 UTC
Hi Vaios,

First off, thanks for putting this together!

We seem unable to trigger the issue even with multiple lip's issued. Would it be possible for you guys to create a reproducer so that we can verify the patch?

Thanks!
-Guil

Comment 12 Veaceslav Falico 2010-09-06 10:56:30 UTC
(In reply to comment #9)
> Created attachment 442494 [details]
> Ndlp refcount fix
> 
> Please see the patch as a fix against the RHEL5.4 8.2.0.48.2p driver. 
> 
> The problem was due to the management path, putting reference count on an ndlp,
> failed to check the return value from lpfc_nlp_get() routine. There is a
> possible race condition where the base driver state machine might release the
> ndlp, and the management path then encounters a null pointer, and without
> checking for that condition its deference can cause a system crash.
> 
> This patch also includes a related fix to the lpfc_ioctl_send_els() routine.

The customer tested the patch and the box isn't even coming up, it's always restarting...

bt:

PID: 8218   TASK: ffff810c265a5040  CPU: 3   COMMAND: "vxesd"
 #0 [ffff810c0b9fba80] crash_kexec at ffffffff800ad301
 #1 [ffff810c0b9fbb40] __die at ffffffff80065157
 #2 [ffff810c0b9fbb80] do_page_fault at ffffffff80066dd7
 #3 [ffff810c0b9fbc70] error_exit at ffffffff8005dde9
    [exception RIP: _spin_lock_irqsave+3]
    RIP: ffffffff80064b0b  RSP: ffff810c0b9fbd20  RFLAGS: 00010006
    RAX: 0000000000000206  RBX: 0000000000001578  RCX: 0000000000000000
    RDX: ffffffff881a502d  RSI: 0000000000000100  RDI: 0000000000001578
    RBP: ffff810c16700800   R8: 0000000000000040   R9: 0000000000000000
    R10: ffff81000901ea90  R11: 0000000000000000  R12: ffff810c0b9fbed8
    R13: ffff810c25880000  R14: 0000000000000000  R15: ffff810c09f39000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff810c0b9fbd20] lpfc_nlp_put at ffffffff88175dd0
 #5 [ffff810c0b9fbd60] lpfc_ioctl_send_mgmt_cmd at ffffffff8819a217
 #6 [ffff810c0b9fbdf0] lpfcdfc_do_ioctl at ffffffff8819ad7a
 #7 [ffff810c0b9fbe90] lpfcdfc_compat_ioctl at ffffffff8819b974
 #8 [ffff810c0b9fbf20] compat_sys_ioctl at ffffffff800f92d3
 #9 [ffff810c0b9fbf80] sysenter_do_call at ffffffff8006149d
    RIP: 00000000ffffe410  RSP: 00000000fffb32b8  RFLAGS: 00000296
    RAX: ffffffffffffffda  RBX: ffffffff8006149d  RCX: 00000000c028fc01
    RDX: 00000000fffb3300  RSI: 00000000f7ca3448  RDI: 00000000fffb3300
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 00000000fffb32e8
    ORIG_RAX: 0000000000000036  CS: 0023  SS: 002b

The bug comes up at [2]:

4371     phba = ndlp->phba; <---- ! [1]
4372     spin_lock_irqsave(&phba->ndlp_lock, flags);
4373     /* Check the ndlp memory free acknowledge flag to avoid the
4374      * possible race condition that kref_put got invoked again
4375      * after previous one has done ndlp memory free.
4376      */
4377     if (NLP_CHK_FREE_ACK(ndlp)) { [2] <--- HERE!!!
4378         spin_unlock_irqrestore(&phba->ndlp_lock, flags);

but we see that at [1] we dereference it successfully at [1]:

0xffffffff88175d7a <lpfc_nlp_put>:      push   %r15
0xffffffff88175d7c <lpfc_nlp_put+2>:    push   %r14
0xffffffff88175d7e <lpfc_nlp_put+4>:    push   %r13
0xffffffff88175d80 <lpfc_nlp_put+6>:    push   %r12
0xffffffff88175d82 <lpfc_nlp_put+8>:    push   %rbp
0xffffffff88175d83 <lpfc_nlp_put+9>:    mov    %rdi,%rbp
0xffffffff88175d86 <lpfc_nlp_put+12>:   push   %rbx
0xffffffff88175d87 <lpfc_nlp_put+13>:   sub    $0x8,%rsp
0xffffffff88175d8b <lpfc_nlp_put+17>:   test   %rdi,%rdi
0xffffffff88175d8e <lpfc_nlp_put+20>:   je     0xffffffff88175efc
0xffffffff88175d94 <lpfc_nlp_put+26>:   mov    0x170(%rdi),%r9d
0xffffffff88175d9b <lpfc_nlp_put+33>:   mov    0x24(%rdi),%ecx
0xffffffff88175d9e <lpfc_nlp_put+36>:   mov    $0xffffffff881a502d,%rdx
0xffffffff88175da5 <lpfc_nlp_put+43>:   mov    0xe0(%rdi),%rdi
0xffffffff88175dac <lpfc_nlp_put+50>:   mov    0x20(%rbp),%r8d
0xffffffff88175db0 <lpfc_nlp_put+54>:   mov    $0x100,%esi
0xffffffff88175db5 <lpfc_nlp_put+59>:   callq  0xffffffff8819635c <lpfc_debugfs_disc_trc>
0xffffffff88175dba <lpfc_nlp_put+64>:   mov    0xd0(%rbp),%rbx <-- [1] HERE
0xffffffff88175dc1 <lpfc_nlp_put+71>:   add    $0x1578,%rbx
0xffffffff88175dc8 <lpfc_nlp_put+78>:   mov    %rbx,%rdi
0xffffffff88175dcb <lpfc_nlp_put+81>:   callq  0xffffffff80064b08 <_spin_lock_irqsave>
0xffffffff88175dd0 <lpfc_nlp_put+86>:   mov    0x3e(%rbp),%dx <-- [2] BUG

and at [2] we have the BUG.

So, the only thing that comes to my mind is that while we are spinning on that spinlock, somebody who has it (the spinlock) taken is free()ing that pointer, so that we end up with unallocated mem after the lock. I've walked the code a bit but didn't find anything that could be doing so, but anyway I don't know the code at all...

Do you know what could it be?

Also, we have some veritas modules loaded, in short time we'll try to reproduce it without them...

Thank you!

Comment 14 David Jeffery 2010-09-07 19:32:10 UTC
Created attachment 445761 [details]
Additional refcount fixes

The crash can still occur with the ndlp refcount fix patch.  From more review, there are places in the code outside of the ioctl path which also reduce the reference count even though an attempt acquire a reference failed.

The second patch is in addition to the patch to fix lpfc_ioctl_send_mgmt_cmd() .

Comment 15 Issue Tracker 2010-09-07 19:46:59 UTC
Event posted on 09-07-2010 03:46pm EDT by fhirtz

Note, we have an occurrence of this on a different system which was logged
in 1287663. I'll include a pointer to where the data is staged shortly.


This event sent from IssueTracker by fhirtz 
 issue 1266703

Comment 16 Issue Tracker 2010-09-07 19:55:27 UTC
Event posted on 09-07-2010 03:55pm EDT by fhirtz

<snip>
Your corefile is ready for you
You may view it at megatron.gsslab.rdu.redhat.com
Login with kerberos name/password
$ cd /cores/20100907150202/work
/cores/20100907150202/work$ ./crash

Backtrace
      KERNEL: /cores/20100907150202/work/vmlinux
    DUMPFILE: /cores/20100907150202/work/vmcore_1287663_eqperforce6_090510
 [PARTIAL DUMP]
        CPUS: 8
        DATE: Sun Sep  5 18:24:08 2010
      UPTIME: 05:49:25
LOAD AVERAGE: 1.91, 1.36, 1.17
       TASKS: 569
    NODENAME: eqperforce6
     RELEASE: 2.6.18-164.17.1.el5
     VERSION: #1 SMP Thu Apr 1 04:46:15 EDT 2010
     MACHINE: x86_64  (2197 Mhz)
      MEMORY: 31.4 GB
       PANIC: "Oops: 0002 [1] SMP " (check log for details)
<snip>
mptctl: /dev/mptctl @ (major,minor=10,220)
ioctl32(bpbkar:30573): Unknown cmd fd(3) cmd(5658462c){16} arg(ff87fb1c)
on /
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /misc
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /var
ioctl32(bpbkar:30573): Unknown cmd fd(8) cmd(5658462c){16} arg(ff87915c)
on /var/lib/nfs/v4recovery
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /tmp
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /sbin
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /etc
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /root
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /bin
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /boot
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /mnt
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /lost+found
ioctl32(bpbkar:30573): Unknown cmd fd(5) cmd(5658462c){16} arg(ff87e0ac)
on /net
FS-Cache: Loaded
ioctl32(bpbkar:24212): Unknown cmd fd(4) cmd(5658462c){16} arg(ffc73f7c)
on /boot
ioctl32(bpbkar:24212): Unknown cmd fd(4) cmd(5658462c){16} arg(ffc73f7c)
on /
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /misc
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /var
ioctl32(bpbkar:24212): Unknown cmd fd(8) cmd(5658462c){16} arg(ffc6d5bc)
on /var/lib/nfs/v4recovery
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /tmp
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /sbin
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /etc
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /root
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /bin
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /boot
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /mnt
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /lost+found
ioctl32(bpbkar:24212): Unknown cmd fd(5) cmd(5658462c){16} arg(ffc7250c)
on /net
Unable to handle kernel paging request at 000000000c901578 RIP:
 [<ffffffff80064b0b>] _spin_lock_irqsave+0x3/0xd
PGD 6c1357067 PUD 6a943b067 PMD 0
Oops: 0002 [1] SMP
last sysfs file: /class/fc_host/host1/active_fc4s
CPU 7
Modules linked in: nfs fscache nfs_acl mptctl mptbase autofs4 dmpjbod(PU)
dmpap(PU) dmpaa(PU) vxspec(PFU) vxio(PFU) vxdmp(PU) lockd sunrpc bonding
ipv6 xfrm_nalgo crypto_api vxportal(PFU) fdd(PFU) vxfs(PU) dm_mirror
dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi
acpi_memhotplug ac parport_pc lp parport sg i2c_amd756 i2c_core lpfc
ide_cd amd_rng hpilo k8_edac e1000 tg3 cdrom edac_mc scsi_transport_fc
shpchp k8temp hwmon floppy serio_raw pcspkr dm_raid45 dm_message
dm_region_hash dm_log dm_mod dm_mem_cache usb_storage cciss sd_mod
scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 21862, comm: vxesd Tainted: PF     2.6.18-164.17.1.el5 #1
RIP: 0010:[<ffffffff80064b0b>]  [<ffffffff80064b0b>]
_spin_lock_irqsave+0x3/0xd
RSP: 0018:ffff8106b1121d20  EFLAGS: 00010006
RAX: 0000000000000206 RBX: 000000000c901578 RCX: 000000000040b370
RDX: ffffffff882b704d RSI: 0000000000000100 RDI: 000000000c901578
RBP: ffff8107ec574a00 R08: 0000000028000045 R09: 0000000000000000
R10: ffff8107ffef4068 R11: ffff8107e8d830f0 R12: ffff8106b1121ed8
R13: ffff8107fe2dc000 R14: 0000000000000000 R15: ffff81020ff42000
FS:  00002b2541078240(0000) GS:ffff8107ffe6a3c0(0063)
knlGS:0000000055571590
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 000000000c901578 CR3: 00000006a4b7a000 CR4: 00000000000006e0
Process vxesd (pid: 21862, threadinfo ffff8106b1120000, task
ffff81033ea4f860)
Stack:  ffffffff88287dd0 ffff81033ea4f860 0000000000000013
ffff8107fb18a4f8
 ffff8106b1121ed8 ffff8107fe2dc000 0000000000000000 ffff81020ff42000
 ffffffff882ac23a ffff81020ff42000 ffff8106b1121e98 0000000000000001
Call Trace:
 [<ffffffff88287dd0>] :lpfc:lpfc_nlp_put+0x56/0x194
 [<ffffffff882ac23a>] :lpfc:lpfc_ioctl_send_mgmt_cmd+0x614/0x633
 [<ffffffff80128562>] avc_has_perm+0x46/0x58
 [<ffffffff882acd9f>] :lpfc:lpfcdfc_do_ioctl+0x301/0xe61
 [<ffffffff882ad999>] :lpfc:lpfcdfc_compat_ioctl+0x9a/0x131
 [<ffffffff800f92d3>] compat_sys_ioctl+0xc5/0x2b2
 [<ffffffff8006160c>] cstar_do_call+0x1b/0x6e


Code: f0 ff 0f 0f 88 f2 00 00 00 c3 fa f0 ff 0f 0f 88 f3 00 00 00
RIP  [<ffffffff80064b0b>] _spin_lock_irqsave+0x3/0xd
 RSP <ffff8106b1121d20>
</snip>

This is obviously happening with the veritas stack installed, and it does
appear from the log that netbackup is probing devices with some sort of
unrecognized 32bit(?) ioctl on boot, though the system here ran for about
6hrs after that. Given the nature of this, that might be interesting.


This event sent from IssueTracker by fhirtz 
 issue 1266703

Comment 17 Issue Tracker 2010-09-08 15:56:57 UTC
Event posted on 09-08-2010 11:56am EDT by fhirtz

To be clear, from the client facing GSS perspective in NA, I own the
interaction. 

I do have a question related to the current strategy. Though Veritas might
be calling some sort of broken/non-translatable 32bit ioctl here, am I
correct in presuming that it shouldn't actually be able to fail the
entire system? Also, do we have an idea why it does so quickly with the
patch(es), but not stock?


This event sent from IssueTracker by fhirtz 
 issue 1266703

Comment 18 Rob Evers 2010-09-09 14:16:41 UTC
Hi Vaios,

Are you still looking into this?

Thanks, Rob

Comment 19 David Jeffery 2010-09-15 17:08:07 UTC
There was a crash with the refcount patches and LOG_NODE set.  We see the following messages from the lpfc driver before the crash:

lpfc 0000:0d:00.0: 0:(0):0932 FIND node did xfffffa NOT FOUND.
lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffa, UNUSED -> PLOGI
lpfc 0000:0d:00.0: 0:(0):0929 FIND node DID Data: xffff810c25146c00 xfffffa x0 x1000000
lpfc 0000:0d:00.0: 0:(0):0929 FIND node DID Data: xffff810c25146c00 xfffffa x0 x1000000
lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffa, UNMAPPED -> PLOGI
lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffc, PLOGI -> UNUSED
lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffa, PLOGI -> REGLOGIN
lpfc 0000:0d:00.0: 0:(0):0279 lpfc_nlp_release: ndlp:xffff810c25146c00 usgmap:xf refcnt:0
lpfc 0000:0d:00.0: 0:(0):0900 Cleanup node for NPort xfffffc Data: x0 x0 x0
lpfc 0000:0d:00.0: 0:(0):0280 lpfc_cleanup_node: ndlp:xffff810c25146c00 usgmap:xf refcnt:0
lpfc 0000:0d:00.0: 0:(0):0904 NPort state transition xfffffa, REGLOGIN -> UNMAPPED

A node for xfffffa was created at address xffff810c25146c00, started through the state machine and then was freed. (The 0280 line from lpfc_cleanup_node() means the FREE_REQ bit was set and the memory was freed at the completion of lpfc_nlp_release() .)


The crash then occurred when lpfc_nlp_put() was called on the already freed xffff810c25146c00 node. The nlp address is in register RBP, which lpfc_nlp_put() used to generate the invalid spin_lock address in RDI.

dis lpfc_nlp_put()
...
0xffffffff88175dce <lpfc_nlp_put+64>:   mov    0xd0(%rbp),%rbx
0xffffffff88175dd5 <lpfc_nlp_put+71>:   add    $0x1578,%rbx
0xffffffff88175ddc <lpfc_nlp_put+78>:   mov    %rbx,%rdi
0xffffffff88175ddf <lpfc_nlp_put+81>:   callq  0xffffffff80064b08 <_spin_lock_irqsave>
0xffffffff88175de4 <lpfc_nlp_put+86>:   mov    0x3e(%rbp),%dx
...


bt
PID: 6419   TASK: ffff810c271330c0  CPU: 11  COMMAND: "vxesd"
 #0 [ffff810c0f2ada80] crash_kexec at ffffffff800ad301
 #1 [ffff810c0f2adb40] __die at ffffffff80065157
 #2 [ffff810c0f2adb80] do_page_fault at ffffffff80066dd7
 #3 [ffff810c0f2adc70] error_exit at ffffffff8005dde9
    [exception RIP: _spin_lock_irqsave+3]
    RIP: ffffffff80064b0b  RSP: ffff810c0f2add20  RFLAGS: 00010006
    RAX: 0000000000000206  RBX: 0000000000001aac  RCX: 0000000000000000
    RDX: ffffffff881a502d  RSI: 0000000000000100  RDI: 0000000000001aac
    RBP: ffff810c25146c00   R8: 000000000001bb40   R9: 0000000000000000
    R10: ffff810009061e90  R11: 0000000000000000  R12: ffff810c0f2aded8
    R13: ffff810c25976000  R14: 0000000000000000  R15: ffff810c0e821000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff810c0f2add20] lpfc_nlp_put at ffffffff88175de4
 #5 [ffff810c0f2add60] lpfc_ioctl_send_mgmt_cmd at ffffffff8819a233
 #6 [ffff810c0f2addf0] lpfcdfc_do_ioctl at ffffffff8819ad96
 #7 [ffff810c0f2ade90] lpfcdfc_compat_ioctl at ffffffff8819b990
 #8 [ffff810c0f2adf20] compat_sys_ioctl at ffffffff800f92d3
 #9 [ffff810c0f2adf80] sysenter_do_call at ffffffff8006149d
    RIP: 00000000ffffe410  RSP: 00000000ffca16a8  RFLAGS: 00000296
    RAX: ffffffffffffffda  RBX: ffffffff8006149d  RCX: 00000000c028fc01
    RDX: 00000000ffca16f0  RSI: 00000000f7bd3448  RDI: 00000000ffca16f0
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 00000000ffca16d8
    ORIG_RAX: 0000000000000036  CS: 0023  SS: 002b


From the messages, it appears lpfc_ioctl_send_mgmt_cmd() took the !pndl path and the reference count for the newly allocated nlp hit zero while the code in lpfc_ioctl_send_mgmt_cmd() still believed it held a reference.

Comment 20 Vaios Papadimitriou 2010-09-16 17:01:47 UTC
Created attachment 447799 [details]
Updated patch for addressing NDLP refcount issue

This is the updated patch that provides an additional fix based on the latest data from Comment #19.

This latest patch (*-2.patch) includes (and extends) the patch entered in Comment #9, and applies to the RHEL5.4 8.2.0.48.2p LPFC driver.

Comment 21 Vaios Papadimitriou 2010-09-16 17:34:44 UTC
Comment on attachment 447799 [details]
Updated patch for addressing NDLP refcount issue

Please disregard this latest -2 patch. We are in the process of building a new one.

Comment 22 Vaios Papadimitriou 2010-09-16 18:51:47 UTC
Created attachment 447823 [details]
Latest patch for ndlp refcount issue

Correct updated patch for addressing NDLP refcount issue.

This is the updated patch that provides an additional fix based on the latest
data from Comment #19.

This latest patch (*-3.patch) includes (and extends) the patch entered in
Comment #9, and applies to the RHEL5.4 8.2.0.48.2p LPFC driver.

Comment 23 Rob Evers 2010-09-17 13:01:29 UTC
(In reply to comment #17)
> Event posted on 09-08-2010 11:56am EDT by fhirtz
> 
> To be clear, from the client facing GSS perspective in NA, I own the
> interaction. 
> 
> I do have a question related to the current strategy. Though Veritas might
> be calling some sort of broken/non-translatable 32bit ioctl here, am I
> correct in presuming that it shouldn't actually be able to fail the
> entire system? Also, do we have an idea why it does so quickly with the
> patch(es), but not stock?
> 
> 
> This event sent from IssueTracker by fhirtz 
>  issue 1266703

Is it possible to get this patch into a kernel at the customer site and tested?

Comment 25 David Jeffery 2010-09-20 19:23:40 UTC
Customer has been running a test kernel with the last patch on 3 systems for 3 days so far.  Previously, the systems would crash at startup do to ioctl calls from vxesd.  With the patched test kernel, the systems did not crash at startup and have continued to run without issue.

Comment 26 Rob Evers 2010-09-20 19:52:50 UTC
(In reply to comment #22)
> Created attachment 447823 [details]
> Latest patch for ndlp refcount issue
> 
> Correct updated patch for addressing NDLP refcount issue.
> 
> This is the updated patch that provides an additional fix based on the latest
> data from Comment #19.
> 
> This latest patch (*-3.patch) includes (and extends) the patch entered in
> Comment #9, and applies to the RHEL5.4 8.2.0.48.2p LPFC driver.

Hi Vaios,

Thanks for getting this patch together.  We need to have a rhel5.6 version of this patch before we can get the rhel5.4 version of this patch into z-stream.  Can you open another bz, copy me and Andrius on it, and attach a rhel5.6 version of this patch to it?

Thanks, Rob

Comment 27 Rob Evers 2010-09-20 20:13:01 UTC
(In reply to comment #26)
> (In reply to comment #22)
> > Created attachment 447823 [details] [details]
> > Latest patch for ndlp refcount issue
> > 
> > Correct updated patch for addressing NDLP refcount issue.
> > 
> > This is the updated patch that provides an additional fix based on the latest
> > data from Comment #19.
> > 
> > This latest patch (*-3.patch) includes (and extends) the patch entered in
> > Comment #9, and applies to the RHEL5.4 8.2.0.48.2p LPFC driver.
> 
> Hi Vaios,
> 
> Thanks for getting this patch together.  We need to have a rhel5.6 version of
> this patch before we can get the rhel5.4 version of this patch into z-stream. 
> Can you open another bz, copy me and Andrius on it, and attach a rhel5.6
> version of this patch to it?
> 
> Thanks, Rob

Also, the rhel5.6 patch is urgent for making a 9/28 z-stream build.

Rob

Comment 28 Vaios Papadimitriou 2010-09-20 20:29:47 UTC
On Comment #25.
David, can we safely declare then this latest patch has fixed the customer's issue?

On Comments # 26 and 27.
When we hear back from the customer the final disposition of this fix, we'll add this fix in the patchset we are planning to submit soon for the RHEL5.6 inclusion.

Comment 29 David Jeffery 2010-09-21 20:14:27 UTC
The lpfc_ioctl_ndlp_refcount_elx-3.patch certainly appears to have fixed the issue.  Without the patch, the customer's test systems were crashing during start up, but have been running fine for days with the patch.

Comment 30 Vaios Papadimitriou 2010-09-22 14:56:25 UTC
Thank you David. 

See new BZ 636215 for the corresponding RHEL5.6 version of the individual patch.

Should we consider then this BZ as closed, now that the fix will appear into the z-stream?

Comment 31 Rob Evers 2010-09-22 15:26:01 UTC
Created attachment 448958 [details]
rhel5.6 version of Ndlp refcount patch

Comment 32 Rob Evers 2010-09-22 15:27:55 UTC
Comment on attachment 448958 [details]
rhel5.6 version of Ndlp refcount patch

patch provided by Emulex for rhel5.6.

Comment 34 RHEL Program Management 2010-09-22 23:29:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 42 Jarod Wilson 2010-09-27 19:12:11 UTC
in kernel-2.6.18-225.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 46 Jaromir Hradilek 2010-10-12 22:53:35 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Under certain circumstances, an attempt to dereference a NULL pointer in the lpfc_nlp_put() function may have caused the system to crash. With this update, several changes have been made to ensure the correct reference count, resolving this issue.

Comment 50 errata-xmlrpc 2011-01-13 21:11:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.