Bug 640225 - [NetApp 5.6 Bug] Kernel panic hit at lpfc_scsi_cmd_iocb_cmpl
Summary: [NetApp 5.6 Bug] Kernel panic hit at lpfc_scsi_cmd_iocb_cmpl
Keywords:
Status: CLOSED DUPLICATE of bug 649489
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5.z
Hardware: All
OS: Linux
high
urgent
Target Milestone: rc
: 5.6
Assignee: Rob Evers
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-10-05 09:37 UTC by Martin George
Modified: 2010-12-01 15:23 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-11-30 15:37:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch for 640225 panic (1.54 KB, patch)
2010-10-13 14:59 UTC, Richard Kennedy
no flags Details | Diff
Patch to fix new lpfc_scsi_cmd_iocb_cmpl panic (1018 bytes, text/plain)
2010-11-12 20:46 UTC, Joseph Mann
no flags Details

Description Martin George 2010-10-05 09:37:56 UTC
Description of problem:
This is a follow up to bug 599487. Hit a kernel panic on a 5.5 Emulex FC host during IO with controller faults, due to a NULL pointer dereference at lpfc_scsi_cmd_iocb_cmpl:

lpfc 0000:03:00.0: 0:0310 Mailbox command x5 timeout Data: x0 x700
xffff810058e67c00
lpfc 0000:03:00.0: 0:0345 Resetting board due to mailbox timeout
lpfc 0000:03:00.0: 0:(0):2530 Mailbox command x23 cannot issue Data: xd00 x2
Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: 
 [<ffffffff8810052d>] :lpfc:lpfc_scsi_cmd_iocb_cmpl+0x9ed/0x137d
PGD 0 
Oops: 0000 [1] SMP 
last sysfs file:
/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/rport-0:0-2/target0:0:0/0:0:0:1/timeout
CPU 3 
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth
lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr
iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core
cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi
video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport sg floppy i2c_i801 i2c_core
ide_cd tg3 cdrom pcspkr e752x_edac edac_mc serio_raw dm_raid45 dm_message
dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac
scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod
ata_piix libata shpchp lpfc scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd
ohci_hcd ehci_hcd
Pid: 17, comm: events/3 Not tainted 2.6.18-194.11.1.el5.lpfc.heartbeat #1
RIP: 0010:[<ffffffff8810052d>]  [<ffffffff8810052d>]
:lpfc:lpfc_scsi_cmd_iocb_cmpl+0x9ed/0x137d
RSP: 0018:ffff81007ff6b948  EFLAGS: 00010286
RAX: 000000000000001e RBX: ffff81000bbf1500 RCX: 0000000000000000
RDX: ffff81007acf24c0 RSI: 0000000000000220 RDI: ffff81007acf2540
RBP: 0000000000000000 R08: ffffffff80311da8 R09: ffff810078b81188
R10: ffff81007e34fba8 R11: 000000000000000a R12: 0000000000001000
R13: ffff81007acf24c0 R14: 00000000040a0000 R15: 0000000000000016
FS:  0000000000000000(0000) GS:ffff8100026ca6c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 0000000000201000 CR4: 00000000000006e0
Process events/3 (pid: 17, threadinfo ffff8100378de000, task ffff81007ff47080)
Stack:  0000000000000000 000000000000000a ffff81007e34fba8 ffff810000000000
 ffff81007ff6ba78 ffff81007e5e8000 ffffffff80071a88 0000000000000001
 ffffffff882baddb ffff81007e34e400 ffff81007e6994f8 ffff81007e69b600
Call Trace:
 <IRQ>  [<ffffffff80071a88>] nommu_map_single+0x24/0x33
 [<ffffffff882baddb>] :tg3:tg3_start_xmit_dma_bug+0x85d/0x90b
 [<ffffffff880d563d>] :lpfc:lpfc_sli_handle_fast_ring_event+0x40b/0x60f
 [<ffffffff8002f972>] dev_queue_xmit+0x250/0x271
 [<ffffffff80031f5a>] ip_output+0x29a/0x2dd
 [<ffffffff80046e4c>] try_to_wake_up+0x472/0x484
 [<ffffffff8003d9f4>] lock_timer_base+0x1b/0x3c
 [<ffffffff8026ada7>] fn_hash_lookup+0x79/0xb2
 [<ffffffff8015081f>] __next_cpu+0x19/0x28
 [<ffffffff880d58df>] :lpfc:lpfc_sli_fp_intr_handler+0x9e/0x107
 [<ffffffff880d8bbb>] :lpfc:lpfc_sli_intr_handler+0x122/0x15e
 [<ffffffff80010bab>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800bae28>] __do_IRQ+0xa4/0x103
 [<ffffffff8006ca11>] do_IRQ+0xe7/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff80064b50>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff800efc7a>] aio_complete+0x1ef/0x1fd
 [<ffffffff800f44c8>] dio_bio_end_aio+0x9f/0xbf
 [<ffffffff8002cc88>] __end_that_request_first+0x23c/0x5bf
 [<ffffffff8005c17b>] blk_run_queue+0x28/0x73
 [<ffffffff88079fe5>] :scsi_mod:scsi_end_request+0x27/0xcd
 [<ffffffff8807a1d9>] :scsi_mod:scsi_io_completion+0x14e/0x324
 [<ffffffff880a7802>] :sd_mod:sd_rw_intr+0x252/0x28c
 [<ffffffff8807a46e>] :scsi_mod:scsi_device_unbusy+0x67/0x81
 [<ffffffff800dca7c>] cache_reap+0x0/0x217
 [<ffffffff80037c1d>] blk_done_softirq+0x5f/0x6d
 [<ffffffff800123b4>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cb8e>] do_softirq+0x2c/0x85
 [<ffffffff8006ca16>] do_IRQ+0xec/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff800dcb16>] cache_reap+0x9a/0x217
 [<ffffffff8004d624>] run_workqueue+0x94/0xe4
 [<ffffffff80049e5f>] worker_thread+0x0/0x122
 [<ffffffff80049f4f>] worker_thread+0xf0/0x122
 [<ffffffff8008cfa1>] default_wake_function+0x0/0xe
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 48 8b 45 10 49 89 45 3c 48 8b 45 18 49 89 45 44 8a 83 c2 00 
RIP  [<ffffffff8810052d>] :lpfc:lpfc_scsi_cmd_iocb_cmpl+0x9ed/0x137d
 RSP <ffff81007ff6b948>
CR2: 0000000000000010
 <0>Kernel panic - not syncing: Fatal exception

As per https://bugzilla.redhat.com/show_bug.cgi?id=599487#c47, this crash was due to the pnode pointer not being checked before dereferencing it at 

memcpy(&fast_path_evt->un.check_cond_evt.scsi_event.wwpn,
                        &pnode->nlp_portname, sizeof(struct lpfc_name));


Version-Release number of selected component (if applicable):
RHEL 5.5.z (2.6.18-194.11.1.el5)
lpfc driver v8.2.0.63.3p

How reproducible:
I've hit this twice now on my Emulex host.

Comment 1 Andrius Benokraitis 2010-10-05 19:14:28 UTC
Including Emulex on this bugzilla.

Comment 2 Rob Evers 2010-10-11 15:22:19 UTC
Vaios,

Have you or anyone at Emulex had a chance to look into this?

Rob

Comment 3 Richard Kennedy 2010-10-13 14:59:34 UTC
Created attachment 453228 [details]
patch for 640225 panic

Comment 4 Vaios Papadimitriou 2010-10-13 15:03:32 UTC
Martin,

Could you please apply the patch that Dick just attached to your LPFC driver, and verify this patch fixes the issue reported in this BZ?

Thanks,
-Vaios-

Comment 5 Martin George 2010-10-27 12:36:11 UTC
(In reply to comment #4)
> Martin,
> 
> Could you please apply the patch that Dick just attached to your LPFC driver,
> and verify this patch fixes the issue reported in this BZ?
> 

Patch looks good. Not hit the panic so far in our tests.

Comment 6 Rob Evers 2010-10-27 20:28:52 UTC
(In reply to comment #3)
> Created attachment 453228 [details]
> patch for 640225 panic

Hi Dick,

Can we get this rolled into a rhel5.6 patch update that I can post.  This will enable the z-stream update.

Thanks, Rob

Comment 7 Rob Evers 2010-10-27 20:49:01 UTC
(In reply to comment #6)
> (In reply to comment #3)
> > Created attachment 453228 [details] [details]
> > patch for 640225 panic
> 
> Hi Dick,
> 
> Can we get this rolled into a rhel5.6 patch update that I can post.  This will
> enable the z-stream update.
> 
> Thanks, Rob

Dick,

Also saw that this patch is in the latest upstream lpfc driver, but not in rhel6.  Can you or Vaios please see that this gets into the next rhel6.1 update so it can be backported to rhel6.0z as well.

Thanks, Rob

Comment 8 Martin George 2010-11-02 18:38:41 UTC
So is this being queued for the next 5.5.z release?

Comment 9 Rob Evers 2010-11-02 19:23:31 UTC
(In reply to comment #8)
> So is this being queued for the next 5.5.z release?

The equivalent patch needs to be provided by Emulex for rhel5.6, and this needs to be accepted into rhel5.6 before the patch can be accepted into rhel5.5z.

Vaios

Can someone at Emulex, please provide an update that contains this patch for rhel5.6.

Thanks, Rob

Comment 10 Martin George 2010-11-09 13:52:48 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Martin,
> > 
> > Could you please apply the patch that Dick just attached to your LPFC driver,
> > and verify this patch fixes the issue reported in this BZ?
> > 
> 
> Patch looks good. Not hit the panic so far in our tests.

Please ignore this comment. 

Unfortunately even with this patch, I hit another NULL pointer dereference in the same function lpfc_scsi_cmd_iocb_cmpl:

crash> bt
PID: 6237   TASK: ffff810079973860  CPU: 3   COMMAND: "syslogd"
 #0 [ffff81007ff6b660] crash_kexec at ffffffff800ada30
 #1 [ffff81007ff6b720] __die at ffffffff80065157
 #2 [ffff81007ff6b760] do_page_fault at ffffffff80066dd7
 #3 [ffff81007ff6b850] error_exit at ffffffff8005dde9
    [exception RIP: lpfc_scsi_cmd_iocb_cmpl+80]
    RIP: ffffffff880ffb47  RSP: ffff81007ff6b908  RFLAGS: 00010292
    RAX: 0000000000000000  RBX: ffff81007ff6ba38  RCX: 0000000000000000
    RDX: ffff81007e1b3000  RSI: ffff81007e6b84f8  RDI: ffff81007e606000
    RBP: ffff81007e606000   R8: 000000000000000d   R9: 00000000040a0000
    R10: 0000000000000296  R11: ffffffff880ff3bf  R12: ffff81007e1b3068
    R13: ffff81007ff6ba38  R14: ffff81007ff6bc68  R15: ffff81007ff6bc68
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff81007ff6b9e0] lpfc_sli_handle_fast_ring_event at ffffffff880d55f7
 #5 [ffff81007ff6bba0] lpfc_sli_fp_intr_handler at ffffffff880d5899
 #6 [ffff81007ff6bbc0] lpfc_sli_intr_handler at ffffffff880d8b75
 #7 [ffff81007ff6bbf0] handle_IRQ_event at ffffffff80010bab
 #8 [ffff81007ff6bc20] __do_IRQ at ffffffff800bae74
 #9 [ffff81007ff6bc60] do_IRQ at ffffffff8006ca11
#10 [ffff81007ff6bce8] vprintk at ffffffff800923c8
#11 [ffff81007ff6bd88] printk at ffffffff80092466
#12 [ffff81007ff6be78] scsi_io_completion at ffffffff8807a370
#13 [ffff81007ff6bed8] sd_rw_intr at ffffffff880a7802
#14 [ffff81007ff6bf38] blk_done_softirq at ffffffff80037bf1
#15 [ffff81007ff6bf58] __do_softirq at ffffffff80012385
#16 [ffff81007ff6bf88] call_softirq at ffffffff8005e2fc
#17 [ffff81007ff6bfa0] do_softirq at ffffffff8006cb8e
#18 [ffff81007ff6bfb0] apic_timer_interrupt at ffffffff8005dc8e
--- <IRQ stack> ---
#19 [ffff81007f22f9b8] apic_timer_interrupt at ffffffff8005dc8e
    [exception RIP: __journal_file_buffer+105]
    RIP: ffffffff88031213  RSP: ffff81007f22fa68  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff8100700f3980  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: ffff8100700f3980  RDI: ffff810055f11c10
    RBP: ffffffffffffffff   R8: 0000000000000000   R9: 0000000000000000
    R10: ffff810055f11c10  R11: 0000000000000060  R12: ffffffffffffffff
    R13: ffffffffffffffff  R14: ffff810079a6d7e0  R15: ffff81006cee50c0
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#20 [ffff81007f22fa90] journal_dirty_data at ffffffff88032c67
#21 [ffff81007f22fac0] ext3_journal_dirty_data at ffffffff8804e08f
#22 [ffff81007f22fae0] walk_page_buffers at ffffffff8804d4b8
#23 [ffff81007f22fb30] ext3_ordered_write_end at ffffffff8804ff3d
#24 [ffff81007f22fb80] generic_file_buffered_write at ffffffff8000fd6c
#25 [ffff81007f22fc80] __generic_file_aio_write_nolock at ffffffff800165e0
#26 [ffff81007f22fd30] __generic_file_write_nolock at ffffffff800c63b5
#27 [ffff81007f22fe20] generic_file_writev at ffffffff800c6416
#28 [ffff81007f22fe60] do_readv_writev at ffffffff800e0675
#29 [ffff81007f22ff40] sys_writev at ffffffff800e081e
#30 [ffff81007f22ff80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 00002b8b9a571aac  RSP: 00007fffbcb42a60  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 0000000000000007  RSI: 00007fffbcb42ab0  RDI: 0000000000000001
    RBP: 0000000000000001   R8: fefefefefefefeff   R9: ff1f6d766e631f72
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007fffbcb42ab0
    R13: 0000000000000007  R14: 0000000000000037  R15: 00000000a56e1b9c
    ORIG_RAX: 0000000000000014  CS: 0033  SS: 002b
crash>

So is this a new bug?

Comment 11 Martin George 2010-11-09 13:54:46 UTC
The vmcore for the above crash is available at 
ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/vmcore

Comment 12 Rob Evers 2010-11-11 03:15:04 UTC
(In reply to comment #11)
> The vmcore for the above crash is available at 
> ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/vmcore

Martin,

Can you provide:

  (1) the vmlinux file that they used with that crash session, and
  (2) the lpfc driver (.ko) and its debuginfo (.ko.debug) file

Thanks, Rob

Comment 13 Martin George 2010-11-11 12:56:28 UTC
(In reply to comment #12)
> Martin,
> 
> Can you provide:
> 
>   (1) the vmlinux file that they used with that crash session, and
>   (2) the lpfc driver (.ko) and its debuginfo (.ko.debug) file
> 
> Thanks, Rob

This is available at 
ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/vmlinux_lpfcmodules.zip

Comment 14 Rob Evers 2010-11-11 17:00:48 UTC
Hi Martin,

Thanks for providing these files.

Can you also provide all files built, assuming you built w/ "rpmbuild -ba'

ie something similar to:

  kernel-2.6.18-128.el5.x86_64.rpm
  kernel-debuginfo-2.6.18-128.el5.x86_64.rpm
  kernel-debuginfo-common-2.6.18-128.el5.x86_64.rpm

Thanks, Rob

Comment 15 Martin George 2010-11-12 10:57:38 UTC
(In reply to comment #14)
> Hi Martin,
> 
> Thanks for providing these files.
> 
> Can you also provide all files built, assuming you built w/ "rpmbuild -ba'
> 
> ie something similar to:
> 
>   kernel-2.6.18-128.el5.x86_64.rpm
>   kernel-debuginfo-2.6.18-128.el5.x86_64.rpm
>   kernel-debuginfo-common-2.6.18-128.el5.x86_64.rpm
> 
> Thanks, Rob

Ok. The kernel packages can be accessed at 
ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/Kernel-packages.zip

Comment 16 Martin George 2010-11-12 14:57:29 UTC
Emulex has mentioned that this latest crash has been fixed in the updated lpfc driver v8.2.0.86-1. But seems even the current RHEL 5.6 Beta lpfc inbox driver is only v8.2.0.80.

So when is this latest lpfc driver version getting included into the RHEL5 inbox stream?

Comment 17 Rob Evers 2010-11-12 15:26:24 UTC
(In reply to comment #16)
> Emulex has mentioned that this latest crash has been fixed in the updated lpfc
> driver v8.2.0.86-1. But seems even the current RHEL 5.6 Beta lpfc inbox driver
> is only v8.2.0.80.
> 
> So when is this latest lpfc driver version getting included into the RHEL5
> inbox stream?

8.2.0.86 which is already been accepted internally into rhel5.6 and will be available in an upcoming rhel5.6 update.

Another update, 8.2.0.87 needs to be processed.  Can Emulex can confirm that this has the required update and I will work on this immediately.

Provided the patch is available in one of these 2 updates, I need to have the name of the individual patch so I can flag it for backporting to rhel5.5 z.

Comment 18 Joseph Mann 2010-11-12 18:27:57 UTC
Rob,

The latest crash that Martin mentions was addressed in the patch we submitted to Redhat for lpfc 8.2.0.76.1p -> 8.2.0.77 (See Bug 603806)

If you have any additional questions, let me know.

Joe

Comment 19 Rob Evers 2010-11-12 19:00:13 UTC
(In reply to comment #18)
> Rob,
> 
> The latest crash that Martin mentions was addressed in the patch we submitted
> to Redhat for lpfc 8.2.0.76.1p -> 8.2.0.77 (See Bug 603806)
> 
> If you have any additional questions, let me know.
> 
> Joe

Joe,

If I understand this, we still need a patch backported for rhel5.5z that only addresses this problem.  Can you generate a patch for this problem that applies to rhel5.5 and attach it to this bugzilla?

Thanks, Rob

Comment 21 Joseph Mann 2010-11-12 20:46:45 UTC
Created attachment 460146 [details]
Patch to fix new lpfc_scsi_cmd_iocb_cmpl panic

This patch contains the fix necessary to correct the most recent kernel panic seen by Martin George

Comment 22 Rob Evers 2010-11-12 21:18:51 UTC
Thank Joseph.

Martin,

Can you give this patch a try?

Thanks, Rob

Comment 23 Martin George 2010-11-15 08:27:01 UTC
(In reply to comment #17)
> 
> 8.2.0.86 which is already been accepted internally into rhel5.6 and will be
> available in an upcoming rhel5.6 update.
> 

FYI - With the latest external 8.2.0.86-1 driver, the RHEL 5.5.z host (root on dm-multipath SANbooted) hangs during the 1st iteration itself of fabric faults. And this is seen consistently. So it does look like there's some problem with this driver.

Meanwhile I'll test with the latest patch given by Joseph. So just to reiterate, I've now patched the RHEL 5.5.z kernel (for testing) with the following 3 lpfc patches:

1) https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10

2) https://bugzilla.redhat.com/show_bug.cgi?id=640225#c3

3) https://bugzilla.redhat.com/show_bug.cgi?id=640225#c21

Comment 24 Rob Evers 2010-11-15 16:30:13 UTC
(In reply to comment #23)
> (In reply to comment #17)
> > 
> > 8.2.0.86 which is already been accepted internally into rhel5.6 and will be
> > available in an upcoming rhel5.6 update.
> > 
> 
> FYI - With the latest external 8.2.0.86-1 driver, the RHEL 5.5.z host (root on
> dm-multipath SANbooted) hangs during the 1st iteration itself of fabric faults.
> And this is seen consistently. So it does look like there's some problem with
> this driver.

FYI to all, 8.2.0.86-1 was never provided to redhat.  Assume it is encompassed by 8.2.0.87 which still needs to be processed for inclusion in rhel5, by me.


> 2) https://bugzilla.redhat.com/show_bug.cgi?id=640225#c3

This patch is not queued for backporting to rhel5.5z since, iirc, it was not known to address the problem at hand.

Comment 25 Martin George 2010-11-15 17:27:55 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > 2) https://bugzilla.redhat.com/show_bug.cgi?id=640225#c3
> 
> This patch is not queued for backporting to rhel5.5z since, iirc, it was not
> known to address the problem at hand.

This is confusing. Are we supposed to include the patch in comment #21 alone & not the one in comment #3? If so, then the comment #3 patch should have been marked as obsolete.

Comment 26 Rob Evers 2010-11-15 21:06:20 UTC
(In reply to comment #10)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > Martin,
> > > 
> > > Could you please apply the patch that Dick just attached to your LPFC driver,
> > > and verify this patch fixes the issue reported in this BZ?
> > > 
> > 
> > Patch looks good. Not hit the panic so far in our tests.
> 
> Please ignore this comment. 
> 
> Unfortunately even with this patch, I hit another NULL pointer dereference in
> the same function lpfc_scsi_cmd_iocb_cmpl:
> 
> crash> bt
> PID: 6237   TASK: ffff810079973860  CPU: 3   COMMAND: "syslogd"
>  #0 [ffff81007ff6b660] crash_kexec at ffffffff800ada30
>  #1 [ffff81007ff6b720] __die at ffffffff80065157
>  #2 [ffff81007ff6b760] do_page_fault at ffffffff80066dd7
>  #3 [ffff81007ff6b850] error_exit at ffffffff8005dde9
>     [exception RIP: lpfc_scsi_cmd_iocb_cmpl+80]
>     RIP: ffffffff880ffb47  RSP: ffff81007ff6b908  RFLAGS: 00010292
>     RAX: 0000000000000000  RBX: ffff81007ff6ba38  RCX: 0000000000000000
>     RDX: ffff81007e1b3000  RSI: ffff81007e6b84f8  RDI: ffff81007e606000
>     RBP: ffff81007e606000   R8: 000000000000000d   R9: 00000000040a0000
>     R10: 0000000000000296  R11: ffffffff880ff3bf  R12: ffff81007e1b3068
>     R13: ffff81007ff6ba38  R14: ffff81007ff6bc68  R15: ffff81007ff6bc68
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #4 [ffff81007ff6b9e0] lpfc_sli_handle_fast_ring_event at ffffffff880d55f7
>  #5 [ffff81007ff6bba0] lpfc_sli_fp_intr_handler at ffffffff880d5899
>  #6 [ffff81007ff6bbc0] lpfc_sli_intr_handler at ffffffff880d8b75
>  #7 [ffff81007ff6bbf0] handle_IRQ_event at ffffffff80010bab
>  #8 [ffff81007ff6bc20] __do_IRQ at ffffffff800bae74
>  #9 [ffff81007ff6bc60] do_IRQ at ffffffff8006ca11
> #10 [ffff81007ff6bce8] vprintk at ffffffff800923c8
> #11 [ffff81007ff6bd88] printk at ffffffff80092466
> #12 [ffff81007ff6be78] scsi_io_completion at ffffffff8807a370
> #13 [ffff81007ff6bed8] sd_rw_intr at ffffffff880a7802
> #14 [ffff81007ff6bf38] blk_done_softirq at ffffffff80037bf1
> #15 [ffff81007ff6bf58] __do_softirq at ffffffff80012385
> #16 [ffff81007ff6bf88] call_softirq at ffffffff8005e2fc
> #17 [ffff81007ff6bfa0] do_softirq at ffffffff8006cb8e
> #18 [ffff81007ff6bfb0] apic_timer_interrupt at ffffffff8005dc8e
> --- <IRQ stack> ---
> #19 [ffff81007f22f9b8] apic_timer_interrupt at ffffffff8005dc8e
>     [exception RIP: __journal_file_buffer+105]
>     RIP: ffffffff88031213  RSP: ffff81007f22fa68  RFLAGS: 00000246
>     RAX: 0000000000000000  RBX: ffff8100700f3980  RCX: 0000000000000000
>     RDX: 0000000000000001  RSI: ffff8100700f3980  RDI: ffff810055f11c10
>     RBP: ffffffffffffffff   R8: 0000000000000000   R9: 0000000000000000
>     R10: ffff810055f11c10  R11: 0000000000000060  R12: ffffffffffffffff
>     R13: ffffffffffffffff  R14: ffff810079a6d7e0  R15: ffff81006cee50c0
>     ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
> #20 [ffff81007f22fa90] journal_dirty_data at ffffffff88032c67
> #21 [ffff81007f22fac0] ext3_journal_dirty_data at ffffffff8804e08f
> #22 [ffff81007f22fae0] walk_page_buffers at ffffffff8804d4b8
> #23 [ffff81007f22fb30] ext3_ordered_write_end at ffffffff8804ff3d
> #24 [ffff81007f22fb80] generic_file_buffered_write at ffffffff8000fd6c
> #25 [ffff81007f22fc80] __generic_file_aio_write_nolock at ffffffff800165e0
> #26 [ffff81007f22fd30] __generic_file_write_nolock at ffffffff800c63b5
> #27 [ffff81007f22fe20] generic_file_writev at ffffffff800c6416
> #28 [ffff81007f22fe60] do_readv_writev at ffffffff800e0675
> #29 [ffff81007f22ff40] sys_writev at ffffffff800e081e
> #30 [ffff81007f22ff80] tracesys at ffffffff8005d28d (via system_call)
>     RIP: 00002b8b9a571aac  RSP: 00007fffbcb42a60  RFLAGS: 00000246
>     RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
>     RDX: 0000000000000007  RSI: 00007fffbcb42ab0  RDI: 0000000000000001
>     RBP: 0000000000000001   R8: fefefefefefefeff   R9: ff1f6d766e631f72
>     R10: 0000000000000000  R11: 0000000000000246  R12: 00007fffbcb42ab0
>     R13: 0000000000000007  R14: 0000000000000037  R15: 00000000a56e1b9c
>     ORIG_RAX: 0000000000000014  CS: 0033  SS: 002b
> crash>
> 
> So is this a new bug?


This looked like the same bug to me, and Emulex confirmed that the bug fix is the 2nd patch.  I obsoleted the first patch.  I hope this is all clear now.

Comment 27 RHEL Program Management 2010-11-15 21:49:35 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 28 Rob Evers 2010-11-16 15:10:45 UTC
(In reply to comment #21)
> Created attachment 460146 [details]
> Patch to fix new lpfc_scsi_cmd_iocb_cmpl panic
> 
> This patch contains the fix necessary to correct the most recent kernel panic
> seen by Martin George

Confirmed that the attached patch is already accepted into rhel5.6, as of lpfc driver update lpfc 8.2.0.76.1p -> 8.2.0.77

https://bugzilla.redhat.com/show_bug.cgi?id=603806

Comment 29 Andrius Benokraitis 2010-11-16 16:01:39 UTC

*** This bug has been marked as a duplicate of bug 603806 ***

Comment 30 Martin George 2010-11-22 13:30:44 UTC
(In reply to comment #21)
> Created attachment 460146 [details]
> Patch to fix new lpfc_scsi_cmd_iocb_cmpl panic
> 
> This patch contains the fix necessary to correct the most recent kernel panic
> seen by Martin George

Latest patch did not help. The host crashed again with a NULL pointer dereference at lpfc_scsi_cmd_iocb_cmpl:

crash> bt
PID: 29050  TASK: ffff81007682b7a0  CPU: 1   COMMAND: "dt.stable"
 #0 [ffff8100226e75d0] crash_kexec at ffffffff800adb59
 #1 [ffff8100226e7690] __die at ffffffff80065157
 #2 [ffff8100226e76d0] do_page_fault at ffffffff80066dd7
 #3 [ffff8100226e77c0] error_exit at ffffffff8005dde9
    [exception RIP: lpfc_scsi_cmd_iocb_cmpl+2549]
    RIP: ffffffff881004ec  RSP: ffff8100226e7878  RFLAGS: 00010286
    RAX: 000000000000000f  RBX: ffff81003cde3500  RCX: 0000000000000000
    RDX: ffff81003c695bc0  RSI: 0000000000000220  RDI: ffff81003c695c40
    RBP: 0000000000000000   R8: 000000000000000d   R9: 000000000000003c
    R10: ffff81007e3fd3f8  R11: 000000000000000a  R12: 0000000000000000
    R13: ffff81003c695bc0  R14: 00000000040a0000  R15: 0000000000000016
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #4 [ffff8100226e7970] __activate_task at ffffffff8008cab2
 #5 [ffff8100226e79a0] try_to_wake_up at ffffffff80046eb5
 #6 [ffff8100226e7a40] autoremove_wake_function at ffffffff800a0b68
 #7 [ffff8100226e7a50] __wake_up_common at ffffffff8008b4d7
 #8 [ffff8100226e7a90] __wake_up at ffffffff8002e219
 #9 [ffff8100226e7ad0] lpfc_sli_sp_intr_handler at ffffffff880d1298
#10 [ffff8100226e7b30] lpfc_sli_intr_handler at ffffffff880d8b75
#11 [ffff8100226e7b60] handle_IRQ_event at ffffffff80010c3a
#12 [ffff8100226e7b90] __do_IRQ at ffffffff800baff7
#13 [ffff8100226e7bd0] do_IRQ at ffffffff8006ca0d
#14 [ffff8100226e7c58] vprintk at ffffffff800924d1
#15 [ffff8100226e7cf8] printk at ffffffff8009256f
#16 [ffff8100226e7de8] __end_that_request_first at ffffffff8002cb5f
#17 [ffff8100226e7e48] scsi_end_request at ffffffff8807a0d2
#18 [ffff8100226e7e78] scsi_io_completion at ffffffff8807a48d
#19 [ffff8100226e7ed8] sd_rw_intr at ffffffff880a7802
#20 [ffff8100226e7f38] blk_done_softirq at ffffffff80037c86
#21 [ffff8100226e7f58] __do_softirq at ffffffff8001241d
#22 [ffff8100226e7f88] call_softirq at ffffffff8005e2fc
#23 [ffff8100226e7fa0] do_softirq at ffffffff8006cb8a
#24 [ffff8100226e7fb0] apic_timer_interrupt at ffffffff8005dc8e
--- <IRQ stack> ---
#25 [ffff810023121f58] apic_timer_interrupt at ffffffff8005dc8e
    RIP: 0000000008056f02  RSP: 00000000ff88fed0  RFLAGS: 00000246
    RAX: 000000000000004c  RBX: 00000000ff88fee8  RCX: 0000000000000000
    RDX: 00000000000001b2  RSI: 0000000009b969b2  RDI: 00000000000099b2
    RBP: ffffffff8005d68b   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 00000000ff88fee8
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff10  CS: 0023  SS: 002b

Comment 31 Rob Evers 2010-11-22 15:52:25 UTC
Vaios,

Can you or someone from Emulex take a look into this?  A patch was provided by Richard Kennedy that I obsoleted.  Can you confirm that obsoleting was the correct action?

Is the stack trace above a different problem that the patch that was provided by Joseph Mann addressed?

Thank, Rob

Comment 32 Richard Kennedy 2010-11-22 18:48:23 UTC
Rob, is this latest crash with 5.5z and the 8.2.0-77 driver?

Comment 33 Martin George 2010-11-22 18:59:38 UTC
(In reply to comment #32)
> Rob, is this latest crash with 5.5z and the 8.2.0-77 driver?

This is with the latest 5.5.z kernel (2.6.18-194.26.1.el5) & lpfc driver version 8.2.0.63.3p. The kernel has been patched with the lpfc fixes given at https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10 & comment #21.

Comment 34 Martin George 2010-11-23 12:10:54 UTC
The vmcore & related files for this latest crash is available at:

1) vmcore - ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/nov22/vmcore

2) kernel packages & lpfc modules - ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/nov22/kernel-lpfc-files.zip

Comment 35 Vaios Papadimitriou 2010-11-23 13:57:15 UTC
All,

I see the original patch/fix posted by Dick on 10/13/10 was obsoleted (see comment #26).

The second - and currently active - patch by Joe addresses a DIFFERENT problem.

Both patches are needed, as they are addressing different problems.

It seems we are confusing the two issues, maybe because we use the same BZ to track two different issues.

Dick and I would recommend every time a new issue/symptom is seen to generate a new BZ.

So, can you please explain why the original patch by Dick was obsoleted? We are getting a little confused here.

Thanks,
-Vaios-

Comment 36 Rob Evers 2010-11-23 16:55:37 UTC
(In reply to comment #35)
> All,
> 
> I see the original patch/fix posted by Dick on 10/13/10 was obsoleted (see
> comment #26).
> 
> The second - and currently active - patch by Joe addresses a DIFFERENT problem.
> 
> Both patches are needed, as they are addressing different problems.
> 
> It seems we are confusing the two issues, maybe because we use the same BZ to
> track two different issues.
> 
> Dick and I would recommend every time a new issue/symptom is seen to generate a
> new BZ.
> 
> So, can you please explain why the original patch by Dick was obsoleted? We are
> getting a little confused here.
> 
> Thanks,
> -Vaios-

My assessment is that the confusion was caused by multiple issues being solved in the same bz, and they were both in the same function.  All there was to go on was the offset into the function, and with patches, that added to the confusion.

Is Dick's patch somewhere in our queue of patches for rhel6?  I looked at what is in and didn't see it.  If it exists in a version of the lpfc driver, which one?

Thanks, Rob

Comment 37 Rob Evers 2010-11-23 21:06:33 UTC
> Is Dick's patch somewhere in our queue of patches for rhel6?  I looked at what
> is in and didn't see it.  If it exists in a version of the lpfc driver, which
> one?
> 
> Thanks, Rob


Sorry, rhel5.6

Comment 38 Martin George 2010-11-24 07:27:32 UTC
So can someone please confirm that the latest crash described in comment #30 is already addressed by Richard Kennedy's patch in comment #3?

If so, I'll rerun the tests with the latest 5.5.z kernel (2.6.18-194.26.1.el5) patched with the following outstanding lpfc fixes described at:

1) comment #3

2) comment #21

3) https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10

Right?

Comment 39 Rob Evers 2010-11-24 13:50:17 UTC
(In reply to comment #38)
> So can someone please confirm that the latest crash described in comment #30 is
> already addressed by Richard Kennedy's patch in comment #3?
> 
> If so, I'll rerun the tests with the latest 5.5.z kernel (2.6.18-194.26.1.el5)
> patched with the following outstanding lpfc fixes described at:
> 
> 1) comment #3
> 
> 2) comment #21
> 
> 3) https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10
> 
> Right?

That looks correct.

Comment 40 Richard Kennedy 2010-11-24 17:29:57 UTC
Rob,
you are right you need:
> 1) comment #3
> 
> 2) comment #21
> 
> 3) https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10
 For 5.5z

Comment 41 Martin George 2010-11-25 14:57:11 UTC
Testing is stalled at the moment because I'm hitting bug 657345 which causes the tests to fail.

Comment 43 Andrius Benokraitis 2010-11-30 15:37:49 UTC
POSTed to two bugzillas already:

https://bugzilla.redhat.com/show_bug.cgi?id=649489
and
https://bugzilla.redhat.com/show_bug.cgi?id=603806

Can only DUPE to one of them so I'll go with the later one.

*** This bug has been marked as a duplicate of bug 649489 ***


Note You need to log in before you can comment on or make changes to this bug.