Description of problem: This is a follow up to bug 599487. Hit a kernel panic on a 5.5 Emulex FC host during IO with controller faults, due to a NULL pointer dereference at lpfc_scsi_cmd_iocb_cmpl: lpfc 0000:03:00.0: 0:0310 Mailbox command x5 timeout Data: x0 x700 xffff810058e67c00 lpfc 0000:03:00.0: 0:0345 Resetting board due to mailbox timeout lpfc 0000:03:00.0: 0:(0):2530 Mailbox command x23 cannot issue Data: xd00 x2 Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: [<ffffffff8810052d>] :lpfc:lpfc_scsi_cmd_iocb_cmpl+0x9ed/0x137d PGD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/rport-0:0-2/target0:0:0/0:0:0:1/timeout CPU 3 Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg floppy i2c_i801 i2c_core ide_cd tg3 cdrom pcspkr e752x_edac edac_mc serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp lpfc scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 17, comm: events/3 Not tainted 2.6.18-194.11.1.el5.lpfc.heartbeat #1 RIP: 0010:[<ffffffff8810052d>] [<ffffffff8810052d>] :lpfc:lpfc_scsi_cmd_iocb_cmpl+0x9ed/0x137d RSP: 0018:ffff81007ff6b948 EFLAGS: 00010286 RAX: 000000000000001e RBX: ffff81000bbf1500 RCX: 0000000000000000 RDX: ffff81007acf24c0 RSI: 0000000000000220 RDI: ffff81007acf2540 RBP: 0000000000000000 R08: ffffffff80311da8 R09: ffff810078b81188 R10: ffff81007e34fba8 R11: 000000000000000a R12: 0000000000001000 R13: ffff81007acf24c0 R14: 00000000040a0000 R15: 0000000000000016 FS: 0000000000000000(0000) GS:ffff8100026ca6c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000010 CR3: 0000000000201000 CR4: 00000000000006e0 Process events/3 (pid: 17, threadinfo ffff8100378de000, task ffff81007ff47080) Stack: 0000000000000000 000000000000000a ffff81007e34fba8 ffff810000000000 ffff81007ff6ba78 ffff81007e5e8000 ffffffff80071a88 0000000000000001 ffffffff882baddb ffff81007e34e400 ffff81007e6994f8 ffff81007e69b600 Call Trace: <IRQ> [<ffffffff80071a88>] nommu_map_single+0x24/0x33 [<ffffffff882baddb>] :tg3:tg3_start_xmit_dma_bug+0x85d/0x90b [<ffffffff880d563d>] :lpfc:lpfc_sli_handle_fast_ring_event+0x40b/0x60f [<ffffffff8002f972>] dev_queue_xmit+0x250/0x271 [<ffffffff80031f5a>] ip_output+0x29a/0x2dd [<ffffffff80046e4c>] try_to_wake_up+0x472/0x484 [<ffffffff8003d9f4>] lock_timer_base+0x1b/0x3c [<ffffffff8026ada7>] fn_hash_lookup+0x79/0xb2 [<ffffffff8015081f>] __next_cpu+0x19/0x28 [<ffffffff880d58df>] :lpfc:lpfc_sli_fp_intr_handler+0x9e/0x107 [<ffffffff880d8bbb>] :lpfc:lpfc_sli_intr_handler+0x122/0x15e [<ffffffff80010bab>] handle_IRQ_event+0x51/0xa6 [<ffffffff800bae28>] __do_IRQ+0xa4/0x103 [<ffffffff8006ca11>] do_IRQ+0xe7/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa [<ffffffff80064b50>] _spin_unlock_irqrestore+0x8/0x9 [<ffffffff800efc7a>] aio_complete+0x1ef/0x1fd [<ffffffff800f44c8>] dio_bio_end_aio+0x9f/0xbf [<ffffffff8002cc88>] __end_that_request_first+0x23c/0x5bf [<ffffffff8005c17b>] blk_run_queue+0x28/0x73 [<ffffffff88079fe5>] :scsi_mod:scsi_end_request+0x27/0xcd [<ffffffff8807a1d9>] :scsi_mod:scsi_io_completion+0x14e/0x324 [<ffffffff880a7802>] :sd_mod:sd_rw_intr+0x252/0x28c [<ffffffff8807a46e>] :scsi_mod:scsi_device_unbusy+0x67/0x81 [<ffffffff800dca7c>] cache_reap+0x0/0x217 [<ffffffff80037c1d>] blk_done_softirq+0x5f/0x6d [<ffffffff800123b4>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cb8e>] do_softirq+0x2c/0x85 [<ffffffff8006ca16>] do_IRQ+0xec/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff800dcb16>] cache_reap+0x9a/0x217 [<ffffffff8004d624>] run_workqueue+0x94/0xe4 [<ffffffff80049e5f>] worker_thread+0x0/0x122 [<ffffffff80049f4f>] worker_thread+0xf0/0x122 [<ffffffff8008cfa1>] default_wake_function+0x0/0xe [<ffffffff8003287b>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8003277d>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 48 8b 45 10 49 89 45 3c 48 8b 45 18 49 89 45 44 8a 83 c2 00 RIP [<ffffffff8810052d>] :lpfc:lpfc_scsi_cmd_iocb_cmpl+0x9ed/0x137d RSP <ffff81007ff6b948> CR2: 0000000000000010 <0>Kernel panic - not syncing: Fatal exception As per https://bugzilla.redhat.com/show_bug.cgi?id=599487#c47, this crash was due to the pnode pointer not being checked before dereferencing it at memcpy(&fast_path_evt->un.check_cond_evt.scsi_event.wwpn, &pnode->nlp_portname, sizeof(struct lpfc_name)); Version-Release number of selected component (if applicable): RHEL 5.5.z (2.6.18-194.11.1.el5) lpfc driver v8.2.0.63.3p How reproducible: I've hit this twice now on my Emulex host.
Including Emulex on this bugzilla.
Vaios, Have you or anyone at Emulex had a chance to look into this? Rob
Created attachment 453228 [details] patch for 640225 panic
Martin, Could you please apply the patch that Dick just attached to your LPFC driver, and verify this patch fixes the issue reported in this BZ? Thanks, -Vaios-
(In reply to comment #4) > Martin, > > Could you please apply the patch that Dick just attached to your LPFC driver, > and verify this patch fixes the issue reported in this BZ? > Patch looks good. Not hit the panic so far in our tests.
(In reply to comment #3) > Created attachment 453228 [details] > patch for 640225 panic Hi Dick, Can we get this rolled into a rhel5.6 patch update that I can post. This will enable the z-stream update. Thanks, Rob
(In reply to comment #6) > (In reply to comment #3) > > Created attachment 453228 [details] [details] > > patch for 640225 panic > > Hi Dick, > > Can we get this rolled into a rhel5.6 patch update that I can post. This will > enable the z-stream update. > > Thanks, Rob Dick, Also saw that this patch is in the latest upstream lpfc driver, but not in rhel6. Can you or Vaios please see that this gets into the next rhel6.1 update so it can be backported to rhel6.0z as well. Thanks, Rob
So is this being queued for the next 5.5.z release?
(In reply to comment #8) > So is this being queued for the next 5.5.z release? The equivalent patch needs to be provided by Emulex for rhel5.6, and this needs to be accepted into rhel5.6 before the patch can be accepted into rhel5.5z. Vaios Can someone at Emulex, please provide an update that contains this patch for rhel5.6. Thanks, Rob
(In reply to comment #5) > (In reply to comment #4) > > Martin, > > > > Could you please apply the patch that Dick just attached to your LPFC driver, > > and verify this patch fixes the issue reported in this BZ? > > > > Patch looks good. Not hit the panic so far in our tests. Please ignore this comment. Unfortunately even with this patch, I hit another NULL pointer dereference in the same function lpfc_scsi_cmd_iocb_cmpl: crash> bt PID: 6237 TASK: ffff810079973860 CPU: 3 COMMAND: "syslogd" #0 [ffff81007ff6b660] crash_kexec at ffffffff800ada30 #1 [ffff81007ff6b720] __die at ffffffff80065157 #2 [ffff81007ff6b760] do_page_fault at ffffffff80066dd7 #3 [ffff81007ff6b850] error_exit at ffffffff8005dde9 [exception RIP: lpfc_scsi_cmd_iocb_cmpl+80] RIP: ffffffff880ffb47 RSP: ffff81007ff6b908 RFLAGS: 00010292 RAX: 0000000000000000 RBX: ffff81007ff6ba38 RCX: 0000000000000000 RDX: ffff81007e1b3000 RSI: ffff81007e6b84f8 RDI: ffff81007e606000 RBP: ffff81007e606000 R8: 000000000000000d R9: 00000000040a0000 R10: 0000000000000296 R11: ffffffff880ff3bf R12: ffff81007e1b3068 R13: ffff81007ff6ba38 R14: ffff81007ff6bc68 R15: ffff81007ff6bc68 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff81007ff6b9e0] lpfc_sli_handle_fast_ring_event at ffffffff880d55f7 #5 [ffff81007ff6bba0] lpfc_sli_fp_intr_handler at ffffffff880d5899 #6 [ffff81007ff6bbc0] lpfc_sli_intr_handler at ffffffff880d8b75 #7 [ffff81007ff6bbf0] handle_IRQ_event at ffffffff80010bab #8 [ffff81007ff6bc20] __do_IRQ at ffffffff800bae74 #9 [ffff81007ff6bc60] do_IRQ at ffffffff8006ca11 #10 [ffff81007ff6bce8] vprintk at ffffffff800923c8 #11 [ffff81007ff6bd88] printk at ffffffff80092466 #12 [ffff81007ff6be78] scsi_io_completion at ffffffff8807a370 #13 [ffff81007ff6bed8] sd_rw_intr at ffffffff880a7802 #14 [ffff81007ff6bf38] blk_done_softirq at ffffffff80037bf1 #15 [ffff81007ff6bf58] __do_softirq at ffffffff80012385 #16 [ffff81007ff6bf88] call_softirq at ffffffff8005e2fc #17 [ffff81007ff6bfa0] do_softirq at ffffffff8006cb8e #18 [ffff81007ff6bfb0] apic_timer_interrupt at ffffffff8005dc8e --- <IRQ stack> --- #19 [ffff81007f22f9b8] apic_timer_interrupt at ffffffff8005dc8e [exception RIP: __journal_file_buffer+105] RIP: ffffffff88031213 RSP: ffff81007f22fa68 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff8100700f3980 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff8100700f3980 RDI: ffff810055f11c10 RBP: ffffffffffffffff R8: 0000000000000000 R9: 0000000000000000 R10: ffff810055f11c10 R11: 0000000000000060 R12: ffffffffffffffff R13: ffffffffffffffff R14: ffff810079a6d7e0 R15: ffff81006cee50c0 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #20 [ffff81007f22fa90] journal_dirty_data at ffffffff88032c67 #21 [ffff81007f22fac0] ext3_journal_dirty_data at ffffffff8804e08f #22 [ffff81007f22fae0] walk_page_buffers at ffffffff8804d4b8 #23 [ffff81007f22fb30] ext3_ordered_write_end at ffffffff8804ff3d #24 [ffff81007f22fb80] generic_file_buffered_write at ffffffff8000fd6c #25 [ffff81007f22fc80] __generic_file_aio_write_nolock at ffffffff800165e0 #26 [ffff81007f22fd30] __generic_file_write_nolock at ffffffff800c63b5 #27 [ffff81007f22fe20] generic_file_writev at ffffffff800c6416 #28 [ffff81007f22fe60] do_readv_writev at ffffffff800e0675 #29 [ffff81007f22ff40] sys_writev at ffffffff800e081e #30 [ffff81007f22ff80] tracesys at ffffffff8005d28d (via system_call) RIP: 00002b8b9a571aac RSP: 00007fffbcb42a60 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 0000000000000007 RSI: 00007fffbcb42ab0 RDI: 0000000000000001 RBP: 0000000000000001 R8: fefefefefefefeff R9: ff1f6d766e631f72 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffbcb42ab0 R13: 0000000000000007 R14: 0000000000000037 R15: 00000000a56e1b9c ORIG_RAX: 0000000000000014 CS: 0033 SS: 002b crash> So is this a new bug?
The vmcore for the above crash is available at ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/vmcore
(In reply to comment #11) > The vmcore for the above crash is available at > ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/vmcore Martin, Can you provide: (1) the vmlinux file that they used with that crash session, and (2) the lpfc driver (.ko) and its debuginfo (.ko.debug) file Thanks, Rob
(In reply to comment #12) > Martin, > > Can you provide: > > (1) the vmlinux file that they used with that crash session, and > (2) the lpfc driver (.ko) and its debuginfo (.ko.debug) file > > Thanks, Rob This is available at ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/vmlinux_lpfcmodules.zip
Hi Martin, Thanks for providing these files. Can you also provide all files built, assuming you built w/ "rpmbuild -ba' ie something similar to: kernel-2.6.18-128.el5.x86_64.rpm kernel-debuginfo-2.6.18-128.el5.x86_64.rpm kernel-debuginfo-common-2.6.18-128.el5.x86_64.rpm Thanks, Rob
(In reply to comment #14) > Hi Martin, > > Thanks for providing these files. > > Can you also provide all files built, assuming you built w/ "rpmbuild -ba' > > ie something similar to: > > kernel-2.6.18-128.el5.x86_64.rpm > kernel-debuginfo-2.6.18-128.el5.x86_64.rpm > kernel-debuginfo-common-2.6.18-128.el5.x86_64.rpm > > Thanks, Rob Ok. The kernel packages can be accessed at ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/Kernel-packages.zip
Emulex has mentioned that this latest crash has been fixed in the updated lpfc driver v8.2.0.86-1. But seems even the current RHEL 5.6 Beta lpfc inbox driver is only v8.2.0.80. So when is this latest lpfc driver version getting included into the RHEL5 inbox stream?
(In reply to comment #16) > Emulex has mentioned that this latest crash has been fixed in the updated lpfc > driver v8.2.0.86-1. But seems even the current RHEL 5.6 Beta lpfc inbox driver > is only v8.2.0.80. > > So when is this latest lpfc driver version getting included into the RHEL5 > inbox stream? 8.2.0.86 which is already been accepted internally into rhel5.6 and will be available in an upcoming rhel5.6 update. Another update, 8.2.0.87 needs to be processed. Can Emulex can confirm that this has the required update and I will work on this immediately. Provided the patch is available in one of these 2 updates, I need to have the name of the individual patch so I can flag it for backporting to rhel5.5 z.
Rob, The latest crash that Martin mentions was addressed in the patch we submitted to Redhat for lpfc 8.2.0.76.1p -> 8.2.0.77 (See Bug 603806) If you have any additional questions, let me know. Joe
(In reply to comment #18) > Rob, > > The latest crash that Martin mentions was addressed in the patch we submitted > to Redhat for lpfc 8.2.0.76.1p -> 8.2.0.77 (See Bug 603806) > > If you have any additional questions, let me know. > > Joe Joe, If I understand this, we still need a patch backported for rhel5.5z that only addresses this problem. Can you generate a patch for this problem that applies to rhel5.5 and attach it to this bugzilla? Thanks, Rob
Created attachment 460146 [details] Patch to fix new lpfc_scsi_cmd_iocb_cmpl panic This patch contains the fix necessary to correct the most recent kernel panic seen by Martin George
Thank Joseph. Martin, Can you give this patch a try? Thanks, Rob
(In reply to comment #17) > > 8.2.0.86 which is already been accepted internally into rhel5.6 and will be > available in an upcoming rhel5.6 update. > FYI - With the latest external 8.2.0.86-1 driver, the RHEL 5.5.z host (root on dm-multipath SANbooted) hangs during the 1st iteration itself of fabric faults. And this is seen consistently. So it does look like there's some problem with this driver. Meanwhile I'll test with the latest patch given by Joseph. So just to reiterate, I've now patched the RHEL 5.5.z kernel (for testing) with the following 3 lpfc patches: 1) https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10 2) https://bugzilla.redhat.com/show_bug.cgi?id=640225#c3 3) https://bugzilla.redhat.com/show_bug.cgi?id=640225#c21
(In reply to comment #23) > (In reply to comment #17) > > > > 8.2.0.86 which is already been accepted internally into rhel5.6 and will be > > available in an upcoming rhel5.6 update. > > > > FYI - With the latest external 8.2.0.86-1 driver, the RHEL 5.5.z host (root on > dm-multipath SANbooted) hangs during the 1st iteration itself of fabric faults. > And this is seen consistently. So it does look like there's some problem with > this driver. FYI to all, 8.2.0.86-1 was never provided to redhat. Assume it is encompassed by 8.2.0.87 which still needs to be processed for inclusion in rhel5, by me. > 2) https://bugzilla.redhat.com/show_bug.cgi?id=640225#c3 This patch is not queued for backporting to rhel5.5z since, iirc, it was not known to address the problem at hand.
(In reply to comment #24) > (In reply to comment #23) > > 2) https://bugzilla.redhat.com/show_bug.cgi?id=640225#c3 > > This patch is not queued for backporting to rhel5.5z since, iirc, it was not > known to address the problem at hand. This is confusing. Are we supposed to include the patch in comment #21 alone & not the one in comment #3? If so, then the comment #3 patch should have been marked as obsolete.
(In reply to comment #10) > (In reply to comment #5) > > (In reply to comment #4) > > > Martin, > > > > > > Could you please apply the patch that Dick just attached to your LPFC driver, > > > and verify this patch fixes the issue reported in this BZ? > > > > > > > Patch looks good. Not hit the panic so far in our tests. > > Please ignore this comment. > > Unfortunately even with this patch, I hit another NULL pointer dereference in > the same function lpfc_scsi_cmd_iocb_cmpl: > > crash> bt > PID: 6237 TASK: ffff810079973860 CPU: 3 COMMAND: "syslogd" > #0 [ffff81007ff6b660] crash_kexec at ffffffff800ada30 > #1 [ffff81007ff6b720] __die at ffffffff80065157 > #2 [ffff81007ff6b760] do_page_fault at ffffffff80066dd7 > #3 [ffff81007ff6b850] error_exit at ffffffff8005dde9 > [exception RIP: lpfc_scsi_cmd_iocb_cmpl+80] > RIP: ffffffff880ffb47 RSP: ffff81007ff6b908 RFLAGS: 00010292 > RAX: 0000000000000000 RBX: ffff81007ff6ba38 RCX: 0000000000000000 > RDX: ffff81007e1b3000 RSI: ffff81007e6b84f8 RDI: ffff81007e606000 > RBP: ffff81007e606000 R8: 000000000000000d R9: 00000000040a0000 > R10: 0000000000000296 R11: ffffffff880ff3bf R12: ffff81007e1b3068 > R13: ffff81007ff6ba38 R14: ffff81007ff6bc68 R15: ffff81007ff6bc68 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #4 [ffff81007ff6b9e0] lpfc_sli_handle_fast_ring_event at ffffffff880d55f7 > #5 [ffff81007ff6bba0] lpfc_sli_fp_intr_handler at ffffffff880d5899 > #6 [ffff81007ff6bbc0] lpfc_sli_intr_handler at ffffffff880d8b75 > #7 [ffff81007ff6bbf0] handle_IRQ_event at ffffffff80010bab > #8 [ffff81007ff6bc20] __do_IRQ at ffffffff800bae74 > #9 [ffff81007ff6bc60] do_IRQ at ffffffff8006ca11 > #10 [ffff81007ff6bce8] vprintk at ffffffff800923c8 > #11 [ffff81007ff6bd88] printk at ffffffff80092466 > #12 [ffff81007ff6be78] scsi_io_completion at ffffffff8807a370 > #13 [ffff81007ff6bed8] sd_rw_intr at ffffffff880a7802 > #14 [ffff81007ff6bf38] blk_done_softirq at ffffffff80037bf1 > #15 [ffff81007ff6bf58] __do_softirq at ffffffff80012385 > #16 [ffff81007ff6bf88] call_softirq at ffffffff8005e2fc > #17 [ffff81007ff6bfa0] do_softirq at ffffffff8006cb8e > #18 [ffff81007ff6bfb0] apic_timer_interrupt at ffffffff8005dc8e > --- <IRQ stack> --- > #19 [ffff81007f22f9b8] apic_timer_interrupt at ffffffff8005dc8e > [exception RIP: __journal_file_buffer+105] > RIP: ffffffff88031213 RSP: ffff81007f22fa68 RFLAGS: 00000246 > RAX: 0000000000000000 RBX: ffff8100700f3980 RCX: 0000000000000000 > RDX: 0000000000000001 RSI: ffff8100700f3980 RDI: ffff810055f11c10 > RBP: ffffffffffffffff R8: 0000000000000000 R9: 0000000000000000 > R10: ffff810055f11c10 R11: 0000000000000060 R12: ffffffffffffffff > R13: ffffffffffffffff R14: ffff810079a6d7e0 R15: ffff81006cee50c0 > ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 > #20 [ffff81007f22fa90] journal_dirty_data at ffffffff88032c67 > #21 [ffff81007f22fac0] ext3_journal_dirty_data at ffffffff8804e08f > #22 [ffff81007f22fae0] walk_page_buffers at ffffffff8804d4b8 > #23 [ffff81007f22fb30] ext3_ordered_write_end at ffffffff8804ff3d > #24 [ffff81007f22fb80] generic_file_buffered_write at ffffffff8000fd6c > #25 [ffff81007f22fc80] __generic_file_aio_write_nolock at ffffffff800165e0 > #26 [ffff81007f22fd30] __generic_file_write_nolock at ffffffff800c63b5 > #27 [ffff81007f22fe20] generic_file_writev at ffffffff800c6416 > #28 [ffff81007f22fe60] do_readv_writev at ffffffff800e0675 > #29 [ffff81007f22ff40] sys_writev at ffffffff800e081e > #30 [ffff81007f22ff80] tracesys at ffffffff8005d28d (via system_call) > RIP: 00002b8b9a571aac RSP: 00007fffbcb42a60 RFLAGS: 00000246 > RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff > RDX: 0000000000000007 RSI: 00007fffbcb42ab0 RDI: 0000000000000001 > RBP: 0000000000000001 R8: fefefefefefefeff R9: ff1f6d766e631f72 > R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffbcb42ab0 > R13: 0000000000000007 R14: 0000000000000037 R15: 00000000a56e1b9c > ORIG_RAX: 0000000000000014 CS: 0033 SS: 002b > crash> > > So is this a new bug? This looked like the same bug to me, and Emulex confirmed that the bug fix is the 2nd patch. I obsoleted the first patch. I hope this is all clear now.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
(In reply to comment #21) > Created attachment 460146 [details] > Patch to fix new lpfc_scsi_cmd_iocb_cmpl panic > > This patch contains the fix necessary to correct the most recent kernel panic > seen by Martin George Confirmed that the attached patch is already accepted into rhel5.6, as of lpfc driver update lpfc 8.2.0.76.1p -> 8.2.0.77 https://bugzilla.redhat.com/show_bug.cgi?id=603806
*** This bug has been marked as a duplicate of bug 603806 ***
(In reply to comment #21) > Created attachment 460146 [details] > Patch to fix new lpfc_scsi_cmd_iocb_cmpl panic > > This patch contains the fix necessary to correct the most recent kernel panic > seen by Martin George Latest patch did not help. The host crashed again with a NULL pointer dereference at lpfc_scsi_cmd_iocb_cmpl: crash> bt PID: 29050 TASK: ffff81007682b7a0 CPU: 1 COMMAND: "dt.stable" #0 [ffff8100226e75d0] crash_kexec at ffffffff800adb59 #1 [ffff8100226e7690] __die at ffffffff80065157 #2 [ffff8100226e76d0] do_page_fault at ffffffff80066dd7 #3 [ffff8100226e77c0] error_exit at ffffffff8005dde9 [exception RIP: lpfc_scsi_cmd_iocb_cmpl+2549] RIP: ffffffff881004ec RSP: ffff8100226e7878 RFLAGS: 00010286 RAX: 000000000000000f RBX: ffff81003cde3500 RCX: 0000000000000000 RDX: ffff81003c695bc0 RSI: 0000000000000220 RDI: ffff81003c695c40 RBP: 0000000000000000 R8: 000000000000000d R9: 000000000000003c R10: ffff81007e3fd3f8 R11: 000000000000000a R12: 0000000000000000 R13: ffff81003c695bc0 R14: 00000000040a0000 R15: 0000000000000016 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #4 [ffff8100226e7970] __activate_task at ffffffff8008cab2 #5 [ffff8100226e79a0] try_to_wake_up at ffffffff80046eb5 #6 [ffff8100226e7a40] autoremove_wake_function at ffffffff800a0b68 #7 [ffff8100226e7a50] __wake_up_common at ffffffff8008b4d7 #8 [ffff8100226e7a90] __wake_up at ffffffff8002e219 #9 [ffff8100226e7ad0] lpfc_sli_sp_intr_handler at ffffffff880d1298 #10 [ffff8100226e7b30] lpfc_sli_intr_handler at ffffffff880d8b75 #11 [ffff8100226e7b60] handle_IRQ_event at ffffffff80010c3a #12 [ffff8100226e7b90] __do_IRQ at ffffffff800baff7 #13 [ffff8100226e7bd0] do_IRQ at ffffffff8006ca0d #14 [ffff8100226e7c58] vprintk at ffffffff800924d1 #15 [ffff8100226e7cf8] printk at ffffffff8009256f #16 [ffff8100226e7de8] __end_that_request_first at ffffffff8002cb5f #17 [ffff8100226e7e48] scsi_end_request at ffffffff8807a0d2 #18 [ffff8100226e7e78] scsi_io_completion at ffffffff8807a48d #19 [ffff8100226e7ed8] sd_rw_intr at ffffffff880a7802 #20 [ffff8100226e7f38] blk_done_softirq at ffffffff80037c86 #21 [ffff8100226e7f58] __do_softirq at ffffffff8001241d #22 [ffff8100226e7f88] call_softirq at ffffffff8005e2fc #23 [ffff8100226e7fa0] do_softirq at ffffffff8006cb8a #24 [ffff8100226e7fb0] apic_timer_interrupt at ffffffff8005dc8e --- <IRQ stack> --- #25 [ffff810023121f58] apic_timer_interrupt at ffffffff8005dc8e RIP: 0000000008056f02 RSP: 00000000ff88fed0 RFLAGS: 00000246 RAX: 000000000000004c RBX: 00000000ff88fee8 RCX: 0000000000000000 RDX: 00000000000001b2 RSI: 0000000009b969b2 RDI: 00000000000099b2 RBP: ffffffff8005d68b R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ff88fee8 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffff10 CS: 0023 SS: 002b
Vaios, Can you or someone from Emulex take a look into this? A patch was provided by Richard Kennedy that I obsoleted. Can you confirm that obsoleting was the correct action? Is the stack trace above a different problem that the patch that was provided by Joseph Mann addressed? Thank, Rob
Rob, is this latest crash with 5.5z and the 8.2.0-77 driver?
(In reply to comment #32) > Rob, is this latest crash with 5.5z and the 8.2.0-77 driver? This is with the latest 5.5.z kernel (2.6.18-194.26.1.el5) & lpfc driver version 8.2.0.63.3p. The kernel has been patched with the lpfc fixes given at https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10 & comment #21.
The vmcore & related files for this latest crash is available at: 1) vmcore - ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/nov22/vmcore 2) kernel packages & lpfc modules - ftp://ftp.netapp.com/pub/home/marting/pub/rh-bug640225/nov22/kernel-lpfc-files.zip
All, I see the original patch/fix posted by Dick on 10/13/10 was obsoleted (see comment #26). The second - and currently active - patch by Joe addresses a DIFFERENT problem. Both patches are needed, as they are addressing different problems. It seems we are confusing the two issues, maybe because we use the same BZ to track two different issues. Dick and I would recommend every time a new issue/symptom is seen to generate a new BZ. So, can you please explain why the original patch by Dick was obsoleted? We are getting a little confused here. Thanks, -Vaios-
(In reply to comment #35) > All, > > I see the original patch/fix posted by Dick on 10/13/10 was obsoleted (see > comment #26). > > The second - and currently active - patch by Joe addresses a DIFFERENT problem. > > Both patches are needed, as they are addressing different problems. > > It seems we are confusing the two issues, maybe because we use the same BZ to > track two different issues. > > Dick and I would recommend every time a new issue/symptom is seen to generate a > new BZ. > > So, can you please explain why the original patch by Dick was obsoleted? We are > getting a little confused here. > > Thanks, > -Vaios- My assessment is that the confusion was caused by multiple issues being solved in the same bz, and they were both in the same function. All there was to go on was the offset into the function, and with patches, that added to the confusion. Is Dick's patch somewhere in our queue of patches for rhel6? I looked at what is in and didn't see it. If it exists in a version of the lpfc driver, which one? Thanks, Rob
> Is Dick's patch somewhere in our queue of patches for rhel6? I looked at what > is in and didn't see it. If it exists in a version of the lpfc driver, which > one? > > Thanks, Rob Sorry, rhel5.6
So can someone please confirm that the latest crash described in comment #30 is already addressed by Richard Kennedy's patch in comment #3? If so, I'll rerun the tests with the latest 5.5.z kernel (2.6.18-194.26.1.el5) patched with the following outstanding lpfc fixes described at: 1) comment #3 2) comment #21 3) https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10 Right?
(In reply to comment #38) > So can someone please confirm that the latest crash described in comment #30 is > already addressed by Richard Kennedy's patch in comment #3? > > If so, I'll rerun the tests with the latest 5.5.z kernel (2.6.18-194.26.1.el5) > patched with the following outstanding lpfc fixes described at: > > 1) comment #3 > > 2) comment #21 > > 3) https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10 > > Right? That looks correct.
Rob, you are right you need: > 1) comment #3 > > 2) comment #21 > > 3) https://bugzilla.redhat.com/show_bug.cgi?id=624394#c10 For 5.5z
Testing is stalled at the moment because I'm hitting bug 657345 which causes the tests to fail.
POSTed to two bugzillas already: https://bugzilla.redhat.com/show_bug.cgi?id=649489 and https://bugzilla.redhat.com/show_bug.cgi?id=603806 Can only DUPE to one of them so I'll go with the later one. *** This bug has been marked as a duplicate of bug 649489 ***