Bug 517774 - Server panic while modprobing lpfc module
Summary: Server panic while modprobing lpfc module
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.7
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: 4.8
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-08-17 05:12 UTC by Lachlan McIlroy
Modified: 2018-10-27 13:46 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-14 20:10:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Lachlan McIlroy 2009-08-17 05:12:09 UTC
Description of problem:
Customer added SCSI tape libraries to the host and rescanned the SCSI bus (using modprobe -r lpfc; modprobe lpfc).  This caused the host to kernel panic and the customer captured a core:

$ cd /cores/20090812000643/work
$ cat sys
     KERNEL: /cores/20090812000643/work/vmlinux
   DUMPFILE: /cores/20090812000643/work/1942674-vmcore  [PARTIAL DUMP]
       CPUS: 16
       DATE: Mon Aug 10 20:11:11 2009
     UPTIME: 103 days, 22:54:48
LOAD AVERAGE: 15.40, 14.84, 7.94
      TASKS: 1252
   NODENAME: sapkxpsepap01
    RELEASE: 2.6.9-78.0.1.ELlargesmp
    VERSION: #1 SMP Tue Jul 22 18:11:13 EDT 2008
    MACHINE: x86_64  (2132 Mhz)
     MEMORY: 16 GB
      PANIC: "Oops: 0000 [1] SMP " (check log for details)


How reproducible:
Customer has produced it once.  Server is production and they have not attempted to reproduce it again.

Steps to Reproduce:
- Have many devices using lpfc module
- modprobe -r lpfc; modprobe lpfc

Comment 1 Lachlan McIlroy 2009-08-17 05:28:28 UTC
System panicked due to a NULL pointer in lpfc_hba_down_post().

Emulex LightPulse Fibre Channel SCSI driver 8.0.16.40
Copyright(c) 2003-2008 Emulex.  All rights reserved.
ACPI: PCI Interrupt 0000:1e:00.0[A] -> GSI 80 (level, low) -> IRQ 225
PCI: Setting latency timer of device 0000:1e:00.0 to 64
lpfc 0000:1e:00.0: 0:1303 Link Up Event x1 received Data: x1 xf7 x10 x0
scsi7 : IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x on PCI bus 1e device 00 irq 225 port 0
 Vendor: ATL       Model: P3000             Rev: 0100
 Type:   Medium Changer                     ANSI SCSI revision: 02
Attached scsi generic sg15 at scsi7, channel 0, id 0, lun 0,  type 8
ACPI: PCI Interrupt 0000:1e:00.1[B] -> GSI 84 (level, low) -> IRQ 233
PCI: Setting latency timer of device 0000:1e:00.1 to 64
modprobe: page allocation failure. order:5, mode:0xd0

Call Trace:<ffffffff8015fa55>{__alloc_pages+789} <ffffffff8015faee>{__get_free_pages+11}
      <ffffffff80162de0>{kmem_getpages+36} <ffffffff80163578>{cache_alloc_refill+612}
      <ffffffff80163243>{__kmalloc+123} <ffffffffa014734d>{:lpfc:lpfc_sli_queue_setup+192}
      <ffffffffa015fd2f>{:lpfc:lpfc_pci_probe_one+4631} <ffffffff801f83d4>{pci_device_probe+110}
      <ffffffff8024eac1>{bus_match+57} <ffffffff8024ebbf>{driver_attach+68}
      <ffffffff8024eedb>{bus_add_driver+143} <ffffffff801f8144>{pci_register_driver+119}
      <ffffffffa0e57046>{:lpfc:lpfc_init+70} <ffffffff80150e41>{sys_init_module+278}
      <ffffffff801102f6>{system_call+126}
Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
cpu 2 hot: low 2, high 6, batch 1
cpu 2 cold: low 0, high 2, batch 1
cpu 3 hot: low 2, high 6, batch 1
cpu 3 cold: low 0, high 2, batch 1
cpu 4 hot: low 2, high 6, batch 1
cpu 4 cold: low 0, high 2, batch 1
cpu 5 hot: low 2, high 6, batch 1
cpu 5 cold: low 0, high 2, batch 1
cpu 6 hot: low 2, high 6, batch 1
cpu 6 cold: low 0, high 2, batch 1
cpu 7 hot: low 2, high 6, batch 1
cpu 7 cold: low 0, high 2, batch 1
cpu 8 hot: low 2, high 6, batch 1
cpu 8 cold: low 0, high 2, batch 1
cpu 9 hot: low 2, high 6, batch 1
cpu 9 cold: low 0, high 2, batch 1
cpu 10 hot: low 2, high 6, batch 1
cpu 10 cold: low 0, high 2, batch 1
cpu 11 hot: low 2, high 6, batch 1
cpu 11 cold: low 0, high 2, batch 1
cpu 12 hot: low 2, high 6, batch 1
cpu 12 cold: low 0, high 2, batch 1
cpu 13 hot: low 2, high 6, batch 1
cpu 13 cold: low 0, high 2, batch 1
cpu 14 hot: low 2, high 6, batch 1
cpu 14 cold: low 0, high 2, batch 1
cpu 15 hot: low 2, high 6, batch 1
cpu 15 cold: low 0, high 2, batch 1
Node 0 Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
cpu 2 hot: low 32, high 96, batch 16
cpu 2 cold: low 0, high 32, batch 16
cpu 3 hot: low 32, high 96, batch 16
cpu 3 cold: low 0, high 32, batch 16
cpu 4 hot: low 32, high 96, batch 16
cpu 4 cold: low 0, high 32, batch 16
cpu 5 hot: low 32, high 96, batch 16
cpu 5 cold: low 0, high 32, batch 16
cpu 6 hot: low 32, high 96, batch 16
cpu 6 cold: low 0, high 32, batch 16
cpu 7 hot: low 32, high 96, batch 16
cpu 7 cold: low 0, high 32, batch 16
cpu 8 hot: low 32, high 96, batch 16
cpu 8 cold: low 0, high 32, batch 16
cpu 9 hot: low 32, high 96, batch 16
cpu 9 cold: low 0, high 32, batch 16
cpu 10 hot: low 32, high 96, batch 16
cpu 10 cold: low 0, high 32, batch 16
cpu 11 hot: low 32, high 96, batch 16
cpu 11 cold: low 0, high 32, batch 16
cpu 12 hot: low 32, high 96, batch 16
cpu 12 cold: low 0, high 32, batch 16
cpu 13 hot: low 32, high 96, batch 16
cpu 13 cold: low 0, high 32, batch 16
cpu 14 hot: low 32, high 96, batch 16
cpu 14 cold: low 0, high 32, batch 16
cpu 15 hot: low 32, high 96, batch 16
cpu 15 cold: low 0, high 32, batch 16
Node 0 HighMem per-cpu: empty

Free pages:       56504kB (0kB HighMem)
Active:3399613 inactive:472304 dirty:927 writeback:14 unstable:0 free:14126 slab:95700 mapped:2666106 pagetables:75381
Node 0 DMA free:9960kB min:12kB low:24kB high:36kB active:0kB inactive:0kB present:15980kB pages_scanned:0 all_unreclaimable? yes
protections[]: 0 0 0
Node 0 Normal free:46544kB min:16356kB low:32712kB high:49068kB active:13598452kB inactive:1889216kB present:16760108kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
Node 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
Node 0 DMA: 0*4kB 1*8kB 0*16kB 1*32kB 1*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 9960kB
Node 0 Normal: 3124*4kB 1106*8kB 699*16kB 438*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46544kB
Node 0 HighMem: empty
2877482 pagecache pages
Swap cache: add 56082667, delete 56082989, find 25730725/30637349, race 4+906
Free swap:       20442144kB
4390912 pages of RAM
281817 reserved pages
10966917 pages shared
175 pages swap cached
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffffa0153f69>{:lpfc:lpfc_hba_down_post+33}
PML4 289e6a067 PGD 3051ae067 PMD 0
Oops: 0000 [1] SMP
CPU 9
Modules linked in: lpfc vfat fat nfsd exportfs md5 ipv6 nfs lockd nfs_acl sunrpc cpufreq_powersave ide_dump scsi_dump diskdump zlib_deflate dmpjbod(U) dmpap(U) dmpaa(U) vxspec(U) vxio(U) vxdmp(U) vxportal(U) fdd(U) vxfs(U) emcpdm(U) emcpgpx(U) emcpmpx(U) emcp(U) joydev button battery ac uhci_hcd ehci_hcd bnx2 e1000 bonding(U) st sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2400 ata_piix libata megaraid_sas qla2xxx scsi_transport_fc sd_mod scsi_mod
Pid: 4255, comm: modprobe Tainted: PF     2.6.9-78.0.1.ELlargesmp
RIP: 0010:[<ffffffffa0153f69>] <ffffffffa0153f69>{:lpfc:lpfc_hba_down_post+33}
RSP: 0000:000001039ac83db8  EFLAGS: 00010046
RAX: 0000010382402380 RBX: 0000000000000000 RCX: 00000000d318dfd0
RDX: 00000000004407ed RSI: 0000000000000046 RDI: 0000010382402000
RBP: 0000010382402320 R08: 0000000000000002 R09: 0000000000000007
R10: 0000000000000001 R11: ffffffff802b32eb R12: 0000010382402000
R13: 0000010382402018 R14: 0000000000001130 R15: 0000002a95579010
FS:  0000002a95577b00(0000) GS:ffffffff80520500(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000004214de000 CR4: 00000000000006e0
Process modprobe (pid: 4255, threadinfo 000001039ac82000, task 0000010097eb27f0)
Stack: ffffffffffffffff 0000010382402000 0000000000000000 00000104208a2140
      0000000000001130 ffffffffa01453ed 0000000000078000 0000010382402000
      0000000000001a00 ffffffffa0148598
Call Trace:<ffffffffa01453ed>{:lpfc:lpfc_sli_brdrestart+629} <ffffffffa0148598>{:lpfc:lpfc_sli_hba_setup+312}
      <ffffffffa015fe5c>{:lpfc:lpfc_pci_probe_one+4932} <ffffffff801f83d4>{pci_device_probe+110}
      <ffffffff8024eac1>{bus_match+57} <ffffffff8024ebbf>{driver_attach+68}
      <ffffffff8024eedb>{bus_add_driver+143} <ffffffff801f8144>{pci_register_driver+119}
      <ffffffffa0e57046>{:lpfc:lpfc_init+70} <ffffffff80150e41>{sys_init_module+278}
      <ffffffff801102f6>{system_call+126}

Code: 4c 8b 33 48 39 c3 74 45 48 8b 03 48 8b 53 08 4c 89 e7 48 89
RIP <ffffffffa0153f69>{:lpfc:lpfc_hba_down_post+33} RSP <000001039ac83db8>
CR2: 0000000000000000

crash> dis lpfc_hba_down_post
0xffffffffa0153f48 <lpfc_hba_down_post>:        push   %r14
0xffffffffa0153f4a <lpfc_hba_down_post+2>:      lea    0x380(%rdi),%rax
0xffffffffa0153f51 <lpfc_hba_down_post+9>:      push   %r13
0xffffffffa0153f53 <lpfc_hba_down_post+11>:     lea    0x18(%rdi),%r13
0xffffffffa0153f57 <lpfc_hba_down_post+15>:     push   %r12
0xffffffffa0153f59 <lpfc_hba_down_post+17>:     mov    %rdi,%r12
0xffffffffa0153f5c <lpfc_hba_down_post+20>:     push   %rbp
0xffffffffa0153f5d <lpfc_hba_down_post+21>:     lea    0x320(%rdi),%rbp
0xffffffffa0153f64 <lpfc_hba_down_post+28>:     push   %rbx
0xffffffffa0153f65 <lpfc_hba_down_post+29>:     mov    0x60(%rbp),%rbx
0xffffffffa0153f69 <lpfc_hba_down_post+33>:     mov    (%rbx),%r14

%rbx is NULL.

int
lpfc_hba_down_post(struct lpfc_hba * phba)
{
	struct lpfc_sli *psli = &phba->sli;
	struct lpfc_sli_ring *pring;
	struct lpfc_dmabuf *mp, *next_mp;
	int i;

	/* Cleanup preposted buffers on the ELS ring */
	pring = &psli->ring[LPFC_ELS_RING];
	list_for_each_entry_safe(mp, next_mp, &pring->postbufq, list) {

%rbx corresponds to phba->sli.ring[LPFC_ELS_RING].postbufq.next which gets initialised in lpfc_sli_queue_setup().

int
lpfc_sli_queue_setup(struct lpfc_hba * phba)
{
	struct lpfc_sli *psli;
	struct lpfc_sli_ring *pring;
	int i, cnt;

	psli = &phba->sli;
	INIT_LIST_HEAD(&psli->mboxq);
	/* Initialize list headers for txq and txcmplq as double linked lists */
	for (i = 0; i < psli->sliinit.num_rings; i++) {
		pring = &psli->ring[i];
		pring->ringno = i;
		pring->next_cmdidx  = 0;
		pring->local_getidx = 0;
		pring->cmdidx = 0;
		INIT_LIST_HEAD(&pring->txq);
		INIT_LIST_HEAD(&pring->txcmplq);
		INIT_LIST_HEAD(&pring->iocb_continueq);
		INIT_LIST_HEAD(&pring->postbufq);
		cnt = psli->sliinit.ringinit[i].fast_iotag;
		if (cnt) {
			pring->fast_lookup =
				kmalloc(cnt * sizeof (struct lpfc_iocbq *),
					GFP_KERNEL);
			if (pring->fast_lookup == 0) {
				return (0);
			}
			memset((char *)pring->fast_lookup, 0,
			       cnt * sizeof (struct lpfc_iocbq *));
		}
	}
	return (1);
}

This is where we got the page allocation failure message from so it looks like kmalloc() failed and we returned prematurely failing to initialize the remaining items in plsi->ring[].

This fast_lookup/kmalloc() code does not exist in RHEL5 or mainline.  This is the patch from mainline that removed it:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=604a3e3042eb89ffaa4f735ef9208281aae786c7


Note You need to log in before you can comment on or make changes to this bug.