Description of problem: Customer added SCSI tape libraries to the host and rescanned the SCSI bus (using modprobe -r lpfc; modprobe lpfc). This caused the host to kernel panic and the customer captured a core: $ cd /cores/20090812000643/work $ cat sys KERNEL: /cores/20090812000643/work/vmlinux DUMPFILE: /cores/20090812000643/work/1942674-vmcore [PARTIAL DUMP] CPUS: 16 DATE: Mon Aug 10 20:11:11 2009 UPTIME: 103 days, 22:54:48 LOAD AVERAGE: 15.40, 14.84, 7.94 TASKS: 1252 NODENAME: sapkxpsepap01 RELEASE: 2.6.9-78.0.1.ELlargesmp VERSION: #1 SMP Tue Jul 22 18:11:13 EDT 2008 MACHINE: x86_64 (2132 Mhz) MEMORY: 16 GB PANIC: "Oops: 0000 [1] SMP " (check log for details) How reproducible: Customer has produced it once. Server is production and they have not attempted to reproduce it again. Steps to Reproduce: - Have many devices using lpfc module - modprobe -r lpfc; modprobe lpfc
System panicked due to a NULL pointer in lpfc_hba_down_post(). Emulex LightPulse Fibre Channel SCSI driver 8.0.16.40 Copyright(c) 2003-2008 Emulex. All rights reserved. ACPI: PCI Interrupt 0000:1e:00.0[A] -> GSI 80 (level, low) -> IRQ 225 PCI: Setting latency timer of device 0000:1e:00.0 to 64 lpfc 0000:1e:00.0: 0:1303 Link Up Event x1 received Data: x1 xf7 x10 x0 scsi7 : IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x on PCI bus 1e device 00 irq 225 port 0 Vendor: ATL Model: P3000 Rev: 0100 Type: Medium Changer ANSI SCSI revision: 02 Attached scsi generic sg15 at scsi7, channel 0, id 0, lun 0, type 8 ACPI: PCI Interrupt 0000:1e:00.1[B] -> GSI 84 (level, low) -> IRQ 233 PCI: Setting latency timer of device 0000:1e:00.1 to 64 modprobe: page allocation failure. order:5, mode:0xd0 Call Trace:<ffffffff8015fa55>{__alloc_pages+789} <ffffffff8015faee>{__get_free_pages+11} <ffffffff80162de0>{kmem_getpages+36} <ffffffff80163578>{cache_alloc_refill+612} <ffffffff80163243>{__kmalloc+123} <ffffffffa014734d>{:lpfc:lpfc_sli_queue_setup+192} <ffffffffa015fd2f>{:lpfc:lpfc_pci_probe_one+4631} <ffffffff801f83d4>{pci_device_probe+110} <ffffffff8024eac1>{bus_match+57} <ffffffff8024ebbf>{driver_attach+68} <ffffffff8024eedb>{bus_add_driver+143} <ffffffff801f8144>{pci_register_driver+119} <ffffffffa0e57046>{:lpfc:lpfc_init+70} <ffffffff80150e41>{sys_init_module+278} <ffffffff801102f6>{system_call+126} Mem-info: Node 0 DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 cpu 2 hot: low 2, high 6, batch 1 cpu 2 cold: low 0, high 2, batch 1 cpu 3 hot: low 2, high 6, batch 1 cpu 3 cold: low 0, high 2, batch 1 cpu 4 hot: low 2, high 6, batch 1 cpu 4 cold: low 0, high 2, batch 1 cpu 5 hot: low 2, high 6, batch 1 cpu 5 cold: low 0, high 2, batch 1 cpu 6 hot: low 2, high 6, batch 1 cpu 6 cold: low 0, high 2, batch 1 cpu 7 hot: low 2, high 6, batch 1 cpu 7 cold: low 0, high 2, batch 1 cpu 8 hot: low 2, high 6, batch 1 cpu 8 cold: low 0, high 2, batch 1 cpu 9 hot: low 2, high 6, batch 1 cpu 9 cold: low 0, high 2, batch 1 cpu 10 hot: low 2, high 6, batch 1 cpu 10 cold: low 0, high 2, batch 1 cpu 11 hot: low 2, high 6, batch 1 cpu 11 cold: low 0, high 2, batch 1 cpu 12 hot: low 2, high 6, batch 1 cpu 12 cold: low 0, high 2, batch 1 cpu 13 hot: low 2, high 6, batch 1 cpu 13 cold: low 0, high 2, batch 1 cpu 14 hot: low 2, high 6, batch 1 cpu 14 cold: low 0, high 2, batch 1 cpu 15 hot: low 2, high 6, batch 1 cpu 15 cold: low 0, high 2, batch 1 Node 0 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 cpu 2 hot: low 32, high 96, batch 16 cpu 2 cold: low 0, high 32, batch 16 cpu 3 hot: low 32, high 96, batch 16 cpu 3 cold: low 0, high 32, batch 16 cpu 4 hot: low 32, high 96, batch 16 cpu 4 cold: low 0, high 32, batch 16 cpu 5 hot: low 32, high 96, batch 16 cpu 5 cold: low 0, high 32, batch 16 cpu 6 hot: low 32, high 96, batch 16 cpu 6 cold: low 0, high 32, batch 16 cpu 7 hot: low 32, high 96, batch 16 cpu 7 cold: low 0, high 32, batch 16 cpu 8 hot: low 32, high 96, batch 16 cpu 8 cold: low 0, high 32, batch 16 cpu 9 hot: low 32, high 96, batch 16 cpu 9 cold: low 0, high 32, batch 16 cpu 10 hot: low 32, high 96, batch 16 cpu 10 cold: low 0, high 32, batch 16 cpu 11 hot: low 32, high 96, batch 16 cpu 11 cold: low 0, high 32, batch 16 cpu 12 hot: low 32, high 96, batch 16 cpu 12 cold: low 0, high 32, batch 16 cpu 13 hot: low 32, high 96, batch 16 cpu 13 cold: low 0, high 32, batch 16 cpu 14 hot: low 32, high 96, batch 16 cpu 14 cold: low 0, high 32, batch 16 cpu 15 hot: low 32, high 96, batch 16 cpu 15 cold: low 0, high 32, batch 16 Node 0 HighMem per-cpu: empty Free pages: 56504kB (0kB HighMem) Active:3399613 inactive:472304 dirty:927 writeback:14 unstable:0 free:14126 slab:95700 mapped:2666106 pagetables:75381 Node 0 DMA free:9960kB min:12kB low:24kB high:36kB active:0kB inactive:0kB present:15980kB pages_scanned:0 all_unreclaimable? yes protections[]: 0 0 0 Node 0 Normal free:46544kB min:16356kB low:32712kB high:49068kB active:13598452kB inactive:1889216kB present:16760108kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 Node 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 Node 0 DMA: 0*4kB 1*8kB 0*16kB 1*32kB 1*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 9960kB Node 0 Normal: 3124*4kB 1106*8kB 699*16kB 438*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46544kB Node 0 HighMem: empty 2877482 pagecache pages Swap cache: add 56082667, delete 56082989, find 25730725/30637349, race 4+906 Free swap: 20442144kB 4390912 pages of RAM 281817 reserved pages 10966917 pages shared 175 pages swap cached Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffffa0153f69>{:lpfc:lpfc_hba_down_post+33} PML4 289e6a067 PGD 3051ae067 PMD 0 Oops: 0000 [1] SMP CPU 9 Modules linked in: lpfc vfat fat nfsd exportfs md5 ipv6 nfs lockd nfs_acl sunrpc cpufreq_powersave ide_dump scsi_dump diskdump zlib_deflate dmpjbod(U) dmpap(U) dmpaa(U) vxspec(U) vxio(U) vxdmp(U) vxportal(U) fdd(U) vxfs(U) emcpdm(U) emcpgpx(U) emcpmpx(U) emcp(U) joydev button battery ac uhci_hcd ehci_hcd bnx2 e1000 bonding(U) st sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2400 ata_piix libata megaraid_sas qla2xxx scsi_transport_fc sd_mod scsi_mod Pid: 4255, comm: modprobe Tainted: PF 2.6.9-78.0.1.ELlargesmp RIP: 0010:[<ffffffffa0153f69>] <ffffffffa0153f69>{:lpfc:lpfc_hba_down_post+33} RSP: 0000:000001039ac83db8 EFLAGS: 00010046 RAX: 0000010382402380 RBX: 0000000000000000 RCX: 00000000d318dfd0 RDX: 00000000004407ed RSI: 0000000000000046 RDI: 0000010382402000 RBP: 0000010382402320 R08: 0000000000000002 R09: 0000000000000007 R10: 0000000000000001 R11: ffffffff802b32eb R12: 0000010382402000 R13: 0000010382402018 R14: 0000000000001130 R15: 0000002a95579010 FS: 0000002a95577b00(0000) GS:ffffffff80520500(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000004214de000 CR4: 00000000000006e0 Process modprobe (pid: 4255, threadinfo 000001039ac82000, task 0000010097eb27f0) Stack: ffffffffffffffff 0000010382402000 0000000000000000 00000104208a2140 0000000000001130 ffffffffa01453ed 0000000000078000 0000010382402000 0000000000001a00 ffffffffa0148598 Call Trace:<ffffffffa01453ed>{:lpfc:lpfc_sli_brdrestart+629} <ffffffffa0148598>{:lpfc:lpfc_sli_hba_setup+312} <ffffffffa015fe5c>{:lpfc:lpfc_pci_probe_one+4932} <ffffffff801f83d4>{pci_device_probe+110} <ffffffff8024eac1>{bus_match+57} <ffffffff8024ebbf>{driver_attach+68} <ffffffff8024eedb>{bus_add_driver+143} <ffffffff801f8144>{pci_register_driver+119} <ffffffffa0e57046>{:lpfc:lpfc_init+70} <ffffffff80150e41>{sys_init_module+278} <ffffffff801102f6>{system_call+126} Code: 4c 8b 33 48 39 c3 74 45 48 8b 03 48 8b 53 08 4c 89 e7 48 89 RIP <ffffffffa0153f69>{:lpfc:lpfc_hba_down_post+33} RSP <000001039ac83db8> CR2: 0000000000000000 crash> dis lpfc_hba_down_post 0xffffffffa0153f48 <lpfc_hba_down_post>: push %r14 0xffffffffa0153f4a <lpfc_hba_down_post+2>: lea 0x380(%rdi),%rax 0xffffffffa0153f51 <lpfc_hba_down_post+9>: push %r13 0xffffffffa0153f53 <lpfc_hba_down_post+11>: lea 0x18(%rdi),%r13 0xffffffffa0153f57 <lpfc_hba_down_post+15>: push %r12 0xffffffffa0153f59 <lpfc_hba_down_post+17>: mov %rdi,%r12 0xffffffffa0153f5c <lpfc_hba_down_post+20>: push %rbp 0xffffffffa0153f5d <lpfc_hba_down_post+21>: lea 0x320(%rdi),%rbp 0xffffffffa0153f64 <lpfc_hba_down_post+28>: push %rbx 0xffffffffa0153f65 <lpfc_hba_down_post+29>: mov 0x60(%rbp),%rbx 0xffffffffa0153f69 <lpfc_hba_down_post+33>: mov (%rbx),%r14 %rbx is NULL. int lpfc_hba_down_post(struct lpfc_hba * phba) { struct lpfc_sli *psli = &phba->sli; struct lpfc_sli_ring *pring; struct lpfc_dmabuf *mp, *next_mp; int i; /* Cleanup preposted buffers on the ELS ring */ pring = &psli->ring[LPFC_ELS_RING]; list_for_each_entry_safe(mp, next_mp, &pring->postbufq, list) { %rbx corresponds to phba->sli.ring[LPFC_ELS_RING].postbufq.next which gets initialised in lpfc_sli_queue_setup(). int lpfc_sli_queue_setup(struct lpfc_hba * phba) { struct lpfc_sli *psli; struct lpfc_sli_ring *pring; int i, cnt; psli = &phba->sli; INIT_LIST_HEAD(&psli->mboxq); /* Initialize list headers for txq and txcmplq as double linked lists */ for (i = 0; i < psli->sliinit.num_rings; i++) { pring = &psli->ring[i]; pring->ringno = i; pring->next_cmdidx = 0; pring->local_getidx = 0; pring->cmdidx = 0; INIT_LIST_HEAD(&pring->txq); INIT_LIST_HEAD(&pring->txcmplq); INIT_LIST_HEAD(&pring->iocb_continueq); INIT_LIST_HEAD(&pring->postbufq); cnt = psli->sliinit.ringinit[i].fast_iotag; if (cnt) { pring->fast_lookup = kmalloc(cnt * sizeof (struct lpfc_iocbq *), GFP_KERNEL); if (pring->fast_lookup == 0) { return (0); } memset((char *)pring->fast_lookup, 0, cnt * sizeof (struct lpfc_iocbq *)); } } return (1); } This is where we got the page allocation failure message from so it looks like kmalloc() failed and we returned prematurely failing to initialize the remaining items in plsi->ring[]. This fast_lookup/kmalloc() code does not exist in RHEL5 or mainline. This is the patch from mainline that removed it: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=604a3e3042eb89ffaa4f735ef9208281aae786c7