Description of problem: NMI Watchdog detected LOCKUP, CPU=0, registers: CPU 0 Modules linked in: netconsole nfs lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) nfsd exportfs lockd nfs_acl i2c_dev i2c_core ipmi_devintf ipmi_si ipmi_msghandler sunrpc md5 ipv6 ide_dump scsi_dump diskdump zlib_deflate dm_round_robin dm_multipath button battery ac k8_edac edac_mc pciehp e1000 bonding(U) sata_nv libata sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2400 qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptscsi mptbase usb_storage uhci_hcd ohci_hcd ehci_hcd sd_mod scsi_mod Pid: 20459, comm: ip.sh Not tainted 2.6.9-67.0.15.ELlargesmp RIP: 0010:[<ffffffff80310027>] <ffffffff80310027>{.text.lock.spinlock+5} RSP: 0018:00000100c1309cf8 EFLAGS: 00000082 RAX: 0000000000000010 RBX: 00000101f8ffc5a8 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 00000000000000d0 RDI: 00000101f8ffc5a8 RBP: 00000101f8fdc000 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000000d0 R11: 00000104eb6c6a00 R12: 00000101f8ffc548 R13: 00000101f8ffc340 R14: 00000000000000d0 R15: 00000104eb6c6a00 FS: 0000002a95aa86e0(0000) GS:ffffffff80506880(0000) knlGS:00000000f7e17b80 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000552abb4c68 CR3: 0000000000101000 CR4: 00000000000006e0 Process ip.sh (pid: 20459, threadinfo 00000100c1308000, task 0000010ff7ab97f0) Stack: 0000000000000010 ffffffff80162a24 00000000000000d0 00000000000000d0 00000101f8ffc340 00000000fffffff4 ffffffff801a406e 0000000000000000 00000104eb6c6a00 ffffffff80162963 Call Trace:<ffffffff80162a24>{cache_alloc_refill+96} <ffffffff801a406e>{load_elf_binary+0} <ffffffff80162963>{kmem_cache_alloc+90} <ffffffff801a4108>{load_elf_binary+154} <ffffffff8015d846>{generic_file_aio_read+48} <ffffffff8017a635>{do_sync_read+178} <ffffffff801a406e>{load_elf_binary+0} <ffffffff80185178>{search_binary_handler+210} <ffffffff801854ad>{do_execve+398} <ffffffff8011026a>{system_call+126} <ffffffff8010ee44>{sys_execve+52} <ffffffff8011068e>{stub_execve+106} Version-Release number of selected component (if applicable): 2.6.9-67.0.15.ELlargesmp How reproducible: Variable; seems workload dependent. Steps to Reproduce: 1. Run e.g. sybase database application for a few days Actual results: Above NMI watchdog trace. Expected results: No lockup. Additional info:
Its not possible to determine what is happening without seeing what the other CPUs are doing when this NMI occurs. Once we gat a core dump we should be able to determine that. Larry Woodman
Updating PM score.
Please attach AltSysrq-M and /proc/meminfo outputs so I can also see exactly how the memory is distributed among the nodes of this system. Larry Woodman
Whats the status with this issue? I think we were waiting for a crash dump so I could see who was holding the cachep->spinlock. Is the problem still happening??? Larry Woodman
Larry, Customer deactivated a collectl job accessing /proc slabinfo and did a service relocation on the cluster. The crash did not yet reappear since then. best regards, Hari
hi, looks like I hit the same issue.. I managed to get the vmcore. While I'm figuring out where to place it for you, I'm attaching 'bt -a' crash output. I'll let you once the vmcore is in public place. thanks, jirka KERNEL: /usr/lib/debug/lib/modules/2.6.18-194.3.1.el5prerel4.2.0_70.61.0/vmlinux DUMPFILE: ./vmcore [PARTIAL DUMP] CPUS: 4 DATE: Sun Sep 12 11:59:13 2010 UPTIME: 00:26:46 LOAD AVERAGE: 0.90, 1.10, 0.84 TASKS: 168 NODENAME: krava RELEASE: 2.6.18-194.3.1.el5prerel4.2.0_70.57.0 VERSION: #1 SMP Tue Jul 6 15:21:33 EDT 2010 MACHINE: x86_64 (2333 Mhz) MEMORY: 3.9 GB PANIC: "" PID: 0 COMMAND: "swapper" TASK: ffffffff80309b60 (1 of 4) [THREAD_INFO: ffffffff803fc000] CPU: 0 STATE: TASK_RUNNING (NMI) PID: 0 TASK: ffffffff80309b60 CPU: 0 COMMAND: "swapper" #0 [ffffffff8044bdc0] crash_kexec at ffffffff800aea87 #1 [ffffffff8044be48] .text.lock.spinlock at ffffffff80065bff (via _spin_lock) #2 [ffffffff8044be80] die_nmi at ffffffff800662a5 #3 [ffffffff8044bea0] nmi_watchdog_tick at ffffffff80066a0b #4 [ffffffff8044bef0] default_do_nmi at ffffffff80066629 #5 [ffffffff8044bf40] do_nmi at ffffffff80066896 #6 [ffffffff8044bf50] nmi at ffffffff80065eef [exception RIP: .text.lock.spinlock+5] RIP: ffffffff80065bff RSP: ffffffff80448de8 RFLAGS: 00000086 RAX: 0000000000000000 RBX: 000000000000001b RCX: 0000000000000000 RDX: 0000000000000036 RSI: ffff81012fe0b200 RDI: ffff81012ff22380 RBP: ffff81012fe0b200 R8: ffff81012fd9f870 R9: 000000000000003c R10: ffff81012fe60038 R11: ffffc20000095a18 R12: ffff81012ff22340 R13: 0000000000000000 R14: ffff81012fdfa4c0 R15: 00000000000000a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #7 [ffffffff80448de8] .text.lock.spinlock at ffffffff80065bff (via _spin_lock) #8 [ffffffff80448de8] drain_array at ffffffff800dcd21 #9 [ffffffff80448e18] kmem_cache_free at ffffffff80007655 #10 [ffffffff80448e58] e1000_put_txbuf at ffffffff88236c8c #11 [ffffffff80448e68] e1000_poll at ffffffff88239c77 #12 [ffffffff80448ed8] __kc_adapter_clean at ffffffff882352d5 #13 [ffffffff80448ef8] net_rx_action at ffffffff8000c85b #14 [ffffffff80448f38] __do_softirq at ffffffff800123da #15 [ffffffff80448f68] call_softirq at ffffffff8005f2fc #16 [ffffffff80448f80] do_softirq at ffffffff8006db8e #17 [ffffffff80448f90] do_IRQ at ffffffff8006da16 --- <IRQ stack> --- #18 [ffffffff803fdee8] ret_from_intr at ffffffff8005e615 [exception RIP: mwait_idle+55] RIP: ffffffff800574a8 RSP: ffffffff803fdf90 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000090000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8030b718 RBP: ffffffff80309d50 R8: ffffffff803fc000 R9: 000000000000003c R10: ffff81012fe60038 R11: 0000000000000286 R12: 00000000001e60ac R13: 00000185601fb815 R14: ffff8100cc55b040 R15: ffffffff80309b60 ORIG_RAX: ffffffffffffff7c CS: 0010 SS: 0018 #19 [ffffffff803fdf90] cpu_idle at ffffffff80049624 PID: 24377 TASK: ffff8100c80e87e0 CPU: 1 COMMAND: "tcpreplay" #0 [ffff81012ff39f20] crash_nmi_callback at ffffffff8007bf27 #1 [ffff81012ff39f40] do_nmi at ffffffff8006688a #2 [ffff81012ff39f50] nmi at ffffffff80065eef [exception RIP: cache_alloc_refill+213] RIP: ffffffff8005c539 RSP: ffff810092995b48 RFLAGS: 00000087 RAX: ffff81012ff22340 RBX: 000000000000001a RCX: 0000000000000000 RDX: ffff8100cd561000 RSI: ffff81012ff22340 RDI: ffff8100cdb22040 RBP: ffff8100cdb22040 R8: ffff81012fe0b000 R9: ffff81012f65e000 R10: 0000000000000000 R11: 0000000000000246 R12: ffff81012fe0b000 R13: ffff81012ff22340 R14: 0000000000000001 R15: ffff81012fdfa4c0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #3 [ffff810092995b48] cache_alloc_refill at ffffffff8005c539 #4 [ffff810092995b80] kmem_cache_alloc at ffffffff8000abf4 #5 [ffff810092995ba0] __alloc_skb at ffffffff8002e1b2 #6 [ffff810092995be0] sock_alloc_send_pskb at ffffffff80228b0b #7 [ffff810092995bf0] __qdisc_run at ffffffff8024023c #8 [ffff810092995c50] packet_sendmsg at ffffffff802812f8 #9 [ffff810092995cb0] sock_sendmsg at ffffffff8005551f #10 [ffff810092995e50] sys_sendto at ffffffff802276be #11 [ffff810092995e70] kthread_create at ffffffff800a1ac0 #12 [ffff810092995ea8] thread_return at ffffffff80063ff8 #13 [ffff810092995f80] system_call at ffffffff8005e116 RIP: 00000036834d4965 RSP: 00007fffd6b5da78 RFLAGS: 00010207 RAX: 000000000000002c RBX: ffffffff8005e116 RCX: 0000000000000075 RDX: 0000000000000069 RSI: 000000001669db92 RDI: 0000000000000003 RBP: 000000001669a2f0 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000001669db92 R13: 0000000000000003 R14: 000000001669db92 R15: 0000000000000069 ORIG_RAX: 000000000000002c CS: 0033 SS: 002b PID: 0 TASK: ffff81012ff23080 CPU: 2 COMMAND: "swapper" #0 [ffff81012ff6ef20] crash_nmi_callback at ffffffff8007bf27 #1 [ffff81012ff6ef40] do_nmi at ffffffff8006688a #2 [ffff81012ff6ef50] nmi at ffffffff80065eef [exception RIP: mwait_idle+55] RIP: ffffffff800574a8 RSP: ffff81012ff67ef0 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffff80057472 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8030b718 RBP: 0000000000000002 R8: ffff81012ff66000 R9: 000000000000003c R10: ffff81012fe60038 R11: 0000000000000286 R12: 00000000000000ff R13: ffffffff803d4180 R14: 0000000000000200 R15: ffffffff803f62a0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #3 [ffff81012ff67ef0] mwait_idle at ffffffff800574a8 #4 [ffff81012ff67ef0] cpu_idle at ffffffff80049624 PID: 0 TASK: ffff81012fe17100 CPU: 3 COMMAND: "swapper" #0 [ffff81012fe40f20] crash_nmi_callback at ffffffff8007bf27 #1 [ffff81012fe40f40] do_nmi at ffffffff8006688a #2 [ffff81012fe40f50] nmi at ffffffff80065eef [exception RIP: mwait_idle+55] RIP: ffffffff800574a8 RSP: ffff81012fe39ef0 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffff80057472 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8030b718 RBP: 0000000000000003 R8: ffff81012fe38000 R9: 000000000000003d R10: ffff81012fe60008 R11: 0000000000000080 R12: 00000000000000ff R13: ffffffff803d4280 R14: 0000000000000300 R15: ffffffff803f62c0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #3 [ffff81012fe39ef0] mwait_idle at ffffffff800574a8 #4 [ffff81012fe39ef0] cpu_idle at ffffffff80049624
please check coredumps in http://pepek.sde.cz/~jirka/nmi-lockup/ jirka