Bug 460935
| Summary: | NMI detected lockup in cache_alloc_refill | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Bryn M. Reeves <bmr> |
| Component: | kernel | Assignee: | Larry Woodman <lwoodman> |
| Status: | CLOSED WONTFIX | QA Contact: | Martin Jenner <mjenner> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.6.z | CC: | dmair, hklein, jolsa, tao |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2009-03-05 14:21:15 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 461297 | ||
Its not possible to determine what is happening without seeing what the other CPUs are doing when this NMI occurs. Once we gat a core dump we should be able to determine that. Larry Woodman Updating PM score. Please attach AltSysrq-M and /proc/meminfo outputs so I can also see exactly how the memory is distributed among the nodes of this system. Larry Woodman Whats the status with this issue? I think we were waiting for a crash dump so I could see who was holding the cachep->spinlock. Is the problem still happening??? Larry Woodman Larry, Customer deactivated a collectl job accessing /proc slabinfo and did a service relocation on the cluster. The crash did not yet reappear since then. best regards, Hari hi,
looks like I hit the same issue.. I managed to get the vmcore.
While I'm figuring out where to place it for you, I'm attaching 'bt -a' crash
output. I'll let you once the vmcore is in public place.
thanks,
jirka
KERNEL: /usr/lib/debug/lib/modules/2.6.18-194.3.1.el5prerel4.2.0_70.61.0/vmlinux
DUMPFILE: ./vmcore [PARTIAL DUMP]
CPUS: 4
DATE: Sun Sep 12 11:59:13 2010
UPTIME: 00:26:46
LOAD AVERAGE: 0.90, 1.10, 0.84
TASKS: 168
NODENAME: krava
RELEASE: 2.6.18-194.3.1.el5prerel4.2.0_70.57.0
VERSION: #1 SMP Tue Jul 6 15:21:33 EDT 2010
MACHINE: x86_64 (2333 Mhz)
MEMORY: 3.9 GB
PANIC: ""
PID: 0
COMMAND: "swapper"
TASK: ffffffff80309b60 (1 of 4) [THREAD_INFO: ffffffff803fc000]
CPU: 0
STATE: TASK_RUNNING (NMI)
PID: 0 TASK: ffffffff80309b60 CPU: 0 COMMAND: "swapper"
#0 [ffffffff8044bdc0] crash_kexec at ffffffff800aea87
#1 [ffffffff8044be48] .text.lock.spinlock at ffffffff80065bff (via _spin_lock)
#2 [ffffffff8044be80] die_nmi at ffffffff800662a5
#3 [ffffffff8044bea0] nmi_watchdog_tick at ffffffff80066a0b
#4 [ffffffff8044bef0] default_do_nmi at ffffffff80066629
#5 [ffffffff8044bf40] do_nmi at ffffffff80066896
#6 [ffffffff8044bf50] nmi at ffffffff80065eef
[exception RIP: .text.lock.spinlock+5]
RIP: ffffffff80065bff RSP: ffffffff80448de8 RFLAGS: 00000086
RAX: 0000000000000000 RBX: 000000000000001b RCX: 0000000000000000
RDX: 0000000000000036 RSI: ffff81012fe0b200 RDI: ffff81012ff22380
RBP: ffff81012fe0b200 R8: ffff81012fd9f870 R9: 000000000000003c
R10: ffff81012fe60038 R11: ffffc20000095a18 R12: ffff81012ff22340
R13: 0000000000000000 R14: ffff81012fdfa4c0 R15: 00000000000000a8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#7 [ffffffff80448de8] .text.lock.spinlock at ffffffff80065bff (via _spin_lock)
#8 [ffffffff80448de8] drain_array at ffffffff800dcd21
#9 [ffffffff80448e18] kmem_cache_free at ffffffff80007655
#10 [ffffffff80448e58] e1000_put_txbuf at ffffffff88236c8c
#11 [ffffffff80448e68] e1000_poll at ffffffff88239c77
#12 [ffffffff80448ed8] __kc_adapter_clean at ffffffff882352d5
#13 [ffffffff80448ef8] net_rx_action at ffffffff8000c85b
#14 [ffffffff80448f38] __do_softirq at ffffffff800123da
#15 [ffffffff80448f68] call_softirq at ffffffff8005f2fc
#16 [ffffffff80448f80] do_softirq at ffffffff8006db8e
#17 [ffffffff80448f90] do_IRQ at ffffffff8006da16
--- <IRQ stack> ---
#18 [ffffffff803fdee8] ret_from_intr at ffffffff8005e615
[exception RIP: mwait_idle+55]
RIP: ffffffff800574a8 RSP: ffffffff803fdf90 RFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000090000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8030b718
RBP: ffffffff80309d50 R8: ffffffff803fc000 R9: 000000000000003c
R10: ffff81012fe60038 R11: 0000000000000286 R12: 00000000001e60ac
R13: 00000185601fb815 R14: ffff8100cc55b040 R15: ffffffff80309b60
ORIG_RAX: ffffffffffffff7c CS: 0010 SS: 0018
#19 [ffffffff803fdf90] cpu_idle at ffffffff80049624
PID: 24377 TASK: ffff8100c80e87e0 CPU: 1 COMMAND: "tcpreplay"
#0 [ffff81012ff39f20] crash_nmi_callback at ffffffff8007bf27
#1 [ffff81012ff39f40] do_nmi at ffffffff8006688a
#2 [ffff81012ff39f50] nmi at ffffffff80065eef
[exception RIP: cache_alloc_refill+213]
RIP: ffffffff8005c539 RSP: ffff810092995b48 RFLAGS: 00000087
RAX: ffff81012ff22340 RBX: 000000000000001a RCX: 0000000000000000
RDX: ffff8100cd561000 RSI: ffff81012ff22340 RDI: ffff8100cdb22040
RBP: ffff8100cdb22040 R8: ffff81012fe0b000 R9: ffff81012f65e000
R10: 0000000000000000 R11: 0000000000000246 R12: ffff81012fe0b000
R13: ffff81012ff22340 R14: 0000000000000001 R15: ffff81012fdfa4c0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#3 [ffff810092995b48] cache_alloc_refill at ffffffff8005c539
#4 [ffff810092995b80] kmem_cache_alloc at ffffffff8000abf4
#5 [ffff810092995ba0] __alloc_skb at ffffffff8002e1b2
#6 [ffff810092995be0] sock_alloc_send_pskb at ffffffff80228b0b
#7 [ffff810092995bf0] __qdisc_run at ffffffff8024023c
#8 [ffff810092995c50] packet_sendmsg at ffffffff802812f8
#9 [ffff810092995cb0] sock_sendmsg at ffffffff8005551f
#10 [ffff810092995e50] sys_sendto at ffffffff802276be
#11 [ffff810092995e70] kthread_create at ffffffff800a1ac0
#12 [ffff810092995ea8] thread_return at ffffffff80063ff8
#13 [ffff810092995f80] system_call at ffffffff8005e116
RIP: 00000036834d4965 RSP: 00007fffd6b5da78 RFLAGS: 00010207
RAX: 000000000000002c RBX: ffffffff8005e116 RCX: 0000000000000075
RDX: 0000000000000069 RSI: 000000001669db92 RDI: 0000000000000003
RBP: 000000001669a2f0 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000001669db92
R13: 0000000000000003 R14: 000000001669db92 R15: 0000000000000069
ORIG_RAX: 000000000000002c CS: 0033 SS: 002b
PID: 0 TASK: ffff81012ff23080 CPU: 2 COMMAND: "swapper"
#0 [ffff81012ff6ef20] crash_nmi_callback at ffffffff8007bf27
#1 [ffff81012ff6ef40] do_nmi at ffffffff8006688a
#2 [ffff81012ff6ef50] nmi at ffffffff80065eef
[exception RIP: mwait_idle+55]
RIP: ffffffff800574a8 RSP: ffff81012ff67ef0 RFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffff80057472 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8030b718
RBP: 0000000000000002 R8: ffff81012ff66000 R9: 000000000000003c
R10: ffff81012fe60038 R11: 0000000000000286 R12: 00000000000000ff
R13: ffffffff803d4180 R14: 0000000000000200 R15: ffffffff803f62a0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#3 [ffff81012ff67ef0] mwait_idle at ffffffff800574a8
#4 [ffff81012ff67ef0] cpu_idle at ffffffff80049624
PID: 0 TASK: ffff81012fe17100 CPU: 3 COMMAND: "swapper"
#0 [ffff81012fe40f20] crash_nmi_callback at ffffffff8007bf27
#1 [ffff81012fe40f40] do_nmi at ffffffff8006688a
#2 [ffff81012fe40f50] nmi at ffffffff80065eef
[exception RIP: mwait_idle+55]
RIP: ffffffff800574a8 RSP: ffff81012fe39ef0 RFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffff80057472 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8030b718
RBP: 0000000000000003 R8: ffff81012fe38000 R9: 000000000000003d
R10: ffff81012fe60008 R11: 0000000000000080 R12: 00000000000000ff
R13: ffffffff803d4280 R14: 0000000000000300 R15: ffffffff803f62c0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#3 [ffff81012fe39ef0] mwait_idle at ffffffff800574a8
#4 [ffff81012fe39ef0] cpu_idle at ffffffff80049624
please check coredumps in http://pepek.sde.cz/~jirka/nmi-lockup/ jirka |
Description of problem: NMI Watchdog detected LOCKUP, CPU=0, registers: CPU 0 Modules linked in: netconsole nfs lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) nfsd exportfs lockd nfs_acl i2c_dev i2c_core ipmi_devintf ipmi_si ipmi_msghandler sunrpc md5 ipv6 ide_dump scsi_dump diskdump zlib_deflate dm_round_robin dm_multipath button battery ac k8_edac edac_mc pciehp e1000 bonding(U) sata_nv libata sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2400 qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptscsi mptbase usb_storage uhci_hcd ohci_hcd ehci_hcd sd_mod scsi_mod Pid: 20459, comm: ip.sh Not tainted 2.6.9-67.0.15.ELlargesmp RIP: 0010:[<ffffffff80310027>] <ffffffff80310027>{.text.lock.spinlock+5} RSP: 0018:00000100c1309cf8 EFLAGS: 00000082 RAX: 0000000000000010 RBX: 00000101f8ffc5a8 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 00000000000000d0 RDI: 00000101f8ffc5a8 RBP: 00000101f8fdc000 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000000d0 R11: 00000104eb6c6a00 R12: 00000101f8ffc548 R13: 00000101f8ffc340 R14: 00000000000000d0 R15: 00000104eb6c6a00 FS: 0000002a95aa86e0(0000) GS:ffffffff80506880(0000) knlGS:00000000f7e17b80 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000552abb4c68 CR3: 0000000000101000 CR4: 00000000000006e0 Process ip.sh (pid: 20459, threadinfo 00000100c1308000, task 0000010ff7ab97f0) Stack: 0000000000000010 ffffffff80162a24 00000000000000d0 00000000000000d0 00000101f8ffc340 00000000fffffff4 ffffffff801a406e 0000000000000000 00000104eb6c6a00 ffffffff80162963 Call Trace:<ffffffff80162a24>{cache_alloc_refill+96} <ffffffff801a406e>{load_elf_binary+0} <ffffffff80162963>{kmem_cache_alloc+90} <ffffffff801a4108>{load_elf_binary+154} <ffffffff8015d846>{generic_file_aio_read+48} <ffffffff8017a635>{do_sync_read+178} <ffffffff801a406e>{load_elf_binary+0} <ffffffff80185178>{search_binary_handler+210} <ffffffff801854ad>{do_execve+398} <ffffffff8011026a>{system_call+126} <ffffffff8010ee44>{sys_execve+52} <ffffffff8011068e>{stub_execve+106} Version-Release number of selected component (if applicable): 2.6.9-67.0.15.ELlargesmp How reproducible: Variable; seems workload dependent. Steps to Reproduce: 1. Run e.g. sybase database application for a few days Actual results: Above NMI watchdog trace. Expected results: No lockup. Additional info: