Bug 460935 - NMI detected lockup in cache_alloc_refill
NMI detected lockup in cache_alloc_refill
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.6.z
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Larry Woodman
Martin Jenner
:
Depends On:
Blocks: 461297
  Show dependency treegraph
 
Reported: 2008-09-02 14:03 EDT by Bryn M. Reeves
Modified: 2010-10-23 00:14 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-03-05 09:21:15 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bryn M. Reeves 2008-09-02 14:03:15 EDT
Description of problem:
 NMI Watchdog detected LOCKUP, CPU=0, registers:
 CPU 0
 Modules linked in: netconsole nfs lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) nfsd exportfs lockd nfs_acl i2c_dev i2c_core ipmi_devintf ipmi_si ipmi_msghandler sunrpc md5 ipv6 ide_dump scsi_dump diskdump zlib_deflate dm_round_robin dm_multipath button battery ac k8_edac edac_mc pciehp e1000 bonding(U) sata_nv libata sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2400 qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptscsi mptbase usb_storage uhci_hcd ohci_hcd ehci_hcd sd_mod scsi_mod
 Pid: 20459, comm: ip.sh Not tainted 2.6.9-67.0.15.ELlargesmp
 RIP: 0010:[<ffffffff80310027>] <ffffffff80310027>{.text.lock.spinlock+5}
 RSP: 0018:00000100c1309cf8  EFLAGS: 00000082
 RAX: 0000000000000010 RBX: 00000101f8ffc5a8 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: 00000000000000d0 RDI: 00000101f8ffc5a8
 RBP: 00000101f8fdc000 R08: 0000000000000000 R09: 0000000000000000
 R10: 00000000000000d0 R11: 00000104eb6c6a00 R12: 00000101f8ffc548
 R13: 00000101f8ffc340 R14: 00000000000000d0 R15: 00000104eb6c6a00
 FS:  0000002a95aa86e0(0000) GS:ffffffff80506880(0000) knlGS:00000000f7e17b80
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 000000552abb4c68 CR3: 0000000000101000 CR4: 00000000000006e0
 Process ip.sh (pid: 20459, threadinfo 00000100c1308000, task 0000010ff7ab97f0)
 Stack: 0000000000000010 ffffffff80162a24 00000000000000d0 00000000000000d0
        00000101f8ffc340 00000000fffffff4 ffffffff801a406e 0000000000000000
        00000104eb6c6a00 ffffffff80162963
 Call Trace:<ffffffff80162a24>{cache_alloc_refill+96} <ffffffff801a406e>{load_elf_binary+0}
        <ffffffff80162963>{kmem_cache_alloc+90} <ffffffff801a4108>{load_elf_binary+154}
        <ffffffff8015d846>{generic_file_aio_read+48} <ffffffff8017a635>{do_sync_read+178}
        <ffffffff801a406e>{load_elf_binary+0} <ffffffff80185178>{search_binary_handler+210}
        <ffffffff801854ad>{do_execve+398} <ffffffff8011026a>{system_call+126}
        <ffffffff8010ee44>{sys_execve+52} <ffffffff8011068e>{stub_execve+106}


Version-Release number of selected component (if applicable):
2.6.9-67.0.15.ELlargesmp

How reproducible:
Variable; seems workload dependent.

Steps to Reproduce:
1. Run e.g. sybase database application for a few days
  
Actual results:
Above NMI watchdog trace.

Expected results:
No lockup.

Additional info:
Comment 2 Larry Woodman 2008-09-02 15:31:56 EDT
Its not possible to determine what is happening without seeing what the other CPUs are doing when this NMI occurs.  Once we gat a core dump we should be able to determine that.

Larry Woodman
Comment 5 RHEL Product and Program Management 2008-09-03 09:00:27 EDT
Updating PM score.
Comment 6 Larry Woodman 2008-09-04 11:57:07 EDT
Please attach AltSysrq-M and /proc/meminfo outputs so I can also see exactly how the memory is distributed among the nodes of this system.

Larry Woodman
Comment 12 Larry Woodman 2008-11-11 10:42:08 EST
Whats the status with this issue?  I think we were waiting for a crash dump so I could see who was holding the cachep->spinlock.  Is the problem still happening???
  
Larry Woodman
Comment 13 Harald Klein 2008-11-11 11:03:01 EST
Larry,

Customer deactivated a collectl job accessing /proc slabinfo and did a service relocation on the cluster. The crash did not yet reappear since then. 

best regards,
Hari
Comment 20 Jiri Olsa 2010-09-12 06:47:08 EDT
hi,

looks like I hit the same issue.. I managed to get the vmcore.

While I'm figuring out where to place it for you, I'm attaching 'bt -a' crash
output. I'll let you once the vmcore is in public place.

thanks,
jirka

      KERNEL: /usr/lib/debug/lib/modules/2.6.18-194.3.1.el5prerel4.2.0_70.61.0/vmlinux
    DUMPFILE: ./vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Sun Sep 12 11:59:13 2010
      UPTIME: 00:26:46
LOAD AVERAGE: 0.90, 1.10, 0.84
       TASKS: 168
    NODENAME: krava
     RELEASE: 2.6.18-194.3.1.el5prerel4.2.0_70.57.0
     VERSION: #1 SMP Tue Jul 6 15:21:33 EDT 2010
     MACHINE: x86_64  (2333 Mhz)
      MEMORY: 3.9 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper"
        TASK: ffffffff80309b60  (1 of 4)  [THREAD_INFO: ffffffff803fc000]
         CPU: 0
       STATE: TASK_RUNNING (NMI)

PID: 0      TASK: ffffffff80309b60  CPU: 0   COMMAND: "swapper"
 #0 [ffffffff8044bdc0] crash_kexec at ffffffff800aea87
 #1 [ffffffff8044be48] .text.lock.spinlock at ffffffff80065bff (via _spin_lock)
 #2 [ffffffff8044be80] die_nmi at ffffffff800662a5
 #3 [ffffffff8044bea0] nmi_watchdog_tick at ffffffff80066a0b
 #4 [ffffffff8044bef0] default_do_nmi at ffffffff80066629
 #5 [ffffffff8044bf40] do_nmi at ffffffff80066896
 #6 [ffffffff8044bf50] nmi at ffffffff80065eef
    [exception RIP: .text.lock.spinlock+5]
    RIP: ffffffff80065bff  RSP: ffffffff80448de8  RFLAGS: 00000086
    RAX: 0000000000000000  RBX: 000000000000001b  RCX: 0000000000000000
    RDX: 0000000000000036  RSI: ffff81012fe0b200  RDI: ffff81012ff22380
    RBP: ffff81012fe0b200   R8: ffff81012fd9f870   R9: 000000000000003c
    R10: ffff81012fe60038  R11: ffffc20000095a18  R12: ffff81012ff22340
    R13: 0000000000000000  R14: ffff81012fdfa4c0  R15: 00000000000000a8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #7 [ffffffff80448de8] .text.lock.spinlock at ffffffff80065bff (via _spin_lock)
 #8 [ffffffff80448de8] drain_array at ffffffff800dcd21
 #9 [ffffffff80448e18] kmem_cache_free at ffffffff80007655
#10 [ffffffff80448e58] e1000_put_txbuf at ffffffff88236c8c
#11 [ffffffff80448e68] e1000_poll at ffffffff88239c77
#12 [ffffffff80448ed8] __kc_adapter_clean at ffffffff882352d5
#13 [ffffffff80448ef8] net_rx_action at ffffffff8000c85b
#14 [ffffffff80448f38] __do_softirq at ffffffff800123da
#15 [ffffffff80448f68] call_softirq at ffffffff8005f2fc
#16 [ffffffff80448f80] do_softirq at ffffffff8006db8e
#17 [ffffffff80448f90] do_IRQ at ffffffff8006da16
--- <IRQ stack> ---
#18 [ffffffff803fdee8] ret_from_intr at ffffffff8005e615
    [exception RIP: mwait_idle+55]
    RIP: ffffffff800574a8  RSP: ffffffff803fdf90  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 0000000000090000  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffffffff8030b718
    RBP: ffffffff80309d50   R8: ffffffff803fc000   R9: 000000000000003c
    R10: ffff81012fe60038  R11: 0000000000000286  R12: 00000000001e60ac
    R13: 00000185601fb815  R14: ffff8100cc55b040  R15: ffffffff80309b60
    ORIG_RAX: ffffffffffffff7c  CS: 0010  SS: 0018
#19 [ffffffff803fdf90] cpu_idle at ffffffff80049624

PID: 24377  TASK: ffff8100c80e87e0  CPU: 1   COMMAND: "tcpreplay"
 #0 [ffff81012ff39f20] crash_nmi_callback at ffffffff8007bf27
 #1 [ffff81012ff39f40] do_nmi at ffffffff8006688a
 #2 [ffff81012ff39f50] nmi at ffffffff80065eef
    [exception RIP: cache_alloc_refill+213]
    RIP: ffffffff8005c539  RSP: ffff810092995b48  RFLAGS: 00000087
    RAX: ffff81012ff22340  RBX: 000000000000001a  RCX: 0000000000000000
    RDX: ffff8100cd561000  RSI: ffff81012ff22340  RDI: ffff8100cdb22040
    RBP: ffff8100cdb22040   R8: ffff81012fe0b000   R9: ffff81012f65e000
    R10: 0000000000000000  R11: 0000000000000246  R12: ffff81012fe0b000
    R13: ffff81012ff22340  R14: 0000000000000001  R15: ffff81012fdfa4c0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #3 [ffff810092995b48] cache_alloc_refill at ffffffff8005c539
 #4 [ffff810092995b80] kmem_cache_alloc at ffffffff8000abf4
 #5 [ffff810092995ba0] __alloc_skb at ffffffff8002e1b2
 #6 [ffff810092995be0] sock_alloc_send_pskb at ffffffff80228b0b
 #7 [ffff810092995bf0] __qdisc_run at ffffffff8024023c
 #8 [ffff810092995c50] packet_sendmsg at ffffffff802812f8
 #9 [ffff810092995cb0] sock_sendmsg at ffffffff8005551f
#10 [ffff810092995e50] sys_sendto at ffffffff802276be
#11 [ffff810092995e70] kthread_create at ffffffff800a1ac0
#12 [ffff810092995ea8] thread_return at ffffffff80063ff8
#13 [ffff810092995f80] system_call at ffffffff8005e116
    RIP: 00000036834d4965  RSP: 00007fffd6b5da78  RFLAGS: 00010207
    RAX: 000000000000002c  RBX: ffffffff8005e116  RCX: 0000000000000075
    RDX: 0000000000000069  RSI: 000000001669db92  RDI: 0000000000000003
    RBP: 000000001669a2f0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: 000000001669db92
    R13: 0000000000000003  R14: 000000001669db92  R15: 0000000000000069
    ORIG_RAX: 000000000000002c  CS: 0033  SS: 002b

PID: 0      TASK: ffff81012ff23080  CPU: 2   COMMAND: "swapper"
 #0 [ffff81012ff6ef20] crash_nmi_callback at ffffffff8007bf27
 #1 [ffff81012ff6ef40] do_nmi at ffffffff8006688a
 #2 [ffff81012ff6ef50] nmi at ffffffff80065eef
    [exception RIP: mwait_idle+55]
    RIP: ffffffff800574a8  RSP: ffff81012ff67ef0  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffffffff80057472  RCX: 0000000000000000
   RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffffffff8030b718
    RBP: 0000000000000002   R8: ffff81012ff66000   R9: 000000000000003c
    R10: ffff81012fe60038  R11: 0000000000000286  R12: 00000000000000ff
    R13: ffffffff803d4180  R14: 0000000000000200  R15: ffffffff803f62a0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #3 [ffff81012ff67ef0] mwait_idle at ffffffff800574a8
 #4 [ffff81012ff67ef0] cpu_idle at ffffffff80049624

PID: 0      TASK: ffff81012fe17100  CPU: 3   COMMAND: "swapper"
 #0 [ffff81012fe40f20] crash_nmi_callback at ffffffff8007bf27
 #1 [ffff81012fe40f40] do_nmi at ffffffff8006688a
 #2 [ffff81012fe40f50] nmi at ffffffff80065eef
    [exception RIP: mwait_idle+55]
    RIP: ffffffff800574a8  RSP: ffff81012fe39ef0  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffffffff80057472  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffffffff8030b718
    RBP: 0000000000000003   R8: ffff81012fe38000   R9: 000000000000003d
    R10: ffff81012fe60008  R11: 0000000000000080  R12: 00000000000000ff
    R13: ffffffff803d4280  R14: 0000000000000300  R15: ffffffff803f62c0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #3 [ffff81012fe39ef0] mwait_idle at ffffffff800574a8
 #4 [ffff81012fe39ef0] cpu_idle at ffffffff80049624
Comment 21 Jiri Olsa 2010-09-14 04:31:14 EDT
please check coredumps in
http://pepek.sde.cz/~jirka/nmi-lockup/

jirka

Note You need to log in before you can comment on or make changes to this bug.