Description of problem: The system crash twice in the same point with apparently the same symptoms. ---- oops ---- Unable to handle kernel paging request at ffff8108bfe98128 RIP: [<ffffffff80046a1a>] try_to_wake_up+0x414/0x481 PGD 8063 PUD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:29/0000:29:00.0/0000:2a:00.1/irq CPU 22 Modules linked in: lock_dlm gfs2 dlm configfs bonding mptctl dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st shpchp qla2xxx ide_cd sg e1000e cdrom bnx2 pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci aacraid ata_piix libata mptsas scsi_transport_sas mptspi scsi_transport_spi mptfc mptscsih scsi_transport_fc mptbase cciss megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 7308, comm: sshg3 Not tainted 2.6.18-128.1.6.el5 #1 RIP: 0010:[<ffffffff80046a1a>] [<ffffffff80046a1a>] try_to_wake_up+0x414/0x481 RSP: 0018:ffff8103b9dd1b78 EFLAGS: 00010046 RAX: 0000000000000080 RBX: 00016784bc1bbe3d RCX: 00000000000003e8 RDX: ffff810009082728 RSI: ffff810009082728 RDI: ffff8108bfe98100 RBP: ffff8103b9dd1c08 R08: 0000000000000080 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000080 R12: 0000000000000016 R13: ffff810563cc80c0 R14: ffff810009082b00 R15: ffff810009081680 FS: 0000000040486940(0063) GS:ffff81083f891e40(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8108bfe98128 CR3: 000000034350d000 CR4: 00000000000006e0 Process sshg3 (pid: 7308, threadinfo ffff8103b9dd0000, task ffff8102a058d080) Stack: 0000000000000000 0000000000000001 0000000f00000004 0000000000000001 000000163f822240 00016784bc16ccec 000004d000000000 ffff81011ce5c440 0000000000000246 00000000000004d0 0000000000001000 0000000000000096 Call Trace: [<ffffffff800888ec>] __wake_up_common+0x3e/0x68 [<ffffffff8002df7c>] __wake_up+0x38/0x4f [<ffffffff80012bda>] sock_def_readable+0x34/0x5f [<ffffffff8004a17e>] unix_stream_sendmsg+0x281/0x346 [<ffffffff800376a3>] do_sock_write+0xc4/0xce [<ffffffff8004719e>] sock_aio_write+0x4f/0x5e [<ffffffff80017d2d>] do_sync_write+0xc7/0x104 [<ffffffff8009dba8>] autoremove_wake_function+0x0/0x2e [<ffffffff800165b1>] vfs_write+0xe1/0x174 [<ffffffff80016e6b>] sys_write+0x45/0x6e [<ffffffff8005d116>] system_call+0x7e/0x83 Code: 8b 47 28 41 39 45 28 7d 0c e8 ab 34 04 00 eb 05 45 31 e4 eb RIP [<ffffffff80046a1a>] try_to_wake_up+0x414/0x481 RSP <ffff8103b9dd1b78> Version-Release number of selected component (if applicable): 2.6.18-164.6.1.el5 How reproducible: a couple times so far, unpredictable. Steps to Reproduce: Unknown.
First core is at megatron.gsslab.rdu.redhat.com:/cores/20100414134034 These are my findings in the first vmcore: Crash point: RIP [<ffffffff80046a1a>] try_to_wake_up+0x414/0x481 Disass: /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sched.c: 1579 0xffffffff80046a16 <try_to_wake_up+0x410>: mov 0x50(%r15),%rdi >0xffffffff80046a1a <try_to_wake_up+0x414>: mov 0x28(%rdi),%eax 0xffffffff80046a1d <try_to_wake_up+0x417>: cmp %eax,0x28(%r13) 0xffffffff80046a21 <try_to_wake_up+0x41b>: jge 0xffffffff80046a2f <try_to_wake_up+0x429> /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sched.c: 1580 C code: 1578 if (!sync || cpu != this_cpu) { >1579 if (TASK_PREEMPTS_CURR(p, rq)) 1580 resched_task(rq->curr); 1581 } 162 #define TASK_PREEMPTS_CURR(p, rq) \ 163 ((p)->prio < (rq)->curr->prio) ^^^^^^^^^^^^^^^^^ struct rq { [0x50] struct task_struct *curr; ... struct task_struct { [0x28] int prio; ... working back on line 'mov 0x28(%rdi),%eax' RDI: ffff8108bfe98100 crash> task_struct ffff8108bfe98100 struct task_struct struct: seek error: kernel virtual address: ffff8108bfe98100 type: "gdb_readmem_callback" Cannot access memory at address 0xffff8108bfe98100 it's invalid, so working back on line: 'mov 0x50(%r15),%rdi' R15: ffff810009081680 crash> struct rq.curr ffff810009081680 curr = 0xffff81083fe98100, Well, curr = 0xffff81083fe98100 is different from RDI: ffff8108bfe98100 and it has just one bit wrong: crash> eval 0xffff81083fe98100 ^ 0xffff8108bfe98100 hexadecimal: 80000000 (2GB) decimal: 2147483648 octal: 20000000000 binary: 0000000000000000000000000000000010000000000000000000000000000000 Ok, so what would happen if we use the correct value based on ->curr: crash> task_struct.prio 0xffff81083fe98100 prio = 0x8c, The other task (p) is on %r13: 0xffff810563cc80c0 and it is correct: crash> task_struct.prio ffff810563cc80c0 prio = 0x73, crash> task_struct.comm ffff810563cc80c0 comm = "ssh-broker-cli\000", crash> task_struct.pid ffff810563cc80c0 pid = 0x6641, <--- 26177d What happened? The CPU was executing these two lines: 0xffffffff80046a16 <try_to_wake_up+0x410>: mov 0x50(%r15),%rdi >0xffffffff80046a1a <try_to_wake_up+0x414>: mov 0x28(%rdi),%eax In the first line it was trying to access (struct rq *)(%r15)->curr which is a (task_struct *) pointer. This pointer is stored in %rdi register. Later, in the second line, the CPU tries to get the (task_struct *)->prio value, but the task_struct pointer was invalid. Interesting that the pointer at %r15 (struct rq) is correct and its has ->curr pointing to 0xffff81083fe98100 which didn't match with the one used on the second line. They differ by one bit only. The other tasks running seems unrelated to my eyes. crash> ps | grep ffff8108bfe98100 <--- bad pointer crash> ps | grep ffff81083fe98100 > 0 1 15 ffff81083fe98100 RU 0.0 0 0 [swapper] crash> so if the bad pointer was a real task_struct before, it has exited. The struct rq lock is correctly holded. Searching the entire image (partial dump) didn't find any other occurrence of the bad pointer besides in the crashed task. This is the slab cache for the task_struct CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff81011ce87280 task_struct 1888 1731 1840 920 4k and it didn't include references to that pointer either. So, either the partial dump is complicating things and I can't find any record of that task at all or this could be a hardware problem as that address in the memory is okay and in the register it has one bit flipped. fbl
Second vmcore available at megatron.gsslab.rdu.redhat.com:/cores/20100510101329 Same crash point, same findings. However, with another CPU this time. The same bit flipped. crash> struct rq.curr ffff8100090a3fa0 curr = 0xffff81083ffd8100, crash> eval 0xffff81083ffd8100 ^ 0xffff8108bffd8100 hexadecimal: 80000000 (2GB) decimal: 2147483648 octal: 20000000000 binary: 0000000000000000000000000000000010000000000000000000000000000000 crash> ps |grep ffff8108bffd8100 crash> ps |grep ffff81083ffd8100 > 0 1 19 ffff81083ffd8100 RU 0.0 0 0 [swapper] crash> The system was fully verified and no errors were found. Another thing called my attention is the fact that the bad pointer is on the stack trace of the crashed task: #3 [ffff81015061bbb0] error_exit at ffffffff8005dde9 [exception RIP: try_to_wake_up+0x414] RIP: ffffffff80047005 RSP: ffff81015061bc68 RFLAGS: 00010046 RAX: 0000000000000080 RBX: 0000bec5fcd7dc6c RCX: 00000000000003e8 RDX: ffff8100090a5058 RSI: ffff8100090a5058 RDI: ffff8108bffd8100 RBP: ffff81015061bcf8 R8: 0000000000000080 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000080 R12: 000000000000000e R13: ffff810194214080 R14: ffff8100090a5420 R15: ffff8100090a3fa0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 ffff81015061bbb8: ffff8100090a3fa0 ffff8100090a5420 ffff81015061bbc8: ffff810194214080 000000000000000e ffff81015061bbd8: ffff81015061bcf8 0000bec5fcd7dc6c ffff81015061bbe8: 0000000000000080 0000000000000000 ffff81015061bbf8: 0000000000000001 0000000000000080 ffff81015061bc08: 0000000000000080 00000000000003e8 ffff81015061bc18: ffff8100090a5058 ffff8100090a5058 ffff81015061bc28: ffff8108bffd8100 ffffffffffffffff ^^^^^^^^^^^^^^^^ That indicates the pointer was real another time and due to the slabcache alignment, it could be possible to have address with a difference of just a single bit, so I said that it didn't look like a HW problem anymore. However, after that pointer in the stack there is pointer to the crashed instruction and then all registers, so I guess those were the crash context saved for kdump or something. I need some help from engineering to find the root cause. fbl