Bug 591285
Summary: | System is crashing at [<ffffffff80046a1a>] try_to_wake_up+0x414/0x481 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Flavio Leitner <fleitner> |
Component: | kernel | Assignee: | Dave Anderson <anderson> |
Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.4 | CC: | esammons, tao |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-06-16 14:25:24 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Flavio Leitner
2010-05-11 19:43:46 UTC
First core is at megatron.gsslab.rdu.redhat.com:/cores/20100414134034 These are my findings in the first vmcore: Crash point: RIP [<ffffffff80046a1a>] try_to_wake_up+0x414/0x481 Disass: /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sched.c: 1579 0xffffffff80046a16 <try_to_wake_up+0x410>: mov 0x50(%r15),%rdi >0xffffffff80046a1a <try_to_wake_up+0x414>: mov 0x28(%rdi),%eax 0xffffffff80046a1d <try_to_wake_up+0x417>: cmp %eax,0x28(%r13) 0xffffffff80046a21 <try_to_wake_up+0x41b>: jge 0xffffffff80046a2f <try_to_wake_up+0x429> /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sched.c: 1580 C code: 1578 if (!sync || cpu != this_cpu) { >1579 if (TASK_PREEMPTS_CURR(p, rq)) 1580 resched_task(rq->curr); 1581 } 162 #define TASK_PREEMPTS_CURR(p, rq) \ 163 ((p)->prio < (rq)->curr->prio) ^^^^^^^^^^^^^^^^^ struct rq { [0x50] struct task_struct *curr; ... struct task_struct { [0x28] int prio; ... working back on line 'mov 0x28(%rdi),%eax' RDI: ffff8108bfe98100 crash> task_struct ffff8108bfe98100 struct task_struct struct: seek error: kernel virtual address: ffff8108bfe98100 type: "gdb_readmem_callback" Cannot access memory at address 0xffff8108bfe98100 it's invalid, so working back on line: 'mov 0x50(%r15),%rdi' R15: ffff810009081680 crash> struct rq.curr ffff810009081680 curr = 0xffff81083fe98100, Well, curr = 0xffff81083fe98100 is different from RDI: ffff8108bfe98100 and it has just one bit wrong: crash> eval 0xffff81083fe98100 ^ 0xffff8108bfe98100 hexadecimal: 80000000 (2GB) decimal: 2147483648 octal: 20000000000 binary: 0000000000000000000000000000000010000000000000000000000000000000 Ok, so what would happen if we use the correct value based on ->curr: crash> task_struct.prio 0xffff81083fe98100 prio = 0x8c, The other task (p) is on %r13: 0xffff810563cc80c0 and it is correct: crash> task_struct.prio ffff810563cc80c0 prio = 0x73, crash> task_struct.comm ffff810563cc80c0 comm = "ssh-broker-cli\000", crash> task_struct.pid ffff810563cc80c0 pid = 0x6641, <--- 26177d What happened? The CPU was executing these two lines: 0xffffffff80046a16 <try_to_wake_up+0x410>: mov 0x50(%r15),%rdi >0xffffffff80046a1a <try_to_wake_up+0x414>: mov 0x28(%rdi),%eax In the first line it was trying to access (struct rq *)(%r15)->curr which is a (task_struct *) pointer. This pointer is stored in %rdi register. Later, in the second line, the CPU tries to get the (task_struct *)->prio value, but the task_struct pointer was invalid. Interesting that the pointer at %r15 (struct rq) is correct and its has ->curr pointing to 0xffff81083fe98100 which didn't match with the one used on the second line. They differ by one bit only. The other tasks running seems unrelated to my eyes. crash> ps | grep ffff8108bfe98100 <--- bad pointer crash> ps | grep ffff81083fe98100 > 0 1 15 ffff81083fe98100 RU 0.0 0 0 [swapper] crash> so if the bad pointer was a real task_struct before, it has exited. The struct rq lock is correctly holded. Searching the entire image (partial dump) didn't find any other occurrence of the bad pointer besides in the crashed task. This is the slab cache for the task_struct CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff81011ce87280 task_struct 1888 1731 1840 920 4k and it didn't include references to that pointer either. So, either the partial dump is complicating things and I can't find any record of that task at all or this could be a hardware problem as that address in the memory is okay and in the register it has one bit flipped. fbl Second vmcore available at megatron.gsslab.rdu.redhat.com:/cores/20100510101329
Same crash point, same findings. However, with another CPU this time.
The same bit flipped.
crash> struct rq.curr ffff8100090a3fa0
curr = 0xffff81083ffd8100,
crash> eval 0xffff81083ffd8100 ^ 0xffff8108bffd8100
hexadecimal: 80000000 (2GB)
decimal: 2147483648
octal: 20000000000
binary: 0000000000000000000000000000000010000000000000000000000000000000
crash> ps |grep ffff8108bffd8100
crash> ps |grep ffff81083ffd8100
> 0 1 19 ffff81083ffd8100 RU 0.0 0 0 [swapper]
crash>
The system was fully verified and no errors were found.
Another thing called my attention is the fact that the bad pointer is on
the stack trace of the crashed task:
#3 [ffff81015061bbb0] error_exit at ffffffff8005dde9
[exception RIP: try_to_wake_up+0x414]
RIP: ffffffff80047005 RSP: ffff81015061bc68 RFLAGS: 00010046
RAX: 0000000000000080 RBX: 0000bec5fcd7dc6c RCX: 00000000000003e8
RDX: ffff8100090a5058 RSI: ffff8100090a5058 RDI: ffff8108bffd8100
RBP: ffff81015061bcf8 R8: 0000000000000080 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000080 R12: 000000000000000e
R13: ffff810194214080 R14: ffff8100090a5420 R15: ffff8100090a3fa0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
ffff81015061bbb8: ffff8100090a3fa0 ffff8100090a5420
ffff81015061bbc8: ffff810194214080 000000000000000e
ffff81015061bbd8: ffff81015061bcf8 0000bec5fcd7dc6c
ffff81015061bbe8: 0000000000000080 0000000000000000
ffff81015061bbf8: 0000000000000001 0000000000000080
ffff81015061bc08: 0000000000000080 00000000000003e8
ffff81015061bc18: ffff8100090a5058 ffff8100090a5058
ffff81015061bc28: ffff8108bffd8100 ffffffffffffffff
^^^^^^^^^^^^^^^^
That indicates the pointer was real another time and due to the
slabcache alignment, it could be possible to have address with a
difference of just a single bit, so I said that it didn't look
like a HW problem anymore. However, after that pointer in the stack
there is pointer to the crashed instruction and then all registers,
so I guess those were the crash context saved for kdump or something.
I need some help from engineering to find the root cause.
fbl
|