Bug 204398
Summary: | occassional panic when running MEMORY2 test on rhel4u3 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Rick Hester <rick.hester> |
Component: | kernel | Assignee: | Will Woods <wwoods> |
Status: | CLOSED WORKSFORME | QA Contact: | Will Woods <wwoods> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.3 | CC: | alex_williamson, dchapman, hcp-admin, richardl |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-10-14 19:50:54 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Rick Hester
2006-08-28 20:58:06 UTC
The cause of the crash is the branch through a null pointer, which is reported as a "kernel NULL pointer dereference." The fact that this happened while interrupts were disabled (PSR.i=0) caused the "sleeping function called from invalid context" message, which is not itself fatal. From the oops dump, interrupts are indeed disabled: psr : 0000101008122010 (PSR.i=0) Branch registers often contain return addresses that have hints about where we came from: b0 : 0000000000000002 b6 : a0000001000af440 b7 : a00000010000edd0 Since b0 = 0x2, probably a branch through b0 caused the crash. Branching through b0 is a typical "return from function". From the kernel with symbols in kernel-debuginfo-2.6.9-34.EL.ia64.rpm: b6=do_futex+128 b7=ia64_switch_to+176 (return point from calling load_switch_stack) I think we're in the middle of a context switch. We've completed load_switch_stack, which restores various registers: r28 = pr = 000000000555669b r19 = fpsr = 0009804c8a70033f r21 = a000000100068500 <context_switch+1152> (return point after calling ia64_switch_to) b0 should be the same as r21 (restored by load_switch_stack). Why isn't it? r12 is the stack pointer. r2, r3, r14, and r15 are cursors that move through the struct switch_stack as we restore registers. By code inspection, we can determine the final cursor values at the exit of load_switch_stack: r12 : e0000007d0687e00 r2 : e0000007d0687df8 r3 : e0000007d0687de8 r14 : e0000007d0687da0 r15 : e0000007d0687da8 member offsets into struct switch_stack: 0x1a0 r6 0x1a8 r7 0x1b0 b0 0x1b8 b1 0x1f0 ar_unat 0x1f8 ar_rnat 0x200 ar_bspstore 0x208 pr r2 = 0xe0000007d0687df8 = &switch_stack->pr r3 = 0xe0000007d0687de8 = &switch_stack->rnat r14 = 0xe0000007d0687da0 = &switch_stack->b0 r15 = 0xe0000007d0687da8 = &switch_stack->b1 Based on the final cursor values, we can compute the stack pointer (which is the address of the struct switch_stack) at the entry to load_switch_stack: switch_stack should be at 0xe0000007d0687da0 - 0x1b0 = 0xe0000007d0687bf0 or (using r2) 0xe0000007d0687df8 - 0x208 = 0xe0000007d0687bf0 or (using r3) 0xe0000007d0687de8 - 0x1f8 = 0xe0000007d0687bf0 So at entry to load_switch_stack, sp = &switch_stack (= 0xe0000007d0687bf0) load_switch_stack returns with "br.many b7" (br7 = 0xa00000010000edd0), so we should execute this code: 0xa00000010000edd0 <ia64_switch_to+176>: [MMB] adds r12=528,r12 0xa00000010000edd1 <ia64_switch_to+177>: sync.i 0xa00000010000edd2 <ia64_switch_to+178>: br.ret.sptk.many b0 We increment the stack pointer (r12) by 528. Using the sp value computed above, the result should be: 0xe0000007d0687bf0 + 528 = 0xe0000007d0687e00 That matches the r12 value in the oops dump, so we probably executed the code above. load_switch_stack should have restored b0 from r21. From the oops dump: b0 should have been 0xa000000100068500 (based on r21) So the "br.ret.sptk.many b0" above should have returned here: 0xa000000100068500 <context_switch+1152>: [MII] nop.m 0x0 0xa000000100068501 <context_switch+1153>: mov.i ar.pfs=r53 0xa000000100068502 <context_switch+1154>: mov b0=r52 0xa000000100068510 <context_switch+1168>: [MIB] nop.m 0x0 0xa000000100068511 <context_switch+1169>: adds r12=16,r12 0xa000000100068512 <context_switch+1170>: br.ret.sptk.many b0;; Above, we restored a new b0 value from r52. This is a stacked register, so the value could be restored from the register backing store in memory. If that memory were corrupted, we could see this crash. But we should have incremented r12 again by 16. That doesn't match the oops dump. But maybe I made an error in deducing its value. We can't reproduce the problem on RHEL4 U4. U4 contains the following change that is not in U3: * Tue Mar 21 2006 Jason Baron <jbaron> [2.6.9-34.6] -ia64: Fix corrupt ar.bspstore (Prarit Bhargava) [177297] ar.bspstore is the backing store pointer, which points to the register backing store. If this were corrupted, we could restore r52 from the wrong place, which would lead to a corrupted b0, which could cause this crash. The changelog references Red Hat bugzilla 177297, which is private, so I can't read it. It would be interesting to know whether that bug mentions a crash similar to this one. This defect can be closed. Testing with the U3 kernel that Bjorn had patched with the patch referred to above yielded no panics. Testing with the U4 kernel yielded no panics. Using the same tests and configs on an unpatched U3 kernel yielded panics. |