Bug 245159
Summary: | [RHEL5 RT] Oops - Null pointer, wakeup_next_waiter | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Jeff Burke <jburke> |
Component: | realtime-kernel | Assignee: | Steven Rostedt <srostedt> |
Status: | CLOSED WONTFIX | QA Contact: | |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 1.0 | CC: | bhu, duck, epollard, lgoncalv, williams |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=194400 | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-07-07 13:29:37 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jeff Burke
2007-06-21 12:39:28 UTC
I spent a ton of time analyzing this bug. And unfortunately, I'm thinking that this may not be solvable without finding a reproducable method. Here's where I'm at... We took the oops at the BUG_ON in: rt_mutex_top_waiter(struct rt_mutex *lock) { struct rt_mutex_waiter *w; w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list_entry); BUG_ON(w->lock != lock); return w; } But the BUG_ON test wasn't what took the oops. It was a bad pointer reference due to the w->lock in the bug on. Which means we got a bogus pointer for w from plist_first_entry. The bad address was at 0x40 (or 64 dec), but using Arnaldo's pahole program, I see that the lock is at offset 88 dec. Which means that the pointer returned to us for w is -24 (88 - 64 = 24). Which looks correct from the output: 0xffffffff802ab0e9 <wakeup_next_waiter+53>: cmp %r12,0x58(%r13) 0x58 == 88 dec R13: ffffffffffffffe8 == -24. Writing a simple program to see how this can happen: int main (int argc, char **argv) { struct rt_mutex *lock = NULL; struct rt_mutex_waiter *w; lock = malloc(sizeof(*lock)); memset(lock, 0, sizeof(*lock)); w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list_entry); printf("w=%p\n", w); return 0; } Returns: w=0xffffffffffffffe8 So somehow the plist in the lock became NULL. One thought was a bad initialization, or race with per_cpu structures, since this is a per_cpu_lock: from __cache_alloc: slab_irq_restore(save_flags, this_cpu); Where: # define slab_irq_restore(flags, cpu) \ do { slab_irq_enable(cpu); (void) (flags); } while (0) and: # define slab_irq_enable(cpu) put_cpu_var_locked(slab_irq_locks, cpu) Were we lock a percpu lock defined in the put_cpu_var_locked. But this didn't make sense either. But this was on bootup and then I thought that this could be a case where we entered the rt_spin_lock_slowunlock path before initializing the plist. The plist is initialized the first time there is contention, and if we get into the slowunlock path without initializing the lock, we will crash. So I thought there might be a bug with the initializing of the lists, but I can't find a way to get into the slowpath without having the lists first initialized. Finally, I guess that something might have corrupted the lock. It could simply touch the "owner" and on unlock we fail and go into the slowlock without ever having the lists initialized. Could be simply a case of some bad memory, that hit us once. But this machine is used for several tests, so that doesn't seem to be the case (but I'm still not ruling it out). Or it simply could be something unrelated that wrote over this memory causing the bug. In which case, we can not debug this further with the information at hand. So I am going to move this to NEED_INFO, since we are stuck at the moment until we can reproduce this again. Opened a ticket with Engineering operations to have the memory on that system verified. https://engineering.redhat.com/rt3/Ticket/Display.html?id=13589 Jeff, Running memtest on it box halts it. With 2GB of RAM in the box (from the original 6GB) the memtests works. Konrad, OK it sounds like there is in issue with the hardware. Are you planning on replacing that memmory or are you going to leave that system at 2 Gig? Jeff The plan is to get a new planar and memory. put into needinfo state till hw fixed Konrad, Any update on this issue? have the parts for ibm-lewis-01.lab.boston.redhat.com come in? Thanks Jeff I received new processors. No sign of the memory but then the tests were inconclusive. I am going to aim to run the diagnostic BIOS test this week by reserving it, but if I don't get to it, and you have the time to do it, please go ahead and hit 'F2' during bootup - that will drop you in the BIOS diagnostic code. Where are we on this bug? Clark, Konrad is no longer with Red Hat(as an IBM on-site). I am going to add the new X server IBM on-site contact. This blade is unreliable and we are attempting to get it repaired/replaced. My apologies for this taking so long. I didn't know there was a bug waiting on it. Good deal. We'll keep this bug open until we get a h/w change then see if the problem is resolved. Thanks, Clark has the hardware been updated? Should we close this bug? Clark Is this bug been seen on a working box? If this failure is only on faulty hardware then we should close this bug. Please test on another box and see if we can still see this bug. I think we should wait until Ed gets that system fixed before we close this issue. Although it has only been seen when booting on this system. It still may be a "quirk" type issue in the RT kernel. But seems unlikely at this point. Ed Pollard Do you happen to have a status on this system? Thanks, Jeff That blade is unreliable unfortunately. I'm still trying to get a replacement for it. Now that the IBM systems are mostly back up and running in the new lab I will put more attention into this. an update from Ed... The system isn't available due to hardware issues. He has an open call with IBM to get it repaired. any update on h/w for this system? |