Description of problem: When we started testing the 2.6.24-30.el5rt kernel we found we were getting a number of failures due to pthread_mutex_unlock returning EPERM. These are PRIO_INHERIT mutexes. Roland Westrelin did further investigation and found that a futex lock pi syscall fails with EFAULT. He states: "My understanding of the libc code is that this return code is ignored. So I guess the user code proceeds as if it had successfully locked the mutex and later when it tries to unlock it, the user app gets the EPERM." One of the main fixes in the -30 kernel was for: https://bugzilla.redhat.com/show_bug.cgi?id=400541 so that might be related. There has been some discussion of this on the mailing list: https://www.redhat.com/mailman/private/rhel-rt-sun/2008-May/msg00012.html Version-Release number of selected component (if applicable): 2.6.24-30.el5rt
Loaded the reproducer sent to us in email from Sun onto rhelrt-3.farm.hsv.redhat.com (Dual 2.0GHz Opterons w/12GB RAM) as well as the RT JVM and ran on a 2.6.24.7-65 trace kernel. The reproducer was run using the attached run_test.sh and produced a series of the following tracebacks on the serial console: WARNING: at kernel/futex.c:543 lookup_pi_state() Pid: 2486, comm: java Not tainted 2.6.24.7-65.el5rttrace #1 Call Trace: [<ffffffff8105ee5c>] futex_lock_pi+0x35d/0x9b0 [<ffffffff8108beee>] ? ftrace_now+0x9/0xb [<ffffffff8105f4c8>] ? do_futex+0x19/0xb47 [<ffffffff810605ae>] ? compat_sys_futex+0xf2/0x110 [<ffffffff8105f4c8>] ? do_futex+0x19/0xb47 [<ffffffff8105ffd6>] do_futex+0xb27/0xb47 [<ffffffff810604d2>] ? compat_sys_futex+0x16/0x110 [<ffffffff81028e29>] ? cstar_do_call+0x8c/0x148 [<ffffffff810604d2>] ? compat_sys_futex+0x16/0x110 [<ffffffff8100c224>] ? ftrace_call+0x5/0x2b [<ffffffff810605ae>] compat_sys_futex+0xf2/0x110 [<ffffffff81028df2>] ? cstar_do_call+0x55/0x148 [<ffffffff81028e29>] cstar_do_call+0x8c/0x148 --------------------------- | preempt count: 00000000 ] | 0-level deep critical section nesting: ----------------------------------------
Created attachment 308695 [details] script to run Sun Java testcase on rhelrt-3.farm.hsv.redhat.com
Created attachment 308696 [details] output from /debug/tracing/trace following test run by run_test.sh This is the output of /debug/tracing/trace following a test run on rhelrt-3 using the run_test.sh on a 2.6.24.7-65.el5rt trace kernel
Clark, could you please create an attachment with the latency-trace output of that very same test? I guess the gurus out there prefer the latency_trace format. I learned that the hard way :)
Created attachment 308719 [details] Latency trace output from /debug/tracing/latency_trace for test run This is the output from /debug/tracing/latency_trace for a test run of the reproducer (same test run as for the attaced "trace" output).
David, Am I correct in presuming you only see this on 64-bit systems? The reason I ask is that the stack trace is going through the compat layer, meaning it's a 32-bit app running on a 64-bit kernel. Or do you see this on 32-bit kernels too? Clark
I'd be interested in a strace output which shows the EFAULT return
Thomas, I am working on it. Meanwhile, strace gives some interesting info: ... [ Process PID=21729 runs in 64 bit mode. ] [ Process PID=21731 runs in 32 bit mode. ] [ Process PID=21729 runs in 64 bit mode. ] [ Process PID=21731 runs in 32 bit mode. ] [ Process PID=21729 runs in 64 bit mode. ] [ Process PID=21731 runs in 32 bit mode. ] [ Process PID=21729 runs in 64 bit mode. ] [ Process PID=21731 runs in 32 bit mode. ] [ Process PID=21729 runs in 64 bit mode. ] [ Process PID=21731 runs in 32 bit mode. ] ...
Created attachment 308757 [details] Strace from the java reproducer thread that receives EPERM This is one of the several threads created by the test. I would like to highlight this detail: ... futex(0x4813406c, FUTEX_WAKE, 2147483647) = 0 ... futex(0xffc5e838, FUTEX_UNLOCK_PI, 0) = -1 EPERM (Operation not permitted) ... This is the only reference to EPERM in all the logs. Interesting enough, it is trying to unlock a lock apparently owned by another thread. I can try to gather more logs if you need better strace info.
Clark: this happens on 32-bit and 64-bit kernels. I'm currently reproducing with the -65 32-bit kernel Thomas: I'm trying to generate strace data for a failure. I will upload once I have it. Meanwhile I'm uploading the strace log for a successful run as I noticed that there was an EPERM error in there too, which seemed odd - but I have no idea from where in the VM execution it occurs and it did not trigger our fatal error handler, so I assume it's a low-level unlock not a direct pthread_mutex_unlock. Luis: As per mail on the mailing list, the issue is the EFAULT on the original lock more than the (expected) EPERM on the subsequent unlock.
Created attachment 308759 [details] strace output for successful run showing a curious EPERM on FUTEX_UNLOCK_PI
The strace log I just attached is showing the same issue that Luis previously reported.
The EPERM really looks like some bogus unlock attempt. Can you find the corresponding lock operation when you strace with the -f option ?
Thomas, running the test under strace -f (instead of -ff) also shows that unlock returning EPERM and no relative lock operation. Two other thoughts: that unlock happens in the very beginning of the test... the hang seems to happens dozens of seconds after that. So, it is not the unbalanced unlock operation that causes the hang. Maybe the other threads are waiting for the lock that was not unlocked (picturing that an unbalanced locks exist along with the unbalanced unlock).
I think the EPERM in the strace log is a red-herring. I ran on a 2.6.22 kernel from a different vendor, as well as earlier RH kernels and that EPERM is always there. It may also be that the original problem description here is incorrect. The EPERM failure and the EFAULT failure occurred under different conditions. I'm doing further testing to clarify the two scenarios. One thing strange I did encounter on the new kernel is that strace -f causes the test application to fail due to a strange "Unknown error 513" that seems to originate from the low-level process.waitFor() code. This doesn't occur on an earlier 2.6.24.4-30 kernel.
Created attachment 308893 [details] strace output for failed run where pthread_mutex_unlock returned EPERM It seems there was a mis-communication between Roland and myself when I put the bug report together. The EPERM and the EFAULT have not actually been seen at the same time. In normal mode our code uses pthread_mutex_lock/unlock and that code encountered the EPERM failure as reported on the unlock. Examination of the code in question shows that the unlock only occurs after a previous lock. We have a "fast mutex" mode in which we replace pthread_mutex use with a small API built directly on top of the futex syscall. In this mode Roland observed an EFAULT from the futex call when trying to perform a lock. Roland hypothesized that if a pthread_mutex_lock encountered the same EFAULT from the futex call, then the EFAULT would be ignored and so the code would think it had locked a mutex that it had not, and hence the later unlock would (correctly) fail with EPERM. But we have not actually observed the EFAULT occurring when using pthread_mutex_lock. The "fast mutex" code can be enabled using -XX:+LinuxUsesFastMutex on the "java" invocation. My apologies for the confusion this has caused. I've attached strace output from a failing test case. The interesting part is near the end. Search for: stat64("/.automount/mackdrive.france/root/export/jrts/work/dh198349/rt-tests/ExecTest$1.class", {st_mode=S_IFREG|0644, st_size=419, ...}) = 0 as this is where the loading of the application class occurs. After this the log shows mainly thread creation and starting - which involves timer_create, sched_setscheduler etc as well as the actual "clone" system call. The crash occurs while one thread is starting another. If we look at the futex operations we see a few EAGAIN "errors": mmap2(NULL, 528384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xa507b000 futex(0x466e65b8, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x466e65b8, FUTEX_WAKE, 1) = 0 clone(child_stack=0xa50fb4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xa50fbbd8, {entry_number:6, base_addr:0xa50fbb90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xa50fbbd8) = 9500 Then the last syscall we see before the VM error handler has called abort is: futex(0xa530c52c, FUTEX_WAIT, 3, {1152843388, 420991068}) = ? ERESTART_RESTARTBLOCK (To be restarted) +++ killed by SIGABRT (core dumped) +++
> futex(0x466e65b8, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) I have a hard time to see how EAGAIN can be returned from futex(FUTEX_WAIT). > futex(0xa530c52c, FUTEX_WAIT, 3, {1152843388, 420991068}) = ? ERESTART_RESTARTBLOCK (To be restarted) > ++ killed by SIGABRT (core dumped) +++ Can you read the stack back trace from the core dump ?
The stacktrace only shows the call from failed mutex-locking wrapper to our fatal error handler (which encounters a secondary error): Program terminated with signal 6, Aborted. #0 0xb7f8c410 in __kernel_vsyscall () (gdb) where #0 0xb7f8c410 in __kernel_vsyscall () #1 0x465d1ba0 in raise () from /lib/libc.so.6 #2 0x465d34b1 in abort () from /lib/libc.so.6 #3 0xb776658c in os::abort (dump_core=1) at /usr/re/jrts2.1/release/kereon/src/os/linux/vm/os_linux.cpp:1995 #4 0xb78b0767 in VMError::report_and_die () at /usr/re/jrts2.1/release/kereon/src/share/vm/utilities/vmError.cpp:698 #5 0xb753c5fb in report_assertion_failure ( file_name=0xb7a149c0 "/usr/re/jrts2.1/release/kereon/src/os/linux/vm/mutex_linux.cpp", line_no=874, message=0xb7a154a0 "assert(((1<<1) & self->_internal_state) == 0,\"already locking a PIP mutex\")") at /usr/re/jrts2.1/release/kereon/src/share/vm/utilities/debug.cpp:212 #6 0xb7731e5c in os::PIPMutex::lock (this=0x8073320, self=0xa530f3f0, interruptible= {_interrupt = 0, _async = 0, _aie = 0}) at /usr/re/jrts2.1/release/kereon/src/os/linux/vm/mutex_linux.cpp:874 #7 0xb772bfb9 in Mutex::wait_for_lock_implementation () at /usr/re/jrts2.1/release/kereon/src/os/linux/vm/mutex_linux.inline.hpp:34 #8 0xb772b104 in Mutex::lock_without_safepoint_check (this=0x80732a0) at /usr/re/jrts2.1/release/kereon/src/share/vm/runtime/mutex.cpp:303 #9 0xb73d1c2b in MutexLockerEx::MutexLockerEx (this=0xa4df4060, mutex=0x80732a0, no_safepoint_check=1, spin=0) at /usr/re/jrts2.1/release/kereon/src/share/vm/runtime/mutexLocker.hpp:173 #10 0xb7876783 in Threads::print_on_error (st=0xa4df41e0, current=0xa530cba0, buf=0xb7f5f900 "SIGTERM", buflen=2000) at /usr/re/jrts2.1/release/kereon/src/share/vm/runtime/thread.cpp:4445 #11 0xb78afc9e in VMError::report (this=0xa4df4250, st=0xa4df41e0) at /usr/re/jrts2.1/release/kereon/src/share/vm/utilities/vmError.cpp:406 #12 0xb78b04f7 in VMError::report_and_die () at /usr/re/jrts2.1/release/kereon/src/share/vm/utilities/vmError.cpp:644 #13 0xb753c66b in report_fatal (file_name=0xb7977360 "/usr/re/jrts2.1/release/kereon/src/os/linux/vm/os_linux.hpp", line_no=198, message=0xa4df42b0 "pthread_mutex_unlock: Operation not permitted") at /usr/re/jrts2.1/release/kereon/src/share/vm/utilities/debug.cpp:218 #14 0xb753c6d3 in report_fatal_vararg ( file_name=0xb7977360 "/usr/re/jrts2.1/release/kereon/src/os/linux/vm/os_linux.hpp", line_no=198, format=0xb7977ace "pthread_mutex_unlock: %s") at /usr/re/jrts2.1/release/kereon/src/share/vm/utilities/debug.cpp:227 #15 0xb7531fc3 in os::Linux::mutex_unlock () at /usr/re/jrts2.1/release/kereon/src/os/linux/vm/os_linux.hpp:198 ---Type <return> to continue, or q <return> to quit--- #16 0xb7531e44 in os::Linux::LLMutex_slow::unlock (this=0xa530c1e8) at /usr/re/jrts2.1/release/kereon/src/os/linux/vm/os_linux.hpp:452 #17 0xb7531933 in os::Linux::OSMutex::unlock () at /usr/re/jrts2.1/release/kereon/src/os/linux/vm/os_linux.hpp:573 #18 0xb7732ce4 in os::PIPMutex::lock (this=0x8073250, self=0xa530f3f0, interruptible= {_interrupt = 0, _async = 0, _aie = 0}) at /usr/re/jrts2.1/release/kereon/src/os/linux/vm/mutex_linux.cpp:1289 #19 0xb772bfb9 in Mutex::wait_for_lock_implementation () at /usr/re/jrts2.1/release/kereon/src/os/linux/vm/mutex_linux.inline.hpp:34 #20 0xb772b104 in Mutex::lock_without_safepoint_check (this=0x80731d0) at /usr/re/jrts2.1/release/kereon/src/share/vm/runtime/mutex.cpp:303 #21 0xb73d1c2b in MutexLockerEx::MutexLockerEx (this=0xa4df45e0, mutex=0x80731d0, no_safepoint_check=1, spin=0) at /usr/re/jrts2.1/release/kereon/src/share/vm/runtime/mutexLocker.hpp:173 #22 0xb764600c in JVM_SetThreadPriority (env=0xa530ccf4, jthread=0xa4df4690, prio=5) at /usr/re/jrts2.1/release/kereon/src/share/vm/prims/jvm.cpp:3263 #23 0xb2246ea9 in ?? () #24 0xa530ccf4 in ?? () #25 0xa4df4690 in ?? () #26 0x00000005 in ?? () #27 0xa4df465c in ?? () #28 0x00000000 in ?? ()
> #13 0xb753c66b in report_fatal (file_name=0xb7977360 > "/usr/re/jrts2.1/release/kereon/src/os/linux/vm/os_linux.hpp", > line_no=198, message=0xa4df42b0 "pthread_mutex_unlock: Operation not > permitted") So that's the EPERM again. But where is the EFAULT happening ? I still see no connection at all. Can you please print out the futex value and the TID of the thread which gets the EPERM in your error handler? That should give us an hint which thread is holding the mutex and which one is trying to release it.
I just read the comments again and I have to admit that I'm pretty confused. According to the original bug report there is a EFAULT returned from a pthread_mutex_lock which is ignored by glibc and later on the pthread_mutex_unlock returns EPERM. I have not yet seen any indication for that EFAULT happening, just the EPERM shows up. Now you have your own fast mutex implementation which seems to have the same EPERM issue. I'm a bit confused about the failure description. What's the common failure mode of glibc and your implemenation ? What are the differences ? Thanks, tglx
Created attachment 308982 [details] debug patch to get more info about the EPERM issue Luis, can you please apply the attached patch to the -rt kernel and report the kernel output for the EPERM case including information about the threads (TIDs) involved in that test. Thanks, tglx
Patchin'and Compilin' :)
Created attachment 308997 [details] Dmesg output for b4482614 using Glibc futex API
Created attachment 309012 [details] Dmesg output for b4482614 using Glibc FastFutexes API
Thomas, As per my update yesterday there is in fact no observed connection between the EPERM and the EFAULT - they are two different failure modes: - The EPERM shows up only with pthread_mutex_unlock. - The EFAULT has only been seen on a futex-lock using our own "fast-mutex" that uses futex directly. It was hypothesized that if an EFAULT also occurred inside a pthread_mutex_lock then glibc would ignore that error and act as if the lock had been successful. That would then explain why the subsequent unlock gave EPERM. But this has not actually been observed. Again my apologies for the confusion here. David
No problem. The hypothesis of EFAULT was the misleading part. If the EFAULT would happen then we'd see it in the strace outputs. Luis run my debug patch and what we see is: futex_unlock_pi: uval 0x00000000, uad 00000000ffef72a8 TID 3194 That means something is trying to unlock a futex which is not locked or the pointer is simply wrong. The address is not used anywhere else, but we do not see the lock operation as it is in the fast path which does not enter the kernel. It might be worthwhile for debugging to disable the user space fast path so we can see all futex operations in strace. What worries me more is the lookup_pi_state back trace in Luis's dmesg outputs: WARNING: at kernel/futex.c:547 lookup_pi_state() Pid: 14109, comm: java Not tainted 2.6.24.7-65_dbg #1 This looks like memory corruption or some complete confused state. I'll come up with debug patch for that as well. Thanks, tglx
Created attachment 309042 [details] add debug output for lookup_pi_state Luis, can you please apply that on top of the previous one and provide the output ? Thanks, tglx
Created attachment 309095 [details] Dmesg output for b4482614 using Glibc futex API Thomas, I have applied futex-debug2.patch and this is the result of the first test.
Created attachment 309124 [details] Dmesg output for b4482614 using FastFutexes API Interesting enough, the test failed but didn't hit the new debug code.
Thanks Luis, lookup_pi: pid mismatch uval 0x800066f6 pid 26358 ctid 26365 keyw 0x83be000 keyp ffff81037d44ba00 keyo 3990 pio ffff81027e1e2140 piop 26363 The TID stored in the user space futex is 26358, but the pi state is owned by TID 26363. /me twists brain :)
What's the condition for the test to be considered failed?
I'm unclear whom the previous question is directed at. :) Our test program in the current context should only "fail" if it encounters the EPERM error and so terminates the VM. The failure mode for that test was that the test hung rather than completing. It could hang under two conditions: a) due to some incorrect logic in Process.waitFor b) because the exec'ed program produced too much output, resulting in the pipe filling and blocking the process. As the main thread was waiting for the process to finish and the process was blocked on a pipe that nothing was reading from, the test hung.
The EPERM which shows with each run of the test program: futex_unlock_pi: uval 0x00000000, uad 00000000ffef72a8 TID 3194 looks like memory corruption, which is either caused by a memset or a call to pthread_mutex_init() while the futex is held by a task which tries to release it later. David, is there a chance that you can instrument this in your test code / java vm ?
Can you clarify exactly what you would like instrumented please. I will try to run with such code in the morning. The mutex in question is initialized when the VM is started and lasts as long as the VM does. I'm suspicious of the fact we only see this with this test that involves fork/exec. Is there a possibility of some connection there? David
> Can you clarify exactly what you would like instrumented please. I will try to > run with such code in the morning. Instrumentation of the pthread_mutex related calls. > The mutex in question is initialized when the VM is started and lasts as long as > the VM does. Ah, so you know which particular mutex that is ? > I'm suspicious of the fact we only see this with this test that involves > fork/exec. Is there a possibility of some connection there? Hmm, I think this one happens way before fork/exec happen
The EPERM which happens once per start is harmless. glibc is built against 2.6.9 kernel headers. So __ASSUME_FUTEX_LOCK_PI is not set as this was introduced in 2.6.18. pthread_mutex_init does therefor: case PTHREAD_PRIO_INHERIT << PTHREAD_MUTEXATTR_PROTOCOL_SHIFT: #ifndef __ASSUME_FUTEX_LOCK_PI if (__builtin_expect (tpi_supported == 0, 0)) { int lock = 0; INTERNAL_SYSCALL_DECL (err); int ret = INTERNAL_SYSCALL (futex, err, 4, &lock, FUTEX_UNLOCK_PI, 0, 0); assert (INTERNAL_SYSCALL_ERROR_P (ret, err)); tpi_supported = INTERNAL_SYSCALL_ERRNO (ret, err) == ENOSYS ? -1 : 1; } if (__builtin_expect (tpi_supported < 0, 0)) return ENOTSUP; #endif break; which checks whether PI support is available in the kernel or not. the lock value is 0, so the EPERM is not a surprise.
The mutex always seems to be what we call the Safepoint_mutex, and it is used to protect various operations on threads. I'm trying to glean more information from the VM but for some reason today the bug refuses to reproduce on either of the machines I've been testing on.
The EFAULT return is really happening due to a missing fault fixup in the return path of futex_lock_pi, which later triggers the WARN_ON in lookup_pi_state. Fix for this problem is in mainline and in the latest -rt release. David, Roland, thanks for the analysis and the reproducer.
Thomas, I'm unclear whether this EFAULT relates to the EPERM we are seeing in pthread_mutex_unlock, or to the EFAULT we saw in the code that used futex directly ? Thanks for the clarification. David
David, those are related. What happens is: T1,T2 are blocked on the futex. T3 releases the futex and T1 is the assigned pending owner T2 acquires the rtmutex fork happens and maps the anon memory read only for COW Either T2 or T1 wants to fixup the user space variable and fault. That leaves the internal status inconsistent. EFAULT is returned to user space, but user space has no way to fix that up. T2 thinks it got the futex and unlocks it, which leads to EPERM due to the mismatch of the user space variable. The patch fixes the fault issue, which is the root cause for all of those problems.
Okay - but why didn't the EFAULT show up in the strace output ?
Because I focussed on the first EPERM which is ok due to glibc probing the availability of PI futex support in the kernel.
Created attachment 310758 [details] gdb stacktraces for all threads in hung execution of ExecTest application I tested the new -69 kernel today to see if this bug had been resolved. I did not reproduce it in 15 runs of the ExecTest application but the 16th execution hung. I've attached the gdb thread dump. There seem to be a number of threads stuck in fork-related malloc/free calls.
Hmm, would have been interesting to see where the threads are stuck in the kernel. If you can reproduce it, please do echo t >/proc/sysrq_trigger and provide the kernel output.
I did not reproduce this in another 200 runs but I still have the hung process active. I tried the command suggested and it after a long delay nothing happens. I have set /proc/sys/kernel/sysrq to 1 first. Does this command send its output somewhere other than the current terminal?
Created attachment 310767 [details] kernel trace of hung java process Discovered the output via dmesg. Attached is the info for the Java process and the perl process it was execing (shown as defunct by ps). Thanks.
Have you been a chance to use any kernel newer than -67? I guess Clark have made -69 available for your testing. These kernels have Thomas' fixes for this futex issues.
As per comment #43 we tested -69. The EPERM issue seems to have gone, but we did run into the hang discussed above.
Thanks, I missed that comment.
I can not reproduce the hang now we have the -72 kernel. I think this bug can be closed.
closing...
Need to reopen and set state to Modified.
Created attachment 313695 [details] testcase .jar file and wrapper scripts for reproducing bug Unpack after installing RT JVM, adjust the script run_the_java_thing.sh and do_the_java_thing.sh to point to the installation directory for the JVM and then run the scripts.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0585.html