Description of problem: A thread in pthread_cond_wait() is being signalled (rapidly) by two other threads but it is not being scheduled. Version-Release number of selected component (if applicable): How reproducible: This is reproducible currently by running up the Ingres database and running a particular testcase supplied by an Ingres customer. We have not yet succeeded in creating a 'cut-down' testcase. We are currently getting clearance from the customer to supply the testcase data at which point we could guide you through reproducing the issue on Ingres. Steps to Reproduce: 1.To be supplied. 2. 3. Actual results: Expected results: Additional info: Please see the attachment for the problem details, one from the point of view of the symptoms seen in Ingres, one described from the Linux point of view.
Created attachment 142001 [details] Ingres & Linux description of the problem
Are realtime priorities involved?
No. The threads have default attributes, so default to SCHED_OTHER. The rapid signalling is a side-effect of the buffering algorithm that occurs when one or more reader threads catch up with the writer. The reader thread(s) in this situation will loop around read-locking the buffer, finding they've read the data in the buffer already, and unlocking the buffer. As part of the unlock of the buffer, they check a condwaiter flag to see if there are any threads condwaiting on the buffer, and if so, a pthread_cond_signal() is sent. What should happen is that the condwaiters are signalled, decrement the condwaiter flag down to zero, and therefore no more signals are generated. Because the condwaiter in this case is not moving off the pthread_cond_wait() line, the condwaiter flag is non-zero, and so large numbers of pthread_cond_signal() calls are executed.
Hi Jakub, Is there any news on this bug report? If you are stuck could you let me know as I'd be very happy to help with explanations, more evidence etc. I have some evidence that seems to suggest that pthread_cond_timedwait() doesn't suffer from the same problem as pthread_cond_wait(). What I did was to change the call to pthread_cond_wait() to pthread_cond_timedwait() with the extra code necessary to set up the time and code to print out a diagnostic message if the wait timed out (ie result == ETIMEDOUT). The original intention here was to get some extra diagnostics if possible. However when I ran the test case, the stall did not occur, even after leaving it running for several days, nor did the timed wait time out. To eliminate the possibility that this was a timing issue (since I had added several new lines of code) I next simply replaced the pthread_cond_timedwait() call with pthread_cond_wait(), leaving all the time code and diagnostic message in place. On running the test case, the stall recreated within minutes. Regards, Kevin.
I couldn't reproduce this on a testcase myself, so a small self-contained reproducer would be highly appreciated.
I have put together a testcase, though it's not small - the 2 compressed tar files are around 30 & 200 MB in size. What's the best way to get the files to you?
Best stick it to some ftp/http server and put here URL, if that's not possible, you can mail it to me directly. Though, does the testcase involve any binary only code? That would make it much harder to see what exactly it is doing, especially when debugging problems with thread synchronization it is essential to check for correct use of the synchronization primitives...
I have included the source code of the functional layer that directly calls the pthread_x functions. I will send you an email direct so that I can give you details of where to ftp the files from etc.
Hi chaps, How are you getting on? Regards, Kevin.
Jakub, Have you managed to recreate the problem? Regards, Kevin.
Sorry, couldn't reproduce the problem during more than 2 hours of running the job1 script on 4way HT i686 (i.e. 8 logical CPUs). uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc i686 2.6.9-42.ELsmp glibc-2.3.4-2.25.i686 6488 ingres 16 0 228m 95m 7416 S 159 1.1 700:45.87 iidbms
It may be worth runnng it overnight. We reproduced it easily here on a machine with a similar version level: uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc i686 2.6.9-42.0.3.ELsmp glibc-2.3.4-2.25.i686 Here is the cpuinfo for the machine: ingres@qasling1> cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor Sample stepping : 0 cpu MHz : 1594.611 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3190.94 processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor Sample stepping : 0 cpu MHz : 1594.611 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3187.97 processor : 2 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor Sample stepping : 0 cpu MHz : 1594.611 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3188.01 processor : 3 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor Sample stepping : 0 cpu MHz : 1594.611 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3188.06
Is it possible to access your test environment, so that I can help recreate the issue?
Can you reproduce it with Intel processors? I've seen at least one report that AMD does implement certain memory ordering situations correctly.
(In reply to comment #23) > Can you reproduce it with Intel processors? I've seen at least one report that AMD does implement certain memory ordering situations correctly. Ulrich, I have reproduced it on Intel running Suse Linux. Here are the details: uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc i686 2.6.5-7.191-smp glibc-2.3.3-98.47.i686 ingres@usilsl9-1:~> cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.40GHz stepping : 3 cpu MHz : 3402.132 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm pni monitor ds_cpl tm2 cid bogomips : 6701.05 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.40GHz stepping : 3 cpu MHz : 3402.132 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm pni monitor ds_cpl tm2 cid bogomips : 6782.97 processor : 2 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.40GHz stepping : 3 cpu MHz : 3402.132 cache size : 2048 KB physical id : 3 siblings : 2 core id : 3 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm pni monitor ds_cpl tm2 cid bogomips : 6782.97 processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.40GHz stepping : 3 cpu MHz : 3402.132 cache size : 2048 KB physical id : 3 siblings : 2 core id : 3 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm pni monitor ds_cpl tm2 cid bogomips : 6799.36 Regards, Kevin.
(In reply to comment #20) > Sorry, couldn't reproduce the problem during more than 2 hours of running the job1 script on 4way HT i686 (i.e. 8 logical CPUs). Jakub, We can get you set up with a vpn connection to our box that this recreates on. When would you be available to log on? Regards, Kevin. Ingres Europe.
FYI, I have been running the job1 script here now for 17 hours and didn't reproduce it yet. If you set up something I could ssh in, guess I can look at it any time today (CET timezone).
(In reply to comment #24) > I have reproduced it on Intel running Suse Linux. And what on earth does the suse glibc has to do with us? If these people cannot provide a stable glibc that's no problem of ours.
(In reply to comment #27) > (In reply to comment #24) > > I have reproduced it on Intel running Suse Linux. > > And what on earth does the suse glibc has to do with us? If these people cannot > provide a stable glibc that's no problem of ours. The problem occurs on both Suse & Redhat with remarkably similar symptoms. Does this mean that you think the problem is down to an unstable glibc?
(In reply to comment #28) > The problem occurs on both Suse & Redhat with remarkably similar symptoms. Does > this mean that you think the problem is down to an unstable glibc? It means that a) you never ever must report a problem in code shipped by other people and b) we cannot reproduce any problem
(In reply to comment #20) > Sorry, couldn't reproduce the problem during more than 2 hours of running the > job1 script on 4way HT i686 (i.e. 8 logical CPUs). > uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc > i686 > 2.6.9-42.ELsmp > glibc-2.3.4-2.25.i686 > > 6488 ingres 16 0 228m 95m 7416 S 159 1.1 700:45.87 iidbms Below are the equivalent details from the customer's Redhat system, on which the problem was first reported and recreates regularly: uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc x86_64 2.6.9--22.0.2.ELsmp glibc-2.3.4-2.9.x86_64 glibc-2.3.4-2.9.i686
Ulrich, I want this issue, which is an O/S problem that happens to occur on both Redhat & Suse, on Intel & AMD, fixed. If you have no constructive comments to make please leave the airwaves clear for those that do. Jakub is the owner of this issue. Please direct your comments through him. Kevin. Ingres Support
Created attachment 145480 [details] data from ps axwml and gdb Ok, here is the data I've captured (and still have the process under gdb, so can poke some more). Most of the processes are waiting on uninteresting condvars that never appear in other processes, the only interesting condvar seems to be 0xaca7055c and its associated 0xaca70544 mutex. Thread 10 is waiting on this cv, thread 9 (CPU eating one from ps axwml) is pthread_cond_signaling it, thread 8 (also CPU hungry) is pthread_mutex_unlocking the associated mutex. Does this match a little bit what you have been seeing?
(In reply to comment #32) Hi Jakub, That sounds like it. Three threads. One in cond_wait, top bit of stack like this: #0 0xffffe410 in __kernel_vsyscall () #1 0x00a6aa86 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0xf7e3bc99 in IICSMTp_semaphore (exclusive=0, sem=0xeb78c340) at /devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmtsem.c:1078 #3 0xf78496c8 in build_pindex (ftx=0xea90c560) at /devsrc/ingresr30/smeke01/src/back/dmf/dmu/dm2uputi.c:545 The other 2 are moving rapidly (as described in earlier attachment), locking the mutex, signalling the cond-var (to no effect, which is the problem), and unlocking the mutex. You will see them therefore with changing stacks, possibly with what look like corrupt stacks. Example: #0 0x00a69260 in pthread_self () from /lib/tls/libpthread.so.0 #1 0xf7e3b598 in IICSMTp_semaphore (exclusive=1, sem=0xeb78c408) at /devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmtsem.c:645 #2 0xf784974a in build_pindex (ftx=0xeb8f4560) at /devsrc/ingresr30/smeke01/src/back/dmf/dmu/dm2uputi.c:566 & #0 0xffffe410 in __kernel_vsyscall () #1 0x00a6d13e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #2 0x00a69d9b in _L_mutex_lock_32 () from /lib/tls/libpthread.so.0 #3 0x00000000 in ?? () The latter is the whole stack trace, so looks as if either it's corrupt or gdb couldn't cope. Let me know if I can help further. Regards, Kevin Ingres Europe.
More info just on the 3 threads from #32: Thread 10 (Thread 16968624 (LWP 25087)): #0 0x004547a2 in _dl_sysinfo_int80 () at rtld.c:577 #1 0x0034ab26 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x00be8e36 in IICSMTp_semaphore () from /opt/Ingres/IngresII/ingres/lib/libbecompat.1.so #3 0x007c3633 in build_pindex () from /opt/Ingres/IngresII/ingres/lib/libdmf.1.so #4 0x00130265 in scs_dbms_task () from /opt/Ingres/IngresII/ingres/lib/libscf.1.so #5 0x0013d794 in scs_sequencer () from /opt/Ingres/IngresII/ingres/lib/libscf.1.so #6 0x00be01c6 in CSMT_setup () from /opt/Ingres/IngresII/ingres/lib/libbecompat.1.so #7 0x00348371 in start_thread (arg=0x102ebb0) at pthread_create.c:261 (gdb) p *(pthread_cond_t *)0xaca7055c $45 = {__data = {__lock = 1, __futex = 5866, __total_seq = 2933, __wakeup_seq = 2933, __woken_seq = 2932, __mutex = 0xaca70544, __nwaiters = 2, __broadcast_seq = 0}, __size = "\001\000\000\000*\026\000\000u\v\000\000\000\000\000\000u\v\000\000\000\000\000\000t\v\000\000\000\000\000\000D\005**\002\000\000\000\000\000\000\000\000\000\000", __align = 25194278158337} (gdb) p *(pthread_mutex_t *)0xaca70544 $46 = {__data = {__lock = 1, __count = 0, __owner = 25088, __kind = 0, __nusers = 2, __spins = 0}, __size = "\001\000\000\000\000\000\000\000\000b\000\000\000\000\000\000\002\000\000\000\000\000\000", __align = 1} waiting on __data.__futex (gdb) x/32w $esp 0x10244b0: 0x00000000 0x00000b74 0x00000000 0x00000000 0x10244c0: 0x00c5c2a0 0x00000000 0xaca70508 0x00be8e36 0x10244d0: 0xaca7055c 0xaca70544 0xaca70544 0x00be8939 0x10244e0: 0x00c5c2a0 0x0034a15f 0x01024538 0xaca7055c 0x10244f0: 0xaca70544 0x0102ebb0 0x01024578 0x00be8f87 0x1024500: 0x00000002 0xaca705a8 0x00009930 0x0102ebb0 0x1024510: 0x00002fa0 0x00000001 0x00000011 0x00000000 0x1024520: 0x00000001 0xacaf94e0 0xacaf94e0 0x00964e20 (i.e. saved seq val is 2932, saved bc_seq is 0). Thread 9 (Thread 126991280 (LWP 25088)): #0 pthread_cond_signal@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/pthread_cond_signal.S:61 #1 0x00be9221 in IICSMTv_semaphore () from /opt/Ingres/IngresII/ingres/lib/libbecompat.1.so #2 0x007c3715 in build_pindex () from /opt/Ingres/IngresII/ingres/lib/libdmf.1.so #3 0x00130265 in scs_dbms_task () from /opt/Ingres/IngresII/ingres/lib/libscf.1.so #4 0x0013d794 in scs_sequencer () from /opt/Ingres/IngresII/ingres/lib/libscf.1.so #5 0x00be01c6 in CSMT_setup () from /opt/Ingres/IngresII/ingres/lib/libbecompat.1.so #6 0x00348371 in start_thread (arg=0x791bbb0) at pthread_create.c:261 #7 0x00a34ffe in clone () from /lib/tls/libc.so.6 (gdb) p *(pthread_cond_t *)0xaca7055c $48 = {__data = {__lock = 1, __futex = 5866, __total_seq = 2933, __wakeup_seq = 2933, __woken_seq = 2932, __mutex = 0xaca70544, __nwaiters = 2, __broadcast_seq = 0}, __size = "\001\000\000\000*\026\000\000u\v\000\000\000\000\000\000u\v\000\000\000\000\000\000t\v\000\000\000\000\000\000D\005**\002\000\000\000\000\000\000\000\000\000\000", __align = 25194278158337} pthread_cond_signal right after acquiring cv internal __data.__lock, holds also 0xaca70544 mutex. Thread 8 (Thread 16313264 (LWP 25089)): #0 0x0034a106 in __pthread_mutex_unlock (mutex=0xaca70544) at pthread_mutex_unlock.c:76 #1 0x007c3633 in build_pindex () from /opt/Ingres/IngresII/ingres/lib/libdmf.1.so #2 0x00130265 in scs_dbms_task () from /opt/Ingres/IngresII/ingres/lib/libscf.1.so #3 0x0013d794 in scs_sequencer () from /opt/Ingres/IngresII/ingres/lib/libscf.1.so #4 0x00be01c6 in CSMT_setup () from /opt/Ingres/IngresII/ingres/lib/libbecompat.1.so #5 0x00348371 in start_thread (arg=0xf8ebb0) at pthread_create.c:261 #6 0x00a34ffe in clone () from /lib/tls/libc.so.6 __pthread_mutex_unlock 1 insn before returning, so doesn't hold the lock anymore The content of the CV fields look all sane, so if it hasn't been woken up for longer than a few timeslices, the FUTEX_WAKE must have been lost. But wakeup_seq as well as futex has been incremented and the function left, so I don't see how this could be a glibc bug. I tried to reproduce this with: #include <pthread.h> #include <stdio.h> pthread_cond_t cond = PTHREAD_COND_INITIALIZER; pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER; void *tf1 (void *arg) { int i; for (i = 0; i < 1000000000; i++) { pthread_mutex_lock (&mtx); pthread_cond_wait (&cond, &mtx); pthread_mutex_unlock (&mtx); if ((i & 0x7ffff) == 0) fputc ('.', stderr); } return NULL; } void *tf2 (void *arg) { while (1) { pthread_mutex_lock (&mtx); pthread_cond_signal (&cond); pthread_mutex_unlock (&mtx); } return NULL; } int main (void) { pthread_t th; pthread_create (&th, NULL, tf1, NULL); pthread_create (&th, NULL, tf2, NULL); pthread_create (&th, NULL, tf2, NULL); int i; while (1) for (i = 0; i < ~0; i++) asm volatile ("" : : "r" (i)); } but that has not reproduced this for several hours. Have you tried a different kernel (e.g. RHEL5 (beta))? If only pthread_cond_{wait,signal} and pthread_mutex_{lock,unlock} is involved here (it seems so, at least bc_seq is 0, so presumably no pthread_cond_broadcast calls have been on this cv) and no pthread_cancel was involved, the code on the glibc side is fairly simple. The difference between RHEL5 and RHEL4 glibc here is just in pthread_cond_signal, where RHEL5 optimizes and uses FUTEX_WAKE_OP when available and only falls back to FUTEX_WAKE, while RHEL4 uses always FUTEX_WAKE.
#0 0xffffe410 in __kernel_vsyscall () #1 0x00a6d13e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #2 0x00a69d9b in _L_mutex_lock_32 () from /lib/tls/libpthread.so.0 #3 0x00000000 in ?? () is expected on RHEL4, fixed in RHEL5. _L_mutex_lock_* etc. are inline asm pads in separate subsections and no unwind info has been provided for it in RHEL4, because supporting it either needs GCC to emit .cfi_* directives everywhere (not even RHEL5 does it yet) or DW_CFA_val_* support and the complicated unwind info tricks used in RHEL5. In RHEL4 you'd need to finish the backtrace yourself by hand, looking at x/64w $esp dumps and at disassembly, noting where on the stack the next return address is stored.
Hi Jakub, Many thanks indeed for the update. The central point of the issue is that the cond-var is being signalled but the thread cond-waiting on the cond-var is not being scheduled. Are the cond-signals raised correctly? Is the cond-var receiving the signals? If not why not? If it is, why isn't the thread associated being scheduled? We have found that if we use pthread_cond_timedwait() the problem does not occur (and this is without the pthread_cond_timedwait() timing out). If we replace just the pthread_cond_timedwait() call, and nothing else, with pthread_cond_wait(), the problem occurs. Does the difference between the code path for pthread_cond_timedwait() and pthread_cond_wait() in terms of cond-signal handling and thread scheduling give any clues as to what might be happening? Attaching with gdb and detaching usually results in the thread in cond-wait being scheduled (ie the cond-var has received a cond-signal). Does this symptom give any clues as to what is happening? We don't feel there's much to be gained by us installing the RHEL5 beta for a number of reasons. In particular our customers would not accept installing a new version of the OS on their production systems without being convinced that the problem had been diagnosed and the cure well-understood. Regards, Kevin. Ingres Europe.
The reason for asking to test with a different (RHEL5) kernel (note, only kernel, with unchanged userland) is to help understanding where the problem is, if we should focus on looking into the differences between RHEL4 and RHEL5 kernel futex code (there were many changes) or something else. That's definitely not something we'd propose for customers. Detaching gdb from a process signals all threads with SIGCONT and as the pthread_cond_wait process is waiting in interruptible sleep inside of futex (FUTEX_WAIT), this means the syscall returns with EINTR. But glibc pthread_cond_wait code always checks the conditions whether it should be awaken, no matter what FUTEX_WAIT returns and as (see above) all internal variables in the cv are correct, it is woken up. If you are able to reproduce the problem within minutes or at most a few hours rather than days as I do here, could you please try the preload libs I'm going to attach? Each of them should be LD_PRELOADed into the iidbms process, overrides just pthread_cond_wait function and nothing else. Guess /etc/rc.d/init.d/ingressII stop LD_PRELOAD=/tmp/cvwait1.so /etc/rc.d/init.d/ingressII start etc. could do the trick (easily verificable by cat /proc/`pidof iidbms`/maps | grep cvwait).
Created attachment 145684 [details] cvwait.tar.bz2 I'm interested to hear which from these (if any) reproduce the problem and which (if any) don't.
(In reply to comment #38) > > I'm interested to hear which from these (if any) reproduce the problem and > which (if any) don't. Jakub, I have some preliminary results: 4 Runs: 1. cswait1.so: Left for over an hour - did not recreate. 2. Control (LD_PRELOAD unset): Recreated within a few minutes. 3. cswait2.so: Left for over an hour - did not recreate. 4. cswait3.so: Recreated after 40 minutes. I will run cswait1.so & cswait2.so for longer periods, starting with cswait1.so overnight tonight. Let me know if you want more diagnostics, different tests etc. Regards, Kevin. Ingres Europe.
Jakub, The overnight run with cvwait1.so (ie 2nd time of running it) reproduced the stall. It actually stalled within 25 minutes of starting the test, but I hadn't realised. I've attached thread stack dumps for yesterday's stall with cvwait3.so & this stall with cvwait1.so. I am running cvwait2.so for a longer period to see if that will reproduce eventually. Regards, Kevin. Ingres Europe.
Created attachment 145993 [details] Thread stackdump for cvwait3.so stall
Created attachment 145994 [details] Thread stackdump for cvwait1.so stall
Created attachment 145995 [details] Thread stackdump for cvwait1.so stall
cvwait1.so was a control test, it contains the same assembly as pthread_cond_wait in libpthread.so, so I was surprised to see first that it didn't reproduce it. cvwait2.so is pthread_cond_timedwait assembly named pthread_cond_wait and slightly changed, so that it assumes maximum possible timeout. cvwait3.so is the C version of pthread_cond_wait (used on powerpc, ia64 and most other arches, except i386, x86_64 and sh).
(In reply to comment #44) Jakub, I now have a run with cvwait2.so stalling. It took approx 22 hours to stall (started run on Fri Jan 19 at 11:11 am and last entry in log was Sat Jan 20 09:41). I have uploaded the stackdump. Let me know if there are any further tests I can run. Regards, Kevin. Ingres Europe.
Created attachment 146161 [details] Thread stackdump for cvwait2.so stall
Created attachment 146162 [details] Thread stackdump for cvwait2.so stall
On Tue, 2007-02-06 at 14:16 -0500, Jakub Jelinek wrote: > I'd say the biggest help would be to try a newer kernel on a machine where > this is reproduceable quickly. Say, if you have some spare partition, > install Fedora Core 6 or RHEL5 there, then when it boots chroot into > the RHEL4 partition, mount the needed filesystems (/proc, /dev/pts, > /dev/shm, ...), and see if you can reproduce it there. > I know it sounds like a wild shot, but anything that helps to decrease the > amount of involved suspected code is important. Jakub, I have the results of some tests comparing RHEL4 & RHEL5. I first ran the recreation test on a 4-processor RHEL4 box. The bug recreated immediately on the first iteration. I then had the box upgraded to RHEL5. I then ran the recreation test twice: neither run recreated the bug. The first run ran overnight (approx 18 hours) before I stopped it. The second ran for an hour before I stopped it (the machine was required for QA work). The RHEL5 environment is: $ uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc i686 2.6.18-1.2747.el5PAE glibc-2.5-4.i686 Regards, Kevin.
For the RHEL5 run, you used the same binary but the RHEL5 runtime (libc, libpthread)? Can you try again and copy over the RHEL4 ld.so, libc, and libpthread (and other libraries you need)? Copy them all into the same directory somewhere (not overwriting system code), and then go into that directory. There run ./ld-linux.so.2 --library-path . ./the-binary If this works fine then it's exclusively a kernel issue.
Or alternatively, install RHEL5 on a different partition than RHEL4, boot into RHEL5 and chroot into RHEL4 partition after mounting the few needed filesystems. The #c51 results suggests that it might be a RHEL4 kernel bug fixed in RHEL5 kernel, but before reassigning to the kernel team to look at the futex.c differences between the two kernels we really need to rule out the possibility of userland differences (for pthread_cond_* they are very minor).
----- Additional Comments From cjt.com 2007-02-22 16:24 EDT ------- I ran the ingres testcase in a chroot to RHEL4 under a RHEL5 kernel and did not experience a lockup for 3 days, meanwhile my test system would typically exhibit the lockup in less than 24 hours when booted on RHEL4. Under RHEL4 I grabbed a system dump with the ingres testcase locked up, but the dump shows all iidbms process had recently ran. I believe the starved iidbms thread may be caused by assuming that calling sched_yield() >will< result in a new process obtaining a mutex. A rather thorough discussion in LTC Bugzilla 19718 describes how sched_yield() does not inherantly result in a new process wining a mutex fight. From Comment #16 https://bugzilla.linux.ibm.com/show_bug.cgi?id=19718#c16 ============ If a task is the only task assigned to a processor, and it's told to remove itself and go to the end of the runqueue, it will have no effect. Remember that there is a runqueue *per processor* in the O(1) scheduler. Specifically, it will remove itself, find somebody to replace itself, find that that somebody is itself, observe that they are already running on this very processor, and realize nothing further needs to be done to get this "new" process running. A few pointers get changed but the sched_yield call will return without even slowing down -- which is correct behavior. ============= I ran a few preliminary tests to convince myself this is what is happening in the ingres testcase. 1) Using the thread_sched_test.c attachment in bugzilla 19718, with a modification to give it 3 reader threads on my 4 processor system, I saw the threads starved of mutex access on occassion. (Much like the ingres testcase reproduction steps indicate, if it doesn't happen in the first couple of interations it's going to take a very long time.) 2) With the thread_shced_test.c test case, in RHEL5 as well as a RHEL4 chroot under a RHEL5 kernel, this phenomenom didn't happen. 3) I intend to run the ingres testcase until I got a locked up interation, if I can unlock it by cycling a cpu intensive task through the cpus using taskset this hypothesis would seem to be proven and Ingres will need to try recoding to obtain the behavior they desire.
All, To summarise our understanding of the issue to-date: 1.In the case as initially reported at RHEL4 the situation is as follows. We have 3 threads. One thread is in pthread_cond_wait on a condition variable. The other 2 threads never call pthread_cond_wait and are never in pthread_cond_wait. They cycle round calling pthread_mutex_lock and pthread_mutex_unlock on the mutex associated with the condition variable, and calling pthread_cond_signal on the condition variable. The question here is: why does the one thread that is waiting on the condition variable not come out of cond_wait and start running, even though it is firstly the only thread waiting on that condition variable, and secondly it is cond_signalled many many times? At the request of the Redhat engineer, we ran a number of tests involving combinations of RHEL4 & RHEL5. 2.The testcase did not recreate on the same machine-type as the original report, upgraded to RHEL5. The testcase also did not reproduce on the same machine-type upgraded to RHEL5, but using RHEL4 libraries. The implication is that there is a difference in the way the RHEL4 and RHEL5 kernels behave such that the problem described in (1) above occurs at RHEL4 kernel, but not at RHEL5 kernel. This reinforces our contention that this is an OS issue, though perhaps gives Redhat some clues as to where to look in order to resolve this at an OS level in RHEL4. Please feel free to contact us at Ingres on the email address above if you have any questions. Regards, Kevin. Ingres Europe.
----- Additional Comments From nevdull.com (prefers email at ricklind.com) 2007-02-27 20:26 EDT ------- I suspect that we are still seeing the scenario I described in the previous bug. If process A has gone to sleep waiting on some resource (condition variable) then it must be awakened and, once awakened, make its way to a processor, and then find the resource still available. If process B (the process giving up the condition variable) gives it up and takes it back too soon, A may receive a signal, may be queued to run, and may run ... only to find that the desired resource is busy again. Ironically, if they were both scheduled on the same CPU, the desired behavior would likely happen, since it's likely that the awakened process will have a higher priority than the one that is trying to yield the processor. But scheduling two processes on the same processor is "idiotic" if you have free processors so without some artificial constraints that would not happen. So, a question. If we show that the above is precisely what is happening, do we agree the problem is with the locking model and not with the kernel? Regardless, we've yet to demonstrate clearly the above is happening. Do we have evidence that B is in fact receiving a signal, responding to it, and finding that the variable is again busy? If so, then we should see a change in behavior if we use sched_setaffinity() so that the holder (A) and the target (B) each can only run on the same cpu. That way "being made runnable" will still happen but the processor that needs to be notified and needs to reevaluate its runqueue is the same one that is willing to suspend A and look for another process. It should find B ready to run and run it instead of A. I presume the same would happen if you ran on a uniproc (single processor). A more invasive method of doing the same would be to modify the code in sys_sched_yield() to record who came through and what state they found the runqueue in. I think what we'd find is that there is nobody to yield to (the awakened process is scheduled on another processor but that other processor has not yet started it running) and so it just keeps going. Adding the debugging would be a simple modification but the previous suggestions have the advantage of not requiring the recompilation of the kernel.
I've just begun investigating this problem. The diff between the RHEL4 and RHEL5 versions of kernel/futex.c is >1500 lines, which includes a significant Priority Inheritance enhancement.
That's true, but these condvars nor mutexes aren't PTHREAD_PRIO_INHERIT nor PTHREAD_PRIO_PROTECT, nor they are they robust. If you manage to reproduce it somewhere, you might after trying RHEL4 and RHEL5 kernels with RHEL4 userland also try FC4/FC5/FC6 GA kernels and thus narrow the search for the differences. Regarding comment #54, do you have any evidence that pthread_yield (the only function in NPTL that calls sched_yield) or sched_yield itself is used?
(In reply to comment #58) > Regarding comment #54, do you have any evidence that pthread_yield (the only > function in NPTL that calls sched_yield) or sched_yield itself is used? I can confirm that we don't use pthread_yield or sched_yield. Regards, Kevin Ingres Europe.
(In reply to comment #56) > ----- Additional Comments From nevdull.com (prefers email at ricklind.com) 2007-02-27 20:26 EDT ------- > A more invasive method of doing the same would be to modify the code in > sys_sched_yield() to record who came through and what state they found the > runqueue in. There is a chance systemtap may be of use here. If you have the tool and its prerequisites (kernel-devel and kernel-debuginfo for your version), you can trace inside sys_sched_yield(). For 2.6.9-42.0.3.EL, this little script inserts a probe just after the dequeue/enqueue statement pair: probe kernel.statement("*@kernel/sched.c:3641") { printf ("%d %d %d\n", pid(), $rq->nr_running, $target) } (One can print other values derived from local pointers; add conditions like process id or execname; collect historical data in arrays instead of printing them out, ...). I'd be happy to help construct a script to help with this problem - I'd just need to know what values/conditions/reports are interesting.
----- Additional Comments From nevdull.com (prefers email at ricklind.com) 2007-03-06 02:27 EDT ------- Kevin said: I can confirm that we don't use pthread_yield or sched_yield. Okay, thanks. An earlier comment suggested that sched_yield() was being used, which took us down that path. I still have a gut feel that we may be grabbing a mutex back before the signalled thread has a chance to acquire it, but without some code to back it up it's only that -- a gut feel. A useful step while the other testing is going on would be to grab the RHEL4-based code for pthread_mutex_unlock() and pthread_cond_wait() and see what series of calls are made to the kernel in implementing those. The comment from petrides suggests this may already be happening if we're inspecting futex code. I'm pretty familiar with things in the kernel but not so much with the thread libraries, so that study may prove useful. Re: systemtap -- might be useful, but I'd be careful about issuing too many printfs. Introducing output can slow some sections of code down an order of magnitude or more, and by dramatically changing the timings, you can inadvertently make the event you're trying to observe simply go away.
----- Additional Comments From cjt.com 2007-03-06 10:17 EDT ------- Since I was the one who brought up sched_yield, I'd like to put closure to that topic by saying that a jprobe on sys_sched_yield(), restricted to current->comm = iidbms, confirmed Ingres' statements and produced no results. (Not from the start of job1 to the point of the iidbms hang.) The jprobe.entry was: asmlinkage long jdo_sched_yield(void) { extern char *jprobe_comm; if (current && (strcmp(current->comm, jprobe_comm) == 0)) { printk("sched_yield from process(%d:%s)..... ", current->pid, current->comm); dump_stack(); } jprobe_return(); return 0; }
Sorry, I'm just getting up to speed on this problem (and I've been distracted for a couple of days working on a RHEL3 kernel HotFix). Anyway, as a follow-up to Kevin's comment #5 regarding test results from the pthread_cond_timedwait() experiment, the kernel's handling of a futex() syscall for the FUTEX_WAIT op is nearly identical with or without a timeout (i.e., whether or not a NULL (struct timespec *) is passed as the 4th arg). Internally, the kernel *always* sets up a timeout, and if user-space passes NULL, the kernel selects a timeout value of LONG_MAX clock ticks (i.e., a large number of years). The only difference in the kernel is that a non-NULL timespec pointer cause a copy of the timespec struct from user-space and a conversion of its fields into a scalar number of clock ticks. The remainder of FUTEX_WAIT processing is absolutely identical. Assuming that the only difference between library implementations of pthread_cond_wait() and pthread_cond_timedwait() is that the former passes a NULL 4th arg to futex(), this suggests that the lost-wakeup race is either timing dependent *or* that the problem is related to memory coherence (e.g., a missing memory barrier in either the kernel or the pthread library). I'm loading the test case onto a 4-processor test machine in Westford, although it's actually got two dual-hyperthreaded cpu packages. I'm also currently running Jakub's little test program in comment #34, but it hasn't reproduced the hang in its first 15 minutes of runtime. I just now noticed that the 32-bit kernel is being used on 64-bit hardware, which is not what's on my test machine. I doubt that's relevant, but I guess I should at least go look at the kernel's 32-bit-compat code for futex() syscall handling.
Please ignore the last sentence in my comment above. The 32-bit-compat code is only relevant for 32-bit executable running on a 64-bit kernel, which is not the environment indicated in this bug report.
----- Additional Comments From cjt.com 2007-03-09 20:31 EDT ------- We have been monitoring the futex usage on a RHEL4 system in the stalled iidbms state. What we found was one iidbms thread waiting on a futex which never gets awoken. Monitoring the futex_wake() calls, we never find that uaddr being passed. Can Ingres provide information on how to identify the "stalled" reader without using gdb? =- Curt
Jakub, why isn't there a memory barrier in __pthread_cond_wait() in between the sampling of cond->__data.__futex (line 140 of RHEL4 version of pthread_cond_wait.c) and the call to lll_mutex_unlock() 3 lines later? Shouldn't this synchronize with another (missing) memory barrier in __pthread_cond_signal() just after the increment of cond->__data.__futex (on line 43 of pthread_cond_signal.c)? Without this synchronization, isn't it possible that the update of the futex value in a waker thread would be sampled by a racing waiter thread, thus causing the futex() syscall to incorrectly wait on the advanced value?
(In reply to comment #66) > ----- Additional Comments From cjt.com 2007-03-09 20:31 EDT ------- > Can Ingres provide information on how to identify the "stalled" reader without > using gdb? > Apologies for answering with a question, but what options for identification do you have in mind? Regards, Kevin Ingres Europe
----- Additional Comments From washer.com 2007-03-10 13:57 EDT ------- Answering the Question as to how to indentify the threads... Is there some unique function on the stack of the various threads? For example does the name of the function first called from the pthread_create differ amongst the various functions threads.. - Adding a bit more to what Curtis wrote... When the hand occurs, we see a couple of threads doing LOTS of wakeups on mutexes for which there are no sleepers, and we see one thread that is waiting on a different mutex. Is it possible that the wrong mutex is being signalled? - Since we don't have access (that I'm aware of) to Ingress code, it's difficult for us to know WHICH thread the process is waiting (hanging) on. How can we know for sure which thread needs to be awoken for things to continue?
(In reply to comment #69) Identifying the threads: A typical stack dump of the 3 threads involved looks like this: Thread 10 (Thread 3952077744 (LWP 6590)): #0 0x00a69260 in pthread_self () from /lib/tls/libpthread.so.0 #1 0xf7e3b598 in IICSMTp_semaphore (exclusive=1, sem=0xeb78c408) at /devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmtsem.c:645 #2 0xf784974a in build_pindex (ftx=0xeb8f4560) at /devsrc/ingresr30/smeke01/src/back/dmf/dmu/dm2uputi.c:566 #3 0xf7dc3265 in scs_dbms_task () from /patchtest/ingres/lib/libscf.1.so #4 0xf7dd0794 in scs_sequencer () from /patchtest/ingres/lib/libscf.1.so #5 0xf7e2fe26 in CSMT_setup (parm=0xea88fe80) at /devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmthl.c:983 #6 0x00a68341 in start_thread () from /lib/tls/libpthread.so.0 #7 0x008fcfee in clone () from /lib/tls/libc.so.6 Thread 9 (Thread 3935398832 (LWP 6591)): #0 0xffffe410 in __kernel_vsyscall () #1 0x00a6aa86 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0xf7e3bc99 in IICSMTp_semaphore (exclusive=0, sem=0xeb78c340) at /devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmtsem.c:1078 #3 0xf78496c8 in build_pindex (ftx=0xea90c560) at /devsrc/ingresr30/smeke01/src/back/dmf/dmu/dm2uputi.c:545 #4 0xf7dc3265 in scs_dbms_task () from /patchtest/ingres/lib/libscf.1.so #5 0xf7dd0794 in scs_sequencer () from /patchtest/ingres/lib/libscf.1.so #6 0xf7e2fe26 in CSMT_setup (parm=0xea8918c0) at /devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmthl.c:983 #7 0x00a68341 in start_thread () from /lib/tls/libpthread.so.0 #8 0x008fcfee in clone () from /lib/tls/libc.so.6 Thread 8 (Thread 3934305200 (LWP 6592)): #0 0xffffe410 in __kernel_vsyscall () #1 0x00a6d13e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #2 0x00a69d9b in _L_mutex_lock_32 () from /lib/tls/libpthread.so.0 #3 0x00000000 in ?? () 1 thread stack will be unchanging, having called pthread_cond_wait() (thread 9 in the example above). The other 2 will be moving. The stack will be the same for the 3 threads as follows up to and including the build_pindex() call: clone() -> start_thread() -> CSMT_setup() -> scs_sequencer() -> scs_dbms_task() -> build_pindex(). [However gdb seems often to have difficulty printing out the stack as you can see in thread 8 above]. build_pindex() calls the IICSMT_x ingres functions that in turn call the pthread_x functions. In the source the calls to pthread_x functions appear as platform-dependent macros. On Linux these are a straight substitution: I've supplied Jakub with a complete Ingres install and the source code for the relevant sections via our ftp site. I will send you details of this separately. The complete source code for Ingres is available on-line at www.ingres.com, though not the exact version our customer is using. If it is necessary I can supply you with the exact source code the customer is using, via our ftp site. I hope this helps, Regards, Kevin Ingres Europe.
Re: #c67 lll_mutex_{,un}lock macros are synchronization barriers themselves, e.g. on i?86 they are asm volatile instructions touching memory with lock prefix and "memory" barrier, no memory loads/stores should be reordered (neither by GCC nor by the CPU itself) accross them. Also, i386/x86_64 use pthread_cond_{wait,timedwait,signal,broadcast} written in assembly rather than the .c versions, see nptl/sysdeps/unix/sysv/linux/i386/i486/*.S. Re: #c64 Kevin has been able to reproduce the hangs even with pthread_cond_timedwait, see comments #c38 through #c47.
(In reply to comment #71) > Re: #c64 > Kevin has been able to reproduce the hangs even with pthread_cond_timedwait, see > comments #c38 through #c47. Jakub - I found that pthread_cond_timedwait didn't seem to suffer from the stall (see comment #36). I tested this by doing tests with pthread_cond_timedwait, with the extra code added to set a time value. I then did a test where everything was the same, including the (technically redudant) extra code for the time value, except that pthread_cond_timedwait() was replaced with pthread_cond_wait(). The results were that pthread_cond_timedwait() never produced the stall, even after several attempts and over short and long overnight periods. The code with pthread_cond_wait() however reproduced the stall on the first run, within 10 minutes. Regards, Kevin Ingres Europe.
In comment #c45 you said you reproduced the stall with LD_PRELOAD=cvwait2.so after 22 hours, where cvwait2.so is pthread_cond_timedwait assembly routine with pthread_cond_wait entry point that just uses maximum possible abstime.
(In reply to comment #73) > In comment #c45 you said you reproduced the stall with LD_PRELOAD=cvwait2.so > after 22 hours, where cvwait2.so is pthread_cond_timedwait assembly routine > with pthread_cond_wait entry point that just uses maximum possible abstime. Those were your diagnostics so you would know, I just ran the tests and reported the results. When I tried pthread_cond_timedwait() in our code it never reproduced the stall. When we tried pthread_cond_timedwait() on the customer's system as an attempt to produce more diagnostics we found the problem went away, and he has since been running with this for weeks as a temporary measure on his test machine without getting the stall. Kevin Ingres Europe.
----- Additional Comments From cjt.com 2007-03-12 18:00 EDT ------- In a given reproduction case we have these iidbms processes in futex_wait() PID futex_q.key === =========== 11848 0x00000003,f632f24c,0xa99 11840 0x00000003,f632f24c,0x7d9 11841 0xaabb2000,f7696080,0x1d0 11845 0x00000003,f632f24c,0x991 11849 0x00000003,f632f24c,0xaf1 None of those process have run following the stall. Apparently 11841 would be the stalled reader since the other's have futex_q keys in shared memory(???) Meanwhile an strace reveals pids 11842 and 11843 active calling sys_futex with uaddr=0xaabb21b4 and 0xaabb227c. (A jprobe on futex_wake reveals these are the only significant processes using FUTEX_WAKE and that aabb21b4 and aabb227c are the only uaddr's being used.) crash> bt -f 11841 PID: 11841 TASK: dfc98330 CPU: 2 COMMAND: "iidbms" #0 [f4ee7e54] schedule at c02d25c9 [RA: c02d2e55 SP: f4ee7e54 FP: f4ee7eb8 SIZE: 104] f4ee7e54: f4ee7eb4 00000082 00000012 00000046 f4ee7e64: f35f39b0 f35f39b0 00000019 c4a67de0 f4ee7e74: f35f39b0 00000000 c4a60740 c4a5fde0 f4ee7e84: 00000002 00000000 8bf46740 000f42f9 f4ee7e94: f35f39b0 dfc98330 dfc9849c 00000008 f4ee7ea4: f633b3b0 00000000 7fffffff f4ee7000 f4ee7eb4: f4ee7f00 c02d2e55 #1 [f4ee7eb8] schedule_timeout at c02d2e50 [RA: c01351dd SP: f4ee7ebc FP: f4ee7ef4 SIZE: 60] f4ee7ebc: 0000001b 00008900 aabb2000 aabb2000 f4ee7ecc: c01506e4 aabb2000 f4ee7f1c 00000246 f4ee7edc: f4ee7f50 f4ee7f08 c012038d 00000000 f4ee7eec: 00000000 00000000 c01351dd #2 [f4ee7ef4] futex_wait at c01351d8 [RA: c013548e SP: f4ee7ef8 FP: f4ee7f80 SIZE: 140] f4ee7ef8: 7fffffff 00066e51 c0439ea8 c0439ea8 f4ee7f08: 00000001 dead4ead f4ee7f5c f4ee7f5c f4ee7f18: c0439e9c aabb2000 f7696080 000001d0 f4ee7f28: ffffffff 00000000 00000000 dfc98330 f4ee7f38: c011e794 00000000 00000000 00000000 f4ee7f48: f4ee7f60 f76960b0 00000000 dfc98330 f4ee7f58: c011e794 f4ee7f10 f4ee7f10 f7696080 f4ee7f68: 000001b4 c02d50c0 aabb21d0 00000000 f4ee7f78: 00000000 f4ee7000 c013548e #3 [f4ee7f80] do_futex at c0135489 [RA: c01355c0 SP: f4ee7f84 FP: f4ee7f90 SIZE: 16] f4ee7f84: aabb21d0 00000000 00066e51 c01355c0 #4 [f4ee7f90] sys_futex at c01355bb [RA: c02d4703 SP: f4ee7f94 FP: f4ee7fc0 SIZE: 48] f4ee7f94: 7fffffff 00066e51 00000000 00b6b538 f4ee7fa4: 00000000 c03269a0 00000000 aabb21d0 f4ee7fb4: 00000000 00066e51 f4ee7000 c02d4703 #5 [f4ee7fc0] system_call at c02d46fc EAX: 000000f0 EBX: aabb21d0 ECX: 00000000 EDX: 00066e51 DS: 007b ESI: 00000000 ES: 007b EDI: 00066e51 SS: 007b ESP: 00b6b4ac EBP: 00b6b538 CS: 0073 EIP: 008da7a2 ERR: 000000f0 EFLAGS: 00000292 [RA: 8da7a2 SP: f4ee7fc4 FP: f4ee7ffc SIZE: 60] f4ee7fc4: aabb21d0 00000000 00066e51 00000000 f4ee7fd4: 00066e51 00b6b538 000000f0 0000007b f4ee7fe4: 0000007b 000000f0 008da7a2 00000073 f4ee7ff4: 00000292 00b6b4ac 0000007b sys_futex(uaddr=0xaabb21d0, op=0, val=0x66e51 ....) crash> vtop -c 11841 aabb21d0 VIRTUAL PHYSICAL aabb21d0 14426b1d0 crash> rd -p 14426b1b0 64 14426b1b0: 4d455349 00000000 00000000 00000000 ISEM............ 14426b1c0: 00000000 00000001 00000000 00000000 ................ 14426b1d0: 00066e52 00033729 00000000 00033729 Rn..)7......)7.. 14426b1e0: 00000000 00033728 00000000 aabb21b4 ....(7.......!.. 14426b1f0: 00000002 00000000 00000000 00000000 ................ That's the pthread_cond_t with __futex=66e52, __total_seq = 33729, __wakeup_seq = 33729, __woken_seq = 33728, __mutex = 0xaabb21b4, __nwaitiers = 2 ... What I'm hoping RH can explain is why the pthread_cond_t's __mutex is the FUTEX_WAKE uaddr for the active iidbms pids. =- Curt
Just clarifying the above a bit in comment 34 above, we see gdb traces showing a print of a pthread_cond_t as such: (gdb) p *(pthread_cond_t *)0xaca7055c $45 = {__data = {__lock = 1, __futex = 5866, __total_seq = 2933, __wakeup_seq = 2933, __woken_seq = 2932, __mutex = 0xaca70544, __nwaiters = 2, __broadcast_seq = 0}, __size = "\001\000\000\000*\026\000\000u\v\000\000\000\000\000\000u\v\000\000\000\000\000\000t\v\000\000\000\000\000\000D\005**\002\000\000\000\000\000\000\000\000\000\000", __align = 25194278158337} (gdb) p *(pthread_mutex_t *)0xaca70544 $46 = {__data = {__lock = 1, __count = 0, __owner = 25088, __kind = 0, __nusers = 2, __spins = 0}, __size = "\001\000\000\000\000\000\000\000\000b\000\000\000\000\000\000\002\000\000\000\000\000\000", __align = 1} Note the futex as the second element of the cond_t and further in the cond_t we see the pointer the the pthread_mutex. Now, in the hang that Curtis reproduced we see a thread sleeping (waiting) on a futex. If one looks at the data in user space using the crash tool, one sees a cond_t structure like the one mentioned in comment34. Again, we see the pointer to the associated pthead_mutex_t. Everything looks good, as we see that this thread called futex_wait with the futex contained withing the user space cond_t Now...what's distubing here is that we see the other two threading calling futex_wake with the address of the pthread_mutex, rather than the address of the FUTEX contained within the cond_t. This seems rather odd... userland is calling futex_wake with a mutex??? Can someone from Ingres and/or RH LIBC folks tell us why (or where we did something wrong in our analysis)?
(In reply to comment #76) > > This seems rather odd... userland is calling futex_wake with a mutex??? > > Can someone from Ingres and/or RH LIBC folks tell us why (or where we did > something wrong in our analysis)? Think this is for RH LIBC analysis. I can confirm that in our code we call pthread_cond_signal() with the condition_variable as the parameter. Kevin Ingres Europe.
Jakub, should I bounce this back over to the "glibc" component? I'm having trouble analyzing the assembly language code in nptl/sysdeps/unix/sysv/linux/i386/i486/pthread_cond_*.S. After reading comment #75 and comment #76, I'm beginning to suspect that this is not a kernel problem anyway.
Please keep it under kernel, we have analyzed the assembly several times and it is correct. To answer comment #75 and comment #76, condvars use 3 addresses for futexes - cv's __data.__lock, cv's __data.__futex and mutex's __data.__lock. cv's __data.__lock is a short lived mutex which guards the changes of other cv's fields, uses values 0/1/2 and on uncontended cv is handled purely in userland. 0 means not locked, 1 locked with no waiters and 2 locked with possibly some waiters waiting till it is unlocked. mutex's __data.__lock works similarly, though it depends upon the application how short-lived it is. If I remember well iidbms at least on the analyzed stalls calls pthread_cond_signal with the associated mutex hold. cv's __data.__futex is a counter (it should always contain (int) (__data.__total_seq + __data.__wakeup_seq)) and pthread_cond_*wait blocks on it with FUTEX_WAIT, pthread_cond_signal FUTEX_WAKEs it (pthread_cond_broadcast isn't involved here, so I won't describe it). Around comment #40 you can see that the stall has been reproduced even with a C version of pthread_cond_wait, so you can perhaps read that as it is more readable than the assembly version if you want to understand what it is doing. But IMHO focusing on userland is a wrong thing now, we know a RHEL5 kernel fixes this problem and we spent considerable time checking the userland code. So I'd say best way is to narrow down the kernel futex.c changes that might be relevant (try FC4/FC5/FC6 GA kernels).
Created attachment 150237 [details] kernel-2.6.9-futex-queue_me-change.patch
----- Additional Comments From cjt.com 2007-03-16 11:18 EDT ------- backport of queue_me changes from 2.6.18-8 futex.c
----- Additional Comments From cjt.com 2007-03-16 11:21 EDT ------- Making the queue_me/lock changes for futex_wait found in the rhel5 kernel have fixed the problem on my system. These changes are basically a derivative of the patch from http://lkml.org/lkml/2005/3/17/37. The exact patch I used is now attached to this bugzilla and applies to the 2.6.9-42 kernel. Please test it and report the results.
I will follow up on all this next week (after finishing work on the RHEL3 U9 kernel). Thanks for all the help.
I can confirm that the patch in comment #81 correctly brings the RHEL4 version of futex_wait() and its subordinate functions in line with what's in RHEL5 (and also with what's upstream in 2.6.20). I might make a few non-functional changes to reduce the extent of the diffs, but overall it looks good to me. I will try to develop a small isolated reproducer based on knowledge of what the fixes address. I prefer to have this to validate the changes.
Created attachment 152587 [details] test program for trying to reproduce pthread_cond_wait() hangs This is the test program that I've been using to attempt to reproduce hangs under pthread_cond_wait(), which tried to generate the relevant race condition in the kernel's handling of futex()/FUTEX_WAIT syscalls. Unfortunately, on 3 different test systems (i386/4-cpu, x86_64/4-cpu, x86_64/8-cpu w/o hyperthreading), I was unable to reproduce the hang in ~24 hours of test runs.
The patch below was posted for internal review on 13-Apr-2007.
Created attachment 152592 [details] This is the patch under consideration for RHEL4.6.
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
(In reply to comment #96) > This request was evaluated by Red Hat Kernel Team for inclusion in a Red > Hat Enterprise Linux maintenance release, and has moved to bugzilla > status POST. Is the maintenance release containing the fix for bug 217067 now generally available? Regards, Kevin. Ingres Europe.
The fix for this problem has been proposed for RHEL4.6 (and has not yet been commited to CVS). What was released earlier this week was RHEL4.5.
committed in stream U6 build 55.2. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
Created attachment 161087 [details] kernel-2.6.9-futex-queue_me-change.patch
Created attachment 161088 [details] This is the patch under consideration for RHEL4.6.
Created attachment 161089 [details] data from ps axwml and gdb
Created attachment 161090 [details] Thread stackdump for cvwait1.so stall
Created attachment 161091 [details] cvwait.tar.bz2
Created attachment 161092 [details] test program for trying to reproduce pthread_cond_wait() hangs
Created attachment 161093 [details] Ingres &amp; Linux description of the problem
Created attachment 161094 [details] Thread stackdump for cvwait3.so stall
Created attachment 161095 [details] Thread stackdump for cvwait2.so stall
A fix for this issue should have been included in the packages contained in the RHEL4.6 Beta released on RHN (also available at partners.redhat.com). Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
------- Comment From cjt.com 2007-08-31 18:56 EDT------- I ran the 2.6.9-56 kernel from RHEL4 U6 beta for 2 days without any occurance of the problem on the customer's test case. So, this problem appears to be fixed in the U6 beta. Someone with the proper permissions with need to update this to "PartnerVerified"
------- Comment From bugzilla 2007-10-03 22:39 EST------- User petrides's account has been disabled, requested by HC
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html