217067 – A thread in pthread_cond_wait() is being signalled (rapidly) by two other threads but it is not being scheduled.

Bug 217067 - A thread in pthread_cond_wait() is being signalled (rapidly) by two other threads but it is not being scheduled.

Summary: A thread in pthread_cond_wait() is being signalled (rapidly) by two other thr...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jason Baron
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	234251 234547 245198
TreeView+	depends on / blocked

Reported:	2006-11-23 16:53 UTC by Ingres
Modified:	2018-10-19 21:17 UTC (History)
CC List:	10 users (show)
Fixed In Version:	RHBA-2007-0791
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-15 16:16:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Ingres & Linux description of the problem (21.56 KB, application/OpenOffice) 2006-11-23 16:53 UTC, Ingres	no flags	Details
data from ps axwml and gdb (39.75 KB, text/plain) 2007-01-12 19:47 UTC, Jakub Jelinek	no flags	Details
cvwait.tar.bz2 (28.15 KB, application/octet-stream) 2007-01-16 12:32 UTC, Jakub Jelinek	no flags	Details
Thread stackdump for cvwait3.so stall (33.03 KB, text/plain) 2007-01-19 16:01 UTC, Ingres	no flags	Details
Thread stackdump for cvwait1.so stall (29.77 KB, application/octet-stream) 2007-01-19 16:02 UTC, Ingres	no flags	Details
Thread stackdump for cvwait1.so stall (29.77 KB, text/plain) 2007-01-19 16:04 UTC, Ingres	no flags	Details
Thread stackdump for cvwait2.so stall (31.55 KB, text/plain) 2007-01-22 10:15 UTC, Ingres	no flags	Details
Thread stackdump for cvwait2.so stall (31.55 KB, text/plain) 2007-01-22 10:15 UTC, Ingres	no flags	Details
kernel-2.6.9-futex-queue_me-change.patch (3.21 KB, text/plain) 2007-03-16 15:22 UTC, IBM Bug Proxy	no flags	Details
test program for trying to reproduce pthread_cond_wait() hangs (1.90 KB, text/plain) 2007-04-13 20:52 UTC, Ernie Petrides	no flags	Details
This is the patch under consideration for RHEL4.6. (5.01 KB, patch) 2007-04-13 21:36 UTC, Ernie Petrides	no flags	Details \| Diff
kernel-2.6.9-futex-queue_me-change.patch (3.21 KB, text/plain) 2007-08-10 22:39 UTC, IBM Bug Proxy	no flags	Details
This is the patch under consideration for RHEL4.6. (5.01 KB, application/octet-stream; charset=ISO-8859-1) 2007-08-10 22:39 UTC, IBM Bug Proxy	no flags	Details
data from ps axwml and gdb (39.75 KB, text/plain) 2007-08-10 22:39 UTC, IBM Bug Proxy	no flags	Details
Thread stackdump for cvwait1.so stall (29.77 KB, text/plain) 2007-08-10 22:40 UTC, IBM Bug Proxy	no flags	Details
cvwait.tar.bz2 (28.15 KB, application/octet-stream) 2007-08-10 22:40 UTC, IBM Bug Proxy	no flags	Details
test program for trying to reproduce pthread_cond_wait() hangs (1.90 KB, text/plain) 2007-08-10 22:40 UTC, IBM Bug Proxy	no flags	Details
Ingres &amp; Linux description of the problem (21.56 KB, application/OpenOffice) 2007-08-10 22:40 UTC, IBM Bug Proxy	no flags	Details
Thread stackdump for cvwait3.so stall (33.03 KB, text/plain) 2007-08-10 22:40 UTC, IBM Bug Proxy	no flags	Details
Thread stackdump for cvwait2.so stall (31.55 KB, text/plain) 2007-08-10 22:41 UTC, IBM Bug Proxy	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	31511	0	None	None	None	2019-07-29 02:24:14 UTC
Red Hat Product Errata	RHBA-2007:0791	0	normal	SHIPPED_LIVE	Updated kernel packages available for Red Hat Enterprise Linux 4 Update 6	2007-11-14 18:25:55 UTC

Description Ingres 2006-11-23 16:53:34 UTC

Description of problem:

A thread in pthread_cond_wait() is being signalled (rapidly) by two other
threads but it is not being scheduled. 

Version-Release number of selected component (if applicable):


How reproducible:

This is reproducible currently by running up the Ingres database and running a
particular testcase supplied by an Ingres customer. We have not yet succeeded in
creating a 'cut-down' testcase. We are currently getting clearance from the
customer to supply the testcase data at which point we could guide you through
reproducing the issue on Ingres.

Steps to Reproduce:

1.To be supplied.
2.
3.
  
Actual results:


Expected results:


Additional info:

Please see the attachment for the problem details, one from the point of view of
the symptoms seen in Ingres, one described from the Linux point of view.

Comment 1 Ingres 2006-11-23 16:53:37 UTC

Created attachment 142001 [details]
Ingres & Linux description of the problem

Comment 2 Jakub Jelinek 2006-11-27 14:43:17 UTC

Are realtime priorities involved?

Comment 4 Ingres 2006-11-27 16:33:49 UTC

No. The threads have default attributes, so default to SCHED_OTHER.

The rapid signalling is a side-effect of the buffering algorithm that occurs
when one or more reader threads catch up with the writer. The reader thread(s)
in this situation will loop around read-locking the buffer, finding they've read
the data in the buffer already, and unlocking the buffer. As part of the unlock
of the buffer, they check a condwaiter flag to see if there are any threads
condwaiting on the buffer, and if so, a pthread_cond_signal() is sent. What
should happen is that the condwaiters are signalled, decrement the condwaiter
flag down to zero, and therefore no more signals are generated.  Because the
condwaiter in this case is not moving off the pthread_cond_wait() line, the
condwaiter flag is non-zero, and so large numbers of pthread_cond_signal() calls
are executed.

Comment 5 Ingres 2006-12-18 11:39:11 UTC

Hi Jakub,

Is there any news on this bug report? If you are stuck could you let me know as
I'd be very happy to help with explanations, more evidence etc.

I have some evidence that seems to suggest that pthread_cond_timedwait() doesn't
suffer from the same problem as pthread_cond_wait(). What I did was to change
the call to pthread_cond_wait() to pthread_cond_timedwait() with the extra code
necessary to set up the time and code to print out a diagnostic message if the
wait timed out (ie result == ETIMEDOUT).  The original intention here was to get
some extra diagnostics if possible. However when I ran the test case, the stall
did not occur, even after leaving it running for several days, nor did the timed
wait time out. 

To eliminate the possibility that this was a timing issue (since I had added
several new lines of code) I next simply replaced the pthread_cond_timedwait()
call with pthread_cond_wait(), leaving all the time code and diagnostic message
in place. On running the test case, the stall recreated within minutes.

Regards,

Kevin.

Comment 6 Jakub Jelinek 2006-12-18 11:43:17 UTC

I couldn't reproduce this on a testcase myself, so a small self-contained
reproducer would be highly appreciated.

Comment 7 Ingres 2006-12-21 12:37:27 UTC

I have put together a testcase, though it's not small - the 2 compressed tar
files are around 30 & 200 MB in size. What's the best way to get the files to you?

Comment 8 Jakub Jelinek 2006-12-21 15:01:22 UTC

Best stick it to some ftp/http server and put here URL, if that's not possible,
you can mail it to me directly.  Though, does the testcase involve any binary
only code?  That would make it much harder to see what exactly it is doing,
especially when debugging problems with thread synchronization it is essential
to check for correct use of the synchronization primitives...

Comment 9 Ingres 2006-12-21 16:37:00 UTC

I have included the source code of the functional layer that directly calls the
pthread_x functions. I will send you an email direct so that I can give you
details of where to ftp the files from etc.

Comment 17 Ingres 2007-01-02 16:13:05 UTC

Hi chaps,

How are you getting on?

Regards,

Kevin.

Comment 18 Ingres 2007-01-04 14:20:01 UTC

Jakub,

Have you managed to recreate the problem?

Regards,

Kevin.

Comment 20 Jakub Jelinek 2007-01-09 19:54:53 UTC

Sorry, couldn't reproduce the problem during more than 2 hours of running the
job1 script on 4way HT i686 (i.e. 8 logical CPUs).
uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc
i686
2.6.9-42.ELsmp
glibc-2.3.4-2.25.i686

 6488 ingres    16   0  228m  95m 7416 S  159  1.1 700:45.87 iidbms

Comment 21 Ingres 2007-01-11 11:09:38 UTC

It may be worth runnng it overnight. We reproduced it easily here on a machine
with a similar version level:

uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc
i686
2.6.9-42.0.3.ELsmp
glibc-2.3.4-2.25.i686

Here is the cpuinfo for the machine:

ingres@qasling1> cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor Sample
stepping        : 0
cpu MHz         : 1594.611
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips        : 3190.94

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor Sample
stepping        : 0
cpu MHz         : 1594.611
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips        : 3187.97

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor Sample
stepping        : 0
cpu MHz         : 1594.611
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips        : 3188.01

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor Sample
stepping        : 0
cpu MHz         : 1594.611
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips        : 3188.06

Comment 22 Ingres 2007-01-11 11:11:13 UTC

Is it possible to access your test environment, so that I can help recreate the
issue?

Comment 23 Ulrich Drepper 2007-01-11 15:14:08 UTC

Can you reproduce it with Intel processors?  I've seen at least one report that
AMD does implement certain memory ordering situations correctly.

Comment 24 Ingres 2007-01-12 09:37:32 UTC

(In reply to comment #23)
> Can you reproduce it with Intel processors?  I've seen at least one report
that AMD does implement certain memory ordering situations correctly.

Ulrich,

I have reproduced it on Intel running Suse Linux. Here are the details:

uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc
i686
2.6.5-7.191-smp
glibc-2.3.3-98.47.i686

ingres@usilsl9-1:~> cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.40GHz
stepping        : 3
cpu MHz         : 3402.132
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm pni monitor ds_cpl
tm2 cid
bogomips        : 6701.05

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.40GHz
stepping        : 3
cpu MHz         : 3402.132
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm pni monitor ds_cpl
tm2 cid
bogomips        : 6782.97

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.40GHz
stepping        : 3
cpu MHz         : 3402.132
cache size      : 2048 KB
physical id     : 3
siblings        : 2
core id         : 3
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm pni monitor ds_cpl
tm2 cid
bogomips        : 6782.97

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.40GHz
stepping        : 3
cpu MHz         : 3402.132
cache size      : 2048 KB
physical id     : 3
siblings        : 2
core id         : 3
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm pni monitor ds_cpl
tm2 cid
bogomips        : 6799.36

Regards,

Kevin.

Comment 25 Ingres 2007-01-12 09:43:54 UTC

(In reply to comment #20)
> Sorry, couldn't reproduce the problem during more than 2 hours of running the
job1 script on 4way HT i686 (i.e. 8 logical CPUs).

Jakub,

We can get you set up with a vpn connection to our box that this recreates on.
When would you be available to log on?

Regards,

Kevin.

Ingres Europe.

Comment 26 Jakub Jelinek 2007-01-12 10:13:42 UTC

FYI, I have been running the job1 script here now for 17 hours and didn't
reproduce it yet.
If you set up something I could ssh in, guess I can look at it any time today
(CET timezone).

Comment 27 Ulrich Drepper 2007-01-12 15:19:41 UTC

(In reply to comment #24)
> I have reproduced it on Intel running Suse Linux.

And what on earth does the suse glibc has to do with us?  If these people cannot
provide a stable glibc that's no problem of ours.

Comment 28 Ingres 2007-01-12 15:46:07 UTC

(In reply to comment #27)
> (In reply to comment #24)
> > I have reproduced it on Intel running Suse Linux.
> 
> And what on earth does the suse glibc has to do with us?  If these people cannot
> provide a stable glibc that's no problem of ours.

The problem occurs on both Suse & Redhat with remarkably similar symptoms. Does
this mean that you think the problem is down to an unstable glibc?

Comment 29 Ulrich Drepper 2007-01-12 16:17:37 UTC

(In reply to comment #28)
> The problem occurs on both Suse & Redhat with remarkably similar symptoms. Does
> this mean that you think the problem is down to an unstable glibc?

It means that

a) you never ever must report a problem in code shipped by other people

and

b) we cannot reproduce any problem

Comment 30 Ingres 2007-01-12 16:23:03 UTC

(In reply to comment #20)
> Sorry, couldn't reproduce the problem during more than 2 hours of running the
> job1 script on 4way HT i686 (i.e. 8 logical CPUs).
> uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc
> i686
> 2.6.9-42.ELsmp
> glibc-2.3.4-2.25.i686
> 
>  6488 ingres    16   0  228m  95m 7416 S  159  1.1 700:45.87 iidbms

Below are the equivalent details from the customer's Redhat system, on which the
problem was first reported and recreates regularly:

uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc
x86_64
2.6.9--22.0.2.ELsmp
glibc-2.3.4-2.9.x86_64
glibc-2.3.4-2.9.i686

Comment 31 Ingres 2007-01-12 16:31:57 UTC

Ulrich,

I want this issue, which is an O/S problem that happens to occur on both Redhat
& Suse, on Intel & AMD, fixed. If you have no constructive comments to make
please leave the airwaves clear for those that do. Jakub is the owner of this
issue. Please direct your comments through him.

Kevin.
Ingres Support

Comment 32 Jakub Jelinek 2007-01-12 19:47:21 UTC

Created attachment 145480 [details]
data from ps axwml and gdb

Ok, here is the data I've captured (and still have the process under gdb, so
can poke some more).
Most of the processes are waiting on uninteresting condvars that never
appear in other processes, the only interesting condvar seems to be
0xaca7055c and its associated 0xaca70544 mutex.
Thread 10 is waiting on this cv, thread 9 (CPU eating one from ps axwml) is
pthread_cond_signaling it, thread 8 (also CPU hungry) is
pthread_mutex_unlocking
the associated mutex.  Does this match a little bit what you have been seeing?

Comment 33 Ingres 2007-01-15 09:47:41 UTC

(In reply to comment #32)

Hi Jakub,

That sounds like it. 

Three threads. One in cond_wait, top bit of stack like this:

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00a6aa86 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0xf7e3bc99 in IICSMTp_semaphore (exclusive=0, sem=0xeb78c340) at
/devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmtsem.c:1078
#3  0xf78496c8 in build_pindex (ftx=0xea90c560) at
/devsrc/ingresr30/smeke01/src/back/dmf/dmu/dm2uputi.c:545

The other 2 are moving rapidly (as described in earlier attachment), locking the
mutex, signalling the cond-var (to no effect, which is the problem), and
unlocking the mutex. You will see them therefore with changing stacks, possibly
with what look like corrupt stacks. Example:

#0  0x00a69260 in pthread_self () from /lib/tls/libpthread.so.0
#1  0xf7e3b598 in IICSMTp_semaphore (exclusive=1, sem=0xeb78c408) at
/devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmtsem.c:645
#2  0xf784974a in build_pindex (ftx=0xeb8f4560) at
/devsrc/ingresr30/smeke01/src/back/dmf/dmu/dm2uputi.c:566

& 

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00a6d13e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2  0x00a69d9b in _L_mutex_lock_32 () from /lib/tls/libpthread.so.0
#3  0x00000000 in ?? ()

The latter is the whole stack trace, so looks as if either it's corrupt or gdb
couldn't cope.

Let me know if I can help further.

Regards,

Kevin

Ingres Europe.

Comment 34 Jakub Jelinek 2007-01-15 20:48:50 UTC

More info just on the 3 threads from #32:
Thread 10 (Thread 16968624 (LWP 25087)):
#0  0x004547a2 in _dl_sysinfo_int80 () at rtld.c:577
#1  0x0034ab26 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0x00be8e36 in IICSMTp_semaphore () from
/opt/Ingres/IngresII/ingres/lib/libbecompat.1.so
#3  0x007c3633 in build_pindex () from
/opt/Ingres/IngresII/ingres/lib/libdmf.1.so
#4  0x00130265 in scs_dbms_task () from
/opt/Ingres/IngresII/ingres/lib/libscf.1.so
#5  0x0013d794 in scs_sequencer () from
/opt/Ingres/IngresII/ingres/lib/libscf.1.so
#6  0x00be01c6 in CSMT_setup () from
/opt/Ingres/IngresII/ingres/lib/libbecompat.1.so
#7  0x00348371 in start_thread (arg=0x102ebb0) at pthread_create.c:261

(gdb) p *(pthread_cond_t *)0xaca7055c
$45 = {__data = {__lock = 1, __futex = 5866, __total_seq = 2933, __wakeup_seq =
2933, __woken_seq = 2932,
    __mutex = 0xaca70544, __nwaiters = 2, __broadcast_seq = 0},
  __size =
"\001\000\000\000*\026\000\000u\v\000\000\000\000\000\000u\v\000\000\000\000\000\000t\v\000\000\000\000\000\000D\005**\002\000\000\000\000\000\000\000\000\000\000", __align = 25194278158337}
(gdb) p *(pthread_mutex_t *)0xaca70544
$46 = {__data = {__lock = 1, __count = 0, __owner = 25088, __kind = 0, __nusers
= 2, __spins = 0},
  __size =
"\001\000\000\000\000\000\000\000\000b\000\000\000\000\000\000\002\000\000\000\000\000\000", __align = 1}

waiting on __data.__futex

(gdb) x/32w $esp
0x10244b0:      0x00000000      0x00000b74      0x00000000      0x00000000
0x10244c0:      0x00c5c2a0      0x00000000      0xaca70508      0x00be8e36
0x10244d0:      0xaca7055c      0xaca70544      0xaca70544      0x00be8939
0x10244e0:      0x00c5c2a0      0x0034a15f      0x01024538      0xaca7055c
0x10244f0:      0xaca70544      0x0102ebb0      0x01024578      0x00be8f87
0x1024500:      0x00000002      0xaca705a8      0x00009930      0x0102ebb0
0x1024510:      0x00002fa0      0x00000001      0x00000011      0x00000000
0x1024520:      0x00000001      0xacaf94e0      0xacaf94e0      0x00964e20
(i.e. saved seq val is 2932, saved bc_seq is 0).

Thread 9 (Thread 126991280 (LWP 25088)):
#0  pthread_cond_signal@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/pthread_cond_signal.S:61
#1  0x00be9221 in IICSMTv_semaphore () from
/opt/Ingres/IngresII/ingres/lib/libbecompat.1.so
#2  0x007c3715 in build_pindex () from
/opt/Ingres/IngresII/ingres/lib/libdmf.1.so
#3  0x00130265 in scs_dbms_task () from
/opt/Ingres/IngresII/ingres/lib/libscf.1.so
#4  0x0013d794 in scs_sequencer () from
/opt/Ingres/IngresII/ingres/lib/libscf.1.so
#5  0x00be01c6 in CSMT_setup () from
/opt/Ingres/IngresII/ingres/lib/libbecompat.1.so
#6  0x00348371 in start_thread (arg=0x791bbb0) at pthread_create.c:261
#7  0x00a34ffe in clone () from /lib/tls/libc.so.6

(gdb) p *(pthread_cond_t *)0xaca7055c
$48 = {__data = {__lock = 1, __futex = 5866, __total_seq = 2933, __wakeup_seq =
2933, __woken_seq = 2932,
    __mutex = 0xaca70544, __nwaiters = 2, __broadcast_seq = 0},
  __size =
"\001\000\000\000*\026\000\000u\v\000\000\000\000\000\000u\v\000\000\000\000\000\000t\v\000\000\000\000\000\000D\005**\002\000\000\000\000\000\000\000\000\000\000", __align = 25194278158337}

pthread_cond_signal right after acquiring cv internal __data.__lock, holds
also 0xaca70544 mutex.

Thread 8 (Thread 16313264 (LWP 25089)):
#0  0x0034a106 in __pthread_mutex_unlock (mutex=0xaca70544) at
pthread_mutex_unlock.c:76
#1  0x007c3633 in build_pindex () from
/opt/Ingres/IngresII/ingres/lib/libdmf.1.so
#2  0x00130265 in scs_dbms_task () from
/opt/Ingres/IngresII/ingres/lib/libscf.1.so
#3  0x0013d794 in scs_sequencer () from
/opt/Ingres/IngresII/ingres/lib/libscf.1.so
#4  0x00be01c6 in CSMT_setup () from
/opt/Ingres/IngresII/ingres/lib/libbecompat.1.so
#5  0x00348371 in start_thread (arg=0xf8ebb0) at pthread_create.c:261
#6  0x00a34ffe in clone () from /lib/tls/libc.so.6

__pthread_mutex_unlock 1 insn before returning, so doesn't hold the lock
anymore

The content of the CV fields look all sane, so if it hasn't been woken up
for longer than a few timeslices, the FUTEX_WAKE must have been lost.
But wakeup_seq as well as futex has been incremented and the function left,
so I don't see how this could be a glibc bug.

I tried to reproduce this with:
#include <pthread.h>
#include <stdio.h>

pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;

void *tf1 (void *arg)
{
  int i;
  for (i = 0; i < 1000000000; i++)
    {
      pthread_mutex_lock (&mtx);
      pthread_cond_wait (&cond, &mtx);
      pthread_mutex_unlock (&mtx);
      if ((i & 0x7ffff) == 0)
        fputc ('.', stderr);
    }
  return NULL;
}

void *tf2 (void *arg)
{
  while (1)
    {
      pthread_mutex_lock (&mtx);
      pthread_cond_signal (&cond);
      pthread_mutex_unlock (&mtx);
    }
  return NULL;
}

int
main (void)
{
  pthread_t th;
  pthread_create (&th, NULL, tf1, NULL);
  pthread_create (&th, NULL, tf2, NULL);
  pthread_create (&th, NULL, tf2, NULL);
  int i;
  while (1)
  for (i = 0; i < ~0; i++)
    asm volatile ("" : : "r" (i));
}

but that has not reproduced this for several hours.

Have you tried a different kernel (e.g. RHEL5 (beta))?

If only pthread_cond_{wait,signal} and pthread_mutex_{lock,unlock} is involved
here (it seems so, at least bc_seq is 0, so presumably no pthread_cond_broadcast
calls have been on this cv) and no pthread_cancel was involved, the code on the
glibc side is fairly simple.  The difference between RHEL5 and RHEL4 glibc
here is just in pthread_cond_signal, where RHEL5 optimizes and uses
FUTEX_WAKE_OP when available and only falls back to FUTEX_WAKE, while RHEL4
uses always FUTEX_WAKE.

Comment 35 Jakub Jelinek 2007-01-16 07:19:32 UTC

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00a6d13e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2  0x00a69d9b in _L_mutex_lock_32 () from /lib/tls/libpthread.so.0
#3  0x00000000 in ?? ()
is expected on RHEL4, fixed in RHEL5. _L_mutex_lock_* etc. are inline asm
pads in separate subsections and no unwind info has been provided for it
in RHEL4, because supporting it either needs GCC to emit .cfi_* directives
everywhere (not even RHEL5 does it yet) or DW_CFA_val_* support and the
complicated unwind info tricks used in RHEL5.  In RHEL4 you'd need to finish
the backtrace yourself by hand, looking at x/64w $esp dumps and at disassembly,
noting where on the stack the next return address is stored.

Comment 36 Ingres 2007-01-16 10:35:04 UTC

Hi Jakub,

Many thanks indeed for the update.

The central point of the issue is that the cond-var is being signalled but the
thread cond-waiting on the cond-var is not being scheduled. Are the cond-signals
raised correctly? Is the cond-var receiving the signals? If not why not? If it
is, why isn't the thread associated being scheduled? 

We have found that if we use pthread_cond_timedwait() the problem does not occur
(and this is without the pthread_cond_timedwait() timing out). If we replace
just the pthread_cond_timedwait() call, and nothing else, with
pthread_cond_wait(), the problem occurs. Does the difference between the code
path for pthread_cond_timedwait() and pthread_cond_wait() in terms of
cond-signal handling and thread scheduling give any clues as to what might be
happening?

Attaching with gdb and detaching usually results in the thread in cond-wait
being scheduled (ie the cond-var has received a cond-signal). Does this symptom
give any clues as to what is happening?

We don't feel there's much to be gained by us installing the RHEL5 beta for a
number of reasons. In particular our customers would not accept installing a new
version of the OS on their production systems without being convinced that the
problem had been diagnosed and the cure well-understood.

Regards,

Kevin.
Ingres Europe.

Comment 37 Jakub Jelinek 2007-01-16 12:30:30 UTC

The reason for asking to test with a different (RHEL5) kernel (note, only kernel,
with unchanged userland) is to help understanding where the problem is, if we
should focus on looking into the differences between RHEL4 and RHEL5 kernel futex
code (there were many changes) or something else.  That's definitely not
something we'd propose for customers.

Detaching gdb from a process signals all threads with SIGCONT and as the
pthread_cond_wait process is waiting in interruptible sleep inside of futex
(FUTEX_WAIT), this means the syscall returns with EINTR.  But glibc
pthread_cond_wait code always checks the conditions whether it should be awaken,
no matter what FUTEX_WAIT returns and as (see above) all internal variables in
the cv are correct, it is woken up.

If you are able to reproduce the problem within minutes or at most a few hours
rather than days as I do here, could you please try the preload libs I'm going
to attach?  Each of them should be LD_PRELOADed into the iidbms process,
overrides just pthread_cond_wait function and nothing else.  Guess
/etc/rc.d/init.d/ingressII stop
LD_PRELOAD=/tmp/cvwait1.so /etc/rc.d/init.d/ingressII start
etc. could do the trick (easily verificable by cat /proc/`pidof iidbms`/maps |
grep cvwait).

Comment 38 Jakub Jelinek 2007-01-16 12:32:48 UTC

Created attachment 145684 [details]
cvwait.tar.bz2

I'm interested to hear which from these (if any) reproduce the problem and
which (if any) don't.

Comment 39 Ingres 2007-01-18 15:46:57 UTC

(In reply to comment #38)
> 
> I'm interested to hear which from these (if any) reproduce the problem and
> which (if any) don't.

Jakub,

I have some preliminary results:

4 Runs:

1. cswait1.so: Left for over an hour - did not recreate.
2. Control (LD_PRELOAD unset): Recreated within a few minutes.
3. cswait2.so: Left for over an hour - did not recreate.
4. cswait3.so: Recreated after 40 minutes.

I will run cswait1.so & cswait2.so for longer periods, starting with cswait1.so
overnight tonight. Let me know if you want more diagnostics, different tests etc. 

Regards,

Kevin.
Ingres Europe.

Comment 40 Ingres 2007-01-19 15:59:58 UTC

Jakub,

The overnight run with cvwait1.so (ie 2nd time of running it) reproduced the
stall. It actually stalled within 25 minutes of starting the test, but I hadn't
realised.

I've attached thread stack dumps for yesterday's stall with cvwait3.so & this
stall with cvwait1.so.

I am running cvwait2.so for a longer period to see if that will reproduce
eventually. 

Regards,

Kevin.
Ingres Europe.

Comment 41 Ingres 2007-01-19 16:01:37 UTC

Created attachment 145993 [details]
Thread stackdump for cvwait3.so stall

Comment 42 Ingres 2007-01-19 16:02:30 UTC

Created attachment 145994 [details]
Thread stackdump for cvwait1.so stall

Comment 43 Ingres 2007-01-19 16:04:18 UTC

Created attachment 145995 [details]
Thread stackdump for cvwait1.so stall

Comment 44 Jakub Jelinek 2007-01-19 19:46:32 UTC

cvwait1.so was a control test, it contains the same assembly as
pthread_cond_wait in libpthread.so, so I was surprised to see first that it
didn't reproduce it.  cvwait2.so is pthread_cond_timedwait assembly named
pthread_cond_wait and slightly changed, so that it assumes maximum possible
timeout.  cvwait3.so is the C version of pthread_cond_wait (used on powerpc,
ia64 and most other arches, except i386, x86_64 and sh).

Comment 45 Ingres 2007-01-22 10:13:57 UTC

(In reply to comment #44)

Jakub,

I now have a run with cvwait2.so stalling. It took approx 22 hours to stall
(started run on Fri Jan 19 at 11:11 am and last entry in log was Sat Jan 20
09:41). I have uploaded the stackdump. 

Let me know if there are any further tests I can run.

Regards,

Kevin.

Ingres Europe.

Comment 46 Ingres 2007-01-22 10:15:09 UTC

Created attachment 146161 [details]
Thread stackdump for cvwait2.so stall

Comment 47 Ingres 2007-01-22 10:15:27 UTC

Created attachment 146162 [details]
Thread stackdump for cvwait2.so stall

Comment 50 Ingres 2007-02-09 15:05:30 UTC

On Tue, 2007-02-06 at 14:16 -0500, Jakub Jelinek wrote:
> I'd say the biggest help would be to try a newer kernel on a machine where
> this is reproduceable quickly.  Say, if you have some spare partition,
> install Fedora Core 6 or RHEL5 there, then when it boots chroot into
> the RHEL4 partition, mount the needed filesystems (/proc, /dev/pts,
> /dev/shm, ...), and see if you can reproduce it there.
> I know it sounds like a wild shot, but anything that helps to decrease the
> amount of involved suspected code is important.

Jakub, I have the results of some tests comparing RHEL4 & RHEL5. I first ran the
recreation test on a 4-processor RHEL4 box. The bug recreated immediately on the
first iteration. I then had the box upgraded to RHEL5. I then ran the recreation
test twice: neither run recreated the bug. The first run ran overnight (approx
18 hours) before I stopped it. The second ran for an hour before I stopped it
(the machine was required for QA work).

The RHEL5 environment is:

$ uname -m; uname -r; rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' glibc
i686
2.6.18-1.2747.el5PAE
glibc-2.5-4.i686

Regards,

Kevin.

Comment 51 Ulrich Drepper 2007-02-09 15:17:42 UTC

For the RHEL5 run, you used the same binary but the RHEL5 runtime (libc,
libpthread)?  Can you try again and copy over the RHEL4 ld.so, libc, and
libpthread (and other libraries you need)?  Copy them all into the same
directory somewhere (not overwriting system code), and then go into that
directory.  There run

   ./ld-linux.so.2 --library-path . ./the-binary

If this works fine then it's exclusively a kernel issue.

Comment 52 Jakub Jelinek 2007-02-12 08:25:43 UTC

Or alternatively, install RHEL5 on a different partition than RHEL4, boot
into RHEL5 and chroot into RHEL4 partition after mounting the few needed
filesystems.
The #c51 results suggests that it might be a RHEL4 kernel bug fixed in RHEL5
kernel, but before reassigning to the kernel team to look at the futex.c
differences between the two kernels we really need to rule out the possibility
of userland differences (for pthread_cond_* they are very minor).

Comment 54 IBM Bug Proxy 2007-02-22 21:30:24 UTC

----- Additional Comments From cjt.com  2007-02-22 16:24 EDT -------
I ran the ingres testcase in a chroot to RHEL4 under a RHEL5 kernel and did not
experience a lockup for 3 days, meanwhile my test system would typically exhibit
the lockup in less than 24 hours when booted on RHEL4.

Under RHEL4 I grabbed a system dump with the ingres testcase locked up, but the
dump shows all iidbms process had recently ran.

I believe the starved iidbms thread may be caused by assuming that calling
sched_yield() >will< result in a new process obtaining a mutex.  A rather
thorough discussion in LTC Bugzilla 19718 describes how sched_yield() does not
inherantly result in a new process wining a mutex fight.  

From Comment #16 https://bugzilla.linux.ibm.com/show_bug.cgi?id=19718#c16
============
If a task is the only task assigned to a processor, and it's told to remove
itself and go to the end of the runqueue, it will have no effect.  Remember that
there is a runqueue *per processor* in the O(1) scheduler.  Specifically, it
will remove itself, find somebody to replace itself, find that that somebody is
itself, observe that they are already running on this very processor, and
realize nothing further needs to be done to get this "new" process running. A
few pointers get changed but the sched_yield call will return without even
slowing down -- which is correct behavior.
=============

I ran a few preliminary tests to convince myself this is what is happening in
the ingres testcase.  

1) Using the thread_sched_test.c attachment in bugzilla 19718, with a
modification to give it 3 reader threads on my 4 processor system, I saw the
threads starved of mutex access on occassion.  (Much like the ingres testcase
reproduction steps indicate, if it doesn't happen in the first couple of
interations it's going to take a very long time.)
2) With the thread_shced_test.c test case, in RHEL5 as well as a RHEL4 chroot
under a RHEL5 kernel, this phenomenom didn't happen.
3) I intend to run the ingres testcase until I got a locked up interation, if I
can unlock it by cycling a cpu intensive task through the cpus using taskset
this hypothesis would seem to be proven and Ingres will need to try recoding to
obtain the behavior they desire.

Comment 55 Ingres 2007-02-23 11:54:01 UTC

All,

To summarise our understanding of the issue to-date:

1.In the case as initially reported at RHEL4 the situation is as follows. We
have 3 threads. One thread is in pthread_cond_wait on a condition variable. The
other 2 threads never call pthread_cond_wait and are never in pthread_cond_wait.
They cycle round calling pthread_mutex_lock and pthread_mutex_unlock on the
mutex associated with the condition variable, and calling pthread_cond_signal on
the condition variable. The question here is: why does the one thread that is
waiting on the condition variable not come out of cond_wait and start running,
even though it is firstly the only thread waiting on that condition variable,
and secondly it is cond_signalled many many times?

At the request of the Redhat engineer, we ran a number of tests involving
combinations of RHEL4 & RHEL5.

2.The testcase did not recreate on the same machine-type as the original report,
upgraded to RHEL5. The testcase also did not reproduce on the same machine-type
upgraded to RHEL5, but using RHEL4 libraries. The implication is that there is a
difference in the way the RHEL4 and RHEL5 kernels behave such that the problem
described in (1) above occurs at RHEL4 kernel, but not at RHEL5 kernel. This
reinforces our contention that this is an OS issue, though perhaps gives Redhat
some clues as to where to look in order to resolve this at an OS level in RHEL4.

Please feel free to contact us at Ingres on the email address above if you have
any questions. 

Regards,

Kevin.
Ingres Europe.

Comment 56 IBM Bug Proxy 2007-02-28 01:30:23 UTC

----- Additional Comments From nevdull.com (prefers email at ricklind.com) 2007-02-27 20:26 EDT -------
I suspect that we are still seeing the scenario I described in the previous bug.
If process A has gone to sleep waiting on some resource (condition variable)
then it must be awakened and, once awakened, make its way to a processor, and
then find the resource still available. If process B (the process giving up the
condition variable) gives it up and takes it back too soon, A may receive a
signal, may be queued to run, and may run ... only to find that the desired
resource is busy again.

Ironically, if they were both scheduled on the same CPU, the desired behavior
would likely happen, since it's likely that the awakened process will have a
higher priority than the one that is trying to yield the processor. But
scheduling two processes on the same processor is "idiotic" if you have free
processors so without some artificial constraints that would not happen.

So, a question. If we show that the above is precisely what is happening, do we
agree the problem is with the locking model and not with the kernel?

Regardless, we've yet to demonstrate clearly the above is happening. Do we have
evidence that B is in fact receiving a signal, responding to it, and finding
that the variable is again busy? If so, then we should see a change in behavior
if we use sched_setaffinity() so that the holder (A) and the target (B) each can
only run on the same cpu. That way "being made runnable" will still happen but
the processor that needs to be notified and needs to reevaluate its runqueue is
the same one that is willing to suspend A and look for another process. It
should find B ready to run and run it instead of A. I presume the same would
happen if you ran on a uniproc (single processor).

A more invasive method of doing the same would be to modify the code in
sys_sched_yield() to record who came through and what state they found the
runqueue in. I think what we'd find is that there is nobody to yield to (the
awakened process is scheduled on another processor but that other processor has
not yet started it running) and so it just keeps going. Adding the debugging
would be a simple modification but the previous suggestions have the advantage
of not requiring the recompilation of the kernel.

Comment 57 Ernie Petrides 2007-03-03 02:46:22 UTC

I've just begun investigating this problem.  The diff between
the RHEL4 and RHEL5 versions of kernel/futex.c is >1500 lines,
which includes a significant Priority Inheritance enhancement.

Comment 58 Jakub Jelinek 2007-03-03 10:29:34 UTC

That's true, but these condvars nor mutexes aren't PTHREAD_PRIO_INHERIT nor
PTHREAD_PRIO_PROTECT, nor they are they robust.
If you manage to reproduce it somewhere, you might after trying RHEL4 and RHEL5
kernels with RHEL4 userland also try FC4/FC5/FC6 GA kernels and thus narrow the
search for the differences.
Regarding comment #54, do you have any evidence that pthread_yield (the only
function in NPTL that calls sched_yield) or sched_yield itself is used?

Comment 60 Ingres 2007-03-05 09:34:04 UTC

(In reply to comment #58)
> Regarding comment #54, do you have any evidence that pthread_yield (the only
> function in NPTL that calls sched_yield) or sched_yield itself is used?
I can confirm that we don't use pthread_yield or sched_yield. 

Regards,

Kevin
Ingres Europe.

Comment 61 Frank Ch. Eigler 2007-03-06 05:24:40 UTC

(In reply to comment #56)
> ----- Additional Comments From nevdull.com (prefers email at
ricklind.com)  2007-02-27 20:26 EDT -------
> A more invasive method of doing the same would be to modify the code in
> sys_sched_yield() to record who came through and what state they found the
> runqueue in.

There is a chance systemtap may be of use here.  If you have the tool and its
prerequisites (kernel-devel and kernel-debuginfo for your version), you can
trace inside sys_sched_yield().  For 2.6.9-42.0.3.EL, this little script
inserts a probe just after the dequeue/enqueue statement pair:

probe kernel.statement("*@kernel/sched.c:3641") {
  printf ("%d %d %d\n", pid(), $rq->nr_running, $target)
}

(One can print other values derived from local pointers; add conditions
like process id or execname; collect historical data in arrays instead
of printing them out, ...).  I'd be happy to help construct a script to
help with this problem - I'd just need to know what values/conditions/reports
are interesting.

Comment 62 IBM Bug Proxy 2007-03-06 07:31:12 UTC

----- Additional Comments From nevdull.com (prefers email at ricklind.com)  2007-03-06 02:27 EDT -------
Kevin said:
    I can confirm that we don't use pthread_yield or sched_yield.

Okay, thanks.  An earlier comment suggested that sched_yield() was being used,
which took us down that path.

I still have a gut feel that we may be grabbing a mutex back before the
signalled thread has a chance to acquire it, but without some code to back it up
it's only that -- a gut feel.  A useful step while the other testing is going on
would be to grab the RHEL4-based code for pthread_mutex_unlock() and
pthread_cond_wait() and see what series of calls are made to the kernel in
implementing those.  The comment from petrides suggests this may
already be happening if we're inspecting futex code.  I'm pretty familiar with
things in the kernel but not so much with the thread libraries, so that study
may prove useful. 

Re: systemtap -- might be useful, but I'd be careful about issuing too many
printfs.  Introducing output can slow some sections of code down an order of
magnitude or more, and by dramatically changing the timings, you can
inadvertently make the event you're trying to observe simply go away.

Comment 63 IBM Bug Proxy 2007-03-06 15:21:03 UTC

----- Additional Comments From cjt.com  2007-03-06 10:17 EDT -------
Since I was the one who brought up sched_yield, I'd like to put closure to that
topic by saying that a jprobe on sys_sched_yield(), restricted to current->comm
= iidbms, confirmed Ingres' statements and produced no results.  (Not from the
start of job1 to the point of the iidbms hang.)

The jprobe.entry was:

asmlinkage long jdo_sched_yield(void) {
        extern char *jprobe_comm;

        if (current && (strcmp(current->comm, jprobe_comm) == 0)) {
                        printk("sched_yield from process(%d:%s).....
",
                                current->pid, current->comm);
                        dump_stack();
        }
        jprobe_return();
        return 0;
}

Comment 64 Ernie Petrides 2007-03-07 04:01:48 UTC

Sorry, I'm just getting up to speed on this problem (and I've been
distracted for a couple of days working on a RHEL3 kernel HotFix).

Anyway, as a follow-up to Kevin's comment #5 regarding test results
from the pthread_cond_timedwait() experiment, the kernel's handling
of a futex() syscall for the FUTEX_WAIT op is nearly identical with
or without a timeout (i.e., whether or not a NULL (struct timespec *)
is passed as the 4th arg).  Internally, the kernel *always* sets up a
timeout, and if user-space passes NULL, the kernel selects a timeout
value of LONG_MAX clock ticks (i.e., a large number of years).

The only difference in the kernel is that a non-NULL timespec pointer
cause a copy of the timespec struct from user-space and a conversion
of its fields into a scalar number of clock ticks.  The remainder of
FUTEX_WAIT processing is absolutely identical.

Assuming that the only difference between library implementations of
pthread_cond_wait() and pthread_cond_timedwait() is that the former
passes a NULL 4th arg to futex(), this suggests that the lost-wakeup
race is either timing dependent *or* that the problem is related to
memory coherence (e.g., a missing memory barrier in either the kernel
or the pthread library).

I'm loading the test case onto a 4-processor test machine in Westford,
although it's actually got two dual-hyperthreaded cpu packages.  I'm
also currently running Jakub's little test program in comment #34,
but it hasn't reproduced the hang in its first 15 minutes of runtime.

I just now noticed that the 32-bit kernel is being used on 64-bit hardware,
which is not what's on my test machine.  I doubt that's relevant, but I
guess I should at least go look at the kernel's 32-bit-compat code for
futex() syscall handling.

Comment 65 Ernie Petrides 2007-03-07 04:44:26 UTC

Please ignore the last sentence in my comment above.  The 32-bit-compat
code is only relevant for 32-bit executable running on a 64-bit kernel,
which is not the environment indicated in this bug report.

Comment 66 IBM Bug Proxy 2007-03-10 01:35:16 UTC

----- Additional Comments From cjt.com  2007-03-09 20:31 EDT -------
We have been monitoring the futex usage on a RHEL4 system in the stalled iidbms
state.  What we found was one iidbms thread waiting on a futex which never gets
awoken.  Monitoring the futex_wake() calls, we never find that uaddr being passed. 

Can Ingres provide information on how to identify the "stalled" reader without
using gdb?  

=- Curt

Comment 67 Ernie Petrides 2007-03-10 02:23:25 UTC

Jakub, why isn't there a memory barrier in __pthread_cond_wait()
in between the sampling of cond->__data.__futex (line 140 of RHEL4
version of pthread_cond_wait.c) and the call to lll_mutex_unlock()
3 lines later?  Shouldn't this synchronize with another (missing)
memory barrier in __pthread_cond_signal() just after the increment
of cond->__data.__futex (on line 43 of pthread_cond_signal.c)?

Without this synchronization, isn't it possible that the update of
the futex value in a waker thread would be sampled by a racing
waiter thread, thus causing the futex() syscall to incorrectly
wait on the advanced value?

Comment 68 Ingres 2007-03-10 17:49:35 UTC

(In reply to comment #66)
> ----- Additional Comments From cjt.com  2007-03-09 20:31 EDT -------
> Can Ingres provide information on how to identify the "stalled" reader without
> using gdb?  
> 
Apologies for answering with a question, but what options for identification do
you have in mind?
Regards,
Kevin
Ingres Europe

Comment 69 IBM Bug Proxy 2007-03-10 19:00:19 UTC

----- Additional Comments From washer.com  2007-03-10 13:57 EDT -------
Answering the Question as to how to indentify the threads... Is there some 
unique function on the stack of the various threads? For example does the name 
of the function first called from the pthread_create differ amongst the various 
functions threads..
-
Adding a bit more to what Curtis wrote... When the hand occurs, we see a couple 
of threads doing LOTS of wakeups on mutexes for which there are no sleepers, 
and we see one thread that is waiting on a different mutex. Is it possible that 
the wrong mutex is being signalled?
-
Since we don't have access (that I'm aware of) to Ingress code, it's difficult 
for us to know WHICH thread the process is waiting (hanging) on. How can we 
know for sure which thread needs to be awoken for things to continue?

Comment 70 Ingres 2007-03-12 09:53:32 UTC

(In reply to comment #69)
Identifying the threads:

A typical stack dump of the 3 threads involved looks like this:

Thread 10 (Thread 3952077744 (LWP 6590)):
#0  0x00a69260 in pthread_self () from /lib/tls/libpthread.so.0
#1  0xf7e3b598 in IICSMTp_semaphore (exclusive=1, sem=0xeb78c408) at
/devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmtsem.c:645
#2  0xf784974a in build_pindex (ftx=0xeb8f4560) at
/devsrc/ingresr30/smeke01/src/back/dmf/dmu/dm2uputi.c:566
#3  0xf7dc3265 in scs_dbms_task () from /patchtest/ingres/lib/libscf.1.so
#4  0xf7dd0794 in scs_sequencer () from /patchtest/ingres/lib/libscf.1.so
#5  0xf7e2fe26 in CSMT_setup (parm=0xea88fe80) at
/devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmthl.c:983
#6  0x00a68341 in start_thread () from /lib/tls/libpthread.so.0
#7  0x008fcfee in clone () from /lib/tls/libc.so.6

Thread 9 (Thread 3935398832 (LWP 6591)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00a6aa86 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0xf7e3bc99 in IICSMTp_semaphore (exclusive=0, sem=0xeb78c340) at
/devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmtsem.c:1078
#3  0xf78496c8 in build_pindex (ftx=0xea90c560) at
/devsrc/ingresr30/smeke01/src/back/dmf/dmu/dm2uputi.c:545
#4  0xf7dc3265 in scs_dbms_task () from /patchtest/ingres/lib/libscf.1.so
#5  0xf7dd0794 in scs_sequencer () from /patchtest/ingres/lib/libscf.1.so
#6  0xf7e2fe26 in CSMT_setup (parm=0xea8918c0) at
/devsrc/ingresr30/smeke01/src/cl/clf/csmt_unix/csmthl.c:983
#7  0x00a68341 in start_thread () from /lib/tls/libpthread.so.0
#8  0x008fcfee in clone () from /lib/tls/libc.so.6

Thread 8 (Thread 3934305200 (LWP 6592)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00a6d13e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2  0x00a69d9b in _L_mutex_lock_32 () from /lib/tls/libpthread.so.0
#3  0x00000000 in ?? ()

1 thread stack will be unchanging, having called pthread_cond_wait() (thread 9
in the example above). The other 2 will be moving. The stack will be the same
for the 3 threads as follows up to and including the build_pindex() call:
clone() -> start_thread() -> CSMT_setup() -> scs_sequencer() -> scs_dbms_task()
-> build_pindex(). [However gdb seems often to have difficulty printing out the
stack as you can see in thread 8 above]. build_pindex() calls the IICSMT_x
ingres functions that in turn call the pthread_x functions. In the source the
calls to pthread_x functions appear as platform-dependent macros. On Linux these
are a straight substitution:

I've supplied Jakub with a complete Ingres install and the source code for the
relevant sections via our ftp site. I will send you details of this separately.
The complete source code for Ingres is available on-line at www.ingres.com,
though not the exact version our customer is using. If it is necessary I can
supply you with the exact source code the customer is using, via our ftp site.

I hope this helps,

Regards,

Kevin
Ingres Europe.

Comment 71 Jakub Jelinek 2007-03-12 12:47:10 UTC

Re: #c67
lll_mutex_{,un}lock macros are synchronization barriers themselves, e.g.
on i?86 they are asm volatile instructions touching memory with lock prefix
and "memory" barrier, no memory loads/stores should be reordered (neither by GCC
nor by the CPU itself) accross them.
Also, i386/x86_64 use pthread_cond_{wait,timedwait,signal,broadcast} written
in assembly rather than the .c versions, see
nptl/sysdeps/unix/sysv/linux/i386/i486/*.S.

Re: #c64
Kevin has been able to reproduce the hangs even with pthread_cond_timedwait, see
comments #c38 through #c47.

Comment 72 Ingres 2007-03-12 13:42:25 UTC

(In reply to comment #71)
> Re: #c64
> Kevin has been able to reproduce the hangs even with pthread_cond_timedwait, see
> comments #c38 through #c47.

Jakub - I found that pthread_cond_timedwait didn't seem to suffer from the stall
(see comment #36). I tested this by doing tests with pthread_cond_timedwait,
with the extra code added to set a time value. I then did a test where
everything was the same, including the (technically redudant) extra code for the
time value, except that pthread_cond_timedwait() was replaced with
pthread_cond_wait(). The results were that pthread_cond_timedwait() never
produced the stall, even after several attempts and over short and long
overnight periods. The code with pthread_cond_wait() however reproduced the
stall on the first run, within 10 minutes.

Regards,

Kevin
Ingres Europe.

Comment 73 Jakub Jelinek 2007-03-12 13:56:55 UTC

In comment #c45 you said you reproduced the stall with LD_PRELOAD=cvwait2.so
after 22 hours, where cvwait2.so is pthread_cond_timedwait assembly routine
with pthread_cond_wait entry point that just uses maximum possible abstime.

Comment 74 Ingres 2007-03-12 14:58:23 UTC

(In reply to comment #73)
> In comment #c45 you said you reproduced the stall with LD_PRELOAD=cvwait2.so
> after 22 hours, where cvwait2.so is pthread_cond_timedwait assembly routine
> with pthread_cond_wait entry point that just uses maximum possible abstime.

Those were your diagnostics so you would know, I just ran the tests and reported
the results. When I tried pthread_cond_timedwait() in our code it never
reproduced the stall. When we tried pthread_cond_timedwait() on the customer's
system as an attempt to produce more diagnostics we found the problem went away,
and he has since been running with this for weeks as a temporary measure on his
test machine without getting the stall.

Kevin
Ingres Europe.

Comment 75 IBM Bug Proxy 2007-03-12 22:05:58 UTC

----- Additional Comments From cjt.com  2007-03-12 18:00 EDT -------
In a given reproduction case we have these iidbms processes in futex_wait()

PID   futex_q.key
===   ===========
11848 0x00000003,f632f24c,0xa99       
11840 0x00000003,f632f24c,0x7d9       
11841 0xaabb2000,f7696080,0x1d0       
11845 0x00000003,f632f24c,0x991       
11849 0x00000003,f632f24c,0xaf1       

None of those process have run following the stall.  Apparently 11841 would be
the stalled reader since the other's have futex_q keys in shared memory(???)  

Meanwhile an strace reveals pids 11842 and 11843 active calling sys_futex with 
uaddr=0xaabb21b4 and 0xaabb227c.   (A jprobe on futex_wake reveals these are the
only significant processes using FUTEX_WAKE and that aabb21b4 and aabb227c are
the only uaddr's being used.)

crash> bt -f 11841
PID: 11841  TASK: dfc98330  CPU: 2   COMMAND: "iidbms"
 #0 [f4ee7e54] schedule at c02d25c9
    [RA: c02d2e55  SP: f4ee7e54  FP: f4ee7eb8  SIZE: 104]
    f4ee7e54: f4ee7eb4  00000082  00000012  00000046
    f4ee7e64: f35f39b0  f35f39b0  00000019  c4a67de0
    f4ee7e74: f35f39b0  00000000  c4a60740  c4a5fde0
    f4ee7e84: 00000002  00000000  8bf46740  000f42f9
    f4ee7e94: f35f39b0  dfc98330  dfc9849c  00000008
    f4ee7ea4: f633b3b0  00000000  7fffffff  f4ee7000
    f4ee7eb4: f4ee7f00  c02d2e55
 #1 [f4ee7eb8] schedule_timeout at c02d2e50
    [RA: c01351dd  SP: f4ee7ebc  FP: f4ee7ef4  SIZE: 60]
    f4ee7ebc: 0000001b  00008900  aabb2000  aabb2000
    f4ee7ecc: c01506e4  aabb2000  f4ee7f1c  00000246
    f4ee7edc: f4ee7f50  f4ee7f08  c012038d  00000000
    f4ee7eec: 00000000  00000000  c01351dd
 #2 [f4ee7ef4] futex_wait at c01351d8
    [RA: c013548e  SP: f4ee7ef8  FP: f4ee7f80  SIZE: 140]
    f4ee7ef8: 7fffffff  00066e51  c0439ea8  c0439ea8
    f4ee7f08: 00000001  dead4ead  f4ee7f5c  f4ee7f5c
    f4ee7f18: c0439e9c  aabb2000  f7696080  000001d0
    f4ee7f28: ffffffff  00000000  00000000  dfc98330
    f4ee7f38: c011e794  00000000  00000000  00000000
    f4ee7f48: f4ee7f60  f76960b0  00000000  dfc98330
    f4ee7f58: c011e794  f4ee7f10  f4ee7f10  f7696080
    f4ee7f68: 000001b4  c02d50c0  aabb21d0  00000000
    f4ee7f78: 00000000  f4ee7000  c013548e
 #3 [f4ee7f80] do_futex at c0135489
    [RA: c01355c0  SP: f4ee7f84  FP: f4ee7f90  SIZE: 16]
    f4ee7f84: aabb21d0  00000000  00066e51  c01355c0
 #4 [f4ee7f90] sys_futex at c01355bb
    [RA: c02d4703  SP: f4ee7f94  FP: f4ee7fc0  SIZE: 48]
    f4ee7f94: 7fffffff  00066e51  00000000  00b6b538
    f4ee7fa4: 00000000  c03269a0  00000000  aabb21d0
    f4ee7fb4: 00000000  00066e51  f4ee7000  c02d4703
 #5 [f4ee7fc0] system_call at c02d46fc
    EAX: 000000f0  EBX: aabb21d0  ECX: 00000000  EDX: 00066e51
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 00066e51
    SS:  007b      ESP: 00b6b4ac  EBP: 00b6b538
    CS:  0073      EIP: 008da7a2  ERR: 000000f0  EFLAGS: 00000292
    [RA: 8da7a2  SP: f4ee7fc4  FP: f4ee7ffc  SIZE: 60]
    f4ee7fc4: aabb21d0  00000000  00066e51  00000000
    f4ee7fd4: 00066e51  00b6b538  000000f0  0000007b
    f4ee7fe4: 0000007b  000000f0  008da7a2  00000073
    f4ee7ff4: 00000292  00b6b4ac  0000007b

sys_futex(uaddr=0xaabb21d0, op=0, val=0x66e51 ....)

crash> vtop -c 11841 aabb21d0
VIRTUAL   PHYSICAL
aabb21d0  14426b1d0


crash> rd -p 14426b1b0 64
14426b1b0:  4d455349 00000000 00000000 00000000   ISEM............
14426b1c0:  00000000 00000001 00000000 00000000   ................
14426b1d0:  00066e52 00033729 00000000 00033729   Rn..)7......)7..
14426b1e0:  00000000 00033728 00000000 aabb21b4   ....(7.......!..
14426b1f0:  00000002 00000000 00000000 00000000   ................

That's the pthread_cond_t with __futex=66e52, __total_seq = 33729, __wakeup_seq
= 33729, __woken_seq = 33728, __mutex = 0xaabb21b4, __nwaitiers = 2 ...

What I'm hoping RH can explain is why the pthread_cond_t's __mutex is the
FUTEX_WAKE uaddr for the active iidbms pids.

=- Curt

Comment 76 James Washer 2007-03-13 15:14:00 UTC

Just clarifying the above a bit in comment 34 above, we see gdb traces showing a
print of a pthread_cond_t as such:
(gdb) p *(pthread_cond_t *)0xaca7055c
$45 = {__data = {__lock = 1, __futex = 5866, __total_seq = 2933, __wakeup_seq =
2933, __woken_seq = 2932,
    __mutex = 0xaca70544, __nwaiters = 2, __broadcast_seq = 0},
  __size =
"\001\000\000\000*\026\000\000u\v\000\000\000\000\000\000u\v\000\000\000\000\000\000t\v\000\000\000\000\000\000D\005**\002\000\000\000\000\000\000\000\000\000\000",
__align = 25194278158337}
(gdb) p *(pthread_mutex_t *)0xaca70544
$46 = {__data = {__lock = 1, __count = 0, __owner = 25088, __kind = 0, __nusers
= 2, __spins = 0},
  __size =
"\001\000\000\000\000\000\000\000\000b\000\000\000\000\000\000\002\000\000\000\000\000\000",
__align = 1}

Note the futex as the second element of the cond_t and further in the cond_t we
see  the pointer the the pthread_mutex. 

Now, in the hang that Curtis reproduced we see a thread sleeping (waiting) on a
futex. If one looks at the data in user space using the crash tool, one sees a
cond_t structure like the one mentioned in comment34. Again, we see the pointer
to the associated pthead_mutex_t. Everything looks good, as we see that this
thread called futex_wait with the futex contained withing the user space cond_t

Now...what's distubing here is that we see the other two threading calling
futex_wake with the address of the pthread_mutex, rather than the address of the
 FUTEX contained within the cond_t. 

This seems rather odd... userland is calling futex_wake with a mutex??? 

Can someone from Ingres and/or RH LIBC folks tell us why (or where we did
something wrong in our analysis)?

Comment 77 Ingres 2007-03-13 16:29:10 UTC

(In reply to comment #76)
> 
> This seems rather odd... userland is calling futex_wake with a mutex??? 
> 
> Can someone from Ingres and/or RH LIBC folks tell us why (or where we did
> something wrong in our analysis)?

Think this is for RH LIBC analysis. I can confirm that in our code we call
pthread_cond_signal() with the condition_variable as the parameter.

Kevin
Ingres Europe.

Comment 79 Ernie Petrides 2007-03-14 00:53:42 UTC

Jakub, should I bounce this back over to the "glibc" component?

I'm having trouble analyzing the assembly language code in
nptl/sysdeps/unix/sysv/linux/i386/i486/pthread_cond_*.S.

After reading comment #75 and comment #76, I'm beginning to
suspect that this is not a kernel problem anyway.

Comment 80 Jakub Jelinek 2007-03-14 10:48:03 UTC

Please keep it under kernel, we have analyzed the assembly several times and it
is correct.
To answer comment #75 and comment #76, condvars use 3 addresses for futexes
- cv's __data.__lock, cv's __data.__futex and mutex's __data.__lock.
cv's __data.__lock is a short lived mutex which guards the changes of other cv's
fields, uses values 0/1/2 and on uncontended cv is handled purely in userland.
0 means not locked, 1 locked with no waiters and 2 locked with possibly some
waiters waiting till it is unlocked.  mutex's __data.__lock works similarly,
though it depends upon the application how short-lived it is.  If I remember well
iidbms at least on the analyzed stalls calls pthread_cond_signal with the
associated mutex hold.  cv's __data.__futex is a counter (it should always
contain (int) (__data.__total_seq + __data.__wakeup_seq)) and pthread_cond_*wait
blocks on it with FUTEX_WAIT, pthread_cond_signal FUTEX_WAKEs it
(pthread_cond_broadcast isn't involved here, so I won't describe it).
Around comment #40 you can see that the stall has been reproduced even with
a C version of pthread_cond_wait, so you can perhaps read that as it is more
readable than the assembly version if you want to understand what it is doing.
But IMHO focusing on userland is a wrong thing now, we know a RHEL5 kernel fixes 
this problem and we spent considerable time checking the userland code.  So I'd
say best way is to narrow down the kernel futex.c changes that might be relevant
(try FC4/FC5/FC6 GA kernels).

Comment 81 IBM Bug Proxy 2007-03-16 15:22:15 UTC

Created attachment 150237 [details]
kernel-2.6.9-futex-queue_me-change.patch

Comment 82 IBM Bug Proxy 2007-03-16 15:23:15 UTC

----- Additional Comments From cjt.com  2007-03-16 11:18 EDT -------
 
backport of queue_me changes from 2.6.18-8 futex.c

Comment 83 IBM Bug Proxy 2007-03-16 15:25:23 UTC

----- Additional Comments From cjt.com  2007-03-16 11:21 EDT -------
Making the queue_me/lock changes for futex_wait found in the rhel5 kernel have
fixed the problem on my system.  These changes are basically a derivative of the
patch from http://lkml.org/lkml/2005/3/17/37.  

The exact patch I used is now attached to this bugzilla and applies to the
2.6.9-42 kernel.  Please test it and report the results.

Comment 86 Ernie Petrides 2007-03-23 22:39:55 UTC

I will follow up on all this next week (after finishing work
on the RHEL3 U9 kernel).  Thanks for all the help.

Comment 92 Ernie Petrides 2007-04-05 22:38:47 UTC

I can confirm that the patch in comment #81 correctly brings the RHEL4
version of futex_wait() and its subordinate functions in line with what's
in RHEL5 (and also with what's upstream in 2.6.20).  I might make a few
non-functional changes to reduce the extent of the diffs, but overall it
looks good to me.

I will try to develop a small isolated reproducer based on knowledge of
what the fixes address.  I prefer to have this to validate the changes.

Comment 93 Ernie Petrides 2007-04-13 20:52:01 UTC

Created attachment 152587 [details]
test program for trying to reproduce pthread_cond_wait() hangs

This is the test program that I've been using to attempt to reproduce
hangs under pthread_cond_wait(), which tried to generate the relevant
race condition in the kernel's handling of futex()/FUTEX_WAIT syscalls.

Unfortunately, on 3 different test systems (i386/4-cpu, x86_64/4-cpu,
x86_64/8-cpu w/o hyperthreading), I was unable to reproduce the hang
in ~24 hours of test runs.

Comment 94 Ernie Petrides 2007-04-13 21:34:14 UTC

The patch below was posted for internal review on 13-Apr-2007.

Comment 95 Ernie Petrides 2007-04-13 21:36:07 UTC

Created attachment 152592 [details]
This is the patch under consideration for RHEL4.6.

Comment 96 RHEL Program Management 2007-04-18 22:46:53 UTC

This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 98 Ingres 2007-05-03 15:32:45 UTC

(In reply to comment #96)
> This request was evaluated by Red Hat Kernel Team for inclusion in a Red
> Hat Enterprise Linux maintenance release, and has moved to bugzilla 
> status POST.

Is the maintenance release containing the fix for bug 217067 now generally
available?

Regards,

Kevin.

Ingres Europe.

Comment 99 Ernie Petrides 2007-05-03 19:48:01 UTC

The fix for this problem has been proposed for RHEL4.6 (and has not yet
been commited to CVS).  What was released earlier this week was RHEL4.5.

Comment 103 Jason Baron 2007-05-11 18:54:06 UTC

committed in stream U6 build 55.2. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 106 IBM Bug Proxy 2007-08-10 22:39:31 UTC

Created attachment 161087 [details]
kernel-2.6.9-futex-queue_me-change.patch

Comment 107 IBM Bug Proxy 2007-08-10 22:39:51 UTC

Created attachment 161088 [details]
This is the patch under consideration for RHEL4.6.

Comment 108 IBM Bug Proxy 2007-08-10 22:39:59 UTC

Created attachment 161089 [details]
data from ps axwml and gdb

Comment 109 IBM Bug Proxy 2007-08-10 22:40:11 UTC

Created attachment 161090 [details]
Thread stackdump for cvwait1.so stall

Comment 110 IBM Bug Proxy 2007-08-10 22:40:22 UTC

Created attachment 161091 [details]
cvwait.tar.bz2

Comment 111 IBM Bug Proxy 2007-08-10 22:40:34 UTC

Created attachment 161092 [details]
test program for trying to reproduce pthread_cond_wait() hangs

Comment 112 IBM Bug Proxy 2007-08-10 22:40:46 UTC

Created attachment 161093 [details]
Ingres &amp;amp; Linux description of the problem

Comment 113 IBM Bug Proxy 2007-08-10 22:40:55 UTC

Created attachment 161094 [details]
Thread stackdump for cvwait3.so stall

Comment 114 IBM Bug Proxy 2007-08-10 22:41:05 UTC

Created attachment 161095 [details]
Thread stackdump for cvwait2.so stall

Comment 115 John Poelstra 2007-08-29 00:14:21 UTC

A fix for this issue should have been included in the packages contained in the
RHEL4.6 Beta released on RHN (also available at partners.redhat.com).  

Requested action: Please verify that your issue is fixed to ensure that it is
included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 116 IBM Bug Proxy 2007-08-31 23:02:04 UTC

------- Comment From cjt.com 2007-08-31 18:56 EDT-------
I ran the 2.6.9-56 kernel from RHEL4 U6 beta for 2 days without any occurance of
the problem on the customer's test case.  So, this problem appears to be fixed
in the U6 beta.

Someone with the proper permissions with need to update this to "PartnerVerified"

Comment 119 IBM Bug Proxy 2007-10-04 22:44:04 UTC

------- Comment From bugzilla 2007-10-03 22:39 EST-------
User petrides's account has been disabled, requested by HC

Comment 121 errata-xmlrpc 2007-11-15 16:16:40 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html

Note You need to log in before you can comment on or make changes to this bug.