Bug 139400 - (IT_49450) [RHEL3] gdb gets confused when threads deadlock
[RHEL3] gdb gets confused when threads deadlock
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: gdb (Show other bugs)
3.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Elena Zannoni
Jay Turner
:
Depends On:
Blocks: 132991 146413
  Show dependency treegraph
 
Reported: 2004-11-15 15:02 EST by Elena Zannoni
Modified: 2015-01-07 19:08 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-01-26 10:18:52 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Elena Zannoni 2004-11-15 15:02:43 EST
When two threads deadlock gdb appears unable to display the stack of a
deadlocking thread.

A backtrace of the thread contains the following message: 

Previous frame identical to this frame (corrupt stack?)
 
The example sourcecode (see attached) isn't perfect. It displays most
of the stack of the locking thread and gives the error. I've seen many
examples where none of the stack is displayed.
 
To see the problem:
1) Compile the example: "g++ dl.C -o dl -lpthread"
2) "gdb dl"
3) "run"
4) wait a couple of seconds then press Ctrl-C
5) "thread 2"
6) "where"
----------
Action by: jrfuller
I followed the above instructions and get an error, but I am not sure
if the methodology is valid.

Here is the gdb output (note the SIGINT is my ctrl-c):

(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/juan/dl
[Thread debugging using libthread_db enabled]
[New Thread -1218549504 (LWP 25330)]
[New Thread 26778544 (LWP 25337)]
 
Program received signal SIGINT, Interrupt.
[Switching to Thread -1218549504 (LWP 25330)]
0x00ae56e1 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) thread 2
[Switching to thread 2 (Thread 26778544 (LWP 25337))]#0  0x00ae56e1 in
__lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) where
#0  0x00ae56e1 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1  0x00ae2797 in _L_mutex_lock_28 () from /lib/tls/libpthread.so.0
#2  0x00ae9a3c in __JCR_LIST__ () from /lib/tls/libpthread.so.0
#3  0x01989bb0 in ?? ()
#4  0x01989a98 in ?? ()
#5  0x080485c4 in thread_fn1 () at dl.c:13
Previous frame identical to this frame (corrupt stack?)
(gdb)

Escalating for assessment.

J


Issue escalated to Support Engineering Group by: jrfuller.

----------
Action by: gavin
I have reproduced exactly the behavior JohnRay has on an up2date RHEL3
box. 

As for valid methodology, it is true that the two threads are
deadlocked, and the aparent stack coruption may be a necessary and
valid side effect of that deadlock with NPTL, and we may have to just
explain this to the customer.  On the other hand, deadlocks are one
the times when you most need an un-corrupt stack so you can debug
them, and so we should do what we can to protect the stack in these cases.




Issue escalated to Sustaining Engineering by: gavin.
Status set to: Accepted

----------
Action by: jrfuller
We have been able to reproduce this issue with the above test case.

Our initial assessment is that the stack corruption may be a necessary
and valid side effect of this specific deadlock within NPTL.  However,
deadlocks are one of those times when you most need an un-corrupt
stack so you can properly debug the cause, so we will do what we can
to protect the stack in these cases.

We are still investigating the cause of this issue and will report
when we know more.

J



----------
Action by: adam.eastwick
Hi,

     I did a test on my own and did not get this error when I ran the
above test case using LinuxThreads.  Could this be a problem with the
NPTL threading library or interaction between NPTL and other tools? 
I'll insert the text of my test below.

Thread 32769 (LWP 2013)]
[New Thread 16386 (LWP 2014)]

Program received signal SIGINT, Interrupt.
[Switching to Thread 16386 (LWP 2014)]
0xb759b074 in __pthread_sigsuspend () from /lib/i686/libpthread.so.0
(gdb) thread 2
[Switching to thread 2 (Thread 32769 (LWP 2013))]#0  0xb744f38a in poll ()
   from /lib/i686/libc.so.6
(gdb) where
#0  0xb744f38a in poll () from /lib/i686/libc.so.6
#1  0xb7597d5e in __pthread_manager () from /lib/i686/libpthread.so.0
#2  0xb759802a in __pthread_manager_event () from
/lib/i686/libpthread.so.0
#3  0xb745808a in clone () from /lib/i686/libc.so.6
(gdb)


-A



Status set to: Waiting on Tech

----------
Action by: ezannoni
Suspect it's a problem with the debug information from glibc. I have
seen a few of bug reports like this. The gdb team is investigating,
but I'll escalate this to bugzilla as well.
Comment 3 Johnray Fuller 2004-11-22 15:10:50 EST
Here are my results with the latest gdb (gdb-6.1post-1.20040607.52)
and glibc (glibc-2.3.2-95.30) from U4.

$ g++ dl.C -o dl -lpthread
$ gdb dl
GNU gdb Red Hat Linux (6.1post-1.20040607.52rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux-gnu"...(no debugging
symbols found)...Using host libthread_db library
"/lib/tls/libthread_db.so.1".
 
(gdb) run
Starting program:
/home/juan/.gnome-desktop/ISSUE-FOLDER/GS/GS-GDB-Thread_Confusion/dl
(no debugging symbols found)...[Thread debugging using libthread_db
enabled]
[New Thread -1218555136 (LWP 9361)]
(no debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...[New Thread -1218557008 (LWP 9370)]
 
Program received signal SIGINT, Interrupt.
[Switching to Thread -1218555136 (LWP 9361)]
0x00e31809 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) thread 2
[Switching to thread 2 (Thread -1218557008 (LWP 9370))]#0  0x00e31809
in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) where
#0  0x00e31809 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1  0x00e2e7f7 in _L_mutex_lock_28 () from /lib/tls/libpthread.so.0
#2  0x00e35b5c in __JCR_LIST__ () from /lib/tls/libpthread.so.0
#3  0xb75e4bb0 in ?? ()
#4  0xb75e4a98 in ?? ()
#5  0x080485c4 in thread_fn1 ()
#6  0x080485c4 in thread_fn1 ()
#7  0x00e2cdec in start_thread () from /lib/tls/libpthread.so.0
#8  0x00f78a2a in clone () from /lib/tls/libc.so.6
(gdb)

Looks like it works. I will verify it works with the older gdb as well...

J
Comment 4 Johnray Fuller 2004-11-22 15:17:44 EST
Tested with gdb-6.1post-1.20040607.17 and the latest glibc
(glibc-2.3.2-95.30). It works.

J
Comment 8 Jay Turner 2005-01-26 10:18:52 EST
Closing out.

Note You need to log in before you can comment on or make changes to this bug.