Bug 97637 - python deadlocks when NPTL is used
python deadlocks when NPTL is used
Status: CLOSED CURRENTRELEASE
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
9
All Linux
medium Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-06-18 13:13 EDT by Balazs Scheidler
Modified: 2007-04-18 12:54 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-06-19 10:58:44 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Balazs Scheidler 2003-06-18 13:13:34 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20021220
Debian/1.2.7-5

Description of problem:
When threads are used in Python, it _always_ deadlocks within a couple of
seconds. The script included below runs with LD_ASSUME_KERNEL=2.4.1 and
deadlocks otherwise.

While debugging the problem I've discovered that python uses a combination of
mutexes and condition variables for locking, e.g. it explicitly signals other
threads to wake up when a given lock is released. (Python/thread_pthread.h
contains the pthread specific thread implementation). The threading within
Python works so that each thread releases the interpreter lock every 100
byte-code instructions, and as it seems this causes the deadlock.

from threading import *

class TestThread(Thread):
	def __init__(self, id):
		Thread.__init__(self)
		self.id = id

	def run(self):
		for i in range(1,10000):
			pass
		print 'ready: %d' % self.id

for i in range(1,10):
	t = TestThread(i)
	t.start()



Version-Release number of selected component (if applicable):
kernel-2.4.20-8

How reproducible:
Always

Steps to Reproduce:
1. run the program above with NPTL enabled
2. it should deadlock

    

Actual Results:  the script deadlocked

Expected Results:  the script should have finished.

Additional info:
Comment 1 Ulrich Drepper 2003-06-18 22:55:40 EDT
I cannot reproduce any hangs in several hundred runs on an SMP machine.  But
then, my system is fully updated.  The originally shipped glibc had, I think
some issues with condvar.  Those are used by Python.

Update to the latest glibc version and the latest kernel.  If you still see
problems report exactly what kind of hardware you're using.
Comment 2 Balazs Scheidler 2003-06-19 10:04:52 EDT
the update to the latest libc+kernel solved the problem indeed. thanks.
Comment 3 Ulrich Drepper 2003-06-19 10:58:44 EDT
The current code works.
Comment 4 Scott Leerssen 2003-07-18 10:07:05 EDT
We have kernel 2.4.20-18.9 and glibc-2.3.2-27.9 and are still getting fairly
consistent thread deadlocks in a massively threaded python application. 
Interesting thing is that often an strace or new thread will free up the other
threads.  What is the "CURRENTRELEASE" that is supposed to solve this issue?

FWIW, here's what strace tells me about all the threads of a deadlocked process:

[root@demo9 root]# strace -p 3356 -p 3357 -p 12929 -p 3596 -p 3364 -p 3595
[pid  3356] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid 12929] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  3596] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  3364] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  3595] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  3357] futex(0xb015a04, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  3356] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid 12929] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  3596] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  3364] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  3595] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  3356] select(5, [4], [], [], {1, 160000} <unfinished ...>
[pid  3596] futex(0x965f88c, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  3364] select(0, NULL, NULL, NULL, {0, 820000} <unfinished ...>
[pid  3595] futex(0xbc2fc2c, FUTEX_WAIT, 0, NULL <unfinished ...>


After running this strace, thread 12929 is scheduled in and the deadlock releases.
Comment 5 Scott Leerssen 2003-07-18 11:54:48 EDT
Also, as previously noted, LD_ASSUME_KERNEL=2.4.1 makes the deadlocks go away.
Comment 6 Scott Leerssen 2003-07-18 13:09:02 EDT
hmm... now an rpm just hung in...

[root@demo9 root]# strace -p 8367
futex(0x40586f20, FUTEX_WAIT, 0, NULL <unfinished ...>
Comment 7 Ulrich Drepper 2003-07-22 03:08:15 EDT
I don't see any problems.  If you really have some and they are not heardware
related you might want to try Severn, the just released beta for the RHLP.

Note You need to log in before you can comment on or make changes to this bug.