Bug 54368

Summary: Race condition in Electric Fence and Pthreads
Product: [Retired] Red Hat Linux Reporter: William Shubert <wms>
Component: ElectricFenceAssignee: Petr Machata <pmachata>
Status: CLOSED CANTFIX QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 7.1CC: mnewsome, simra
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
URL: ? None
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-10-18 14:26:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to efence.c to fix the deadlock problem. none

Description William Shubert 2001-10-04 21:54:06 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901

Description of problem:
There seems to be a race condition in electric fence and pthreads. I have a
large multithreaded application that, during shutdown, has many threads
exiting and many calls to "free" all at once. About 1 in 5 shutdowns, two
threads get deadlocked. gdb reveals that one thread is as follows:

(gdb) where
#0  0x400cf8a5 in __sigsuspend (set=0x41b228ec)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
#1  0x400920d9 in __pthread_wait_for_restart_signal (self=0x41b22c00)
    at pthread.c:934
#2  0x400930ac in __new_sem_wait (sem=0x400a173c) at restart.h:34
#3  0x4009f179 in lock () from /usr/lib/libefence.so.0
#4  0x4009fa10 in free () from /usr/lib/libefence.so.0
#5  0x40092bf3 in __pthread_destroy_specifics () at specific.c:165
#6  0x4008f2d7 in pthread_exit (retval=0x0) at join.c:37
#7  0x4008fc05 in pthread_start_thread (arg=0x41b22c00) at manager.c:265

And another is:

(gdb) where
#0  0x400cf8a5 in __sigsuspend (set=0x4132260c)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
#1  0x400920d9 in __pthread_wait_for_restart_signal (self=0x41322c00)
    at pthread.c:934
#2  0x400930ac in __new_sem_wait (sem=0x400a173c) at restart.h:34
#3  0x4009f179 in lock () from /usr/lib/libefence.so.0
#4  0x4009fa10 in free () from /usr/lib/libefence.so.0
#5  0x08085064 in wms_free (deadbuf=0x45d1cff0) at wms.c:109

(wms_free() is my function that calls free). It seems that all other
threads exit fine, these two are stuck, so I'm guessing some kind of bad
interaction when a thread is exiting just as another is freeing memory. If
you have trouble reproducing this, let me know.

Version-Release number of selected component (if applicable):
ElectricFence-2.2.2-7

How reproducible:
Sometimes

Steps to Reproduce:
1.Uhhh...run my large non-free multithreaded server
2.exit
3.OK so this is hard for you to reproduce. If a glance at the code doesn't
make the bug obvious, let me know, I'll see if I can write a small program
to reproduce. Now I feel really lazy telling you to look at the code
instead of me. Really, I would, I'm trying to put together a release in the
next 2 hours and I don't have time. Maybe later today I'll see what I can
find if I can get this release done.
	

Actual Results:  Two threads got deadlocked.

Expected Results:  Threads should have just plain exited.

Additional info:

Would be wonderful to get efence working with threads. I think I have a
problem accessing memory after freeing, I can't find it, but it happens
very rarely (like once every month the server is running) so I need to be
able to run production server with efence, and that isn't possible when the
thread of this shutdown problem is there.

Comment 1 simra 2003-02-02 01:04:50 UTC
Created attachment 89771 [details]
Patch to efence.c to fix the deadlock problem.

The problem is that efence calls sem_wait which deadlocks inside pthread_exit. 
The attached patch to efence.c gives the option of using a pthread_mutex rather
than a semaphore.  Note: you *must* compile with -DUSE_MUTEX instead of
-DUSE_SEMAPHORE.

I've mailed the same patch to Bruce Perens.  So far, the patch has solved my
problem, which was identical to that of the bug reported.

Comment 2 simra 2003-02-02 01:08:14 UTC
Note: My patch was applied on a RedHat 8.0 box. The bug looks like it's present
on all systems from 7.1 to 8.0.


Comment 3 Bill Nottingham 2006-08-07 17:22:07 UTC
Red Hat Linux is no longer supported by Red Hat, Inc. If you are still
running Red Hat Linux, you are strongly advised to upgrade to a
current Fedora Core release or Red Hat Enterprise Linux or comparable.
Some information on which option may be right for you is available at
http://www.redhat.com/rhel/migrate/redhatlinux/.

Red Hat apologizes that these issues have not been resolved yet. We do
want to make sure that no important bugs slip through the cracks.
Please check if this issue is still present in a current Fedora Core
release. If so, please change the product and version to match, and
check the box indicating that the requested information has been
provided. Note that any bug still open against Red Hat Linux on will be
closed as 'CANTFIX' on September 30, 2006. Thanks again for your help.


Comment 4 Bill Nottingham 2006-10-18 14:26:00 UTC
Red Hat Linux is no longer supported by Red Hat, Inc. If you are still
running Red Hat Linux, you are strongly advised to upgrade to a
current Fedora Core release or Red Hat Enterprise Linux or comparable.
Some information on which option may be right for you is available at
http://www.redhat.com/rhel/migrate/redhatlinux/.

Closing as CANTFIX.