Bug 180637

Summary:	[libgcj] timer_create appears to deadlock garbage collector
Product:	[Fedora] Fedora	Reporter:	Andrew Cagney <cagney>
Component:	gcc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	rawhide	CC:	amellan, aph, berryja, bjohnson, bnocera, caolanm, davej, ianburrell, ilya.konstantinov, ncunning, ndbecker2, nicolas.mailhot, redhatbugs, tromey, wtogami, zhouwu
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	4.1.2	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-03-12 13:10:30 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	179228
Bug Blocks:	173278, 182226, 182263

Description Andrew Cagney 2006-02-09 16:44:13 UTC

Bryce suggests that the frysk problem, while a garbage collection lock-up (same
symptoms) isn't caused by the same bug; cloning.

+++ This bug was initially created as a clone of Bug #179228 +++

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922
Fedora/1.0.7-1.1.fc4 Firefox/1.0.7

Description of problem:
Eclipse, RSSOwl, Azureus, chainsaw, etc all run fine on 2.6.15-1.1826.2.10_FC5
with java-1.4.2-gcj-compat, but they all freeze during startup on newer kernels.

It looks like this is happening during GC.  I gathered some stack traces from an
eclipse process and will I'll upload in a minute.

AG


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Run eclipse
2.
3.
  

Additional info:

-- Additional comment from green on 2006-01-28 12:39 EST --
Created an attachment (id=123840)
stack traces from gdb


-- Additional comment from davej on 2006-02-03 15:28 EST --
*** Bug 179002 has been marked as a duplicate of this bug. ***

-- Additional comment from berryja on 2006-02-04 16:04 EST --
I am seeing this as well.  It has bitten me most when trying to update. 
gcj-dbtool get stuck and yum sits there waiting on it.  Running "for((;;)); do
killall gcj-dbtool; sleep 1; done" allows yum to get through updating, but I'm
sure my java stuff is a mess.  I think I'm also seeing this affect mono apps,
like beagle.  Running beagle-search just sits there.  Attaching to it with gdb
shows:
0x00002ba878af615d in sem_wait () from /lib64/libpthread.so.0
(gdb) info threads
  3 Thread 1073822048 (LWP 3674)  0x00002ba878af7461 in __nanosleep_nocancel ()
from /lib64/libpthread.so.0
  2 Thread 1075988832 (LWP 3675)  0x00002ba878af46f7 in
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  1 Thread 48002585361616 (LWP 3673)  0x00002ba878af615d in sem_wait () from
/lib64/libpthread.so.0

Let me know what I can do to help with debugging this problem.

Jonathan

-- Additional comment from green on 2006-02-04 16:23 EST --
(In reply to comment #3)
> I think I'm also seeing this affect mono apps,
> like beagle.

This makes sense.  gcj and mono use the same GC implementation, and the problem
shows up when the collector tries to stop all threads so it can take care of
business.


-- Additional comment from cagney on 2006-02-08 12:31 EST --
Changing arch=all, remarkably similar wedgie occures on i386 during garbage collect.


-- Additional comment from mckinlay on 2006-02-08 14:37 EST --
Actually I'm pretty sure this is x86_64 arch specific. Something is wrong with
signals/sigsuspend. Removing the following patches from kernel-2.6.15-1.1914_FC5
fixes it for me:

Patch206: linux-2.6-x86_64-tif-restore-sigmask.patch
Patch207: linux-2.6-x86_64-generic-sigsuspend.patch 
Patch208: linux-2.6-x86_64-add-ppoll-pselect.patch 


-- Additional comment from ajocksch on 2006-02-08 15:04 EST --
Just replicated with 2.6.15-1.1826.2.10_FC5 #1 Wed Jan 11 18:12:42 EST 2006 i686
i686 i386 GNU/Linux on i386 machine using Frysk.

-- Additional comment from redhatbugs on 2006-02-09 07:48 EST --
(In reply to comment #6)
> Actually I'm pretty sure this is x86_64 arch specific. Something is wrong with
> signals/sigsuspend. Removing the following patches from kernel-2.6.15-1.1914_FC5
> fixes it for me:
> 
> Patch206: linux-2.6-x86_64-tif-restore-sigmask.patch
> Patch207: linux-2.6-x86_64-generic-sigsuspend.patch 
> Patch208: linux-2.6-x86_64-add-ppoll-pselect.patch 

What did those pathces try to fix ? I mean could they temp. be disabled in the
next rawhide kernels ?





-- Additional comment from caillon on 2006-02-09 11:11 EST --
*** Bug 177820 has been marked as a duplicate of this bug. ***

-- Additional comment from caillon on 2006-02-09 11:12 EST --
*** Bug 179304 has been marked as a duplicate of this bug. ***

-- Additional comment from caillon on 2006-02-09 11:13 EST --
*** Bug 177592 has been marked as a duplicate of this bug. ***

-- Additional comment from caillon on 2006-02-09 11:15 EST --
*** Bug 177703 has been marked as a duplicate of this bug. ***

-- Additional comment from caillon on 2006-02-09 11:20 EST --
*** Bug 180551 has been marked as a duplicate of this bug. ***

Comment 1 Bryce McKinlay 2006-02-09 17:00:59 UTC

The symptoms appear similar, but its not clear whether or not this is the same bug:

The x86_64 bug:

- Effects *all* Java applications
- Can be made to go away by reverting some x86_64-specific kernel patches

The i686 bug:

- Seems to effect only Frysk. Eclipse and other apps appear to be working fine.
- Doesn't appear to be dependent on kernel version?

It would be good to figure out exactly what conditions caused the i686/frisk bug
to appear.

Comment 2 Andrew Cagney 2006-02-10 23:18:22 UTC

adam and I examining the strace see the sequence:

- successfull garbage collect
...

- creation of real-time timer manager thread (I'm guessing that is what it is) 
- timer_create
- creation of short-lived timer expired thread; then exit
- repeat; in some cases several exist in parallel; but all appear to exit

then

- garbage collect sigpwr signals
- hang

suggests bug is bad interaction with the timer event and the garbage collector.

---------------------------------------------------------------------

at the time of a deadlock, that thread is doing:

ajocksch	#0 0x00969402 in ?? ()
ajocksch	#1 0x00134126 in __nanosleep_nocancel () from /lib/libpthread.so.0
ajocksch	#2 0x0300a95b in GC_lock () at ../../../boehm-gc/pthread_support.c:1490
ajocksch	#3 0x0300acc7 in pthread_create (new_thread=0xb34e4418, attr=0x9f2f290,
ajocksch	start_routine=0x94e3d0 <timer_sigev_thread>, arg=0x9f2f280)
ajocksch	at ../../../boehm-gc/pthread_support.c:1245
ajocksch	#4 0x0094e4f7 in timer_helper_thread () from /lib/librt.so.1
ajocksch	#5 0x0300adb4 in GC_start_routine (arg=0x12cffe0)
ajocksch	at ../../../boehm-gc/pthread_support.c:1188
ajocksch	#6 0x0012e262 in start_thread () from /lib/libpthread.so.0
ajocksch	#7 0x00a8feae in clone () from /lib/libc.so.6
ajocksch	That's the bt from the thread created by thread_create

----------------------------------------------------------------------

bryce	ajocksch: do you know if this patch is in your libgcj?
bryce	ajocksch: 2006-02-06 Jakub Jelinek <jakub>
bryce	Anthony Green <green>
bryce	Tom Tromey <tromey>
bryce	* include/gc_ext_config.h.in: Added GC_PTHREAD_SYM_VERSION.
bryce	* include/gc_config.h.in: Rebuilt.
bryce	...
ajocksch	bryce: I'm up to date with rawhide (except for today's updatE)
bryce	whats the libgcj ver?
bryce	I would try rolling back to an earlier libgcj that doesn't have that patch
bryce	boehm-gc is intercepting timer_create()'s pthread_create call
ajocksch	bryce libgcj-4.1.0-0.23

-------------------------------------------------------------------


gcc41-gc-pthread_create.patch which contains that patch is in 0.23.

Comment 3 Andrew Cagney 2006-02-11 17:49:39 UTC

Should the JVM, and hence the garbage collector, only manipulate native threads
that were explicitly attached using the JNI method (*jvm)->AttachCurrentThread?

Comment 4 Andrew Cagney 2006-02-13 21:56:00 UTC

Adding to fc5 blocker (discussed with bryce); any app manipulating non-pure java
threads and signals will likely deadlock.

Comment 5 Tom Tromey 2006-02-13 23:10:03 UTC

Yeah, this (comment #3) would be ideal.
Originally I thought we would have to wait for the real GC pthread
patch from upstream.

But it occurs to me that we could still keep our current hack in place,
and then add an additional hack so that the GC only tries to stop a
thread which has been explicitly registered.  I'm not sure if I would
want this patch in gcc svn, but we could have it in our RPM; it would
only be needed until the upstream GC is fixed.

Comment 6 Andrew Cagney 2006-02-14 03:25:19 UTC

Two things I found going through the JNI book (it's scant on information so some
reading between the lines is needed :-):

- I can't see how the current hack of intercepting pthread create calls can work
The book includes a very explicit example of how code can load and then start
the the managed environment (JVM) after the program has started using an
explicit dlopen call.  Generalizing, that could be after many sub-systems and
their libraries have been loaded, and their private threads started, and hence
well past the point where pthread create an be intercepted.

- non-Java threads need to be explicitly bound
The book describes how a non-managed thread needs to be explicitly attached
before, and then detached after entering the managed environment or JVM. 
Presumably, just like for memory, if the managed environment isn't told about a
thread the managed environment should leave the thread alone.

Comment 7 Tom Tromey 2006-02-14 03:40:26 UTC

The pthread_create interception stuff is only partly related to JNI.
It is more related to the implementation of the GC itself.  Namely,
the GC needs to know the stack bounds, and there is no other way to
get this information.

Comment 8 Bryce McKinlay 2006-02-16 16:49:58 UTC

I'm working on a patch for this. We'll add a GC_attach_thread() call to be used
by JNI/CNI invocation. It'll use pthread_getattr_np() to find the stack bounds,
avoiding the need to intercept pthread_create.

Comment 9 Andrew Cagney 2006-03-02 03:52:05 UTC

For FC-5, frysk has committed a workaround to this bug.

Comment 10 Warren Togami 2006-03-08 21:02:51 UTC

I see a March 3rd frysk in dist-fc5, so assuming this is fixed.

Comment 11 Andrew Cagney 2006-03-08 22:32:51 UTC

(In reply to comment #10)
> I see a March 3rd frysk in dist-fc5, so assuming this is fixed.

See comment #9.  Frysk re-implemented a section of code to avoid this problem;
the underlying problem with the GC code is still there though.  Perhaphs what is
wanted is for it to be removed from the blocker list?

Comment 12 Warren Togami 2006-03-08 23:07:48 UTC

Yes, that is fine.  FC5Update then?  Go ahead and change the keyword as desired.

Comment 13 Tom Tromey 2006-09-27 18:32:54 UTC

I think this bug is probably fixed in FC6.
The new thread registration patch went in there.
Do we have a simple way to test this?

Comment 14 Jakub Jelinek 2007-03-12 13:10:30 UTC

Please reopen with a reproducer if you see a problem in rawhide, I really believe
this is fixed.