Bug 180637
| Summary: | [libgcj] timer_create appears to deadlock garbage collector | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Andrew Cagney <cagney> |
| Component: | gcc | Assignee: | Jakub Jelinek <jakub> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | rawhide | CC: | amellan, aph, berryja, bjohnson, bnocera, caolanm, davej, ianburrell, ilya.konstantinov, ncunning, ndbecker2, nicolas.mailhot, redhatbugs, tromey, wtogami, zhouwu |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | 4.1.2 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2007-03-12 13:10:30 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 179228 | ||
| Bug Blocks: | 173278, 182226, 182263 | ||
|
Description
Andrew Cagney
2006-02-09 16:44:13 UTC
The symptoms appear similar, but its not clear whether or not this is the same bug: The x86_64 bug: - Effects *all* Java applications - Can be made to go away by reverting some x86_64-specific kernel patches The i686 bug: - Seems to effect only Frysk. Eclipse and other apps appear to be working fine. - Doesn't appear to be dependent on kernel version? It would be good to figure out exactly what conditions caused the i686/frisk bug to appear. adam and I examining the strace see the sequence: - successfull garbage collect ... - creation of real-time timer manager thread (I'm guessing that is what it is) - timer_create - creation of short-lived timer expired thread; then exit - repeat; in some cases several exist in parallel; but all appear to exit then - garbage collect sigpwr signals - hang suggests bug is bad interaction with the timer event and the garbage collector. --------------------------------------------------------------------- at the time of a deadlock, that thread is doing: ajocksch #0 0x00969402 in ?? () ajocksch #1 0x00134126 in __nanosleep_nocancel () from /lib/libpthread.so.0 ajocksch #2 0x0300a95b in GC_lock () at ../../../boehm-gc/pthread_support.c:1490 ajocksch #3 0x0300acc7 in pthread_create (new_thread=0xb34e4418, attr=0x9f2f290, ajocksch start_routine=0x94e3d0 <timer_sigev_thread>, arg=0x9f2f280) ajocksch at ../../../boehm-gc/pthread_support.c:1245 ajocksch #4 0x0094e4f7 in timer_helper_thread () from /lib/librt.so.1 ajocksch #5 0x0300adb4 in GC_start_routine (arg=0x12cffe0) ajocksch at ../../../boehm-gc/pthread_support.c:1188 ajocksch #6 0x0012e262 in start_thread () from /lib/libpthread.so.0 ajocksch #7 0x00a8feae in clone () from /lib/libc.so.6 ajocksch That's the bt from the thread created by thread_create ---------------------------------------------------------------------- bryce ajocksch: do you know if this patch is in your libgcj? bryce ajocksch: 2006-02-06 Jakub Jelinek <jakub> bryce Anthony Green <green> bryce Tom Tromey <tromey> bryce * include/gc_ext_config.h.in: Added GC_PTHREAD_SYM_VERSION. bryce * include/gc_config.h.in: Rebuilt. bryce ... ajocksch bryce: I'm up to date with rawhide (except for today's updatE) bryce whats the libgcj ver? bryce I would try rolling back to an earlier libgcj that doesn't have that patch bryce boehm-gc is intercepting timer_create()'s pthread_create call ajocksch bryce libgcj-4.1.0-0.23 ------------------------------------------------------------------- gcc41-gc-pthread_create.patch which contains that patch is in 0.23. Should the JVM, and hence the garbage collector, only manipulate native threads that were explicitly attached using the JNI method (*jvm)->AttachCurrentThread? Adding to fc5 blocker (discussed with bryce); any app manipulating non-pure java threads and signals will likely deadlock. Yeah, this (comment #3) would be ideal. Originally I thought we would have to wait for the real GC pthread patch from upstream. But it occurs to me that we could still keep our current hack in place, and then add an additional hack so that the GC only tries to stop a thread which has been explicitly registered. I'm not sure if I would want this patch in gcc svn, but we could have it in our RPM; it would only be needed until the upstream GC is fixed. Two things I found going through the JNI book (it's scant on information so some reading between the lines is needed :-): - I can't see how the current hack of intercepting pthread create calls can work The book includes a very explicit example of how code can load and then start the the managed environment (JVM) after the program has started using an explicit dlopen call. Generalizing, that could be after many sub-systems and their libraries have been loaded, and their private threads started, and hence well past the point where pthread create an be intercepted. - non-Java threads need to be explicitly bound The book describes how a non-managed thread needs to be explicitly attached before, and then detached after entering the managed environment or JVM. Presumably, just like for memory, if the managed environment isn't told about a thread the managed environment should leave the thread alone. The pthread_create interception stuff is only partly related to JNI. It is more related to the implementation of the GC itself. Namely, the GC needs to know the stack bounds, and there is no other way to get this information. I'm working on a patch for this. We'll add a GC_attach_thread() call to be used by JNI/CNI invocation. It'll use pthread_getattr_np() to find the stack bounds, avoiding the need to intercept pthread_create. For FC-5, frysk has committed a workaround to this bug. I see a March 3rd frysk in dist-fc5, so assuming this is fixed. (In reply to comment #10) > I see a March 3rd frysk in dist-fc5, so assuming this is fixed. See comment #9. Frysk re-implemented a section of code to avoid this problem; the underlying problem with the GC code is still there though. Perhaphs what is wanted is for it to be removed from the blocker list? Yes, that is fine. FC5Update then? Go ahead and change the keyword as desired. I think this bug is probably fixed in FC6. The new thread registration patch went in there. Do we have a simple way to test this? Please reopen with a reproducer if you see a problem in rawhide, I really believe this is fixed. |