234879 – FCORE locks up

Bug 234879 - FCORE locks up

Summary: FCORE locks up

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	frysk
Sub Component:
Version:	6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Andrew Cagney
QA Contact:	Len DiMaggio
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	229481
TreeView+	depends on / blocked

Reported:	2007-04-02 18:26 UTC by Jan Kratochvil
Modified:	2007-11-30 22:12 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-04-02 20:33:38 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Sourceware	3727	0	None	None	None	Never
Sourceware	4313	0	None	None	None	Never

Description Jan Kratochvil 2007-04-02 18:26:45 UTC

Description of problem:
Cannot FCORE a sample SLEEP program as it locks up (futex loops).

Version-Release number of selected component (if applicable):
frysk-0.0.1.2007.02.07.rh1-1.fc6.x86_64
coreutils-5.97-12.3.fc6.x86_64
package coreutils-debuginfo is not installed
glibc-2.5-10.fc6.x86_64
glibc-debuginfo-2.5-10.fc6.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. sleep 1h & pid=$! ;sleep 1;fcore -o /tmp/sleep.core $pid

Actual results:
Still running, empty output file.

Expected results:
core file output, short running time.

Additional info:
[pid 20688] ptrace(PTRACE_PEEKDATA, 20679, 0x2aaaad584258, [0x866b00008665]) = 0
[pid 20688] futex(0xa3492c, 0x5 /* FUTEX_??? */, 1) = 1
[pid 20686] <... futex resumed> )       = 0
[pid 20688] futex(0xa3ab8c, FUTEX_WAIT, 11264211, NULL <unfinished ...>
[pid 20686] futex(0xa34958, FUTEX_WAKE, 1) = 0
[pid 20686] futex(0xa3ab8c, 0x5 /* FUTEX_??? */, 1 <unfinished ...>
...

Thread 3 (Thread 1084229952 (LWP 20914)):
#0  0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000003a5dbb93bc in _Jv_CondWait (cv=0x3a5f140578, mu=0x3a5f140540,
millis=<value optimized out>, nanos=0)
    at ../libjava/posix-threads.cc:166
#2  0x0000003a5dba386c in gnu::gcj::runtime::FinalizerThread::run (this=<value
optimized out>)
    at ../libjava/gnu/gcj/runtime/natFinalizerThread.cc:57
#3  0x0000003a5dbb2738 in _Jv_ThreadRun (thread=0x2aaaae23fe00) at
../libjava/java/lang/natThread.cc:302
#4  0x0000003a5dbb8f77 in really_start (x=0x2aaaadfdc710) at
../libjava/posix-threads.cc:432
#5  0x0000003a5e2f6a66 in GC_start_routine (arg=<value optimized out>) at
../boehm-gc/pthread_support.c:1294
#6  0x0000003e03e06305 in start_thread (arg=<value optimized out>) at
pthread_create.c:296
#7  0x0000003e00ecd50d in clone () from /lib64/libc.so.6
#8  0x0000000000000000 in ?? ()

Thread 2 (Thread 1094719808 (LWP 20915)):
#0  0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000003a5dbb93bc in _Jv_CondWait (cv=0x2aaaae23fd48, mu=0x2aaaae23fd50,
millis=<value optimized out>, nanos=0)
    at ../libjava/posix-threads.cc:166
#2  0x0000003a5dbaed60 in java::lang::Object::wait (this=<value optimized out>,
timeout=0, nanos=0)
    at ../libjava/java/lang/natObject.cc:1333
#3  0x000000000045f7e6 in frysk.sys.Ptrace$PtraceThread.run()void
(this=0x2aaaae2f0e40) at frysk/sys/Ptrace.java:208
#4  0x0000003a5dbb2738 in _Jv_ThreadRun (thread=0x2aaaae2f0e40) at
../libjava/java/lang/natThread.cc:302
#5  0x0000003a5dbb8f77 in really_start (x=0x2aaaae2e6d60) at
../libjava/posix-threads.cc:432
#6  0x0000003a5e2f6a66 in GC_start_routine (arg=<value optimized out>) at
../boehm-gc/pthread_support.c:1294
#7  0x0000003e03e06305 in start_thread (arg=<value optimized out>) at
pthread_create.c:296
#8  0x0000003e00ecd50d in clone () from /lib64/libc.so.6
#9  0x0000000000000000 in ?? ()

Thread 1 (Thread 46912496204128 (LWP 20913)):
#0  0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000003a5dbb93bc in _Jv_CondWait (cv=0x2aaaae23fd48, mu=0x2aaaae23fd50,
millis=<value optimized out>, nanos=0)
    at ../libjava/posix-threads.cc:166
#2  0x0000003a5dbaed60 in java::lang::Object::wait (this=<value optimized out>,
timeout=0, nanos=0)
    at ../libjava/java/lang/natObject.cc:1333
#3  0x000000000045f8dc in frysk.sys.Ptrace$PtraceThread.notifyPtraceThread(int,
int, long, long)long (this=0x2aaaae2f0e40, request=2, 
    pid=20911, addr=46912528250712, data=0) at frysk/sys/Ptrace.java:246
#4  0x000000000045993a in frysk::sys::PtraceByteBuffer::peek (this=<value
optimized out>, addr=46912528250712, 
    buf=<value optimized out>, off=32057176, len=10701144) at
frysk/sys/cni/PtraceByteBuffer.cxx:128
#5  0x0000000000453365 in inua.eio.ByteBuffer.peekFully(long, byte[], long,
long)void (this=0x2aaaae3338a0, caret=46912528250712, 
    bytes=0x2aaaae398000, off=32057176, len=10701144) at inua/eio/ByteBuffer.java:93
#6  0x0000000000453de3 in inua.eio.ByteBuffer.get(long, byte[], int,
int)inua.eio.ByteBuffer (this=0x2aaaae3338a0, index=46912496193536, 
    dst=0x2aaaae398000, off=0, len=<value optimized out>) at
inua/eio/ByteBuffer.java:358
#7  0x000000000043147d in
frysk.util.CoredumpAction$CoreMapsBuilder.buildMap(long, long, boolean, boolean,
boolean, boolean, long, int, int, int, int, int)void (this=0x2aaaae32e1e0,
addressLow=46912496193536, addressHigh=46912551710720, permRead=<value optimized
out>, 
    permWrite=<value optimized out>, permExecute=false, shared=false, offset=0,
devMajor=8, devMinor=1, inode=3163834, 
    pathnameOffset=1112, pathnameLength=30) at frysk/util/CoredumpAction.java:1083
#8  0x000000000045ae86 in frysk::sys::proc::MapsBuilder::construct
(this=0x2aaaae32e1e0, buf=<value optimized out>)
    at frysk/sys/proc/cni/MapsBuilder.cxx:107
#9  0x000000000045af03 in frysk::sys::proc::MapsBuilder::construct
(this=0x2aaaae32e1e0, pid=<value optimized out>)
    at frysk/sys/proc/cni/MapsBuilder.cxx:58
#10 0x000000000043109b in
frysk.util.CoredumpAction.write_elf_file(frysk.proc.Task[], frysk.proc.Proc)void
(this=0x2aaaae23fd90, 
    tasks=<value optimized out>, proc=0x2aaaae286be0) at
frysk/util/CoredumpAction.java:656
#11 0x0000000000430962 in
frysk.util.CoredumpAction.allExistingTasksCompleted()void (this=0xa3492c)
    at frysk/util/CoredumpAction.java:1172
#12 0x0000000000443b2b in
frysk.proc.ProcBlockAction.checkFinish(frysk.proc.Task)void
(this=0x2aaaae23fd90, task=<value optimized out>)
    at frysk/proc/ProcBlockAction.java:241
#13 0x0000000000443f25 in
frysk.proc.ProcBlockAction$ProcBlockTaskObserver$1.execute()void
(this=0x2aaaae2ebc78)
    at frysk/proc/ProcBlockAction.java:88
#14 0x00000000004323c8 in frysk.event.EventLoop.runEventLoop(boolean)void
(this=0x2aaaae286e60, pendingOnly=<value optimized out>)
    at frysk/event/EventLoop.java:308
#15 0x000000000043247b in frysk.event.EventLoop.run()void (this=0x2aaaae286e60)
at frysk/event/EventLoop.java:414
#16 0x00000000004286b3 in fcore.main(java.lang.String[])void (args=<value
optimized out>) at frysk/bindir/fcore.java:128
#17 0x0000003a5dba41b3 in gnu::java::lang::MainThread::call_main
(this=0x2aaaae286f00) at ../libjava/gnu/java/lang/natMainThread.cc:50
#18 0x0000003a5dbf925e in gnu.java.lang.MainThread.run()void
(this=0x2aaaae286f00) at ../libjava/gnu/java/lang/MainThread.java:108
#19 0x0000003a5dbb2738 in _Jv_ThreadRun (thread=0x2aaaae286f00) at
../libjava/java/lang/natThread.cc:302
#20 0x0000003a5db75a05 in _Jv_RunMain (vm_args=<value optimized out>,
klass=0x6eb440, name=0x0, argc=4, argv=0x7fff0c1f7b98, 
    is_jar=false) at ../libjava/prims.cc:1535
#21 0x0000000000428066 in main (argc=0, argv=0x7b0e13) at /tmp/ccgEGq6e.i:14
#22 0x0000003e00e1da44 in __libc_start_main (main=0x428038 <main>, argc=4,
ubp_av=0x7fff0c1f7b98, init=<value optimized out>, 
    fini=<value optimized out>, rtld_fini=<value optimized out>,
stack_end=0x7fff0c1f7b88) at libc-start.c:231
#23 0x0000000000427fa9 in _start ()
#0  0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

Comment 1 Andrew Cagney 2007-04-02 20:33:38 UTC

You're looking at the ptrace server thread which is why it appears to be in a
loop.  A closer examination will show that the system calls made by that thread
change, the code is making forward progress.

Two factors are at play here, both lead to correct, but sub-optimal performance:

-> The kernel [effectively] limits all ptrace calls to a single thread.  To work
around this frysk currently routes all ptrace calls through a separate ptrace
server thread.  Frysk's event-loop and ptrace threads are currently being merged
to eliminate this overhead.   Going forward, utrace, will eliminate the
restriction entirely.

-> Ptrace is very inefficient for large transfers.  Current changes are
switching the code to a more efficient streaming mechanism - writing the data in
large chunks direct to disk.

fstack, which uses identical attach/detach code, gets this result:

$ sleep 1h & pid=$! ; sleep 1 ; fstack $pid
[1] 13428
Task #13428
#0 0x00dda402 in __kernel_vsyscall ()
#1 0x00ca2940 in nanosleep ()
#2 0x0804a710 in [unknown]
#3 0x0804912a in [unknown]
#4 0x00c2d4e4 in __libc_start_main ()
#5 0x08048d11 in [unknown]
$ 

Pushing bug upstream.

Comment 2 Mark Wielaard 2007-04-03 11:01:28 UTC

fcore (from CVS) takes a very long time (about 7 minutes) on the above example
and then generates the following stacktrace:

Exception in thread "main" inua.eio.BufferUnderflowException
   at inua.eio.ByteBuffer.get(fcore)
   at frysk.util.CoredumpAction$CoreMapsBuilder.buildMap(fcore)
   at frysk.sys.proc.MapsBuilder.construct(fcore)
   at frysk.sys.proc.MapsBuilder.construct(fcore)
   at frysk.util.CoredumpAction.write_elf_file(fcore)
   at frysk.util.CoredumpAction.allExistingTasksCompleted(fcore)
   at frysk.proc.ProcBlockAction.checkFinish(fcore)
   at frysk.proc.ProcBlockAction$ProcBlockTaskObserver$1.execute(fcore)
   at frysk.event.EventLoop.runEventLoop(fcore)
   at frysk.event.EventLoop.run(fcore)
   at fcore.main(fcore)

Comment 3 Phil Muldoon 2007-04-03 12:56:42 UTC

Hi Mark,

Locally (from CVS) I cannot reproduce as I get:

[pmuldoon@localhost filesystems]$ sleep 1h & pid=$! ;sleep 1;fcore -o
/tmp/sleep.core $pid
[1] 8533
[pmuldoon@localhost filesystems]$ 

[pmuldoon@localhost filesystems]$ eu-readelf  -h /tmp/sleep.core.8533 
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Ident Version:                     1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              CORE (Core file)

Additionally it completes in < 30 seconds. Is there access to your machine so I
can test?

Comment 4 Mark Wielaard 2007-04-03 13:11:36 UTC

(In reply to comment #3)
> Locally (from CVS) I cannot reproduce as I get:
> [...]
> [pmuldoon@localhost filesystems]$ eu-readelf  -h /tmp/sleep.core.8533 
> > [...]
> Additionally it completes in < 30 seconds. Is there access to your machine so I
> can test?

Cool! That looks promising. I am using a x86_64 machine for my tests (looks you
use something 32 bits based). I'll try to find you on irc.gimp.org in #frysk to
coordinate a debugging session.

Thanks,

Mark

Comment 5 Mark Wielaard 2007-04-03 13:47:17 UTC

The issue discussed in comment #2, #3 and #4 is a new issue, tracked upstream as:
http://sourceware.org/bugzilla/show_bug.cgi?id=4313

Comment 6 Phil Muldoon 2007-04-03 14:02:11 UTC

On further testing this is a bug in CVS HEAD, not in FC6. As of right now, I
cannot replicate the symptoms in:

[root@localhost pmuldoon]# rpm -q frysk
frysk-0.0.1.2007.02.07.rh1-1.fc6

but can in CVS HEAD. I'll track further reports on the upstream bug

Comment 7 Jan Kratochvil 2007-04-03 15:10:11 UTC

(In reply to comment #6)
> On further testing this is a bug in CVS HEAD, not in FC6. As of right now, I
> cannot replicate the symptoms in:
> 
> [root@localhost pmuldoon]# rpm -q frysk
> frysk-0.0.1.2007.02.07.rh1-1.fc6

In Comment 0 you can check I reported it originally for this FC6 version, on x86_64.

I was not talking about any crash just that it is unusably slow.  Even if it
would end one day.

Comment 8 Phil Muldoon 2007-04-03 15:21:12 UTC

In reply to #7,

I should have prefaced the comment to comments #3, #4, and #5. I'm just trying
to triage the bug and get the relative data out of the trouble report, so we can
pursue a fix and quickly and efficiently, and if required split this bug into 2
bugs. I understand you were not talking about the backtrace. 

Not sure what you meant but "Even if it would end one day"

?

Comment 9 Jan Kratochvil 2007-04-03 16:03:15 UTC

(In reply to comment #8)
> Not sure what you meant but "Even if it would end one day"

That it is too slow.
("end"->"finish" would be probably better English, sorry)

Note You need to log in before you can comment on or make changes to this bug.