Description of problem: Cannot FCORE a sample SLEEP program as it locks up (futex loops). Version-Release number of selected component (if applicable): frysk-0.0.1.2007.02.07.rh1-1.fc6.x86_64 coreutils-5.97-12.3.fc6.x86_64 package coreutils-debuginfo is not installed glibc-2.5-10.fc6.x86_64 glibc-debuginfo-2.5-10.fc6.x86_64 How reproducible: Always. Steps to Reproduce: 1. sleep 1h & pid=$! ;sleep 1;fcore -o /tmp/sleep.core $pid Actual results: Still running, empty output file. Expected results: core file output, short running time. Additional info: [pid 20688] ptrace(PTRACE_PEEKDATA, 20679, 0x2aaaad584258, [0x866b00008665]) = 0 [pid 20688] futex(0xa3492c, 0x5 /* FUTEX_??? */, 1) = 1 [pid 20686] <... futex resumed> ) = 0 [pid 20688] futex(0xa3ab8c, FUTEX_WAIT, 11264211, NULL <unfinished ...> [pid 20686] futex(0xa34958, FUTEX_WAKE, 1) = 0 [pid 20686] futex(0xa3ab8c, 0x5 /* FUTEX_??? */, 1 <unfinished ...> ... Thread 3 (Thread 1084229952 (LWP 20914)): #0 0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003a5dbb93bc in _Jv_CondWait (cv=0x3a5f140578, mu=0x3a5f140540, millis=<value optimized out>, nanos=0) at ../libjava/posix-threads.cc:166 #2 0x0000003a5dba386c in gnu::gcj::runtime::FinalizerThread::run (this=<value optimized out>) at ../libjava/gnu/gcj/runtime/natFinalizerThread.cc:57 #3 0x0000003a5dbb2738 in _Jv_ThreadRun (thread=0x2aaaae23fe00) at ../libjava/java/lang/natThread.cc:302 #4 0x0000003a5dbb8f77 in really_start (x=0x2aaaadfdc710) at ../libjava/posix-threads.cc:432 #5 0x0000003a5e2f6a66 in GC_start_routine (arg=<value optimized out>) at ../boehm-gc/pthread_support.c:1294 #6 0x0000003e03e06305 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #7 0x0000003e00ecd50d in clone () from /lib64/libc.so.6 #8 0x0000000000000000 in ?? () Thread 2 (Thread 1094719808 (LWP 20915)): #0 0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003a5dbb93bc in _Jv_CondWait (cv=0x2aaaae23fd48, mu=0x2aaaae23fd50, millis=<value optimized out>, nanos=0) at ../libjava/posix-threads.cc:166 #2 0x0000003a5dbaed60 in java::lang::Object::wait (this=<value optimized out>, timeout=0, nanos=0) at ../libjava/java/lang/natObject.cc:1333 #3 0x000000000045f7e6 in frysk.sys.Ptrace$PtraceThread.run()void (this=0x2aaaae2f0e40) at frysk/sys/Ptrace.java:208 #4 0x0000003a5dbb2738 in _Jv_ThreadRun (thread=0x2aaaae2f0e40) at ../libjava/java/lang/natThread.cc:302 #5 0x0000003a5dbb8f77 in really_start (x=0x2aaaae2e6d60) at ../libjava/posix-threads.cc:432 #6 0x0000003a5e2f6a66 in GC_start_routine (arg=<value optimized out>) at ../boehm-gc/pthread_support.c:1294 #7 0x0000003e03e06305 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #8 0x0000003e00ecd50d in clone () from /lib64/libc.so.6 #9 0x0000000000000000 in ?? () Thread 1 (Thread 46912496204128 (LWP 20913)): #0 0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003a5dbb93bc in _Jv_CondWait (cv=0x2aaaae23fd48, mu=0x2aaaae23fd50, millis=<value optimized out>, nanos=0) at ../libjava/posix-threads.cc:166 #2 0x0000003a5dbaed60 in java::lang::Object::wait (this=<value optimized out>, timeout=0, nanos=0) at ../libjava/java/lang/natObject.cc:1333 #3 0x000000000045f8dc in frysk.sys.Ptrace$PtraceThread.notifyPtraceThread(int, int, long, long)long (this=0x2aaaae2f0e40, request=2, pid=20911, addr=46912528250712, data=0) at frysk/sys/Ptrace.java:246 #4 0x000000000045993a in frysk::sys::PtraceByteBuffer::peek (this=<value optimized out>, addr=46912528250712, buf=<value optimized out>, off=32057176, len=10701144) at frysk/sys/cni/PtraceByteBuffer.cxx:128 #5 0x0000000000453365 in inua.eio.ByteBuffer.peekFully(long, byte[], long, long)void (this=0x2aaaae3338a0, caret=46912528250712, bytes=0x2aaaae398000, off=32057176, len=10701144) at inua/eio/ByteBuffer.java:93 #6 0x0000000000453de3 in inua.eio.ByteBuffer.get(long, byte[], int, int)inua.eio.ByteBuffer (this=0x2aaaae3338a0, index=46912496193536, dst=0x2aaaae398000, off=0, len=<value optimized out>) at inua/eio/ByteBuffer.java:358 #7 0x000000000043147d in frysk.util.CoredumpAction$CoreMapsBuilder.buildMap(long, long, boolean, boolean, boolean, boolean, long, int, int, int, int, int)void (this=0x2aaaae32e1e0, addressLow=46912496193536, addressHigh=46912551710720, permRead=<value optimized out>, permWrite=<value optimized out>, permExecute=false, shared=false, offset=0, devMajor=8, devMinor=1, inode=3163834, pathnameOffset=1112, pathnameLength=30) at frysk/util/CoredumpAction.java:1083 #8 0x000000000045ae86 in frysk::sys::proc::MapsBuilder::construct (this=0x2aaaae32e1e0, buf=<value optimized out>) at frysk/sys/proc/cni/MapsBuilder.cxx:107 #9 0x000000000045af03 in frysk::sys::proc::MapsBuilder::construct (this=0x2aaaae32e1e0, pid=<value optimized out>) at frysk/sys/proc/cni/MapsBuilder.cxx:58 #10 0x000000000043109b in frysk.util.CoredumpAction.write_elf_file(frysk.proc.Task[], frysk.proc.Proc)void (this=0x2aaaae23fd90, tasks=<value optimized out>, proc=0x2aaaae286be0) at frysk/util/CoredumpAction.java:656 #11 0x0000000000430962 in frysk.util.CoredumpAction.allExistingTasksCompleted()void (this=0xa3492c) at frysk/util/CoredumpAction.java:1172 #12 0x0000000000443b2b in frysk.proc.ProcBlockAction.checkFinish(frysk.proc.Task)void (this=0x2aaaae23fd90, task=<value optimized out>) at frysk/proc/ProcBlockAction.java:241 #13 0x0000000000443f25 in frysk.proc.ProcBlockAction$ProcBlockTaskObserver$1.execute()void (this=0x2aaaae2ebc78) at frysk/proc/ProcBlockAction.java:88 #14 0x00000000004323c8 in frysk.event.EventLoop.runEventLoop(boolean)void (this=0x2aaaae286e60, pendingOnly=<value optimized out>) at frysk/event/EventLoop.java:308 #15 0x000000000043247b in frysk.event.EventLoop.run()void (this=0x2aaaae286e60) at frysk/event/EventLoop.java:414 #16 0x00000000004286b3 in fcore.main(java.lang.String[])void (args=<value optimized out>) at frysk/bindir/fcore.java:128 #17 0x0000003a5dba41b3 in gnu::java::lang::MainThread::call_main (this=0x2aaaae286f00) at ../libjava/gnu/java/lang/natMainThread.cc:50 #18 0x0000003a5dbf925e in gnu.java.lang.MainThread.run()void (this=0x2aaaae286f00) at ../libjava/gnu/java/lang/MainThread.java:108 #19 0x0000003a5dbb2738 in _Jv_ThreadRun (thread=0x2aaaae286f00) at ../libjava/java/lang/natThread.cc:302 #20 0x0000003a5db75a05 in _Jv_RunMain (vm_args=<value optimized out>, klass=0x6eb440, name=0x0, argc=4, argv=0x7fff0c1f7b98, is_jar=false) at ../libjava/prims.cc:1535 #21 0x0000000000428066 in main (argc=0, argv=0x7b0e13) at /tmp/ccgEGq6e.i:14 #22 0x0000003e00e1da44 in __libc_start_main (main=0x428038 <main>, argc=4, ubp_av=0x7fff0c1f7b98, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fff0c1f7b88) at libc-start.c:231 #23 0x0000000000427fa9 in _start () #0 0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
You're looking at the ptrace server thread which is why it appears to be in a loop. A closer examination will show that the system calls made by that thread change, the code is making forward progress. Two factors are at play here, both lead to correct, but sub-optimal performance: -> The kernel [effectively] limits all ptrace calls to a single thread. To work around this frysk currently routes all ptrace calls through a separate ptrace server thread. Frysk's event-loop and ptrace threads are currently being merged to eliminate this overhead. Going forward, utrace, will eliminate the restriction entirely. -> Ptrace is very inefficient for large transfers. Current changes are switching the code to a more efficient streaming mechanism - writing the data in large chunks direct to disk. fstack, which uses identical attach/detach code, gets this result: $ sleep 1h & pid=$! ; sleep 1 ; fstack $pid [1] 13428 Task #13428 #0 0x00dda402 in __kernel_vsyscall () #1 0x00ca2940 in nanosleep () #2 0x0804a710 in [unknown] #3 0x0804912a in [unknown] #4 0x00c2d4e4 in __libc_start_main () #5 0x08048d11 in [unknown] $ Pushing bug upstream.
fcore (from CVS) takes a very long time (about 7 minutes) on the above example and then generates the following stacktrace: Exception in thread "main" inua.eio.BufferUnderflowException at inua.eio.ByteBuffer.get(fcore) at frysk.util.CoredumpAction$CoreMapsBuilder.buildMap(fcore) at frysk.sys.proc.MapsBuilder.construct(fcore) at frysk.sys.proc.MapsBuilder.construct(fcore) at frysk.util.CoredumpAction.write_elf_file(fcore) at frysk.util.CoredumpAction.allExistingTasksCompleted(fcore) at frysk.proc.ProcBlockAction.checkFinish(fcore) at frysk.proc.ProcBlockAction$ProcBlockTaskObserver$1.execute(fcore) at frysk.event.EventLoop.runEventLoop(fcore) at frysk.event.EventLoop.run(fcore) at fcore.main(fcore)
Hi Mark, Locally (from CVS) I cannot reproduce as I get: [pmuldoon@localhost filesystems]$ sleep 1h & pid=$! ;sleep 1;fcore -o /tmp/sleep.core $pid [1] 8533 [pmuldoon@localhost filesystems]$ [pmuldoon@localhost filesystems]$ eu-readelf -h /tmp/sleep.core.8533 ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Ident Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: CORE (Core file) Additionally it completes in < 30 seconds. Is there access to your machine so I can test?
(In reply to comment #3) > Locally (from CVS) I cannot reproduce as I get: > [...] > [pmuldoon@localhost filesystems]$ eu-readelf -h /tmp/sleep.core.8533 > > [...] > Additionally it completes in < 30 seconds. Is there access to your machine so I > can test? Cool! That looks promising. I am using a x86_64 machine for my tests (looks you use something 32 bits based). I'll try to find you on irc.gimp.org in #frysk to coordinate a debugging session. Thanks, Mark
The issue discussed in comment #2, #3 and #4 is a new issue, tracked upstream as: http://sourceware.org/bugzilla/show_bug.cgi?id=4313
On further testing this is a bug in CVS HEAD, not in FC6. As of right now, I cannot replicate the symptoms in: [root@localhost pmuldoon]# rpm -q frysk frysk-0.0.1.2007.02.07.rh1-1.fc6 but can in CVS HEAD. I'll track further reports on the upstream bug
(In reply to comment #6) > On further testing this is a bug in CVS HEAD, not in FC6. As of right now, I > cannot replicate the symptoms in: > > [root@localhost pmuldoon]# rpm -q frysk > frysk-0.0.1.2007.02.07.rh1-1.fc6 In Comment 0 you can check I reported it originally for this FC6 version, on x86_64. I was not talking about any crash just that it is unusably slow. Even if it would end one day.
In reply to #7, I should have prefaced the comment to comments #3, #4, and #5. I'm just trying to triage the bug and get the relative data out of the trouble report, so we can pursue a fix and quickly and efficiently, and if required split this bug into 2 bugs. I understand you were not talking about the backtrace. Not sure what you meant but "Even if it would end one day" ?
(In reply to comment #8) > Not sure what you meant but "Even if it would end one day" That it is too slow. ("end"->"finish" would be probably better English, sorry)