Bug 234879

Summary: FCORE locks up
Product: [Fedora] Fedora Reporter: Jan Kratochvil <jan.kratochvil>
Component: fryskAssignee: Andrew Cagney <cagney>
Status: CLOSED UPSTREAM QA Contact: Len DiMaggio <ldimaggi>
Severity: medium Docs Contact:
Priority: medium    
Version: 6CC: kasal, mcvet, mjw, npremji, pmuldoon, rmoseley, scox, timoore
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-04-02 20:33:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 229481    

Description Jan Kratochvil 2007-04-02 18:26:45 UTC
Description of problem:
Cannot FCORE a sample SLEEP program as it locks up (futex loops).

Version-Release number of selected component (if applicable):
frysk-0.0.1.2007.02.07.rh1-1.fc6.x86_64
coreutils-5.97-12.3.fc6.x86_64
package coreutils-debuginfo is not installed
glibc-2.5-10.fc6.x86_64
glibc-debuginfo-2.5-10.fc6.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. sleep 1h & pid=$! ;sleep 1;fcore -o /tmp/sleep.core $pid

Actual results:
Still running, empty output file.

Expected results:
core file output, short running time.

Additional info:
[pid 20688] ptrace(PTRACE_PEEKDATA, 20679, 0x2aaaad584258, [0x866b00008665]) = 0
[pid 20688] futex(0xa3492c, 0x5 /* FUTEX_??? */, 1) = 1
[pid 20686] <... futex resumed> )       = 0
[pid 20688] futex(0xa3ab8c, FUTEX_WAIT, 11264211, NULL <unfinished ...>
[pid 20686] futex(0xa34958, FUTEX_WAKE, 1) = 0
[pid 20686] futex(0xa3ab8c, 0x5 /* FUTEX_??? */, 1 <unfinished ...>
...

Thread 3 (Thread 1084229952 (LWP 20914)):
#0  0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000003a5dbb93bc in _Jv_CondWait (cv=0x3a5f140578, mu=0x3a5f140540,
millis=<value optimized out>, nanos=0)
    at ../libjava/posix-threads.cc:166
#2  0x0000003a5dba386c in gnu::gcj::runtime::FinalizerThread::run (this=<value
optimized out>)
    at ../libjava/gnu/gcj/runtime/natFinalizerThread.cc:57
#3  0x0000003a5dbb2738 in _Jv_ThreadRun (thread=0x2aaaae23fe00) at
../libjava/java/lang/natThread.cc:302
#4  0x0000003a5dbb8f77 in really_start (x=0x2aaaadfdc710) at
../libjava/posix-threads.cc:432
#5  0x0000003a5e2f6a66 in GC_start_routine (arg=<value optimized out>) at
../boehm-gc/pthread_support.c:1294
#6  0x0000003e03e06305 in start_thread (arg=<value optimized out>) at
pthread_create.c:296
#7  0x0000003e00ecd50d in clone () from /lib64/libc.so.6
#8  0x0000000000000000 in ?? ()

Thread 2 (Thread 1094719808 (LWP 20915)):
#0  0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000003a5dbb93bc in _Jv_CondWait (cv=0x2aaaae23fd48, mu=0x2aaaae23fd50,
millis=<value optimized out>, nanos=0)
    at ../libjava/posix-threads.cc:166
#2  0x0000003a5dbaed60 in java::lang::Object::wait (this=<value optimized out>,
timeout=0, nanos=0)
    at ../libjava/java/lang/natObject.cc:1333
#3  0x000000000045f7e6 in frysk.sys.Ptrace$PtraceThread.run()void
(this=0x2aaaae2f0e40) at frysk/sys/Ptrace.java:208
#4  0x0000003a5dbb2738 in _Jv_ThreadRun (thread=0x2aaaae2f0e40) at
../libjava/java/lang/natThread.cc:302
#5  0x0000003a5dbb8f77 in really_start (x=0x2aaaae2e6d60) at
../libjava/posix-threads.cc:432
#6  0x0000003a5e2f6a66 in GC_start_routine (arg=<value optimized out>) at
../boehm-gc/pthread_support.c:1294
#7  0x0000003e03e06305 in start_thread (arg=<value optimized out>) at
pthread_create.c:296
#8  0x0000003e00ecd50d in clone () from /lib64/libc.so.6
#9  0x0000000000000000 in ?? ()

Thread 1 (Thread 46912496204128 (LWP 20913)):
#0  0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000003a5dbb93bc in _Jv_CondWait (cv=0x2aaaae23fd48, mu=0x2aaaae23fd50,
millis=<value optimized out>, nanos=0)
    at ../libjava/posix-threads.cc:166
#2  0x0000003a5dbaed60 in java::lang::Object::wait (this=<value optimized out>,
timeout=0, nanos=0)
    at ../libjava/java/lang/natObject.cc:1333
#3  0x000000000045f8dc in frysk.sys.Ptrace$PtraceThread.notifyPtraceThread(int,
int, long, long)long (this=0x2aaaae2f0e40, request=2, 
    pid=20911, addr=46912528250712, data=0) at frysk/sys/Ptrace.java:246
#4  0x000000000045993a in frysk::sys::PtraceByteBuffer::peek (this=<value
optimized out>, addr=46912528250712, 
    buf=<value optimized out>, off=32057176, len=10701144) at
frysk/sys/cni/PtraceByteBuffer.cxx:128
#5  0x0000000000453365 in inua.eio.ByteBuffer.peekFully(long, byte[], long,
long)void (this=0x2aaaae3338a0, caret=46912528250712, 
    bytes=0x2aaaae398000, off=32057176, len=10701144) at inua/eio/ByteBuffer.java:93
#6  0x0000000000453de3 in inua.eio.ByteBuffer.get(long, byte[], int,
int)inua.eio.ByteBuffer (this=0x2aaaae3338a0, index=46912496193536, 
    dst=0x2aaaae398000, off=0, len=<value optimized out>) at
inua/eio/ByteBuffer.java:358
#7  0x000000000043147d in
frysk.util.CoredumpAction$CoreMapsBuilder.buildMap(long, long, boolean, boolean,
boolean, boolean, long, int, int, int, int, int)void (this=0x2aaaae32e1e0,
addressLow=46912496193536, addressHigh=46912551710720, permRead=<value optimized
out>, 
    permWrite=<value optimized out>, permExecute=false, shared=false, offset=0,
devMajor=8, devMinor=1, inode=3163834, 
    pathnameOffset=1112, pathnameLength=30) at frysk/util/CoredumpAction.java:1083
#8  0x000000000045ae86 in frysk::sys::proc::MapsBuilder::construct
(this=0x2aaaae32e1e0, buf=<value optimized out>)
    at frysk/sys/proc/cni/MapsBuilder.cxx:107
#9  0x000000000045af03 in frysk::sys::proc::MapsBuilder::construct
(this=0x2aaaae32e1e0, pid=<value optimized out>)
    at frysk/sys/proc/cni/MapsBuilder.cxx:58
#10 0x000000000043109b in
frysk.util.CoredumpAction.write_elf_file(frysk.proc.Task[], frysk.proc.Proc)void
(this=0x2aaaae23fd90, 
    tasks=<value optimized out>, proc=0x2aaaae286be0) at
frysk/util/CoredumpAction.java:656
#11 0x0000000000430962 in
frysk.util.CoredumpAction.allExistingTasksCompleted()void (this=0xa3492c)
    at frysk/util/CoredumpAction.java:1172
#12 0x0000000000443b2b in
frysk.proc.ProcBlockAction.checkFinish(frysk.proc.Task)void
(this=0x2aaaae23fd90, task=<value optimized out>)
    at frysk/proc/ProcBlockAction.java:241
#13 0x0000000000443f25 in
frysk.proc.ProcBlockAction$ProcBlockTaskObserver$1.execute()void
(this=0x2aaaae2ebc78)
    at frysk/proc/ProcBlockAction.java:88
#14 0x00000000004323c8 in frysk.event.EventLoop.runEventLoop(boolean)void
(this=0x2aaaae286e60, pendingOnly=<value optimized out>)
    at frysk/event/EventLoop.java:308
#15 0x000000000043247b in frysk.event.EventLoop.run()void (this=0x2aaaae286e60)
at frysk/event/EventLoop.java:414
#16 0x00000000004286b3 in fcore.main(java.lang.String[])void (args=<value
optimized out>) at frysk/bindir/fcore.java:128
#17 0x0000003a5dba41b3 in gnu::java::lang::MainThread::call_main
(this=0x2aaaae286f00) at ../libjava/gnu/java/lang/natMainThread.cc:50
#18 0x0000003a5dbf925e in gnu.java.lang.MainThread.run()void
(this=0x2aaaae286f00) at ../libjava/gnu/java/lang/MainThread.java:108
#19 0x0000003a5dbb2738 in _Jv_ThreadRun (thread=0x2aaaae286f00) at
../libjava/java/lang/natThread.cc:302
#20 0x0000003a5db75a05 in _Jv_RunMain (vm_args=<value optimized out>,
klass=0x6eb440, name=0x0, argc=4, argv=0x7fff0c1f7b98, 
    is_jar=false) at ../libjava/prims.cc:1535
#21 0x0000000000428066 in main (argc=0, argv=0x7b0e13) at /tmp/ccgEGq6e.i:14
#22 0x0000003e00e1da44 in __libc_start_main (main=0x428038 <main>, argc=4,
ubp_av=0x7fff0c1f7b98, init=<value optimized out>, 
    fini=<value optimized out>, rtld_fini=<value optimized out>,
stack_end=0x7fff0c1f7b88) at libc-start.c:231
#23 0x0000000000427fa9 in _start ()
#0  0x0000003e03e0a416 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

Comment 1 Andrew Cagney 2007-04-02 20:33:38 UTC
You're looking at the ptrace server thread which is why it appears to be in a
loop.  A closer examination will show that the system calls made by that thread
change, the code is making forward progress.

Two factors are at play here, both lead to correct, but sub-optimal performance:

-> The kernel [effectively] limits all ptrace calls to a single thread.  To work
around this frysk currently routes all ptrace calls through a separate ptrace
server thread.  Frysk's event-loop and ptrace threads are currently being merged
to eliminate this overhead.   Going forward, utrace, will eliminate the
restriction entirely.

-> Ptrace is very inefficient for large transfers.  Current changes are
switching the code to a more efficient streaming mechanism - writing the data in
large chunks direct to disk.

fstack, which uses identical attach/detach code, gets this result:

$ sleep 1h & pid=$! ; sleep 1 ; fstack $pid
[1] 13428
Task #13428
#0 0x00dda402 in __kernel_vsyscall ()
#1 0x00ca2940 in nanosleep ()
#2 0x0804a710 in [unknown]
#3 0x0804912a in [unknown]
#4 0x00c2d4e4 in __libc_start_main ()
#5 0x08048d11 in [unknown]
$ 

Pushing bug upstream.

Comment 2 Mark Wielaard 2007-04-03 11:01:28 UTC
fcore (from CVS) takes a very long time (about 7 minutes) on the above example
and then generates the following stacktrace:

Exception in thread "main" inua.eio.BufferUnderflowException
   at inua.eio.ByteBuffer.get(fcore)
   at frysk.util.CoredumpAction$CoreMapsBuilder.buildMap(fcore)
   at frysk.sys.proc.MapsBuilder.construct(fcore)
   at frysk.sys.proc.MapsBuilder.construct(fcore)
   at frysk.util.CoredumpAction.write_elf_file(fcore)
   at frysk.util.CoredumpAction.allExistingTasksCompleted(fcore)
   at frysk.proc.ProcBlockAction.checkFinish(fcore)
   at frysk.proc.ProcBlockAction$ProcBlockTaskObserver$1.execute(fcore)
   at frysk.event.EventLoop.runEventLoop(fcore)
   at frysk.event.EventLoop.run(fcore)
   at fcore.main(fcore)


Comment 3 Phil Muldoon 2007-04-03 12:56:42 UTC
Hi Mark,

Locally (from CVS) I cannot reproduce as I get:

[pmuldoon@localhost filesystems]$ sleep 1h & pid=$! ;sleep 1;fcore -o
/tmp/sleep.core $pid
[1] 8533
[pmuldoon@localhost filesystems]$ 

[pmuldoon@localhost filesystems]$ eu-readelf  -h /tmp/sleep.core.8533 
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Ident Version:                     1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              CORE (Core file)

Additionally it completes in < 30 seconds. Is there access to your machine so I
can test?


Comment 4 Mark Wielaard 2007-04-03 13:11:36 UTC
(In reply to comment #3)
> Locally (from CVS) I cannot reproduce as I get:
> [...]
> [pmuldoon@localhost filesystems]$ eu-readelf  -h /tmp/sleep.core.8533 
> > [...]
> Additionally it completes in < 30 seconds. Is there access to your machine so I
> can test?

Cool! That looks promising. I am using a x86_64 machine for my tests (looks you
use something 32 bits based). I'll try to find you on irc.gimp.org in #frysk to
coordinate a debugging session.

Thanks,

Mark



Comment 5 Mark Wielaard 2007-04-03 13:47:17 UTC
The issue discussed in comment #2, #3 and #4 is a new issue, tracked upstream as:
http://sourceware.org/bugzilla/show_bug.cgi?id=4313

Comment 6 Phil Muldoon 2007-04-03 14:02:11 UTC
On further testing this is a bug in CVS HEAD, not in FC6. As of right now, I
cannot replicate the symptoms in:

[root@localhost pmuldoon]# rpm -q frysk
frysk-0.0.1.2007.02.07.rh1-1.fc6

but can in CVS HEAD. I'll track further reports on the upstream bug

Comment 7 Jan Kratochvil 2007-04-03 15:10:11 UTC
(In reply to comment #6)
> On further testing this is a bug in CVS HEAD, not in FC6. As of right now, I
> cannot replicate the symptoms in:
> 
> [root@localhost pmuldoon]# rpm -q frysk
> frysk-0.0.1.2007.02.07.rh1-1.fc6

In Comment 0 you can check I reported it originally for this FC6 version, on x86_64.

I was not talking about any crash just that it is unusably slow.  Even if it
would end one day.


Comment 8 Phil Muldoon 2007-04-03 15:21:12 UTC
In reply to #7,

I should have prefaced the comment to comments #3, #4, and #5. I'm just trying
to triage the bug and get the relative data out of the trouble report, so we can
pursue a fix and quickly and efficiently, and if required split this bug into 2
bugs. I understand you were not talking about the backtrace. 

Not sure what you meant but "Even if it would end one day"

?


Comment 9 Jan Kratochvil 2007-04-03 16:03:15 UTC
(In reply to comment #8)
> Not sure what you meant but "Even if it would end one day"

That it is too slow.
("end"->"finish" would be probably better English, sorry)