Bug 234879
Summary: | FCORE locks up | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jan Kratochvil <jan.kratochvil> |
Component: | frysk | Assignee: | Andrew Cagney <cagney> |
Status: | CLOSED UPSTREAM | QA Contact: | Len DiMaggio <ldimaggi> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6 | CC: | kasal, mcvet, mjw, npremji, pmuldoon, rmoseley, scox, timoore |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-04-02 20:33:38 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 229481 |
Description
Jan Kratochvil
2007-04-02 18:26:45 UTC
You're looking at the ptrace server thread which is why it appears to be in a loop. A closer examination will show that the system calls made by that thread change, the code is making forward progress. Two factors are at play here, both lead to correct, but sub-optimal performance: -> The kernel [effectively] limits all ptrace calls to a single thread. To work around this frysk currently routes all ptrace calls through a separate ptrace server thread. Frysk's event-loop and ptrace threads are currently being merged to eliminate this overhead. Going forward, utrace, will eliminate the restriction entirely. -> Ptrace is very inefficient for large transfers. Current changes are switching the code to a more efficient streaming mechanism - writing the data in large chunks direct to disk. fstack, which uses identical attach/detach code, gets this result: $ sleep 1h & pid=$! ; sleep 1 ; fstack $pid [1] 13428 Task #13428 #0 0x00dda402 in __kernel_vsyscall () #1 0x00ca2940 in nanosleep () #2 0x0804a710 in [unknown] #3 0x0804912a in [unknown] #4 0x00c2d4e4 in __libc_start_main () #5 0x08048d11 in [unknown] $ Pushing bug upstream. fcore (from CVS) takes a very long time (about 7 minutes) on the above example and then generates the following stacktrace: Exception in thread "main" inua.eio.BufferUnderflowException at inua.eio.ByteBuffer.get(fcore) at frysk.util.CoredumpAction$CoreMapsBuilder.buildMap(fcore) at frysk.sys.proc.MapsBuilder.construct(fcore) at frysk.sys.proc.MapsBuilder.construct(fcore) at frysk.util.CoredumpAction.write_elf_file(fcore) at frysk.util.CoredumpAction.allExistingTasksCompleted(fcore) at frysk.proc.ProcBlockAction.checkFinish(fcore) at frysk.proc.ProcBlockAction$ProcBlockTaskObserver$1.execute(fcore) at frysk.event.EventLoop.runEventLoop(fcore) at frysk.event.EventLoop.run(fcore) at fcore.main(fcore) Hi Mark, Locally (from CVS) I cannot reproduce as I get: [pmuldoon@localhost filesystems]$ sleep 1h & pid=$! ;sleep 1;fcore -o /tmp/sleep.core $pid [1] 8533 [pmuldoon@localhost filesystems]$ [pmuldoon@localhost filesystems]$ eu-readelf -h /tmp/sleep.core.8533 ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Ident Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: CORE (Core file) Additionally it completes in < 30 seconds. Is there access to your machine so I can test? (In reply to comment #3) > Locally (from CVS) I cannot reproduce as I get: > [...] > [pmuldoon@localhost filesystems]$ eu-readelf -h /tmp/sleep.core.8533 > > [...] > Additionally it completes in < 30 seconds. Is there access to your machine so I > can test? Cool! That looks promising. I am using a x86_64 machine for my tests (looks you use something 32 bits based). I'll try to find you on irc.gimp.org in #frysk to coordinate a debugging session. Thanks, Mark The issue discussed in comment #2, #3 and #4 is a new issue, tracked upstream as: http://sourceware.org/bugzilla/show_bug.cgi?id=4313 On further testing this is a bug in CVS HEAD, not in FC6. As of right now, I cannot replicate the symptoms in: [root@localhost pmuldoon]# rpm -q frysk frysk-0.0.1.2007.02.07.rh1-1.fc6 but can in CVS HEAD. I'll track further reports on the upstream bug (In reply to comment #6) > On further testing this is a bug in CVS HEAD, not in FC6. As of right now, I > cannot replicate the symptoms in: > > [root@localhost pmuldoon]# rpm -q frysk > frysk-0.0.1.2007.02.07.rh1-1.fc6 In Comment 0 you can check I reported it originally for this FC6 version, on x86_64. I was not talking about any crash just that it is unusably slow. Even if it would end one day. In reply to #7, I should have prefaced the comment to comments #3, #4, and #5. I'm just trying to triage the bug and get the relative data out of the trouble report, so we can pursue a fix and quickly and efficiently, and if required split this bug into 2 bugs. I understand you were not talking about the backtrace. Not sure what you meant but "Even if it would end one day" ? (In reply to comment #8) > Not sure what you meant but "Even if it would end one day" That it is too slow. ("end"->"finish" would be probably better English, sorry) |