Bug 267161
Summary: | ptrace operations create unkillable process eating 100% cpu | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Tom Horsley <horsley1953> | ||||||
Component: | kernel | Assignee: | Roland McGrath <roland> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 7 | CC: | bugsy, cebbert, jan.kratochvil, roland | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i686 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | 2.6.22.7-85.fc7 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2007-09-25 22:34:07 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 309761 | ||||||||
Attachments: |
|
Description
Tom Horsley
2007-08-30 14:15:34 UTC
Created attachment 181121 [details]
the step-hang source code
I've now seen the test take up to 50,000 iterations before hanging, so I really have no idea what the frequency of occurrence is. For what it is worth, this doesn't seem to hang on rhel5. I ran over 100000 iterations on a similar system with 2.6.18-8.1.8.el5 and it never hung. Can you try SysRq-t and/or SysRq-p to see where it is looping in the kernel? http://en.wikipedia.org/wiki/Magic_SysRq_key I doubt SysRq-p would show registers associated with the process since the system isn't hung, just the process. Here's the hung step-hang process I just reproduced on my x86_64 fedora 7 box (so it isn't 386 specific): step-hang R running task 0 3650 1 step-hang ? 0000009810371cc4 0 3651 3650 ffff81005c617d58 0000000000000046 ffff8100780f2800 0000000000000000 ffff810060c5e2c0 ffff8100780f2800 ffff810060888000 ffff8100780f2a68 000000007f096f00 ffff810060c5e2d0 ffff810060c5e2c0 0000000000000000 Call Trace: [<ffffffff81066d38>] remove_detached+0x73/0x85 [<ffffffff8103675c>] do_exit+0x7d7/0x7db [<ffffffff810367df>] sys_exit_group+0x0/0xe [<ffffffff8103d660>] get_signal_to_deliver+0x3a5/0x3d3 [<ffffffff81009053>] do_notify_resume+0x9c/0x726 [<ffffffff8103d060>] sys_kill+0x105/0x118 [<ffffffff81249741>] paranoid_userspace1+0x3b/0x3e This is the x86_64 2.6.22.4-65.fc7 kernel. The debugger process seems to have done everything and is simply exiting when it gets hung. Perhaps not doing the final waitpid() for the final debugged child status has something to do with the hang? I'll go boot 32 bit and see if I get similar task info from a hang there. Here's the i686 2.6.22.4-65.fc7 version of the same hung task info: step-hang R running 3352 24808 1 step-hang ? 0000000f 3020 24809 24808 f5393e74 00000046 c625017d 0000000f f5393e58 00000000 f527b688 f5393000 00000001 f7ffe480 f5baec00 f5baed9c c2019a80 00000001 c0723640 f5228540 c0458007 c20f7cc8 00000000 c20f7cc8 00024ab1 c045809e 00000000 ffffffff Call Trace: [<c0458007>] check_dead_utrace+0xe5/0x122 [<c045809e>] remove_detached+0x5a/0x6d [<c042a8fc>] do_exit+0x6d0/0x6d4 [<c042a96d>] sys_exit_group+0x0/0xd [<c0430e9a>] get_signal_to_deliver+0x36d/0x391 [<c04045c2>] do_notify_resume+0x8c/0x6c7 [<c040749b>] send_sigtrap+0x60/0x67 [<c040502a>] work_notifysig+0x13/0x19 remove_detached() is part of utrace. Please try sysrq-p a few times in a row and see if you can see it looping in the kernel. It's the only way to see what a running process is up to. OK, sysrq-p does show the hung process - I'm surprised :-). Once again this is the i686 2.6.22.4-65.fc7 version. One sysrq-t followed by 3 near identical sysrq-p: step-hang R running 3312 23670 1 step-hang ? 0000003c 3020 23671 23670 ede01e74 00000046 c756903b 0000003c ede01e58 00000000 ed82d370 ede01000 00000001 00000246 ed878000 ed87819c c2019a80 00000001 c0723640 edebb700 c0458007 f7d2f638 00000000 f7d2f638 0007cf9e c045809e 00000000 ffffffff Call Trace: [<c0458007>] check_dead_utrace+0xe5/0x122 [<c045809e>] remove_detached+0x5a/0x6d [<c042a8fc>] do_exit+0x6d0/0x6d4 [<c042a96d>] sys_exit_group+0x0/0xd [<c0430e9a>] get_signal_to_deliver+0x36d/0x391 [<c04045c2>] do_notify_resume+0x8c/0x6c7 [<c040749b>] send_sigtrap+0x60/0x67 [<c040502a>] work_notifysig+0x13/0x19 ======================= SysRq : Show Regs Pid: 23670, comm: step-hang EIP: 0060:[<c060a2ae>] CPU: 0 EIP is at cond_resched+0xe/0x31 EFLAGS: 00000286 Not tainted (2.6.22.4-65.fc7 #1) EAX: 00000000 EBX: ed878000 ECX: edebb380 EDX: edfd3000 ESI: ed878c00 EDI: ed8791e8 EBP: edfd3f58 DS: 007b ES: 007b FS: 00d8 CR0: 8005003b CR2: b7f20000 CR3: 2df85000 CR4: 000006d0 [<c045ac2c>] ptrace_exit+0xb0/0xd4 [<c042a34d>] do_exit+0x121/0x6d4 [<c047980e>] vfs_write+0x120/0x15a [<c042a96d>] sys_exit_group+0x0/0xd [<c0404f8e>] syscall_call+0x7/0xb ======================= SysRq : Show Regs Pid: 23670, comm: step-hang EIP: 0060:[<c060a2a0>] CPU: 0 EIP is at cond_resched+0x0/0x31 EFLAGS: 00000246 Not tainted (2.6.22.4-65.fc7 #1) EAX: ed8791e8 EBX: ed878000 ECX: edebb380 EDX: ed8791e8 ESI: ed878c00 EDI: ed8791e8 EBP: 00000001 DS: 007b ES: 007b FS: 00d8 CR0: 8005003b CR2: b7f20000 CR3: 2df85000 CR4: 000006d0 [<c045ac2c>] ptrace_exit+0xb0/0xd4 [<c042a34d>] do_exit+0x121/0x6d4 [<c047980e>] vfs_write+0x120/0x15a [<c042a96d>] sys_exit_group+0x0/0xd [<c0404f8e>] syscall_call+0x7/0xb ======================= SysRq : Show Regs Pid: 23670, comm: step-hang EIP: 0060:[<c045abbb>] CPU: 1 EIP is at ptrace_exit+0x3f/0xd4 EFLAGS: 00000202 Not tainted (2.6.22.4-65.fc7 #1) EAX: ed8791e8 EBX: ed878000 ECX: edebb380 EDX: edfd3000 ESI: ed878c00 EDI: ed8791e8 EBP: 00000001 DS: 007b ES: 007b FS: 00d8 CR0: 8005003b CR2: b7f20000 CR3: 2df85000 CR4: 000006d0 [<c042a34d>] do_exit+0x121/0x6d4 [<c047980e>] vfs_write+0x120/0x15a [<c042a96d>] sys_exit_group+0x0/0xd [<c0404f8e>] syscall_call+0x7/0xb ======================= I just tried an experiment - if I throw in a waitpid() call in the debugger to wait on the debugged child status after I kill the debugged child, the test no longer seems to hang (at least not after running a couple of copies 300000 iterations or so :-). So it seems as if the root cause of the hang has something to do with the debugger just exiting and expecting init to inherit the zombie child status and clean it up. Created attachment 184741 [details]
new version of step-hang.c program
This new version of step-hang.c can be compiled with -DDO_NOT_HANG=1 to make
a version that does not seem to hang:
g++ -o step-nohang -DDO_NOT_HANG=1 step-hang.c
g++ -o step-hang step-hang.c
Running ./step-hang will eventually produce a hung debugger process,
running ./step-nohang hasn't yet hung in any of my tests.
Have you seen this on x86_64 as well as i686? (I notice your test program is prepared for either.) Have you noticed anything in particular that might make it more likely to hit the problem sooner? e.g. other heavy load on the system, etc. So far I haven't reproduced it. I originally saw it on a 32 bit dual xeon system, but have reproduced it on my athlon 64 X2 system at home in both 64 bit and 32 bit mode. So far, it has always happened every time I've run the program on any fedora 7 system I've tried it on (but only fedora 7 - I've tried things like rhel5 and rhel4 and never seen the hang). It does seem like other stuff happening on the system may make it more likely, but even then I've sometimes seem it take 50 or 60 thousand iterations sometimes (other times it happens right away). I think it tends to happen sooner if I run two separate copies of the test at the same time, but it is hard to be sure. I've always run it at runlevel 5 in an X session, so there has at least been all that overhead on the system at the same time. I did fix the original program which the test was derived from to always be careful to wait for child processes, and it no longer winds up hanging on fedora 7, but I guess the bug still represents a potential denial of service security problem even if any real debugger code is unlikely to encounter it (since real debuggers rarely forget about their debugged kids :-). The original program was testing the behavior of lots of different ptrace features, and I notice that it would hang almost anywhere, not just in the one test I copied to make the test program, so I don't think the ptrace operations that come before the debugger exits make much difference - the one thing that seems to trigger the hang is leaving the debugged child for init - the specific ptrace operations that came before don't seem to matter. Maybe one of these days I'll get the energy to install the fedora 8 test release and see if it still happens there. I tried an experiment that should be highly reproducible: 1. downloaded the Fedora 8 test 1 i686 live CD from the torrent, wrote the ISO to a CD, booted the CD on my AMD athlon 64 X2 +4400 box (in 32 bit mode since it was the i686 live CD). 2. let it boot with the default run from image mode and let user fedora autologin after the 1 minute timeout. 3. plugged in a USB thumb drive where I had saved the step-hang and step-nohang executables built (with 32 bit compiler) from the attachment in comment #9 above. 4. Opened a terminal using System Tools > Terminal 5. cd'ed to the thumb drive 6. ran ./step-hang 7. went to the terminal program's file menu to open another terminal. 8. At just about the exact time the new terminal opened, the original ./step-hang program from step 6 reproduced the hang (I think it was around 4300 iterations). So, the hang is still there in fedora 8 test 1, and the live CD should be an easily reproducible environment to try the test in. If the act of opening another terminal really triggered the bug, it may well be that lots of context switching is needed to make the bug happen. Another couple of data points: I tried booting two old Windows boxes with the fedora 8 liveCD, with these results: 1. athlon XP +1800 (uniprocessor) ran for about 20 minutes, 2 copies of ./step-hang each ran up to about 130000 iterations and it never hung. 2. Pentium 4 2.8 Gigahertz (hyperthreaded) only took 450 iterations to hang - I hadn't even tried to open the 2nd terminal yet. So this may be something that only happens on SMP systems. Tom, at least reproduced so far, thanks for the report. I have a fix upstream that should hit the next rawhide and f7-updates kernel builds. Just got the new 2.6.22.5-76.fc7 kernel in today's updates, and the bug is still showing up, but perhaps the above fix hadn't yet percolated into that fedora update. It hasn't been built yet. It will be -80 or higher. In kernel 2.6.22.6-81.fc7, submitted for testing. FYI still not fixed in: kernel-2.6.23-0.171.rc5.git1.fc8.x86_64 Just adding a note: The 2.6.22.7-85.fc7 kernel that showed up in updates this morning does indeed fix the problem on the original dual xeon i386 system which triggered this bug report. Thanks! |