Bug 173304
Summary: | Fix for SystemTap bugzilla #1345 - return probe on do_execve | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Jim Keniston <kenistoj> | ||||||||||
Component: | kernel | Assignee: | Dave Anderson <anderson> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Jay Turner <jturner> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 4.0 | CC: | jbaron, lwang, srevivo | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | RHSA-2006-0132 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2006-03-07 20:48:16 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 168429 | ||||||||||||
Attachments: |
|
Description
Jim Keniston
2005-11-16 06:24:21 UTC
Created attachment 121110 [details]
Sample module to illustrate the bug
Compile and insmod this module to demonstrate the bug.
If the bug is fixed, /var/log/messages should show output such as the
following:
Nov 15 22:01:17 xxx kernel: Registering probes for sys_execve
Nov 15 22:01:17 xxx kernel: Registering probes for do_execve
Nov 15 22:01:17 xxx kernel: Registering probes for load_elf_binary
Nov 15 22:01:17 xxx kernel: Registering probes for flush_old_exec
[The following lines are displayed after you rmmod the module. The number of
calls and returns depends on how many commands you run between insmod and
rmmod.]
Nov 15 22:01:45 xxx kernel: sys_execve: 21 calls, 21 returns, 0 missed
Nov 15 22:01:45 xxx kernel: do_execve: 21 calls, 21 returns, 0 missed
Nov 15 22:01:45 xxx kernel: load_elf_binary: 21 calls, 21 returns, 0 missed
Nov 15 22:01:45 xxx kernel: flush_old_exec: 21 calls, 21 returns, 0 missed
Created attachment 121111 [details]
This patch fixes the bug on all architectures.
This patch has been tested on i386 and ppc64. It will be tested on ia64 and
x86_64 by Nov. 16.
Created attachment 121155 [details]
Patch for RHEL4 U3
This patch applies to RHEL4 U3. The previously provided patch applies to the
upstream kernel, v2.6.15-rc1.
> Patch for RHEL4 U3 > > This patch applies to RHEL4 U3. The previously provided patch applies to the > upstream kernel, v2.6.15-rc1. This patch no longer applies to the current RHEL4 tree. Please provide a fixed version: $ patch -p1 --dry-run < $HOME/rpfix-rhel4u3.patch patching file arch/i386/kernel/process.c Hunk #1 succeeded at 337 (offset 2 lines). patching file arch/ia64/kernel/process.c Hunk #1 FAILED at 25. Hunk #2 succeeded at 696 (offset 1 line). 1 out of 2 hunks FAILED -- saving rejects to file arch/ia64/kernel/process.c.rejpatching file arch/ppc64/kernel/process.c patching file arch/x86_64/kernel/process.c $ ...as well as posting ia64 and x86_64 test results. Please also provide a short explanation of what the original problem actually is, and how the removal of the kprobe_flush_task() call from the 3 processor-specific flush_thread() calls does, addresses the issue. Sorry -- I have no experience or insight into kprobes/SystemTap... Created attachment 121255 [details]
fixed version of rpfix-rhel4u3.patch
The attached patch applies cleanly to the current RHEL4-U3 tree.
In response to Comment #5: Test results for x86_64 and ia64: ia64 (as tested by anil.s.keshavamurthy) and x86_64 (as tested by me) produce the desired results as described in Comment #1. Problem description (sorry, it's not short): From Documentation/kprobes.txt in the mainline kernel: ----- 1.3 How Does a Return Probe Work? When you call register_kretprobe(), Kprobes establishes a kprobe at the entry to the function. When the probed function is called and this probe is hit, Kprobes saves a copy of the return address, and replaces the return address with the address of a "trampoline." The trampoline is an arbitrary piece of code -- typically just a nop instruction. At boot time, Kprobes registers a kprobe at the trampoline. When the probed function executes its return instruction, control passes to the trampoline and that probe is hit. Kprobes' trampoline handler calls the user-specified handler associated with the kretprobe, then sets the saved instruction pointer to the saved return address, and that's where execution resumes upon return from the trap. While the probed function is executing, its return address is stored in an object of type kretprobe_instance. Before calling register_kretprobe(), the user sets the maxactive field of the kretprobe struct to specify how many instances of the specified function can be probed simultaneously. register_kretprobe() pre-allocates the indicated number of kretprobe_instance objects. ----- If a return-probed function never returns, the kretprobe_instance object will never be recycled, and you'll quickly run out. So when a program image is going away (e.g., via do_exit()), we call kprobe_flush_task() to recycle all that task's kretprobe_instance objects. do_execve() also discards the program image, so our original implementation also called kprobe_flush_task() from flush_thread() (which is called from do_execve()). This was a mistake, since do_execve() retains the stack and does indeed return. For reasons I won't go into, this worked fine on the original architectures (i386 and x86_64) but not on ppc64 and ia64. The architectures got out of sync (and RHEL4 got out of sync with the mainline kernel, apparently), and a subsequent change to Kprobes changed this harmless mistake to a fatal one. Correct practice is to call kprobe_flush_task() from do_exit() (via exit_thread()), but not from do_execve() (via flush_thread()). The patch associated with comment #2 fixes this in the mainline kernel, and the patch in #6 (but not #3, apparently) fixes this in RHEL4 U3. > The patch associated with comment #2 fixes this in the mainline kernel, > and the patch in #6 (but not #3, apparently) fixes this in RHEL4 U3. Ok -- before this can be proposed for RHEL4, we need absolute test results running i686, x86_64, ppc64 and ia64 RHEL4 kernels. In this location: http://people.redhat.com/anderson/BZ_173304 are the following binary rpms: kernel-smp-2.6.9-22.20.EL.bz173304.i686.rpm kernel-hugemem-2.6.9-22.20.EL.bz173304.i686.rpm kernel-smp-2.6.9-22.20.EL.bz173304.x86_64.rpm kernel-2.6.9-22.20.EL.bz173304.ia64.rpm kernel-2.6.9-22.20.EL.bz173304.ppc64.rpm and the associated src.rpm used to build them: kernel-2.6.9-22.20.EL.bz173304.src.rpm and the patch in the src.rpm that was applied to 2.6.9-22.20.EL: linux-kernel-test.patch Note that there are both smp and hugemem i686 kernels. We need test results from both i686 kernels, due to a major screw-up with the original kprobes patch that caused the use of gdb breakpoints on a user application to crash the hugemem kernel. Testing is in progress using the kernels you provided. What constitutes "absolute" test results? Thanks Jim -- we appreciate the extra effort -- and just a "thumbs up" on the 5 kernels provided will be fine. It's just that we can't simply go with testing on upstream kernels only, and we're a little gun-shy after getting burnt by the kprobes/gdb/hugemem fiasco... We have installed and tested the kernels you provided on the appropriate architectures: i686 smp - kevinrs.com i686 hugemem - kevinrs.com ppc64 - hien.com ia64 - anil.s.keshavamurthy x86_64 - jkenisto.com The rp.c test (from Comment #1) passed on all architectures. Thanks Jim. I've posted the patch internally for review. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html |