Bug 490234 - systemtap panics kernel when killing concurrent staprun processes
systemtap panics kernel when killing concurrent staprun processes
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: systemtap (Show other bugs)
All Linux
low Severity medium
: rc
: ---
Assigned To: David Smith
: Patch
Depends On:
  Show dependency treegraph
Reported: 2009-03-13 18:58 EDT by Matthew Whitehead
Modified: 2011-02-16 09:09 EST (History)
4 users (show)

See Also:
Fixed In Version: systemtap-1.3-1.el4
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 510282 (view as bug list)
Last Closed: 2011-02-16 09:09:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Proposed patch (613 bytes, patch)
2010-03-09 11:38 EST, Marc Milgram
no flags Details | Diff
Updated patch (1.13 KB, patch)
2010-03-09 14:50 EST, Marc Milgram
no flags Details | Diff

  None (edit)
Description Matthew Whitehead 2009-03-13 18:58:36 EDT
Description of problem: apparent race condition in runtime/procfs.c:_stp_rmdir_proc_module when killing two staprun processes with a kill -9.

Version-Release number of selected component (if applicable): 


How reproducible: Unsure

Steps to Reproduce:
1. Start two systemtap scripts at the same time
2. Use kill -9 with multiple process ids on the command line to kill at the same time
3. See if you get the panic
Actual results:

Expected results:

Additional info:

Crash occurs in fs/proc/generic.c:remove_proc_entry, calling the BUG_ON(de->subdir) macro. 

Here is the code fragment:

        for (p = &parent->subdir; *p; p=&(*p)->next ) {
                if (!proc_match(len, fn, *p))
                de = *p;
                *p = de->next;
                de->next = NULL;
                if (S_ISDIR(de->mode))
                de->nlink = 0;
                if (!atomic_read(&de->count))
                else {
                        de->deleted = 1;
                        printk("remove_proc_entry: %s/%s busy, count=%d\n",
                                parent->name, de->name, atomic_read(&de->count)\

   DUMPFILE: /var/crash/crash_dumps/ap12.1236890303.dmp
       CPUS: 8
       DATE: Thu Mar 12 16:38:23 2009
     UPTIME: 25 days, 02:31:04
LOAD AVERAGE: 7.48, 8.65, 8.42
      TASKS: 916
   NODENAME: chip6.acs.census.gov
    RELEASE: 2.6.9-67.0.15.ELsmp
    VERSION: #1 SMP Tue Apr 22 13:58:43 EDT 2008
    MACHINE: x86_64  (2193 Mhz)
     MEMORY: 33 GB
      PANIC: "vmdump"
        PID: 2868
    COMMAND: "staprun"
       TASK: 104bf77a7f0  [THREAD_INFO: 1040baf0000]
        CPU: 2

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at generic:712
invalid operand: 0000 [1] SMP
Modules linked in: stap_a5058ef9256374d79b204f5489fce41b_6002(U)
stap_034dd3d28d1d58dff817bd97d399e87f_3085(U) loop sr_mod ip_vs nfsd
exportfs parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U)
gfs(U) lock_harness(U) nfs lockd dlm(U) cman(U) md5 ipv6 ipt_state
ip_conntrack iptable_filter ip_tables hw_random k8_edac edac_mc nfs_acl
sunrpc egenera_nmi(U) egenera_veth(U) sd_mod egenera_vscsi(U) scsi_mod
egenera_vmdump(U) egenera_dumpdev(U) egenera_ipmi(U) egenera_base(U)
egenera_virtual_bus(U) egenera_fs(U) dm_snapshot dm_zero dm_mirror ext3
jbd dm_mod
Pid: 2868, comm: staprun Tainted: PF     2.6.9-67.0.15.ELsmp
RIP: 0010:[<ffffffff801b0f4e>] <ffffffff801b0f4e>{remove_proc_entry+340}
RSP: 0000:000001040baf1e48  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00000102381de500 RCX: 0000000000000000
RDX: 000000000000000a RSI: 000001021c68b219 RDI: ffffffff804f4700
RBP: 000001021c68b180 R08: ffffffffa03c71b5 R09: 0000000000000080
R10: 0000000000000080 R11: ffffffff801d232b R12: 00000102381de400
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000002a9557f440(0000) GS:ffffffff804f3680(0000) knlGS:00000000f6e03ba0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003c105142d8 CR3: 0000000438fc0000 CR4: 00000000000006e0
Process staprun (pid: 2868, threadinfo 000001040baf0000, task
Stack: ffffffffa03cdb00 0000000000000000 ffffffff803f0c60 ffffffffa03c71b5
      ffffffff803ed1a0 ffffffffa03cdb00 0000000000000000 ffffffffa03c2875
      0000010440000000 ffffffffa03c28b2




Code: 0f 0b 72 de 32 80 ff ff ff ff c8 02 8b 85 80 00 00 00 85 c0
RIP <ffffffff801b0f4e>{remove_proc_entry+340} RSP <000001040baf1e48>

crash> bt 2868
PID: 2868   TASK: 104bf77a7f0       CPU: 2   COMMAND: "staprun"
#0 [1040baf1c40] dump_execute at ffffffffa010c483
#1 [1040baf1c80] __do_vmdump at ffffffffa010c556
#2 [1040baf1ca0] try_crashdump at ffffffff8014c7dd
#3 [1040baf1cb0] die at ffffffff80111c04
#4 [1040baf1cd0] do_invalid_op at ffffffff80111fcc
#5 [1040baf1d90] error_exit at ffffffff80110d91
   [exception RIP: remove_proc_entry+340]
   RIP: ffffffff801b0f4e  RSP: 000001040baf1e48  RFLAGS: 00010202
   RAX: 0000000000000000  RBX: 00000102381de500  RCX: 0000000000000000
   RDX: 000000000000000a  RSI: 000001021c68b219  RDI: ffffffff804f4700
   RBP: 000001021c68b180   R8: ffffffffa03c71b5   R9: 0000000000000080
   R10: 0000000000000080  R11: ffffffff801d232b  R12: 00000102381de400
   R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
#6 [1040baf1e40] remove_proc_entry at ffffffff801b0f3f
#7 [1040baf1e80] _stp_rmdir_proc_module at ffffffffa03c2875
#8 [1040baf1e90] _stp_unregister_ctl_channel at ffffffffa03c28b2
#9 [1040baf1ec0] _stp_transport_close at ffffffffa03c6e13
#10 [1040baf1ee0] sys_delete_module at ffffffff8014ed74
#11 [1040baf1f80] tracesys at ffffffff801103ce (via system_call)
   RIP: 0000003c0fbc9e79  RSP: 0000007fbffff768  RFLAGS: 00000246
   RAX: ffffffffffffffda  RBX: ffffffff801103ce  RCX: ffffffffffffffff
   RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000509060
   RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000200
   R10: 0000000000000004  R11: 0000000000000246  R12: 0000000000000000
   R13: 0000000000000003  R14: 0000000000000000  R15: 0000000000000000
   ORIG_RAX: 00000000000000b0  CS: 0033  SS: 002b
Comment 1 Marc Milgram 2009-03-26 15:23:19 EDT
The fix for http://sourceware.org/bugzilla/show_bug.cgi?id=5042 seemed to cause this problem.

Currently, the code only checks to see that /proc/systemtap is not being used before deleting /proc/systemtap.  But, it may be unused while another module has a directory under /proc/systemtap.

The solution is to add a guard to ensure that there are no other files under /proc/systemtap.

diff --git a/runtime/procfs.c b/runtime/procfs.c
index d6b7533..dfbe2de 100644
--- a/runtime/procfs.c
+++ b/runtime/procfs.c
@@ -59,7 +59,8 @@ static void _stp_rmdir_proc_module(void)
                /* modules to use it, even though it will not show up */
                /* in directory listings. */
-               if (atomic_read(&_stp_proc_stap->count) == 0) {
+               if ((atomic_read(&_stp_proc_stap->count) == 0) &&
+                   (!_stp_proc_stap->subdir)) {
                        remove_proc_entry("systemtap", NULL);
                        _stp_proc_stap = NULL;
Comment 2 David Smith 2009-07-06 10:18:35 EDT
Fixed in upstream commit 83eaf9b.  Besides checking to make sure there are no other files in /proc/systemtap before removal, the transport directory is locked before removal of '/proc/systemtap/${MODULE}' also.
Comment 3 Marc Milgram 2010-03-09 11:38:17 EST
Created attachment 398842 [details]
Proposed patch
Comment 4 David Smith 2010-03-09 11:47:40 EST
Here's a link to the upstream commit (that I mentioned earlier) that more fully fixes this problem:


In includes (basically) the same thing as your proposed patch plus better locking.
Comment 5 Marc Milgram 2010-03-09 14:50:03 EST
Created attachment 398912 [details]
Updated patch

Ported upstream commit 83eaf9b to 4.9.

Upstream uses _stp_lock_transport_dir(), but 4.9 uses _stp_lock_debugfs() - so I used 4.9 locking.
Comment 6 David Smith 2010-03-09 15:09:17 EST
Your updated patch looks reasonable to backport to systemtap-0.6.2.
Comment 9 errata-xmlrpc 2011-02-16 09:09:10 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.