Bug 624657

Summary: Timing issue in systemtap initscript restart command
Product: Red Hat Enterprise Linux 6 Reporter: Petr Muller <pmuller>
Component: systemtapAssignee: David Smith <dsmith>
Status: CLOSED ERRATA QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: fche, mjw, ohudlick
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: systemtap-1.4-2.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 644350 (view as bug list) Environment:
Last Closed: 2011-05-19 13:54:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Muller 2010-08-17 12:00:05 UTC
Description of problem:
Our Beaker test for the initscript sometimes fails to restart the service when some script is running. When investigating this issue, I've found out this is probably a timing issue: if I add a slight lag (like, sleep 3) between 'stop' and 'start' calls in 'restart' function, the issue disappears.

Version-Release number of selected component (if applicable):
systemtap-1.2-9.el6

How reproducible:
On some boxes, not always. When it appears, it is usually reproducible.

Steps to Reproduce:
1. cat > /etc/systemtap/script.d/heart.stp << EOF
> probe timer.ms(500){
>   print("Beat!\n");
> }
> EOF
2. # service systemtap start; sleep 1; service systemtap restart;
  
Actual results:
Starting systemtap:  Compiling heart ... done
 Starting heart ... done
                                                           [  OK  ]
Stopping systemtap:                                        [  OK  ]
Starting systemtap: heart is dead, but another script is running.
                                                           [FAILED]

Expected results:
Starting systemtap: [  OK  ]
Stopping systemtap: [  OK  ]
Starting systemtap:  Compiling heart ... done
 Starting heart ... done
[  OK  ]

Additional info:

Comment 1 David Smith 2010-11-23 20:41:33 UTC
I haven't been able to duplicate this (tried on 3 different machines).  On a machine where this happens, can you show me the new info added to /var/log/systemtap.log?

Comment 2 Petr Muller 2010-11-24 12:58:22 UTC
David,

I had a look on the issue and I found I omitted quite important piece of reproducing information: I probably had it configured from the automated test run so I forgot to include it. Sorry about that. I see the issue after doing:

# echo "heart_OPT='-o /tmp/stap-test.log'" > /etc/systemtap/conf.d/heart.conf

before doing step 2. I haven't managed to reproduce the problem without this. Even with it, I had to run the start-sleep-restart triple in a loop, seeing it in about 1 of 5 cases on one box. I can see it failing consistently on s390x, though. 

This shows up in /var/log/systemtap.log:
# tail -f /var/log/systemtap.log
Nov 24 07:57:12: Starting systemtap: 
Nov 24 07:57:12:  Starting heart ... 
Nov 24 07:57:12: Exec: /usr/bin/staprun -o /tmp/stap-test.log -D /var/cache/systemtap/2.6.32-71.el6.ppc64/heart.ko
Nov 24 07:57:12: Exec: cp -f ./pid /var/run/systemtap/heart
Nov 24 07:57:12: done
Nov 24 07:57:12: Pass: systemtap startup
Nov 24 07:57:13: Stopping systemtap: 
Nov 24 07:57:13: Exec: kill -TERM 3787
Nov 24 07:57:13: Pass: systemtap stopping 
Nov 24 07:57:13: Starting systemtap: 
Nov 24 07:57:13: heart is dead, but another script is running.
Nov 24 07:57:13: Error: Failed to run "heart". (4)

Comment 3 David Smith 2010-11-30 20:27:03 UTC
Fixed in upstream commit 671a1d8:

<http://sources.redhat.com/git/gitweb.cgi?p=systemtap.git;a=commitdiff;h=671a1d824ff1320f9e2fa3ed27d5458cc44a5dcc>

Using the 'heart_OPT' configuration allowed me to reproduce this problem.  Basically we were sending stapio a signal to make it unload the module, but not waiting on the module to unload.

While testing the solution to the stopping problem, I ran into a related, but different, problem when loading the module.  When the '-D' option is used, staprun detaches from the terminal and then prints the pid.  Then we'd check the contents of the pid file before it was written.

The above commit fixes both problems.

Comment 7 errata-xmlrpc 2011-05-19 13:54:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0651.html