Bug 518480

Summary: Oprofile seems to not daemonize properly on ia64
Product: Red Hat Enterprise Linux 5 Reporter: Petr Muller <pmuller>
Component: oprofileAssignee: William Cohen <wcohen>
Status: CLOSED ERRATA QA Contact: BaseOS QE <qe-baseos-auto>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.4CC: ohudlick
Target Milestone: rc   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: 0.9.4-14.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 08:51:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Disconnect children running perfmon from stdin/stdout none

Description Petr Muller 2009-08-20 15:09:53 UTC
Description of problem:
During the tier testing for RHEL5, we've encountered oprofile on ia64 hanging on something like this:

opcontrol --start-daemon --no-vmlinux --verbose 2>&1 | tee $TMPOUTPUT

Firstly, we suspected the oprofile itself to hang for some reason, but when I looked closer at it, I've seen that it is the 'tee' command which remains sitting there forever, waiting for some input, even when the opcontrol itself already finished. Then I've dissected what is being done in starting daemon, and I'm suspecting oprofile daemon is not reopening it's streams after the fork, which causes the 'tee' command run forever - the pipe's entrace never gets closed, and thus tee cannot find out that it should end itself. If I do 'opcontrol --shutdown' from another terminal, the command in the first one ends - this supports his theory. The only weird fact is that we observe this behavior on ia64 only, while I would expect that it would be independent on the platform :/

Version-Release number of selected component (if applicable):
oprofile-0.9.4-11.el5
but even the old (RHEL5.3) oprofile behaves like this

How reproducible:
always

Steps to Reproduce:
1. opcontrol --deinit; opcontrol --init; opcontrol --start-daemon | tee log
2. wait indefinitely
3. from other terminal, run 'opcontrol --shutdown', and see the command finished
  
Actual results:
tee waiting more input, even when opcontrol script ended

Expected results:
like the other daemons - tee logs just the output of the launching script and ends with it, not logging any output from the daemon itself

Additional info:

Comment 1 William Cohen 2009-08-31 21:29:26 UTC
I attempted to reproduce the problem following the steps above, but was unable to replicate the behavior. This machine is subscribed to RHN and has RHEL 5.3 on it with the exception on newer kernel and the listed oprofile rpm. I wasn't able to reproduce the problem on this ia64 machine. I am setting up a red hat test machine with clean version of rhel5.4

What kernel verion was used for the testing (uname -a)?

Comment 2 William Cohen 2009-09-01 02:06:55 UTC
I tried a fresh install of RHEL5.4-Server-20090819.0 on a red hat test system ia64 machine and still was unable to recreate the hang using the steps listed. The machine had the following rpms:

kernel-2.6.18-164.el5
oprofile-0.9.4-11.el5

What was the original shell script that produced the problem on ia64. Was there something in there that intercepted signals?

Comment 3 Petr Muller 2009-09-01 12:31:27 UTC
I can with no problem reproduce the issue on a RHEL5.3 and RHEL5.4 testing box by simply running 

# opcontrol --deinit; opcontrol --init; opcontrol --start-daemon | tee log

as stated above. I did the original investigation on RHEL5.4 box, so it is weird you cannot reproduce it.

The versions:
oprofile-0.9.4-11.el5.ia64
on both:
2.6.18-128.el5 (rhel5.3)
2.6.18-162.el5xen (some rhel5.4 candidate)

The original script showing the behavior was the runtest.sh of /tools/oprofile/Sanity/opcontrol-options RHTS test, at line 189, doing
'opcontrol --start-daemon --no-vmlinux --verbose 2>&1 | tee $TMPOUTPUT'

There is nothing signal-interfering that I know about.

Comment 4 Petr Muller 2009-09-01 12:33:26 UTC
Sample output of the reproducing line:

# opcontrol --deinit; opcontrol --init; opcontrol --start-daemon --verbose | tee log
Stopping profiling.
Killing daemon.
Unloading oprofile module
Parameters used:
SESSION_DIR /var/lib/oprofile
LOCK_FILE /var/lib/oprofile/lock
SAMPLES_DIR /var/lib/oprofile/samples
CURRENT_SAMPLES_DIR /var/lib/oprofile/samples/current
CPUTYPE ia64/itanium2
BUF_SIZE 500
BUF_WATERSHED 250
CPU_BUF_SIZE 1000
SEPARATE_LIB 0
SEPARATE_KERNEL 0
SEPARATE_THREAD 0
SEPARATE_CPU 0
CALLGRAPH 0
VMLINUX none
KERNEL_RANGE
XENIMAGE none
XEN_RANGE
executing oprofiled --session-dir=/var/lib/oprofile --separate-lib=0 --separate-kernel=0 --separate-thread=0 --separate-cpu=0 --events=CPU_CYCLES:18:0:150000:0:1:1, --no-vmlinux --verbose=all
Events: CPU_CYCLES:18:0:150000:0:1:1,
Using 2.6+ OProfile kernel interface.
Running perfmon child on CPU0.
Events: CPU_CYCLES:18:0:150000:0:1:1,
Using 2.6+ OProfile kernel interface.
Waiting on CPU0
Perfmon child up on CPU0
Daemon started.
(... sitting here until ctrl-c or something ...)

Comment 5 William Cohen 2009-09-01 15:27:32 UTC
Created attachment 359410 [details]
Disconnect children running perfmon from stdin/stdout

I originally misunderstood the desired behavior. I compared the ia64 behavior with the x86_64 and found how the "--start-daemon" option was suppose to behave.

When ia64 oprofiled starts up it creates children processes to run perfmon. These children processes still have file descriptors open for stdin, stdout, and stderr. The attached patch closes those file descriptors to allow the tee operation to continue. This patch in not in the final state, but shows what is going wrong on the ia64 and the basic approach to fix it.

Comment 10 errata-xmlrpc 2010-03-30 08:51:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0283.html