Red Hat Bugzilla – Bug 518480
Oprofile seems to not daemonize properly on ia64
Last modified: 2016-09-19 22:05:32 EDT
Description of problem:
During the tier testing for RHEL5, we've encountered oprofile on ia64 hanging on something like this:
opcontrol --start-daemon --no-vmlinux --verbose 2>&1 | tee $TMPOUTPUT
Firstly, we suspected the oprofile itself to hang for some reason, but when I looked closer at it, I've seen that it is the 'tee' command which remains sitting there forever, waiting for some input, even when the opcontrol itself already finished. Then I've dissected what is being done in starting daemon, and I'm suspecting oprofile daemon is not reopening it's streams after the fork, which causes the 'tee' command run forever - the pipe's entrace never gets closed, and thus tee cannot find out that it should end itself. If I do 'opcontrol --shutdown' from another terminal, the command in the first one ends - this supports his theory. The only weird fact is that we observe this behavior on ia64 only, while I would expect that it would be independent on the platform :/
Version-Release number of selected component (if applicable):
but even the old (RHEL5.3) oprofile behaves like this
Steps to Reproduce:
1. opcontrol --deinit; opcontrol --init; opcontrol --start-daemon | tee log
2. wait indefinitely
3. from other terminal, run 'opcontrol --shutdown', and see the command finished
tee waiting more input, even when opcontrol script ended
like the other daemons - tee logs just the output of the launching script and ends with it, not logging any output from the daemon itself
I attempted to reproduce the problem following the steps above, but was unable to replicate the behavior. This machine is subscribed to RHN and has RHEL 5.3 on it with the exception on newer kernel and the listed oprofile rpm. I wasn't able to reproduce the problem on this ia64 machine. I am setting up a red hat test machine with clean version of rhel5.4
What kernel verion was used for the testing (uname -a)?
I tried a fresh install of RHEL5.4-Server-20090819.0 on a red hat test system ia64 machine and still was unable to recreate the hang using the steps listed. The machine had the following rpms:
What was the original shell script that produced the problem on ia64. Was there something in there that intercepted signals?
I can with no problem reproduce the issue on a RHEL5.3 and RHEL5.4 testing box by simply running
# opcontrol --deinit; opcontrol --init; opcontrol --start-daemon | tee log
as stated above. I did the original investigation on RHEL5.4 box, so it is weird you cannot reproduce it.
2.6.18-162.el5xen (some rhel5.4 candidate)
The original script showing the behavior was the runtest.sh of /tools/oprofile/Sanity/opcontrol-options RHTS test, at line 189, doing
'opcontrol --start-daemon --no-vmlinux --verbose 2>&1 | tee $TMPOUTPUT'
There is nothing signal-interfering that I know about.
Sample output of the reproducing line:
# opcontrol --deinit; opcontrol --init; opcontrol --start-daemon --verbose | tee log
Unloading oprofile module
executing oprofiled --session-dir=/var/lib/oprofile --separate-lib=0 --separate-kernel=0 --separate-thread=0 --separate-cpu=0 --events=CPU_CYCLES:18:0:150000:0:1:1, --no-vmlinux --verbose=all
Using 2.6+ OProfile kernel interface.
Running perfmon child on CPU0.
Using 2.6+ OProfile kernel interface.
Waiting on CPU0
Perfmon child up on CPU0
(... sitting here until ctrl-c or something ...)
Created attachment 359410 [details]
Disconnect children running perfmon from stdin/stdout
I originally misunderstood the desired behavior. I compared the ia64 behavior with the x86_64 and found how the "--start-daemon" option was suppose to behave.
When ia64 oprofiled starts up it creates children processes to run perfmon. These children processes still have file descriptors open for stdin, stdout, and stderr. The attached patch closes those file descriptors to allow the tee operation to continue. This patch in not in the final state, but shows what is going wrong on the ia64 and the basic approach to fix it.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.