Bug 127849 - application breaks under RHEL3, possibly because of SIGCHLD workaround
Summary: application breaks under RHEL3, possibly because of SIGCHLD workaround
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Ingo Molnar
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-07-14 18:22 UTC by Jim Burnes
Modified: 2007-11-30 22:07 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-08-18 11:24:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jim Burnes 2004-07-14 18:22:09 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6)
Gecko/20040506 Firefox/0.8

Description of problem:
Application 'nxserver' (compiled for RH9) is exiting unexpectantly
immediately after the following error message from the kernel is
displayed:

kernel: application bug: nxserver(4788) has SIGCHLD set to SIG_IGN bu
t calls wait().
Jul 14 11:12:17 is-fletch kernel: (see the NOTES section of 'man 2
wait'). Workaround activated
.

I understand that you are not supporting NX, but something you have
changed in the kernel may well be breaking a properly working
application.  Could it be something that was backported into the
2.4.21 kernel?

Sometimes the parent process that is involved in this call considers
the child to have closed normally and reports the child process
terminating normally.  Sometimes the parent process looses track of
the child process completely -- whether that is due to the particular
workaround chosen or whether the child process actually aborts is
anyone's guess.  Possibly it's related to the the child process
exiting before the call to waitpid() is even initiated by the parent.
 Maybe some new behavior exibited in the signal handling prevents a
zombie from being created.

(I've read the notes in the man page as well as much as I could fine
about this issue online).

The vendor of 'nx' (nomachine) is participating in looking for the
source of this bug.


Version-Release number of selected component (if applicable):
kernel-2.4.21-15-ELsmp

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL3
2. Download the eval 'nxserver' and nxclient from www.nomachine.com
3. Install them an establish a connection from a remote machine to the
server you just configured.
4. Observe /var/log/messages and notice that correctly authenticated
session never startup an X environment.  They either shut down
immediately or hang for around 60 seconds and then shutdown.
    

Actual Results:  See above.


Expected Results:  An X Windows login session should have been
established and the selected  X environment should have started up
(either Gnome, KDE or other)

Additional info:

I have seen references to this kernel message elsewhere, but very
little specific information on the workaround.

Comment 1 Ernie Petrides 2004-07-14 18:47:48 UTC
Hello, Jim.  When the 'nxserver' application is compiled on RHEL3,
does the problem still occur?  (I'm not sure whether we guarantee
application-binary-compatibility with RHL 9, but it would be nice
to remove this variable from the equation.)


Comment 2 Arjan van de Ven 2004-07-14 18:53:13 UTC
the sigchld issue is that it's not valid to call wait() (and by
extension library functions that call wait) when you've set SIGCHILD
to SIGIGN. That will cause deadlocks in case the child gets reaped by
init before the wait() executes. Older kernels sort of kinda tolerated
this, NPTL does not, but tries to work around it somewhat in our kernels.


Note You need to log in before you can comment on or make changes to this bug.