Bug 166876

Summary: xinetd fails to listen on port after Xvnc exits
Product: Red Hat Enterprise Linux 3 Reporter: Dana Holgate <dana.holgate>
Component: xinetdAssignee: Jan Safranek <jsafrane>
Status: CLOSED WONTFIX QA Contact: Brock Organ <borgan>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: dana.holgate, sgrubb
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-19 18:55:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Strace of bug occurance with comments ("#").
none
Strace of bug occurance with comments ("#").
none
strace of stopping xinetd with Xvnc running as child none

Description Dana Holgate 2005-08-26 17:54:07 UTC
Description of problem:

Xinetd is setup to run Xvnc when contacted by VNC client.
This is a sample /etc/xinetd.d/vncserver entry:

service vncserver
{
        disable         = no
        id              = 1
        type            = UNLISTED
        port            = 5901
        socket_type     = stream
        protocol        = tcp
        wait            = yes
        user            = user_x
        server          = /usr/bin/Xvnc
        server_args     = -inetd -desktop SystemX -geometry 1280x960 -query 
localhost -once passwordFile=/home/user_x/.vnc/passwd
        per_source      = 5
        cps             = 100 2
        flags           = IPv4
}

Consistently but intermittently, after logging in using VNC client and later 
logging out, the VNC client is unable to connect to the same port and reports 
error 10061, which indicates "the server machine contactable, not accepting 
connections on the port".

lsof -i :5901   shows the port to not be open by either xinetd nor Xvnc. 
Xinetd is running, Xvnc is not running for that user.

Restarting xinetd or doing a hard reconfigure (SIGHUP) corrects the problem 
and VNC client is then able to connect and initiate login.

My opinion:  xinetd appears to have missed the death of a child signal 
(SIGCHLD) when Xvnc exits during logout and therefore fails to resume 
listening on the port.


Version-Release number of selected component (if applicable):
xinetd-2.3.12-6.3E

How reproducible:


Steps to Reproduce:
1. Start VNC client on client system to assigned port (e.g., 5901) on server
2. Login to Red Hat Enterprise Linux
3. Logout from Linux
4. Start VNC client on client system to same assigned port on server
5. If error 10061 is reported by VNC client, problem has occured.
6. If no error, repeat steps 3 and 4 until it does.  

Actual results:
VNC client reports error 10061 which is interpreted is "the server machine 
contactable, not accepting connections on the port".
Repeated attempts result in same error.


Expected results:
VNC client should prompt for VNC password and then display Linux login screen.

Additional info:
The VNC client was from RealVNC running on Windows XP or Windows 2000.

The problem has been exhibited on several different ports, each with a 
vncserver entry similar to the sample provided.

This problem has only been exhibited after logging out from the Linux system. 
It has not been seen when the VNC client is terminated without logging out. In 
that case the VNC client is able to reconnect and resume the session without 
difficulty. In this case Xvnc does not terminate.

I did find mention of a problem of "lost SIGCHLD" in report #54963, but no 
resolution of that problem.

Comment 1 Jay Fenlason 2005-08-29 14:35:12 UTC
If you can reproduce this fairly easily, try running xinetd under  

Comment 2 Jay Fenlason 2005-08-29 14:38:24 UTC
If you can reproduce this consistently, running xinetd under strace should 
give some interesting debugging information.  Use something like "strace -o 
/root/xinetd.strace.out -p {the xinetd pid}", trigger the problem, then stop 
xinetd.  Attach the xinetd.strace.out output to this bug report. 

Comment 3 Dana Holgate 2005-08-31 17:29:04 UTC
Created attachment 118314 [details]
Strace of bug occurance with comments ("#").

Comment 4 Dana Holgate 2005-08-31 17:33:34 UTC
Created attachment 118315 [details]
Strace of bug occurance with comments ("#").

Comment 5 Dana Holgate 2005-08-31 17:40:54 UTC
I see in the strace as xinetd is shutting down that it kills process 25334 
apparently without an error, which seems to indicate that the suspect vnc 
client (Xvnc) is still running.  Prior to stopping xinetd I did "ps -ef | grep 
25334" and "ps -ef | grep Xvnc" and neither reported process 25334 running.

Is it possible the process was in a state that was not reported by "ps -ef"?

Dana

Comment 6 Steve Grubb 2005-08-31 18:11:12 UTC
The strace has something real unusual in it:

select(25, [3 5 6 8 9 10 11 12 14 16 17 20 21 23 24], NULL, NULL, NULL) = ?
ERESTARTNOHAND (To be restarted)

That errno is supposed to stay inside the kernel. As an aside, this may be
related to bz 161468.

In any event, I do not see a SIGCHLD that is associated with the exit that you
say occurs. Xinetd does a waitpid(-1, &status, NOHANG). From the waitpid man page:

-1 means to wait for any child process; this is the same behaviour which wait
exhibits.

This means that any SIGCHLD received by xinetd will cause it to reap *all* child
processes that have exited. This is done in a loop until waitpid says there are
no more children.

Further down in the trace is this:

kill(25334, SIGKILL)                    = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
write(4, "\21", 1)                      = 1
sigreturn()                             = ? (mask now [])
waitpid(25334, NULL, WNOHANG)           = 25334
close(13)                               = 0
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 13

This says that pid 25334 did exist and xinetd needed to kill it. As soon as kill
 was called, a SIGCHLD was received and waitpid returned with 25334.

This would indicate that either the kernel was in a weird state, or vncd was
still alive.

Comment 7 Dana Holgate 2005-08-31 21:27:46 UTC
I've identified a sequence of actions that produces the behavior repeatably 
every time.

   xinetd is running
   vnc client connects 
     (xinetd starts Xvnc)
   xinetd is restarted - not a hard reconfigure
     (the Xvnc process is orphaned, parent is now pid=1)
   vnc user logs out (Xvnc terminates)
     (xinetd does not see SIGCHLD, because its not the parent, therefore
      does not resume listening on the port)
   vnc client attempts to connect  
     (no one is listening/selecting on the port)
   vnc user sees the error

This explains the behavior, but also raises a few questions based on the
strace output.

1. How did xinetd know to kill process 25334 as it shutdown?
   And, did it really successfully kill 25334?

2. Does xinetd attempt to "recover" the orphaned processes from a previous 
   instantiation of xinetd?

3. Does this behavior impact any other services started by xinetd?

This behavior can be avoided by doing a hard reconfigure (SIGHUP signal to 
xinetd). That leaves the problem that using 
System Settings > Server Settings > Services to start the Service 
Configuration window and then selecting and restarting xinetd does a restart 
and not a hard recofigure.
   

Comment 8 Jay Fenlason 2005-08-31 21:42:51 UTC
1: xinetd knows the vncd's pid because it forked it.  It has to know the pid 
so it can wait(2) for it. 
 
2: It can't.  There's no way for the new xinetd to discover that a random vncd 
process that appears to have been started by init was actually started by a 
previous xinetd.  And even if it could, there's no way to rearrange the 
process tree such that said random vncd becomes a child of the new xinetd. 
 
3: Only if xinetd isn't correctly killing its child services.  If vncd is 
ignoring the command to terminate, that's solely a vncd problem, not xinetd's. 
 
So the bug appears to be that either xinetd does not kill the vncd on 
shutdown, or that vncd does not exit when it's parent tries to shut it down. 
 
Oh, and you missed step 3a in your sequence above: when Xinetd restarts, it 
attempts to bind to the vnc port, fails (because vncd is still running), and 
disables the service.  (What else could it do?) 
 
 

Comment 9 Dana Holgate 2005-08-31 23:47:28 UTC
Created attachment 118324 [details]
strace of stopping xinetd with Xvnc running as child


I don't think Xvnc is the problem.

The attached strace is of xinetd being stopped with SIGTERM while an Xvnc is
running.
Note that there is no evidence of xinetd trying to terminate the Xvnc server
as it terminates.   (There were no other servers as children of this xinetd.)

An strace of the Xvnc at the same time shows no evidence of any signal
received to cause termination.

The xinetd man page says the SIGTERM should cause xinetd to terminate all
running servers before terminating xinetd.

SIGQUIT should cause the behavior of terminating without terminating 
running servers.

Is it possible that the configuration of vncservers is causing this behavior?
Perhaps the  wait parameter set to "yes"?

Comment 10 Steve Grubb 2005-09-01 02:39:07 UTC
I am one of the authors of xinetd. Let me take a stab at a few things

>1. How did xinetd know to kill process 25334 as it shutdown?

It maintains a list of children. When it forks, it knows the pid. When children
die, waitpid returns the child's pid. It looks up the pid when reaping children
and removes the child from its list of children.

>  And, did it really successfully kill 25334?

If it calls kill and you got a SIGCHLD, it killed a child program.

>2. Does xinetd attempt to "recover" the orphaned processes from a previous 
>   instantiation of xinetd?

As Jay explained, it does not and cannot.

>3. Does this behavior impact any other services started by xinetd?

It shouldn't. Each service should be independent regarding the port and address
being bound to. It should be noted that tcp-wait programs are fragile for
several reasons. The child application must always accept the connection - even
on error or bad things will hapen. And tcp_wrappers does not work for unaccepted
sockets, so you should always have flags = NOLIBWRAP.

Regarding comment #9, xinetd only terminates a few safe services. From the
source code in main.c:

      /* Terminate the service if it is:
       * 1) internal (if we don't, it'll zombie)
       * 2) a redirector (again, if we don't it'll zombie)
       * 3) It's RPC (we must deregister it).
       */

This is because you might be running sshd from xinetd and decide to do "service
xinetd restart". If all services were killed, you would be rudely disconnected.
If you made a config mistake, you won't be able to get back in. Maybe I should
update the man page a little to explain it.

Comment 11 Dana Holgate 2005-09-01 17:37:49 UTC
About your response to my questions:

1.  What you both described is how child processes are normally done. What I 
observed did not appear to be normal.  That is, why couldn't I see process 
25334 with "ps -ef", yet xinetd was able to kill and wait for the death of the 
process?    The ps output showed to me that the Xvnc (25334) had exited. The 
strace of xinetd showed no evidence of the SIGCHLD that should have occurred. 
Then xinetd goes and kills the "non-existing" process.

Perhaps my observations were wrong. Until I see it again, I'll chalk it up to 
the hallucigenic effects of management's latest problem solving 
directive, "Lets get creative!" (-;

2.  It might be possible, but only in a convoluted, full of holes, error prone 
way, definately not worth doing even if it is possible.

On response to comment #9, yes it ould be nice if the man page described the 
behavior of SIGTERM and SIGQUIT to indicate what actually happens and the 
consequences to running servers.  Had I realized what was happening I'd have 
been using SIGHUP for the hard reconfig all along. Unfortunately using System 
Settings > Server Settings > Services does the SIGTERM and a quick look at 
that showed its not set up to do anything else easily.

So a man page improvement seems all that is needed.

Thanks for your attention and time on this.

Comment 12 RHEL Program Management 2007-10-19 18:55:12 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.