Bug 166876
Summary: | xinetd fails to listen on port after Xvnc exits | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Dana Holgate <dana.holgate> | ||||||||
Component: | xinetd | Assignee: | Jan Safranek <jsafrane> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | Brock Organ <borgan> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 3.0 | CC: | dana.holgate, sgrubb | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2007-10-19 18:55:12 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Dana Holgate
2005-08-26 17:54:07 UTC
If you can reproduce this fairly easily, try running xinetd under If you can reproduce this consistently, running xinetd under strace should give some interesting debugging information. Use something like "strace -o /root/xinetd.strace.out -p {the xinetd pid}", trigger the problem, then stop xinetd. Attach the xinetd.strace.out output to this bug report. Created attachment 118314 [details]
Strace of bug occurance with comments ("#").
Created attachment 118315 [details]
Strace of bug occurance with comments ("#").
I see in the strace as xinetd is shutting down that it kills process 25334 apparently without an error, which seems to indicate that the suspect vnc client (Xvnc) is still running. Prior to stopping xinetd I did "ps -ef | grep 25334" and "ps -ef | grep Xvnc" and neither reported process 25334 running. Is it possible the process was in a state that was not reported by "ps -ef"? Dana The strace has something real unusual in it: select(25, [3 5 6 8 9 10 11 12 14 16 17 20 21 23 24], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted) That errno is supposed to stay inside the kernel. As an aside, this may be related to bz 161468. In any event, I do not see a SIGCHLD that is associated with the exit that you say occurs. Xinetd does a waitpid(-1, &status, NOHANG). From the waitpid man page: -1 means to wait for any child process; this is the same behaviour which wait exhibits. This means that any SIGCHLD received by xinetd will cause it to reap *all* child processes that have exited. This is done in a loop until waitpid says there are no more children. Further down in the trace is this: kill(25334, SIGKILL) = 0 --- SIGCHLD (Child exited) @ 0 (0) --- write(4, "\21", 1) = 1 sigreturn() = ? (mask now []) waitpid(25334, NULL, WNOHANG) = 25334 close(13) = 0 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 13 This says that pid 25334 did exist and xinetd needed to kill it. As soon as kill was called, a SIGCHLD was received and waitpid returned with 25334. This would indicate that either the kernel was in a weird state, or vncd was still alive. I've identified a sequence of actions that produces the behavior repeatably every time. xinetd is running vnc client connects (xinetd starts Xvnc) xinetd is restarted - not a hard reconfigure (the Xvnc process is orphaned, parent is now pid=1) vnc user logs out (Xvnc terminates) (xinetd does not see SIGCHLD, because its not the parent, therefore does not resume listening on the port) vnc client attempts to connect (no one is listening/selecting on the port) vnc user sees the error This explains the behavior, but also raises a few questions based on the strace output. 1. How did xinetd know to kill process 25334 as it shutdown? And, did it really successfully kill 25334? 2. Does xinetd attempt to "recover" the orphaned processes from a previous instantiation of xinetd? 3. Does this behavior impact any other services started by xinetd? This behavior can be avoided by doing a hard reconfigure (SIGHUP signal to xinetd). That leaves the problem that using System Settings > Server Settings > Services to start the Service Configuration window and then selecting and restarting xinetd does a restart and not a hard recofigure. 1: xinetd knows the vncd's pid because it forked it. It has to know the pid so it can wait(2) for it. 2: It can't. There's no way for the new xinetd to discover that a random vncd process that appears to have been started by init was actually started by a previous xinetd. And even if it could, there's no way to rearrange the process tree such that said random vncd becomes a child of the new xinetd. 3: Only if xinetd isn't correctly killing its child services. If vncd is ignoring the command to terminate, that's solely a vncd problem, not xinetd's. So the bug appears to be that either xinetd does not kill the vncd on shutdown, or that vncd does not exit when it's parent tries to shut it down. Oh, and you missed step 3a in your sequence above: when Xinetd restarts, it attempts to bind to the vnc port, fails (because vncd is still running), and disables the service. (What else could it do?) Created attachment 118324 [details]
strace of stopping xinetd with Xvnc running as child
I don't think Xvnc is the problem.
The attached strace is of xinetd being stopped with SIGTERM while an Xvnc is
running.
Note that there is no evidence of xinetd trying to terminate the Xvnc server
as it terminates. (There were no other servers as children of this xinetd.)
An strace of the Xvnc at the same time shows no evidence of any signal
received to cause termination.
The xinetd man page says the SIGTERM should cause xinetd to terminate all
running servers before terminating xinetd.
SIGQUIT should cause the behavior of terminating without terminating
running servers.
Is it possible that the configuration of vncservers is causing this behavior?
Perhaps the wait parameter set to "yes"?
I am one of the authors of xinetd. Let me take a stab at a few things >1. How did xinetd know to kill process 25334 as it shutdown? It maintains a list of children. When it forks, it knows the pid. When children die, waitpid returns the child's pid. It looks up the pid when reaping children and removes the child from its list of children. > And, did it really successfully kill 25334? If it calls kill and you got a SIGCHLD, it killed a child program. >2. Does xinetd attempt to "recover" the orphaned processes from a previous > instantiation of xinetd? As Jay explained, it does not and cannot. >3. Does this behavior impact any other services started by xinetd? It shouldn't. Each service should be independent regarding the port and address being bound to. It should be noted that tcp-wait programs are fragile for several reasons. The child application must always accept the connection - even on error or bad things will hapen. And tcp_wrappers does not work for unaccepted sockets, so you should always have flags = NOLIBWRAP. Regarding comment #9, xinetd only terminates a few safe services. From the source code in main.c: /* Terminate the service if it is: * 1) internal (if we don't, it'll zombie) * 2) a redirector (again, if we don't it'll zombie) * 3) It's RPC (we must deregister it). */ This is because you might be running sshd from xinetd and decide to do "service xinetd restart". If all services were killed, you would be rudely disconnected. If you made a config mistake, you won't be able to get back in. Maybe I should update the man page a little to explain it. About your response to my questions: 1. What you both described is how child processes are normally done. What I observed did not appear to be normal. That is, why couldn't I see process 25334 with "ps -ef", yet xinetd was able to kill and wait for the death of the process? The ps output showed to me that the Xvnc (25334) had exited. The strace of xinetd showed no evidence of the SIGCHLD that should have occurred. Then xinetd goes and kills the "non-existing" process. Perhaps my observations were wrong. Until I see it again, I'll chalk it up to the hallucigenic effects of management's latest problem solving directive, "Lets get creative!" (-; 2. It might be possible, but only in a convoluted, full of holes, error prone way, definately not worth doing even if it is possible. On response to comment #9, yes it ould be nice if the man page described the behavior of SIGTERM and SIGQUIT to indicate what actually happens and the consequences to running servers. Had I realized what was happening I'd have been using SIGHUP for the hard reconfig all along. Unfortunately using System Settings > Server Settings > Services does the SIGTERM and a quick look at that showed its not set up to do anything else easily. So a man page improvement seems all that is needed. Thanks for your attention and time on this. This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you. |