Bug 49132 - xinetd stops responding to new connections after a period of time
Summary: xinetd stops responding to new connections after a period of time
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: xinetd
Version: 7.1
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Trond Eivind Glomsrxd
QA Contact: Ben Levenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-07-14 22:41 UTC by Bishop Clark
Modified: 2007-04-18 16:34 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2001-12-01 00:17:01 UTC
Embargoed:


Attachments (Terms of Use)
test fix for the disabled services (1.76 KB, patch)
2001-09-01 02:21 UTC, Rob Braun
no flags Details | Diff
Userspace ping server (2.05 KB, text/plain)
2001-10-10 10:02 UTC, Matthew Kirkwood
no flags Details
Userspace ping server xinetd config (143 bytes, text/plain)
2001-10-10 10:02 UTC, Matthew Kirkwood
no flags Details

Description Bishop Clark 2001-07-14 22:41:54 UTC
Description of problem:
After a period of time, 3 days in this case, xinetd appears to stop 
responding to incoming connections.  My netstat -a log is full of lines of 
the following (shortened to prevent CRLF):

tcp   0   0    atlas:vtun        clark.hosted.co:4033 SYN_RECV  -


How reproducible:
Sometimes

Steps to Reproduce:
1.Configure Xinets with a bunch of services
2.Hit Those services infrequenty - 8 times a day
3.Stare, stunned, at the 25th - or so - attempt failing.
	
Actual Results:  Well, as mentioned:
 - Clients see a Connection Refused msg
 - IPChains logs a valid, accepted connection (-j ACCEPT -l)
 - Xinetd spawns no process (as confirmed by syslog) 

Expected Results:  (seen after service xinetd restart)
 - xinetd spawns process (Jul 11 18:41:39 atlas xinetd[28405]: START:...)
 - connection proceeds.

Additional info:

IPChains, when prompted, will spit reports into my syslog that prove that 
connections are being attempted to the vtun port.

The reason I'm claiming the xinetd is responsible is because the fix for 
this problem is:  service xinetd restart .  Currently, my servers are 
restarting Xinetd every 48 hours as an NT-style preventative measure.

 - I'm apparently using xinetd-2.3.0-1.71 .  
 - there is NO message in my syslog as to looping services, nor (with one 
hit every 3 hours) should there be.
 - I want to reiterate - restarting xinetd stops the problem for another 3 
days.

Comment 1 Trond Eivind Glomsrxd 2001-07-20 19:46:49 UTC
Can you try the one at http://people.redhat.com/teg/xinetd/?

Comment 2 Bishop Clark 2001-07-20 19:54:45 UTC
Yes.  I've picked it up and will install it late this evening.  This should give 
us some news next week.


Comment 3 Trond Eivind Glomsrxd 2001-08-15 19:55:27 UTC
Can you try 2.3.0-6 from the same location?

Comment 4 Bishop Clark 2001-08-15 20:04:23 UTC
Will do.

Comment 5 Fred Feirtag 2001-08-16 16:03:50 UTC
I'm having a possibly related problem with xinetd getting hung,
failing to wait for child processes.  After some time I get:

2271 ?        S      0:01 xinetd -stayalive -reuse -pidfile /var/run/xinetd.pid
6990 ?        Z      0:00  \_ [in.tftpd <defunct>]

And no future tftp requests are answered until xinetd is restarted.

One relatively good way to tickle this is with pxelinux (of syslinux-1.63)
performing a diskless boot.  Several tftp requests are thrown out in rapid
succession, and I wouldn't be surprised if each in.tftpd's procedure of forking
a chroot child is causing a problem by further complicating things is a factor:

Aug 15 23:42:29 sys109 tftpd[1917]: sending /pxelinux.0
Aug 15 23:42:29 sys109 tftpd[1917]: tftp: client does not accept options
Aug 15 23:42:29 sys109 tftpd[1919]: sending /pxelinux.0
Aug 15 23:42:29 sys109 tftpd[1921]: could not serve file /pxelinux.cfg/C0A80A7A
Aug 15 23:42:29 sys109 tftpd[1923]: could not serve file /pxelinux.cfg/C0A80A7
Aug 15 23:42:29 sys109 tftpd[1925]: could not serve file /pxelinux.cfg/C0A80A
Aug 15 23:42:29 sys109 tftpd[1927]: could not serve file /pxelinux.cfg/C0A80
Aug 15 23:42:29 sys109 tftpd[1929]: could not serve file /pxelinux.cfg/C0A8
Aug 15 23:42:29 sys109 tftpd[1931]: could not serve file /pxelinux.cfg/C0A
Aug 15 23:42:29 sys109 tftpd[1933]: could not serve file /pxelinux.cfg/C0
Aug 15 23:42:29 sys109 tftpd[1935]: could not serve file /pxelinux.cfg/C
Aug 15 23:42:29 sys109 tftpd[1937]: sending /pxelinux.cfg/default
Aug 15 23:42:29 sys109 tftpd[1939]: sending /linux
Aug 15 23:42:30 sys109 tftpd[1941]: sending /initrd-diskless

Note the odd PID numbers--while it is the child process reporting, it is
the parent process which always turns out as the sticking zombie, as for
example:

 2180 ?        S      0:00  \_ /usr/sbin/sshd
 2181 pts/1    S      0:00      \_ -csh
 7191 pts/1    S      0:00          \_ xinetd -d -stayalive -reuse -loop 15
 7240 ?        Z      0:00              \_ [in.tftpd <defunct>]

and

Aug 16 15:43:53 sys109 tftpd[7235]: could not serve file /pxelinux.cfg/C0A8
Aug 16 15:43:53 sys109 tftpd[7237]: could not serve file /pxelinux.cfg/C0A
Aug 16 15:43:53 sys109 tftpd[7239]: could not serve file /pxelinux.cfg/C0
Aug 16 15:43:53 sys109 tftpd[7241]: could not serve file /pxelinux.cfg/C

PID 7241 answers the request and logs, but 7240 is the zombie.

My workaround is to change pxelinux.asm to skip searching for the files I
don't use (the hexidecimal IP specific names) and just attempt to retrieve
the file I do use, "/pxelinux.cfg/default."  This increases the time between
failures from ~10 diskless boots to ~100 diskless boots.

I'm seeing the same thing under both 2.4.7 and 2.4.9-pre4, and with both
xinetd-2.3.0-1.71.i386.rpm and xinetd-2.3.0-7.i386.rpm.

Fred

ffeirtag

Comment 6 Rob Braun 2001-09-01 02:21:59 UTC
Created attachment 30474 [details]
test fix for the disabled services

Comment 7 Rob Braun 2001-09-01 02:23:45 UTC
Can you try the attached diff to 2.3.3?  I think it might fix the problem.  I was 
able to occasionally get defunct processes when I created a test program 
to splat tftp requests as fast as possible to xinetd, and this seemed to fix it 
for my test case.

Comment 8 Trond Eivind Glomsrxd 2001-09-06 20:14:36 UTC
We tried that here, and it worked fine for a time (applied to 2.3.3).
Eventually, this started happening:

Sep  6 15:58:39 porkchop xinetd[24655]: {general_handler} (24655) Unexpected
signal: 11 (Segmentation fault)
Sep  6 15:58:39 porkchop last message repeated 9 times
Sep  6 15:58:39 porkchop xinetd[24655]: Resetting...
Sep  6 15:58:39 porkchop xinetd[24655]: {general_handler} (24655) Unexpected
signal: 11 (Segmentation fault)
Sep  6 15:58:40 porkchop last message repeated 8 times
Sep  6 15:58:40 porkchop xinetd[24655]: Resetting...
Sep  6 15:58:40 porkchop xinetd[24655]: {general_handler} (24655) Unexpected
signal: 11 (Segmentation fault)
Sep  6 15:58:40 porkchop last message repeated 8 times
Sep  6 15:58:40 porkchop xinetd[24655]: Resetting...
Sep  6 15:58:40 porkchop xinetd[24655]: {general_handler} (24655) Unexpected
signal: 11 (Segmentation fault)
Sep  6 15:58:40 porkchop last message repeated 8 times
Sep  6 15:58:40 porkchop xinetd[24655]: Resetting...
Sep  6 15:58:40 porkchop xinetd[24655]: {general_handler} (24655) Unexpected
signal: 11 (Segmentation fault)


Comment 9 Rob Braun 2001-09-06 20:24:13 UTC
Despite appearances, this is good.  The defunct process problem should 
be fixed, but that patch had it's own problems.  Could you try http://
www.xinetd.org/devel/xinetd-2001.9.6.tar.gz?  It includes what should fix 
that segfaulting, and it also has some portability changes (shouldn't affect 
this problem).

Comment 10 Bishop Clark 2001-09-06 20:24:31 UTC
teg;

the 2.3.0-6 did the same thing whiel I was on vacation last week.  I forgot until just this moment when I read your update.  Whatever it is, then, it's present from 230 to 
233.  Apologies for the delay - my services don't loop as often, and I was  bad testor here.  I like the tftp stress tester though, as that was a good idea, as would be a 
web server via xinetd with loop protection disabled.

 - bish

Comment 11 Rob Braun 2001-09-06 21:11:11 UTC
A web server test should not show the same signs.  From my 
interpretation of the original problem, this is a wait=yes problem 
(regardless of udp vs. tcp and has nothing to do with the old linuxconf 
problems).  Essentially, a wait=yes service that respawned quickly and as 
a result died quickly, could potentially be left in a suspended state 
because xinetd didn't catch one of the children dying.  The end result was 
a zombied process hanging around, and xinetd waiting for a SIGCHLD for 
that process before resuming the service again (which of course would 
never happen since it missed the first SIGCHLD, the process is zombied, 
and the service will never be resumed).

Now, a web server presumably is run with wait=no, since you don't want 
to serialize your web connects at the socket layer.  This should never 
exhibit the "xinetd stops responding to new connections after a period of 
time" problem, since the service is never suspended, since it is wait=no.  
The service should always be available, unless xinetd its self has died.  
Now, if xinetd is dying, I think that's a different bug.  The segfault that 
occured after the patch was applied should have been from a bug in the 
patch, not part of the original problem.  If you're getting segfaults with 
xinetd-2.3.3 without applying the patch, then you're seeing a different 
problem and you should probably be filing a different bug.

Comment 12 Trond Eivind Glomsrxd 2001-09-06 21:14:53 UTC
The service I think is triggering this is tftp, which is wait=yes.

Comment 13 Bishop Clark 2001-09-06 21:18:27 UTC
BBraun;

Quite right.  My apologies for suggesting that as a hypothetical test case.  BenL is right, though, that the problem is being reported with a TFTP set-up, and that should 
eb a much better set-up than how I'm using Xinetd, which now requires a good month to see the problem recur (and I'm using wait=yes, and I'm not using it to serve 
web pages as that was entirely a suggestion, etc, etc, disclaimer, waiver, etc)

 - bish

Comment 14 Rob Braun 2001-09-06 21:45:28 UTC
I tested the first round patch just hammering away on a tftp service, and 
was able to reproduce the zombied process problem enough of the time 
to diagnose it.   I didn't see the segfaulting above when doing that test, but 
afterwards when working on other portability issues, I found the 
segfaulting you were seeing.  The xinetd-2001.9.6.tar.gz fixes all known 
problems with the first round patch.  

Bishop: no problems.  I didn't see a thorough diagnosis in the bug, and 
my own mention of the patch was along the lines of "try this and call me in 
the morning" without an explaination of the problem, which doesn't really 
help anybody.  

For the record, the way I've tackled this problem was to rework xinetd's 
event handling in it's main loop.  The main loop used to set flags in signal 
handlers, then actually do the work in the main loop.  The problem was 
the flags were binary, so there was no way to detect if a signal happened 
more than once before the signal was addressed by the main loop.  To 
get around this, I used a timer mechanism introduced in xinetd-2.3.2 with 
the timer set to 0 (to handle the timer immediatly).  With this mechanism, 
a new event was inserted into the timer queue for every action that needs 
to be taken.  This handles the multiple signals before the main loop 
iterates, and it's also aware of the select, so there should never be a 
problem of blocking in the select, ignoring a pending event.

The segfaulting in the first patch was caused by me not dealing with 
some boundary cases in the timer mechanism.  These only manifested 
themselves when I started using timers with a 0 length of time before they 
expired, and time could pass between when the addition of the timer and 
when the timer actually executed, potentially resulting in negative time 
until the timer expired.  So, that's been fixed in xinetd-2001.9.6

Comment 15 Matthew Kirkwood 2001-10-10 10:01:27 UTC
I'm seeing the same problem with 2.3.3-1.  My testcase is easily reproducable
(to me, at least -- I'm on SMP which probably helps): I can usually extract a
zombie process within a dozen packets.

Attachments to follow.

Comment 16 Matthew Kirkwood 2001-10-10 10:02:16 UTC
Created attachment 33721 [details]
Userspace ping server

Comment 17 Matthew Kirkwood 2001-10-10 10:02:51 UTC
Created attachment 33722 [details]
Userspace ping server xinetd config

Comment 18 Rob Braun 2001-10-11 08:02:14 UTC
Can those having problems please try out
http://www.xinetd.org/devel/xinetd-2001.10.10.tar.gz
I've tested with the ping server attached, and it seems to work fairly well. 
This is a work in progress, and I'd like to get some feedback.  Probably not
ready for prime-time use though.

Comment 19 Matthew Kirkwood 2001-10-30 14:43:42 UTC
I still see the problem with this snapshot.  Strace offers:

write(3, "01/10/30@14:43:36: DEBUG: {serve"..., 6301/10/30@14:43:36: DEBUG:
{server_start} Starting service ping
) = 63
fork()                                  = 2428
time([1004453016])                      = 1004453016
time([1004453016])                      = 1004453016
write(3, "01/10/30@14:43:36: DEBUG: {main_"..., 5801/10/30@14:43:36: DEBUG:
{main_loop} active_services = 0
) = 58
rt_sigsuspend([]01/10/30@14:43:36: DEBUG: {exec_server} duping 6
 <unfinished ...>
--- SIGCHLD (Child exited) ---
<... rt_sigsuspend resumed> )           = -1 EINTR (Interrupted system call)
time([1004453016])                      = 1004453016
write(3, "01/10/30@14:43:36: DEBUG: {my_ha"..., 5001/10/30@14:43:36: DEBUG:
{my_handler} writing 17
) = 50
write(5, "\21\0\0\0", 4)                = 4
sigreturn()                             = ? (mask now [])
time([1004453016])                      = 1004453016
write(3, "01/10/30@14:43:36: DEBUG: {main_"..., 5801/10/30@14:43:36: DEBUG:
{main_loop} active_services = 0
) = 58


Comment 20 Trond Eivind Glomsrxd 2001-12-01 00:16:54 UTC
Can you try the rpms at http://people.redhat.com/teg/xinetd/ ?



Comment 21 Trond Eivind Glomsrxd 2002-07-25 21:57:35 UTC
Closing due to lack of feedback. If it still happens with current versions,
reopen with new data.


Note You need to log in before you can comment on or make changes to this bug.