Description of problem: After a period of time, 3 days in this case, xinetd appears to stop responding to incoming connections. My netstat -a log is full of lines of the following (shortened to prevent CRLF): tcp 0 0 atlas:vtun clark.hosted.co:4033 SYN_RECV - How reproducible: Sometimes Steps to Reproduce: 1.Configure Xinets with a bunch of services 2.Hit Those services infrequenty - 8 times a day 3.Stare, stunned, at the 25th - or so - attempt failing. Actual Results: Well, as mentioned: - Clients see a Connection Refused msg - IPChains logs a valid, accepted connection (-j ACCEPT -l) - Xinetd spawns no process (as confirmed by syslog) Expected Results: (seen after service xinetd restart) - xinetd spawns process (Jul 11 18:41:39 atlas xinetd[28405]: START:...) - connection proceeds. Additional info: IPChains, when prompted, will spit reports into my syslog that prove that connections are being attempted to the vtun port. The reason I'm claiming the xinetd is responsible is because the fix for this problem is: service xinetd restart . Currently, my servers are restarting Xinetd every 48 hours as an NT-style preventative measure. - I'm apparently using xinetd-2.3.0-1.71 . - there is NO message in my syslog as to looping services, nor (with one hit every 3 hours) should there be. - I want to reiterate - restarting xinetd stops the problem for another 3 days.
Can you try the one at http://people.redhat.com/teg/xinetd/?
Yes. I've picked it up and will install it late this evening. This should give us some news next week.
Can you try 2.3.0-6 from the same location?
Will do.
I'm having a possibly related problem with xinetd getting hung, failing to wait for child processes. After some time I get: 2271 ? S 0:01 xinetd -stayalive -reuse -pidfile /var/run/xinetd.pid 6990 ? Z 0:00 \_ [in.tftpd <defunct>] And no future tftp requests are answered until xinetd is restarted. One relatively good way to tickle this is with pxelinux (of syslinux-1.63) performing a diskless boot. Several tftp requests are thrown out in rapid succession, and I wouldn't be surprised if each in.tftpd's procedure of forking a chroot child is causing a problem by further complicating things is a factor: Aug 15 23:42:29 sys109 tftpd[1917]: sending /pxelinux.0 Aug 15 23:42:29 sys109 tftpd[1917]: tftp: client does not accept options Aug 15 23:42:29 sys109 tftpd[1919]: sending /pxelinux.0 Aug 15 23:42:29 sys109 tftpd[1921]: could not serve file /pxelinux.cfg/C0A80A7A Aug 15 23:42:29 sys109 tftpd[1923]: could not serve file /pxelinux.cfg/C0A80A7 Aug 15 23:42:29 sys109 tftpd[1925]: could not serve file /pxelinux.cfg/C0A80A Aug 15 23:42:29 sys109 tftpd[1927]: could not serve file /pxelinux.cfg/C0A80 Aug 15 23:42:29 sys109 tftpd[1929]: could not serve file /pxelinux.cfg/C0A8 Aug 15 23:42:29 sys109 tftpd[1931]: could not serve file /pxelinux.cfg/C0A Aug 15 23:42:29 sys109 tftpd[1933]: could not serve file /pxelinux.cfg/C0 Aug 15 23:42:29 sys109 tftpd[1935]: could not serve file /pxelinux.cfg/C Aug 15 23:42:29 sys109 tftpd[1937]: sending /pxelinux.cfg/default Aug 15 23:42:29 sys109 tftpd[1939]: sending /linux Aug 15 23:42:30 sys109 tftpd[1941]: sending /initrd-diskless Note the odd PID numbers--while it is the child process reporting, it is the parent process which always turns out as the sticking zombie, as for example: 2180 ? S 0:00 \_ /usr/sbin/sshd 2181 pts/1 S 0:00 \_ -csh 7191 pts/1 S 0:00 \_ xinetd -d -stayalive -reuse -loop 15 7240 ? Z 0:00 \_ [in.tftpd <defunct>] and Aug 16 15:43:53 sys109 tftpd[7235]: could not serve file /pxelinux.cfg/C0A8 Aug 16 15:43:53 sys109 tftpd[7237]: could not serve file /pxelinux.cfg/C0A Aug 16 15:43:53 sys109 tftpd[7239]: could not serve file /pxelinux.cfg/C0 Aug 16 15:43:53 sys109 tftpd[7241]: could not serve file /pxelinux.cfg/C PID 7241 answers the request and logs, but 7240 is the zombie. My workaround is to change pxelinux.asm to skip searching for the files I don't use (the hexidecimal IP specific names) and just attempt to retrieve the file I do use, "/pxelinux.cfg/default." This increases the time between failures from ~10 diskless boots to ~100 diskless boots. I'm seeing the same thing under both 2.4.7 and 2.4.9-pre4, and with both xinetd-2.3.0-1.71.i386.rpm and xinetd-2.3.0-7.i386.rpm. Fred ffeirtag
Created attachment 30474 [details] test fix for the disabled services
Can you try the attached diff to 2.3.3? I think it might fix the problem. I was able to occasionally get defunct processes when I created a test program to splat tftp requests as fast as possible to xinetd, and this seemed to fix it for my test case.
We tried that here, and it worked fine for a time (applied to 2.3.3). Eventually, this started happening: Sep 6 15:58:39 porkchop xinetd[24655]: {general_handler} (24655) Unexpected signal: 11 (Segmentation fault) Sep 6 15:58:39 porkchop last message repeated 9 times Sep 6 15:58:39 porkchop xinetd[24655]: Resetting... Sep 6 15:58:39 porkchop xinetd[24655]: {general_handler} (24655) Unexpected signal: 11 (Segmentation fault) Sep 6 15:58:40 porkchop last message repeated 8 times Sep 6 15:58:40 porkchop xinetd[24655]: Resetting... Sep 6 15:58:40 porkchop xinetd[24655]: {general_handler} (24655) Unexpected signal: 11 (Segmentation fault) Sep 6 15:58:40 porkchop last message repeated 8 times Sep 6 15:58:40 porkchop xinetd[24655]: Resetting... Sep 6 15:58:40 porkchop xinetd[24655]: {general_handler} (24655) Unexpected signal: 11 (Segmentation fault) Sep 6 15:58:40 porkchop last message repeated 8 times Sep 6 15:58:40 porkchop xinetd[24655]: Resetting... Sep 6 15:58:40 porkchop xinetd[24655]: {general_handler} (24655) Unexpected signal: 11 (Segmentation fault)
Despite appearances, this is good. The defunct process problem should be fixed, but that patch had it's own problems. Could you try http:// www.xinetd.org/devel/xinetd-2001.9.6.tar.gz? It includes what should fix that segfaulting, and it also has some portability changes (shouldn't affect this problem).
teg; the 2.3.0-6 did the same thing whiel I was on vacation last week. I forgot until just this moment when I read your update. Whatever it is, then, it's present from 230 to 233. Apologies for the delay - my services don't loop as often, and I was bad testor here. I like the tftp stress tester though, as that was a good idea, as would be a web server via xinetd with loop protection disabled. - bish
A web server test should not show the same signs. From my interpretation of the original problem, this is a wait=yes problem (regardless of udp vs. tcp and has nothing to do with the old linuxconf problems). Essentially, a wait=yes service that respawned quickly and as a result died quickly, could potentially be left in a suspended state because xinetd didn't catch one of the children dying. The end result was a zombied process hanging around, and xinetd waiting for a SIGCHLD for that process before resuming the service again (which of course would never happen since it missed the first SIGCHLD, the process is zombied, and the service will never be resumed). Now, a web server presumably is run with wait=no, since you don't want to serialize your web connects at the socket layer. This should never exhibit the "xinetd stops responding to new connections after a period of time" problem, since the service is never suspended, since it is wait=no. The service should always be available, unless xinetd its self has died. Now, if xinetd is dying, I think that's a different bug. The segfault that occured after the patch was applied should have been from a bug in the patch, not part of the original problem. If you're getting segfaults with xinetd-2.3.3 without applying the patch, then you're seeing a different problem and you should probably be filing a different bug.
The service I think is triggering this is tftp, which is wait=yes.
BBraun; Quite right. My apologies for suggesting that as a hypothetical test case. BenL is right, though, that the problem is being reported with a TFTP set-up, and that should eb a much better set-up than how I'm using Xinetd, which now requires a good month to see the problem recur (and I'm using wait=yes, and I'm not using it to serve web pages as that was entirely a suggestion, etc, etc, disclaimer, waiver, etc) - bish
I tested the first round patch just hammering away on a tftp service, and was able to reproduce the zombied process problem enough of the time to diagnose it. I didn't see the segfaulting above when doing that test, but afterwards when working on other portability issues, I found the segfaulting you were seeing. The xinetd-2001.9.6.tar.gz fixes all known problems with the first round patch. Bishop: no problems. I didn't see a thorough diagnosis in the bug, and my own mention of the patch was along the lines of "try this and call me in the morning" without an explaination of the problem, which doesn't really help anybody. For the record, the way I've tackled this problem was to rework xinetd's event handling in it's main loop. The main loop used to set flags in signal handlers, then actually do the work in the main loop. The problem was the flags were binary, so there was no way to detect if a signal happened more than once before the signal was addressed by the main loop. To get around this, I used a timer mechanism introduced in xinetd-2.3.2 with the timer set to 0 (to handle the timer immediatly). With this mechanism, a new event was inserted into the timer queue for every action that needs to be taken. This handles the multiple signals before the main loop iterates, and it's also aware of the select, so there should never be a problem of blocking in the select, ignoring a pending event. The segfaulting in the first patch was caused by me not dealing with some boundary cases in the timer mechanism. These only manifested themselves when I started using timers with a 0 length of time before they expired, and time could pass between when the addition of the timer and when the timer actually executed, potentially resulting in negative time until the timer expired. So, that's been fixed in xinetd-2001.9.6
I'm seeing the same problem with 2.3.3-1. My testcase is easily reproducable (to me, at least -- I'm on SMP which probably helps): I can usually extract a zombie process within a dozen packets. Attachments to follow.
Created attachment 33721 [details] Userspace ping server
Created attachment 33722 [details] Userspace ping server xinetd config
Can those having problems please try out http://www.xinetd.org/devel/xinetd-2001.10.10.tar.gz I've tested with the ping server attached, and it seems to work fairly well. This is a work in progress, and I'd like to get some feedback. Probably not ready for prime-time use though.
I still see the problem with this snapshot. Strace offers: write(3, "01/10/30@14:43:36: DEBUG: {serve"..., 6301/10/30@14:43:36: DEBUG: {server_start} Starting service ping ) = 63 fork() = 2428 time([1004453016]) = 1004453016 time([1004453016]) = 1004453016 write(3, "01/10/30@14:43:36: DEBUG: {main_"..., 5801/10/30@14:43:36: DEBUG: {main_loop} active_services = 0 ) = 58 rt_sigsuspend([]01/10/30@14:43:36: DEBUG: {exec_server} duping 6 <unfinished ...> --- SIGCHLD (Child exited) --- <... rt_sigsuspend resumed> ) = -1 EINTR (Interrupted system call) time([1004453016]) = 1004453016 write(3, "01/10/30@14:43:36: DEBUG: {my_ha"..., 5001/10/30@14:43:36: DEBUG: {my_handler} writing 17 ) = 50 write(5, "\21\0\0\0", 4) = 4 sigreturn() = ? (mask now []) time([1004453016]) = 1004453016 write(3, "01/10/30@14:43:36: DEBUG: {main_"..., 5801/10/30@14:43:36: DEBUG: {main_loop} active_services = 0 ) = 58
Can you try the rpms at http://people.redhat.com/teg/xinetd/ ?
Closing due to lack of feedback. If it still happens with current versions, reopen with new data.