From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 Description of problem: After several hours, ftpd refuses connexions. A ps shows only one process in.ftpd in defunc status. A xinetd restart solves temporarly the problem. Version-Release number of selected component (if applicable): xinetd-2.3.9-0.73 wu-ftpd-2.6.2-5 How reproducible: Sometimes Steps to Reproduce: 1.Have a standard heavy loaded ftp server (wu-ftpd enabled in xinetd) up-to-date 2.Wait 1 hours, 2 hours or 2 days... 3.When ftp connexions are refused, do a "ps |grep ftpd" and a "/etc/rc.d/init.d/xinetd restart" Additional info: [root@triton root]# ps awux |grep ftp root 9012 0.0 0.0 0 0 ? ZN 19:11 0:00 [in.ftpd <defunct>] [root@triton root]# /etc/rc.d/init.d/xinetd restart Arrjt de xinetd : [ OK ] Dimarrage de xinetd : [ OK ] [root@triton root]# ps awux |grep ftp ftp 11066 0.0 0.2 3620 2220 ? SN 20:22 0:00 ftpd: 132.248.173.40: anonymous/mozilla@: RETR /ge/languages/babylon_ ftp 11212 0.0 0.2 3548 2144 ? SN 20:28 0:00 ftpd: marseille-4-a7-62-147-63-23.dial.proxad.net: anonymous/anonymou ftp 11218 0.0 0.2 3624 2228 ? SN 20:29 0:00 ftpd: line-74-226.dial.freestart.hu: anonymous/mozilla@: RETR /mirror ftp 11343 0.0 0.2 3616 2192 ? SN 20:32 0:00 ftpd: line-74-226.dial.freestart.hu: anonymous/mozilla@: IDLE ftp 11344 0.0 0.2 3616 2192 ? SN 20:32 0:00 ftpd: line-74-226.dial.freestart.hu: anonymous/mozilla@: IDLE ftp 11347 0.0 0.2 3548 2144 ? SN 20:33 0:00 ftpd: marseille-4-a7-62-147-63-23.dial.proxad.net: anonymous/anonymou ftp 11348 0.0 0.2 3548 2144 ? SN 20:33 0:00 ftpd: marseille-4-a7-62-147-63-23.dial.proxad.net: anonymous/anonymou ftp 11349 0.0 0.2 3548 2144 ? SN 20:33 0:00 ftpd: marseille-4-a7-62-147-63-23.dial.proxad.net: anonymous/anonymou ftp 11350 0.0 0.2 3548 2144 ? SN 20:33 0:00 ftpd: marseille-4-a7-62-147-63-23.dial.proxad.net: anonymous/anonymou [root@triton root]# rpm -q xinetd xinetd-2.3.9-0.73 [root@triton root]# rpm -q wu-ftpd wu-ftpd-2.6.2-5
I think that this is related. In the logs I get: Nov 1 00:41:55 shell2 xinetd[15914]: Deactivating service ftp due to excessive incoming connections. Restarting in 30 seconds. Nov 1 00:42:25 shell2 xinetd[15914]: Activating service ftp Nov 1 00:42:31 shell2 xinetd[15914]: file descriptor of service ftp has been closed Nov 1 00:42:31 shell2 xinetd[15914]: select reported EBADF but no bad file descriptors were found Nov 1 00:46:57 shell2 xinetd[15914]: Service ftp: server exit with 0 running servers
Sorry. I should indicated that was with xinetd-2.3.9-0.73.
No, it's not the same problem. I know this issue, and of course, I have not the same messages into syslog. I didn't mentionned it, but, when ftpd responds no more, I have no more logs. It's really a freeze, xinetd continues to open other services, but no more ftpd, and that without any explanation. It's really painfull, like a DOS, and for the moment, as a workaround, I installed proftpd in standalone mode...
I have this same problem (not the Deactivating service ftp problem, which I fixed) but on RH 7.2, xinetd-2.3.9-0.72, with the same [in.ftpd <defunct>] error on three ftp load balance servers at my site. I am restarting xinetd every 1/2 hour in cron right now so when the problem happens (which it still does), it will be reset automatically. Any way to get this priority raised?
Yes, please, raise the priority. xinetd did the same thing, on another server, with ipop3d... it's really critical! I also need to restart xinetd from cron.
I have the exact same problem, but in my case the problem is with the imapd. xinetd need to be restartet, before the imapd works again.
It's been 18 days since I first reported the bug. It seems that xinetd is broken and that servers all around the world have the same problem. It reminds me bug No 75128, mysqld crashed all over the world and it took a long time for RedHat to consider the problem... Is this because it's not security related that nobody at RedHat cares about this xinetd bug??
It's not a general bug. It appears to be a DOS attack. Somebody connects and creates a large number of partial connections. From what I can see, it appears that the connection gets ripped down before xinetd actually calls ftpd. it seems to take about 30 connections to cause the problem. The attack consists of a storm of syn packets. When the server responds with a syn-ack, the attacker responds with an ack followed by an fin-ack without necessarily waiting for a response (see below) In some cases, ftp doesn't even get around to transmitting a welcome banner. the tcpdump from an isolated stream is included below (using delta time stamps) I just noticed that the rst-ack doesn't arrive until 5 hours later. (attachment: single1) 1 0.000000 66.139.79.23 -> 66.51.123.178 TCP 59351 59351 > 21 [SYN] Seq=533978424 Ack=0 Win=5840 Len=0 2 0.000021 66.51.123.178 -> 66.139.79.23 TCP 21 21 > 59351 [SYN, ACK] Seq=224836778 Ack=533978425 Win=5792 Len=0 3 0.091872 66.139.79.23 -> 66.51.123.178 TCP 59351 59351 > 21 [ACK] Seq=533978425 Ack=224836779 Win=5840 Len=0 4 0.000867 66.139.79.23 -> 66.51.123.178 TCP 59351 59351 > 21 [FIN, ACK] Seq=533978425 Ack=224836779 Win=5840 Len=0 5 0.000676 66.51.123.178 -> 66.139.79.23 TCP 21 21 > 59351 [ACK] Seq=224836779 Ack=533978426 Win=5792 Len=0 6 17440.498890 66.51.123.178 -> 66.139.79.23 TCP 21 21 > 59351 [RST, ACK] Seq=224836779 Ack=533978426 Win=5792 Len=0 Once the attack is successful xinetd starts behaving wierdly. There will be a bunch of ftp servers in <defunct> state, and xinetd will respond to an opening connection as follows: It responds to a syn with a syn-ack, but it then it acts like it is recieving no other packets -- only responding with syn-acks until the remote system gives up. (attachment attack-failure) The attackers appear to be spraying netblocks with their attacks. the attachment 2 [details]-attack-sets includes a tcpdump from two sample attack sets. Note that the attacks go to multiple addresses.
Created attachment 85524 [details] A single attack sequence ereal/tcpdump file. fast ack/ fin-ack timing
Created attachment 85525 [details] After an attack has succeeded, xinetd will respond as in this tcpdump
Created attachment 85526 [details] This contains 2 different attacks Note that each one attacks addresses on one subnet at a time. -- ethereal/tcpdump file
It's looking like this attack is against the default settings of xinetd.d The compiled-in defaults appear to be 25 maximum instances and 27 connections per source address(!). The standard RPM setup increases the maximum instances to 60, which means that 3 attacking IPs can lock out a system. It appears that xinet responds to the overage by using the syn-acks as a keep-alive to keep the connections alive while waiting for an open connection slot. I changed the numbers to 100 concurrent and 10 per-source. When I activated the changes, it looks like xinetd cleaned up some backlog. so this looks hopeful, so far. With this setup, a DOS attack is still possible, but it would require more machines to successfully mount. This should handle most situations, but could cause problems if you have customers with large source-NATed networks. Settings: (either in the defaults section of /etc/xinetd.conf or in /etc/xinetd.d/wu-ftpd) instances = 100 per_source = 10 Someone else try this and see if it helps.
I'm not sure if this is different to the attack, but when xinetd fails to start an ftp daemon I see the following: $ lsof -p 6933 ... xinetd 6933 root 2r CHR 1,3 66678 /dev/null xinetd 6933 root 3r FIFO 0,5 173349 pipe xinetd 6933 root 4w FIFO 0,5 173349 pipe xinetd 6933 root 5u IPv4 173354 TCP *:ftp (LISTEN) xinetd 6933 root 6u unix 0xd9731a40 426588 socket (END) $ strace -p 3933 recv(6, Note that xinetd is _only_listening on ftp. I'm not sure what the unix domain socket is, but that's where it's stuck. I'll check on the state of any ftpd's next time we hang up.
never mind.. The 'fix' came because my signals to reload xinetd's config caused it to reap it's dead children (one child process reaped per signal sent). Once all the children are reaped, the xinetd starts accept connections again. I realized this using USR1 signals. No state dumpes were done until all of the children were reaped... then all 5 (in this case) state dumps were done at once. File available for attachment: kill-log.txt
I am seeing this same thing happen to me on a private server. I have wu-ftpd and ipop3d configured in xinetd. I have a message server that connects to the pop3 daemon every 4 or 5 minutes and checks 4 or 5 accounts each time. After it runs a little while I end up with one ipop3d in a defunct state and nobody can connect until I restart xinetd. This is a test server that is not open to be attacked yet so it should not be happening because of a DOS attack. When the mail is checked by the message server there is a flurry of connections to the pop3 port. I do need this fixed pretty quickly. Any ideas?
I would like to try the fix (instances = 100, per_source = 10) on my ftp server but it is in production state and now works well with proftpd in standalone mode, and I don't want to change something until I'm sure it will work well. Meanwhile, I activated those options on my pop/imap server and deactivated my cron hourly restart of xinetd, so I'll see if it works, and I'll tell you that...
I thought it was clear from my following message: the settings don't seem to be the source of the problem. Rather than randomly restarting xinetd, I think I've come up with a (rather unusual) hack... Sending signals to xinetd when it has defunct processes. #!/bin/sh while sleep 97; do if ps -auxww | grep 'in.ftpd <defunct>' |grep -v grep ; then echo date `date` found defunct # allow for races... sleep 2 if XINETD=`cat /var/run/xinetd.pid` ; then while ps -auxw | grep 'in.ftpd <defunct>' |grep -v grep ; do echo killing kill -ALRM $XINETD sleep 3 done else echo no xinetd service running???? fi fi done
Two notes: 1) in the script above, the " grep -v" should be changed to "grep -qv" -- otherwise they'll produce spurious output on a match (OK for debugging purposes?) 2) I'm looking at the question of whether this is a kernel-related bug. I'm seeing something vaguly similar occurring with Interchange. For people having this problem I have two questions: A) are you running an SMP system? B) are you running a bigmam kernel? The problems that we've had with this are on a dual processor system. The problems with Interchange have only (and consistently) occurred on dual processor systems. I'm including the query about bigmem simply for completion (we're running on a 6GB system... Users running bigmem are reasonably rare). If you email to me directly ( samuel ) I'll summarize the responses that I receive
Kernel: 2.4.9 / 2.4.18 RH OS: RedHat 7.3 xinetd version: xinetd-2.3.9-0.73 Hi, I've the same problem on several internal / external servers, regarding every services xinetd is managing. After a few hours, xinetd stop to accept any new incoming connections. It then then start to behaves exactly as if it was refusing the connection because of TCP wrapper. All services are impacted. The only way to fix this, is by restarting xinetd services, or to downgrade to the version originally shipped with RedHat 7.3 (xinetd-2.3.4-0.8.i386.rpm) Right now, we are currently downgrading all xinetd servers to fix this problem. Regards, Vincent Jaussaud
Well, at least this is not redhat-rpm-specific bug - i have got the same to reproduce with rpm comliled from the original source from xinetd.org and a cut down spec file from the redhat .src.rpm (removed the extra patches, and the configure line doesn't include lib_tcpwrappers)- xinetd related services die withing hours on my server, the only colution is to restart xinetd. However, (strange that noone mentioned it), with both rpms (the redhat's one and the one i compiled), the services remain accessible from localhost for some time after they stop responding from other systems. if i wait enough, then it also stops responding from localhost. the server is 7.32, with all errata applied. by the way, this doesn't happen if i install the 2.3.7 xinetd from the redhat 8.0 cd - all is working well, except from the fact that that xinetd doesn't honor my "redirect" services - it's like the config files for them doesn't exist !
sorry, in the prevous comment there is a typo : the server is 7.3 not the server is 7.32
Given the responses that I've gotten from people, it would appear that multi-processor has nothing to do with it. People with single processer boxes have this problem too. Whatever the bug is, going back to xinetd-2.3.7-2 (as released for redhat 8.0) resolves the problem. Interestingly enough, it appears that installing xinetd-2.3.9-0.73 (from redhat7.3) on redhat8.0 (on the one machine I've tried it) does NOT seem to trigger the bug... (kernel kernel-2.4.18-14 ) In the mean time, I've managed to build a script to test the bug (attached later).. seems to do the job. What I've found is interesting: I have been able to put together a properly working version of the kill-defunct script. It turns out that kill -ALRM causes xinetd to reap dead children, but doesn't get xinetd back to a responding state. My script required at least one USR1 signal to 'fix' xinetd. If I only use SIGALRMs, it seems to leave xinet in a (different) locked-up state. Once xinet recieves the last USR1, it appears to be fixed *permanently* -- Despite repeated attempts, I'm unable to lock it up again until I restart xinetd. My script could use USR1 for all signals, but when it does that, xinetd will cue the dump requests and do multiple dumps once it starts moving again. If someone else is willing to try installing 2.3.9 on another 8.0 box and see if my script can trigger a lockup there, it would be nice (would someone from redhat be willing to get involved?) .
Created attachment 86625 [details] shell script which seems to trigger the bug on redhat 7.3 boxes.. requires setup in .netrc
Created attachment 86627 [details] (tested) shell script which hunts and kills defunct children of xinetd .. then issues a kill -USR1 to restart it (dunno why it works yet)
Yesterday I had the same problem on a RedHat 7.3 box. xinetd was configured to run telnet and pop3 services. Because the box was used by the users I had time to do just a strace on xinetd and it was blocked on the "recv(7, " system call, as found by jch. With the ftp test script presented above I managed to duplicate the problem and produce an explanation: the unix domain socket from which xinetd tries to receive the datagram and is stalled on is the syslogd's socket, /dev/log. xinetd is stalling on /dev/log due to a race condition. After xinetd accepts a connection, it forks, closes the file descriptor (but saves the fd in its internal data structures) and proceeds to listen to new connections and watch the children. When a child exits, the xinetd daemon tries to free the connection. To free the connection it tries to receive a datagram on the file descriptor of the connection and to close it again. Usually the try to receive a datagram on the file descriptor fails because the descriptor is already closed. The problem is caused by the fact that sometimes the connection to syslogd is closed during the lifetime of a process spawned by xinetd and the new unix socket receives the same fd as the one of the child process. When xinetd tries to free the connection it is calling recv() not on an already closed filedescriptor, but on /dev/log and blocks there forever (or until it receives a sufficient number of SIGUSR1 signals as proved by the "hunting" shell script above). The above is happening on version 2.3.9 due to a change between 2.3.7 and 2.3.9 and partially fixed in the latest development version. The faulty code is in conn_free() from xinetd/connection.c. 2.3.7 called correctly drain() (the routine that calls recv()), only when the connection was datagram oriented and xinetd failed to launch the server for the service. The code in 2.3.9 is always calling drain(), even if the connection is not datagram oriented. The last development version fixes the problem for stream oriented connections, but may leave it open for datagram oriented connections. I'll attach a strace log for parties interested to analyze the problem.
Created attachment 86644 [details] lsof and strace output
GRR: after all of the work that we did on this, it turns out that RedHat was working on bug # 76146. Dunno why they nobody ever bothered to mark this as a duplicate. Makes me wonder just how closely RedHat looks at these bugzilla reports.
*** This bug has been marked as a duplicate of 76146 ***