From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 Description of problem: When testing Apache using the worker MPM I get very intermittent hangs using the loopback interface under high loads. If I edit /etc/hosts so that all my virtual hosts use an ethernet interface, the problem disappears. I never see the problem using Apache's prefork MPM (not threaded), nor do I see it with the worker MPM using less than 4 threads per process. Apache's server-status shows the hung thread is in "R" state, i.e, trying to read the http request line. Apache times out the hung condition after 5 minutes, then issues close(), then the client reports: for fd 37 (after reading 0 bytes): read: Connection reset by peer So both sides of the connection were trying to read! gdb shows that the hung Apache worker thread is stuck in a poll() syscall when there are bytes available to be read according to netstat -at: (gdb) bt #0 0x420db1a7 in poll () from /lib/i686/libc.so.6 #1 0x4005d084 in apr_poll () from /home/gregames/apache/2.0.46/built/lib/libapr-0.so.0 #2 0x4005d496 in apr_wait_for_io_or_timeout () from /home/gregames/apache/2.0.46/built/lib/libapr-0.so.0 #3 0x4005470b in uapr_socket_recv () from /home/gregames/apache/2.0.46/built/lib/libapr-0.so.0 #4 0x40053d11 in apr_socket_recv () from /home/gregames/apache/2.0.46/built/lib/libapr-0.so.0 #5 0x40019b04 in socket_bucket_read () from /home/gregames/apache/2.0.46/built/lib/libaprutil-0.so.0 #6 0x4001a2ba in apr_brigade_split_line () from /home/gregames/apache/2.0.46/built/lib/libaprutil-0.so.0 #7 0x080788b3 in core_input_filter () #8 0x08072022 in ap_get_brigade () #9 0x08072022 in ap_get_brigade () #10 0x08072e28 in ap_rgetline_core () #11 0x080732e2 in read_request_line () netstat -st shows that TCPAbortOnClose is incremented after Apache times out and closes the connection. kernel source says that this happens because there is unread data. It seems like there might be a race condition between the loopback driver sending data from the client's process and the poll() for readability from the Apache worker thread, where new data arrives just after tcp_poll checks for it, but before the poll sleeps. Looking at the kernel source, I can't see how this is serialized/locked, but I'm a kernel newbie. Version-Release number of selected component (if applicable): kernel-2.4.18-14 How reproducible: Sometimes Steps to Reproduce: 1. Run Apache with the worker MPM and ThreadsPerChild set to 8 2. (this part is hard) On the same machine, run a client that simulates a production web site's workload with a mixture of tiny, medium, and huge files and CGIs. SPECWeb99 might work (not confirmed). I'm using a custom client. 3. Make sure that all http traffic flows thru the loopback interface. Expected Results: There shouldn't be race conditions between poll() for readability and the loopback driver. Additional info:
could you try the current erratum kernel for RHL8? At least it doesn't have remote exploits etc etc and has lots of bugfixes...
OK, I'm running kernel 2.4.20-13.8 now. The bug is either gone or a lot more elusive now, but I think I hit it once or twice yesterday. The external symtoms looked the same anyway. Today I haven't been able to hit it at all and collect the netstats & backtrace etc to verify it's the same thing. It looks like file caching is working better in this kernel. Almost all of my files are served out of the cache now after the first run, but they weren't with 2.4.18. I believe this is decreasing the interrupt rate and might make this bug harder to catch, whatever it is. I've been running grep -r thru /usr to add some interrupts. If I figure out how to recreate this more reliably, I will report back. Thanks, Greg
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/