Bug 92114

Summary: threaded Apache hangs using loopback
Product: [Retired] Red Hat Linux Reporter: Greg Ames <gregames>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: low Docs Contact:
Priority: medium    
Version: 8.0   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:41:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Greg Ames 2003-06-02 20:04:13 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Description of problem:
When testing Apache using the worker MPM I get very intermittent hangs using the
loopback interface under high loads.  If I edit /etc/hosts so that all my
virtual hosts use an ethernet interface, the problem disappears.  I never see
the problem using Apache's prefork MPM (not threaded), nor do I see it with the
worker MPM using less than 4 threads per process.

Apache's server-status shows the hung thread is in "R" state, i.e, trying to
read the http request line.  Apache times out the hung condition after 5
minutes, then issues close(), then the client reports:

for fd 37 (after reading 0 bytes): read: Connection reset by peer

So both sides of the connection were trying to read!  gdb shows that the hung
Apache worker thread is stuck in a poll() syscall when there are bytes available
to be read according to netstat -at:

(gdb) bt
#0  0x420db1a7 in poll () from /lib/i686/libc.so.6
#1  0x4005d084 in apr_poll ()
   from /home/gregames/apache/2.0.46/built/lib/libapr-0.so.0
#2  0x4005d496 in apr_wait_for_io_or_timeout ()
   from /home/gregames/apache/2.0.46/built/lib/libapr-0.so.0
#3  0x4005470b in uapr_socket_recv ()
   from /home/gregames/apache/2.0.46/built/lib/libapr-0.so.0
#4  0x40053d11 in apr_socket_recv ()
   from /home/gregames/apache/2.0.46/built/lib/libapr-0.so.0
#5  0x40019b04 in socket_bucket_read ()
   from /home/gregames/apache/2.0.46/built/lib/libaprutil-0.so.0
#6  0x4001a2ba in apr_brigade_split_line ()
   from /home/gregames/apache/2.0.46/built/lib/libaprutil-0.so.0
#7  0x080788b3 in core_input_filter ()
#8  0x08072022 in ap_get_brigade ()
#9  0x08072022 in ap_get_brigade ()
#10 0x08072e28 in ap_rgetline_core ()
#11 0x080732e2 in read_request_line ()

netstat -st shows that TCPAbortOnClose is incremented after Apache times out and
closes the connection.  kernel source says that this happens because there is
unread data.

It seems like there might be a race condition between the loopback driver
sending data from the client's process and the poll() for readability from the
Apache worker thread, where new data arrives just after tcp_poll checks for it,
but before the poll sleeps.  Looking at the kernel source, I can't see how this
is serialized/locked, but I'm a kernel newbie.

Version-Release number of selected component (if applicable):
kernel-2.4.18-14

How reproducible:
Sometimes

Steps to Reproduce:
1. Run Apache with the worker MPM and ThreadsPerChild set to 8 
2. (this part is hard) On the same machine, run a client that simulates a
production web site's workload with a mixture of tiny, medium, and huge files
and CGIs.  SPECWeb99 might work (not confirmed).  I'm using a custom client. 
3.  Make sure that all http traffic flows thru the loopback interface.

    

Expected Results:  There shouldn't be race conditions between poll() for
readability and the loopback driver.

Additional info:

Comment 1 Arjan van de Ven 2003-06-02 20:10:19 UTC
could you try the current erratum kernel for RHL8?
At least it doesn't have remote exploits etc etc and has lots of bugfixes...

Comment 2 Greg Ames 2003-06-04 14:18:43 UTC
OK, I'm running kernel 2.4.20-13.8 now.  The bug is either gone or a lot more
elusive now, but I think I hit it once or twice yesterday.  The external symtoms
looked the same anyway.  Today I haven't been able to hit it at all and collect
the netstats & backtrace etc to verify it's the same thing.

It looks like file caching is working better in this kernel.  Almost all of my
files are served out of the cache now after the first run, but they weren't with
2.4.18.  I believe this is decreasing the interrupt rate and might make this bug
harder to catch, whatever it is.  I've been running grep -r thru /usr to add
some interrupts.

If I figure out how to recreate this more reliably, I will report back.

Thanks,
Greg

Comment 3 Bugzilla owner 2004-09-30 15:41:02 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/