126367 – init.d/httpd restart suspect?

Bug 126367 - init.d/httpd restart suspect?

Summary: init.d/httpd restart suspect?

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	httpd
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Joe Orton
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-06-20 10:32 UTC by Peter Collinson
Modified:	2007-11-30 22:07 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-07-13 16:01:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Dump of /etc/httpd - conf/httpd.conf - and include files + mod ssl setup (31.19 KB, application/octet-stream) 2004-06-21 09:02 UTC, Peter Collinson	no flags	Details
View All

Description Peter Collinson 2004-06-20 10:32:21 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.4)
Gecko/20031128

Description of problem:
Sorry this seems a little vague.. I cannot put my finger on 
why this is failing  - it's one of those intermittent hard to
diagnose problems.

I am running Httpd with 220 virtual hosts and SSL. The server
occasionally runs very slowly (or not at all). This seems to be
characterised by lots of processes from server-status being in 'R'
mode and running to timeout - according to strace.

I thought at first this was a DOS attack.. but am unsure - I guess
that it's possible to get into this state by the server itself looking
for some system resource.

I have found on occasion that init.d/httpd restart fails to restart
the server - usually the secure server won't restart - it cannot get
hold of some resource. So I've taken to using 'stop', count to 10 and
'start'- which seems to work.

The overnight scripts - logrotate send a -HUP to the httpd process -
and this morning I arrived with several processes in the 'R' state -
and stuck there. I needed to get things to happen so did a stop/start
again using init.d/httpd.

I am beginning to wonder whether the code in init.d/httpd should use
/usr/sbin/httpd -k restart|stop|start and not send explicit signals.
It seems possible that things are not closed down properly, so that
the server doesn't restart properly.

I have no real hard evidence of any of this - and it needs someone
who is familiar with the code to add flesh or dismiss my suspicions.



Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1.use /etc/init.d/httpd restart with an SSL enabled httpd
2.
3.
    

Actual Results:  Well sometimes this works and restarts - and
somrtimes it fails.
I suspect that httpd -k restart will kill any active processes 
which init.d/httpd restart won't.

Expected Results:  Clean restart

Additional info:

Comment 1 Joe Orton 2004-06-20 11:11:55 UTC

Thanks for the report.  Can you clarify a few things:

> The server
> occasionally runs very slowly (or not at all). This seems to be
> characterised by lots of processes from server-status being in 'R'
> mode and running to timeout - according to strace.

OK, you have captured the "strace" output for such a process?  What
was the strace output, and what does "running to timeout" mean precisely?

> I have found on occasion that init.d/httpd restart fails to restart
> the server - usually the secure server won't restart - it cannot get
> hold of some resource.

By "fails to restart", you mean that the server is stopped, but not
started up again?  What is the exact output from the command, and what
is printed to the error_log in that case?  Can you clarify what you
mean by "it cannot get hold of some resource"?

Also please confirm the "rpm -q httpd mod_ssl" output to ensure you're
using the latest released updates.

Comment 2 Peter Collinson 2004-06-21 06:45:03 UTC

Thanks for your prompt response...

The 'R' state is controlled by 
#
# Timeout: The number of seconds before receives and sends time out.
#
Timeout 300

in httpd.conf. When I said 'running to completion', I meant that 5
minutes elapse before the state is terminated. The process sits and
waits. Here is strace...

Process 21044 attached - interrupt to quit
poll(

It sits looking like this for about 5 minutes and then

Process 21044 attached - interrupt to quit
poll([{fd=325, events=POLLIN}], 1, 300000) = 0
gettimeofday({1087731789, 745615}, NULL) = 0
gettimeofday({1087731789, 745709}, NULL) = 0
shutdown(325, 1 /* send */)             = 0
poll([{fd=325, events=POLLIN}], 1, 2000) = 0
close(325)                              = 0
read(5, 0xbfffa7c3, 1)                  = -1 EAGAIN (Resource
temporarily unavailable)
gettimeofday({1087731791, 755549}, NULL) = 0
semop(458753, 0xb730c6b8, 1 <unfinished ...>
Process 21044 detached

In this case things were not busy so the process waited for more work,
and I killed the strace.

I am getting processes stuck in the 'R' state a lot, and have been
trying to find out why. It could be entirely unrelated to the restart
problem - and could be legitimate. It just seems to me to be happening
too frequently for peace of mind about it.

Of course, I've been unable to replicate the actual restart problem -
I think it needs things to be busy and for some processes to be
waiting. Or it could be related to the time that the server has been
live. I use PHP a lot for several big database sites and suspect that
it loses memory rather badly. Sadly, I've not got any record in my
dumps either. I've tried several times but since this server is giving
service, it's hard to bounce it a lot in case viewers are badly
affected. I'll keep trying and will report back when it happens.

This is a 'current' ES3 system:

% rpm -q httpd mod_ssl 
httpd-2.0.46-32.ent
mod_ssl-2.0.46-32.ent

The 'stuck in R state' COULD be a DOS attack - I can replicate this
behaviour by telnetting to port 80 on the machine and just waiting.
However, to test this, I need to be able to tie an incoming IP address
with the running httpd process - and I just don't know how to do that.
I can see that open fds in /proc will give me some handle on the
socket - but I don't know how to tie that up with netstat output. It
would help if the server-status code reported the IP address when in R
state. I am just suspicious that someone is making a legitimate
connection and just getting nothing back because the server is locked
in some way.

Comment 3 Joe Orton 2004-06-21 08:33:11 UTC

A few notes:

1. the fact that the 'fd' number has reached 325 may imply a resource
leak in one of the modules you're using; it may be a good idea to
reduce the MaxRequestsPerChild setting.

2. if *all* the spare httpd children get tied up in I/O timeouts, then
you should try both increasing MaxClients to allow more children, and
to decrease the Timeout setting e.g. to 60 seconds.

3. if the server-status page does not allow you to associate the
children in timeout with a particular client IP, you can use a command
like "netstat -pat | grep httpd" (as root) to do so; and
cross-reference the pid from the server-status output with the pid in
the netstat output.

Can you attach your complete httpd.conf for reference?

Comment 4 Peter Collinson 2004-06-21 09:02:24 UTC

Created attachment 101288 [details]
Dump of /etc/httpd - conf/httpd.conf - and include files  + mod ssl setup

Comment 5 Peter Collinson 2004-06-21 09:04:06 UTC

The 'high' fd is apparently OK - I've got something like 217 virtual
hosts on this server - each has their own access and error logs - so
around 300 open files are OK - they are mostly pointing to the log
files. Dunno if this is a limit in the end - has Linux got rid of the
maximum open files per process problems that plagued earlier UNIX systems?

However - I've reduced the MaxRequestsPerChild down to 500 - I guess
this will help with any memory leaks anyway.

Thanks for the netstat tip! That will do it.

The included file is a gzipped tar of portions of /etc/httpd
because my conf file is somewhat distributed..

Comment 6 Joe Orton 2004-06-21 09:20:28 UTC

Yes with lots of vhost logs you get high fd numbers, that's fine. 
(the max fds per process is configurable to "very high" in RHEL, don't
worry about that)

The MaxClients setting is at the default; try increasing it to 256 in
the first instance.

Comment 7 Peter Collinson 2004-06-21 09:43:09 UTC

OK that's done.

Very grateful for your assistance.

Comment 8 Joe Orton 2004-06-21 16:33:25 UTC

Glad to help.  I'll leave this bug in the NEEDINFO state, pending, as
above:

1) output of "service httpd restart" when it fails
2) error_log output after it fails

Let us know if you have further performance issues.

Comment 9 Peter Collinson 2004-07-01 07:15:03 UTC

My 'stuck in reader state problem' persisted, and we had an incident
this week where nearly all of the 256 httpd processes were stuck in R
state. The machine's load average went bananas. I started to track
RT state using a perl script to analyse apachectl status output.

I now think that this may be a networking problem - ie the TCP traffic
was being hit for some reason. The remote site would connect, but some
reply would go astray. Between my server and the internet cloud is
some nameless firewall and a router. I am wondering whether this is
not handling TCP/IP fragmentation properly.

So, I've started iptables and added
-A OUTPUT -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS
--clamp-mss-to-pmtu 

because the firewall is doing the business of blocking things, this is 
the only rule. It seems that loads of ISPs just block ICMP, and
certainly this was the case for mine. I've asked them to change things
for me. However fragmentation updates were probably not reaching the
machine. This would probably explain why I was getting the 'hung'
reader problem and why it was apperently intermittent.

I'll let you know what happens.

BTW, I've been using httpd restart with no apparent problem to date.

Comment 10 Peter Collinson 2004-07-01 13:25:38 UTC

It looks like I got the wrong end of the stick here. I am talking to
my ISP to see if it's them that are not doing the right thing with TCP
fragmentation.

Comment 11 Peter Collinson 2004-07-13 12:38:41 UTC

Well after a LOT of persistance - I finally proved that one of the
routes from my ISP to the outside world was dropping packets massively
in what appears to have been a load related way. They've replaced a
router and things seem OK today.

Packet loss explains the symptoms I've been having with the web
service, and also explains why some people were having trouble getting
to my DNS; and also why some remote mail systems were connecting - and
then apparently hanging.

Thanks again for your help - which really eliminated the software as a
problem.

Maxim for the day - don't trust ISP's who use Windoze to manage their
systems.

Comment 12 Joe Orton 2004-07-13 16:01:57 UTC

Thanks for keeping us informed.  

I'll close this bug for the moment; if you have further issues with
"service httpd restart", please re-open giving further details.

Note You need to log in before you can comment on or make changes to this bug.