From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.4) Gecko/20031128 Description of problem: Sorry this seems a little vague.. I cannot put my finger on why this is failing - it's one of those intermittent hard to diagnose problems. I am running Httpd with 220 virtual hosts and SSL. The server occasionally runs very slowly (or not at all). This seems to be characterised by lots of processes from server-status being in 'R' mode and running to timeout - according to strace. I thought at first this was a DOS attack.. but am unsure - I guess that it's possible to get into this state by the server itself looking for some system resource. I have found on occasion that init.d/httpd restart fails to restart the server - usually the secure server won't restart - it cannot get hold of some resource. So I've taken to using 'stop', count to 10 and 'start'- which seems to work. The overnight scripts - logrotate send a -HUP to the httpd process - and this morning I arrived with several processes in the 'R' state - and stuck there. I needed to get things to happen so did a stop/start again using init.d/httpd. I am beginning to wonder whether the code in init.d/httpd should use /usr/sbin/httpd -k restart|stop|start and not send explicit signals. It seems possible that things are not closed down properly, so that the server doesn't restart properly. I have no real hard evidence of any of this - and it needs someone who is familiar with the code to add flesh or dismiss my suspicions. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1.use /etc/init.d/httpd restart with an SSL enabled httpd 2. 3. Actual Results: Well sometimes this works and restarts - and somrtimes it fails. I suspect that httpd -k restart will kill any active processes which init.d/httpd restart won't. Expected Results: Clean restart Additional info:
Thanks for the report. Can you clarify a few things: > The server > occasionally runs very slowly (or not at all). This seems to be > characterised by lots of processes from server-status being in 'R' > mode and running to timeout - according to strace. OK, you have captured the "strace" output for such a process? What was the strace output, and what does "running to timeout" mean precisely? > I have found on occasion that init.d/httpd restart fails to restart > the server - usually the secure server won't restart - it cannot get > hold of some resource. By "fails to restart", you mean that the server is stopped, but not started up again? What is the exact output from the command, and what is printed to the error_log in that case? Can you clarify what you mean by "it cannot get hold of some resource"? Also please confirm the "rpm -q httpd mod_ssl" output to ensure you're using the latest released updates.
Thanks for your prompt response... The 'R' state is controlled by # # Timeout: The number of seconds before receives and sends time out. # Timeout 300 in httpd.conf. When I said 'running to completion', I meant that 5 minutes elapse before the state is terminated. The process sits and waits. Here is strace... Process 21044 attached - interrupt to quit poll( It sits looking like this for about 5 minutes and then Process 21044 attached - interrupt to quit poll([{fd=325, events=POLLIN}], 1, 300000) = 0 gettimeofday({1087731789, 745615}, NULL) = 0 gettimeofday({1087731789, 745709}, NULL) = 0 shutdown(325, 1 /* send */) = 0 poll([{fd=325, events=POLLIN}], 1, 2000) = 0 close(325) = 0 read(5, 0xbfffa7c3, 1) = -1 EAGAIN (Resource temporarily unavailable) gettimeofday({1087731791, 755549}, NULL) = 0 semop(458753, 0xb730c6b8, 1 <unfinished ...> Process 21044 detached In this case things were not busy so the process waited for more work, and I killed the strace. I am getting processes stuck in the 'R' state a lot, and have been trying to find out why. It could be entirely unrelated to the restart problem - and could be legitimate. It just seems to me to be happening too frequently for peace of mind about it. Of course, I've been unable to replicate the actual restart problem - I think it needs things to be busy and for some processes to be waiting. Or it could be related to the time that the server has been live. I use PHP a lot for several big database sites and suspect that it loses memory rather badly. Sadly, I've not got any record in my dumps either. I've tried several times but since this server is giving service, it's hard to bounce it a lot in case viewers are badly affected. I'll keep trying and will report back when it happens. This is a 'current' ES3 system: % rpm -q httpd mod_ssl httpd-2.0.46-32.ent mod_ssl-2.0.46-32.ent The 'stuck in R state' COULD be a DOS attack - I can replicate this behaviour by telnetting to port 80 on the machine and just waiting. However, to test this, I need to be able to tie an incoming IP address with the running httpd process - and I just don't know how to do that. I can see that open fds in /proc will give me some handle on the socket - but I don't know how to tie that up with netstat output. It would help if the server-status code reported the IP address when in R state. I am just suspicious that someone is making a legitimate connection and just getting nothing back because the server is locked in some way.
A few notes: 1. the fact that the 'fd' number has reached 325 may imply a resource leak in one of the modules you're using; it may be a good idea to reduce the MaxRequestsPerChild setting. 2. if *all* the spare httpd children get tied up in I/O timeouts, then you should try both increasing MaxClients to allow more children, and to decrease the Timeout setting e.g. to 60 seconds. 3. if the server-status page does not allow you to associate the children in timeout with a particular client IP, you can use a command like "netstat -pat | grep httpd" (as root) to do so; and cross-reference the pid from the server-status output with the pid in the netstat output. Can you attach your complete httpd.conf for reference?
Created attachment 101288 [details] Dump of /etc/httpd - conf/httpd.conf - and include files + mod ssl setup
The 'high' fd is apparently OK - I've got something like 217 virtual hosts on this server - each has their own access and error logs - so around 300 open files are OK - they are mostly pointing to the log files. Dunno if this is a limit in the end - has Linux got rid of the maximum open files per process problems that plagued earlier UNIX systems? However - I've reduced the MaxRequestsPerChild down to 500 - I guess this will help with any memory leaks anyway. Thanks for the netstat tip! That will do it. The included file is a gzipped tar of portions of /etc/httpd because my conf file is somewhat distributed..
Yes with lots of vhost logs you get high fd numbers, that's fine. (the max fds per process is configurable to "very high" in RHEL, don't worry about that) The MaxClients setting is at the default; try increasing it to 256 in the first instance.
OK that's done. Very grateful for your assistance.
Glad to help. I'll leave this bug in the NEEDINFO state, pending, as above: 1) output of "service httpd restart" when it fails 2) error_log output after it fails Let us know if you have further performance issues.
My 'stuck in reader state problem' persisted, and we had an incident this week where nearly all of the 256 httpd processes were stuck in R state. The machine's load average went bananas. I started to track RT state using a perl script to analyse apachectl status output. I now think that this may be a networking problem - ie the TCP traffic was being hit for some reason. The remote site would connect, but some reply would go astray. Between my server and the internet cloud is some nameless firewall and a router. I am wondering whether this is not handling TCP/IP fragmentation properly. So, I've started iptables and added -A OUTPUT -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu because the firewall is doing the business of blocking things, this is the only rule. It seems that loads of ISPs just block ICMP, and certainly this was the case for mine. I've asked them to change things for me. However fragmentation updates were probably not reaching the machine. This would probably explain why I was getting the 'hung' reader problem and why it was apperently intermittent. I'll let you know what happens. BTW, I've been using httpd restart with no apparent problem to date.
It looks like I got the wrong end of the stick here. I am talking to my ISP to see if it's them that are not doing the right thing with TCP fragmentation.
Well after a LOT of persistance - I finally proved that one of the routes from my ISP to the outside world was dropping packets massively in what appears to have been a load related way. They've replaced a router and things seem OK today. Packet loss explains the symptoms I've been having with the web service, and also explains why some people were having trouble getting to my DNS; and also why some remote mail systems were connecting - and then apparently hanging. Thanks again for your help - which really eliminated the software as a problem. Maxim for the day - don't trust ISP's who use Windoze to manage their systems.
Thanks for keeping us informed. I'll close this bug for the moment; if you have further issues with "service httpd restart", please re-open giving further details.