Description of problem: After logrotate has run, httpd does not server requests for up to 30 minutes. The reason is that one or more "workers" are still hanging around, waiting to exit, so the -HUP does not succeed until apache itself has killed them. Version-Release number of selected component (if applicable): httpd-2.0.48-1.2 logrotate-3.6.10-1 How reproducible: Not 100% but on some servers this happens almost on each run of logrotate, on others it happens only rarely. Steps to Reproduce: 1. install httpd and logrotate 2. let logrotate run 3. watch the number of processes or: 1. install httpd 2. run "killall -HUP httpd; killall -HUP httpd Actual results: Some httpd-processes keep hanging around for up to 30 minutes. Expected results: httpd should imediately kill off its idle children, and the rest should die as soon as they stop their current serving. Then the -HUP should be completed and a new logfile opened. Additional info: I am 99% sure the reason for this problem is the "postrotate" section in /etc/logrotate.d/httpd It seems like logrotate itself is sending a killall -HUP, and then it is run from postrotate as well. When httpd recieves two HUPs in a short timespan, some of the proesses will refuse to die. Which processes and why is still a mystery to me. =;-) If this should be fixed in logrotate - so it does not send a -HUP to processes that have a postrotate section, or in httpd - so it does not add a postrotate section is for you to decide.
Humm... The is not that the -HUP is sent twice. I can reproduce the same issue with just one HUP. But as said in the first post, not every time. Thiere is no difference between the processes that do not die and those that exits as they should as far as I can tell.
Created attachment 98878 [details] Merged list of processes, error_log and a couple of straces This is a log consiting of a ps axuwn every second during such a hang (snipped a bit when nothing happens) and entries in the error_log and strace of two hanging processes.
There appears to be a rather serious mathematical error in the algorithm which waits for children to terminate / terminates them prematurely; the parent will indeed wait for the children to terminate for up to ~24 minutes!
BTW, what modules are you using, that the children are getting stuck in futex calls? Subversion? Process 18696 attached - interrupt to quit futex(0x427ba0, FUTEX_WAIT, 2, NULL) = -1 EINTR (Interrupted system call) +++ killed by SIGKILL +++
Nothing special as far as I know. This particular server is only running a webserver that serves a few php-scripts which reads status information from other processes. # uname -prv 2.4.22-1.2174.nptl #1 Wed Feb 18 16:38:32 EST 2004 i686 # rpm -qa |grep httpd httpd-2.0.48-1.2 httpd-manual-2.0.48-1.2 # rpm -qa |grep mod_ mod_ssl-2.0.48-1.2 mod_python-3.0.4-0.1 mod_perl-1.99_12-2 # rpm -qa |grep php php-4.3.4-1.1 php-snmp-4.3.4-1.1 php-ldap-4.3.4-1.1 php-devel-4.3.4-1.1 php-pgsql-4.3.4-1.1 php-odbc-4.3.4-1.1 php-domxml-4.3.4-1.1 php-xmlrpc-4.3.4-1.1 php-mysql-4.3.4-1.1 php-imap-4.3.4-1.1 We could probably remove both mod_ssl, mod_perl and mod_python, and most of the php packages. This is just a generic install we run, but no customized modules or other packages related to apache (according to our packager). Anything else you need?
There are really two problems here: 1) the httpd parent process does not restart in a timely fashion when a child process has hung and ignores SIGTERM 2) some of your httpd children are blocked in futex() calls; this could possibly be a problem in a PHP script (but unlikely), or some other module. (1) is a simple bug and is easy enough to solve. (2) is not.
Great. If 1 is fixed, then 2 should not cause such a problem. However, I would like to get 2 fixed as well. Is there any way we can try to locate what causes this? The server is only serving an automated script, that is run once a minute (from cron) and the script simply returns "OK" or "Error", so it is no complicated pages that should cause a timeout from the client or anything. I will try to remove all the modules we do not need and see if that helps.
I believe the problem is in mod_python somewhere. I removed all unneeded packages, and then reinstalled one by one. After installing mod_python httpd did not respond well to kill -HUP anymore. I will experiment a bit more.
Almost forgot about this bug, but I realized it still semms to be a problem from time to time. Have you been able to fix 1) in comment #6?
Not yet, sorry Ola. It's in the TODO list.
I have built experimental packages in Raw Hide which should fix the timing algorithm now. If you'd like to test these out, please see bug 132360 comment 4.
Hello, just curious if there is any update on this? It is also affecting RHEL 3U3 httpd-2.0.46-54.ent mod_authz_ldap-0.22-5 mod_perl-1.99_09-10.ent mod_python-3.0.3-5.ent mod_ssl-2.0.46-54.ent PHP varies from : php-4.3.2-26.ent to compiled 5.0.4 & 4.4.1
The problem with slow restarts is fixed in all current Fedora Core httpd packages, and all current RHEL httpd updates; Sean please file a new bug or open a support case describing the problem you're having.