Red Hat Bugzilla – Bug 127981
rotatelogs and graceful failure
Last modified: 2007-11-30 17:07:02 EST
Description of problem:
piping apache logs through rotatelogs can hang some apache children on
long lived connections
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. pipe apache logs through rotatelogs
2. create long lived client connection
3. kill -SIGUSR1 httpd
hung child ( no log entry )
child exits with log entry
From previous email:
> I've run into this bug :
> Any chance of getting it fixed? From the list thread
mentioned in the
> bug, this problem is easily identifiable, but seems hard to fix.
Yep, as normal please do file a bug in the Red Hat bugzilla database for
this, and we can prioritize development work on the issue appropriately.
> From my point of view, sysadmin not developer, I'm wondering why
> rotatelogs gets killed on a graceful restart. It seems that only the
> httpd children need to be restarted. yes? no? maybe?
Well, possibly that could be done, but that's not actually necessary to
fix the bug. It might be confusing behaviour too: if you restart after
making a *change* to a piped logger script, you'd expect the new script
to be used after the restart.
The simplest fix for this bug is as outlined on the upstream report:
simply to allow the rotatelogs process for each server "generation" to
terminate only when all the httpd children for that generation have
terminated. This is simple to implement since it can be done by using
normal pipe semantics, that you get an EOF when all the processes at the
other end of the pipe have exited.
target for RHEL3 U3 ? :)
Thanks for filing the bug. Even "the simplest fix" is not really a
dead-simple fix, so pretty unlikely to fix this for U3 at this stage.
We have 3 licenses for RH ES3 and this is a recurring problem. Both
DEV and Production servers are intermittently failing to restart at
4:02 on Sunday morning after logrotate. Contents of
/bin/kill -HUP `cat /var/run/httpd.pid 2>/dev/null` 2>
/dev/null || true
I have yet to find a 'workaround' for this anywhere on the net.
Is this an Apache problem? Could a viable script workaround for this
be posted here to ENSURE apache will be restarted ? We would prefer a
Redhat solution to this instead of crafting our own hack.
Jolyon, the issue described here should only be triggered by a
graceful restart, and logrotate in RHEL3 does uses a non-graceful
restart. Please open an issue with support or file a separate bug
describing the issue you are having.
Any news on a fix for the graceful restart hang?
the apache bugzilla doesn't have any updates either.
Not yet sorry, progress on the patch got stalled.
/me willing to test patch(s)
Any progress on this?
If not, is there an apache developer willing to be "sponsored" to work on this?
In fact, fortuitous timing, Jeff Trawick upstream has produced a patch which
avoids the hangs. I'll build some test packages including this.
There are really two aspects to this problem: (1) the fact that log entries can
be lost for requests made during a graceful restart if using piped loggers, and
(2) the fact that children can hang during a graceful restart. Jeff's patch
fixes (2) but not (1); is that acceptable to you?
works for me.
I already have to deal with both (1) and (2), so any fix is better :)
If you attach the patch or link I can build test packages locally.
Test packages with the patch applied are available from here:
FYI, I just updated a newer set of packages there (*-2.0.46-48.ent) which fix
several other issues in the piped logger code (potential parent process segfault
in configurations using large numbers of piped loggers).
This is a little OT, but before filing another silly bug (cf. 164367) I'll ask
here. Is there a reason for the /etc/logrotate.d/httpd script from the httpd
package to use -HUP rather that -USR1? This thread seems to suggest that a
graceful restart may fail with piped logs. I use cronolog with this feature.
Vincent: SIGHUP is just a "reload", SIGUSR1 is a "graceful restart" - only the
latter suffers from this problem when using piped loggers.
Joe: Sure, I understand the differences. If it's the case that the only
difference between using these two signals is that -USR1 would cause connections
to be not dropped then why is -HUP used in the log rotation script?
Is the problem described in this thread the reason for this choice?
It's really just historical that the logrotate script uses -HUP rather than
-USR1, but yes, the fact that the graceful restart can (without the fix for this
bug) cause child processes to hang if piped loggers are used is indeed a good
reason why -HUP is should be used by default rather than -USR1.
In the future, when the graceful restart process is made more robust even when
piped loggers are used, then it would be good to change the logrotate script to
I've been running the patched binaries in devel for several months without problems.
I've been running the RHEL3U6beta binaries for a couple weeks in production
without any major problems.
The only issue I have seen are when we get a (usually bad) spider come in using
keep alive. the 'soon to be dead' child can hang around for a while processing
many spider requests.
missing a log entry or two is acceptable. missing many is less desirable.
if child gets SIGUSR1 ; then stop using keep alive.
this should allow at most one lost log message.
yes? no? maybe?
should I post this elsewhere?
(In reply to comment #16)
> Joe: Sure, I understand the differences. If it's the case that the only
> difference between using these two signals is that -USR1 would cause connections
> to be not dropped then why is -HUP used in the log rotation script?
> Is the problem described in this thread the reason for this choice?
a -HUP causes processes to be instantly killed. This means a half way completed
transaction fould fail. an example could be a 20 minute subversion commit.
SIGUSR1 would let it complete. a SIGHUP would lose it.
For RHEL5 we can hopefully implement the complete solution so that *no* log
messages are lost after a graceful restart even if piped loggers are used. This
is technically quite simple but risky and requires a kind of an interface so
can't really be done for RHEL3/4.
It would be something of a hack to turn off keepalive globally after receiving a
signal and might not be acceptable upstream. But yes, if you could post an RFE
upstream we could see whether there would be resistance to that --
(BTW, even if you have *minor* problems with the U6Beta packages please let us
know especially if they are regressions!)
The "hung children" issue was fixed in the U6/U2 errata along with the other
graceful restart/piped logger fixes. (though the bug reference was missed from