Red Hat Bugzilla – Bug 125281
Cluster manager detects false failures under heavy system load
Last modified: 2013-03-06 00:57:06 EST
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Description of problem:
Under heavy system load (e.g., high CPU usage, low levels
of free memory) the quorum daemon detects cluster node
failures where none has occurred.
The system involved is an HP DL-380 with dual 3.06GHz Xeons
with 9GB of memory, running RedHat LAS2.1enterprise e.27 and
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Run an application on a system with RHCM that creates a heavy load.
A mixture of heavy cpu usage, high memory consumption and high i/o
rates is best.
2.Wait for failover
Actual Results: Unexpected failovers to the standby node occurred,
where no failure
could be detected on the active host.
Expected Results: The active host should continue to run.
This problem, on the surface, appears to be a scheduling issue-
that heavy loads prevent the components of the cluster manager
from running. We are currently looking at using soft realtime
to improve this situation. Any suggestions or fixes you could
provide would be welcome.
Another option would be to increase the failure detection time...
Created attachment 100868 [details]
Set/get scheduling policy/priority of an arbitrary process
Simple program which gets/sets the scheduling policy/priority of an arbitrary
process; you may use it for testing if you wish.
gcc -Wall -Wstrict-prototypes -Werror -Wshadow -o prio prio.c
Is there a way to modify the failure detection time without
recompiling the entire src. From the code, it looks like there are
ways to set % parameters, but I do not know how set these from the
Please file a ticket with Red Hat Support if you have not already done so:
You may include a reference to this bugzilla if you desire. FYI, here
is how to adjust the failover detection time on RHEL 2.1:
netdown * interval => Failover time (in seconds) without heartbeat
channels responding normally
netup * interval => Failover time (in seconds) with heartbeat channels
Created attachment 101067 [details]
error log from clumanager
I just added an attachment that shows the results of some
experiments. First we changed MAX_SAME_TIMESTAMP_NETUP from 12 to 150
and MAX_SAME_TIMESTAMP_NETDOWN from 7 to 90. Also, we set
reallyDoShutdown to = in quorumd. We get ENOMEMs but do not shut
ourselves down. This is due to very haevy system load and no free
memory to allocate to the cluster manager. The cluster manager should
be resilient to this. As the item 3) shows, we eventually get a
signal 11 as a result of ignoring the ENOMEM.
It looks like the system was run out of low memory. At least, it was
extremely stressed for low memory - this is evidenced by the fact that
read() returned -1/ENOMEM. This only occurs, as far as I can tell, in
the kernel (drivers/char/raw.c) raw read/write path after it calls
There may be a kernel issue with that particular path of allocating
too much RAM (as I understand it, allocating 4 pages instead of 1) for
As for the signal 11: Cluster Manager raised it (SIGSEGV/signal 11)
when it could not figure out what to do. There's a macro which logs
at EMERG, then raises SIGSEGV to prevent further operation. As far as
I know, this can only occur if rebooting fails (or, is disabled -
which is intended for debugging purposes only).
Loss of access to shared storage is considered fatal by Cluster
Manager, which is why it tries to reboot: if it can't reach shared
storage, it is incorrect to assume that other tasks can: force failover.
In this case, the loss of access was due to the inability to allocate
kernel memory to complete the operation. This is consistent with how
Cluster Manager would have behaved had it received MAP_FAILED/ENOMEM
during the mmap() call in the diskRawRead/diskRawWrite paths: if
Cluster Manager can not allocate memory, it is incorrect to assume
other tasks can: force failover.
Note that when the available memory is low enough, the kernel will
begin killing tasks (see mm/oom_kill.c). It is better to prevent
entering this state rather than letting it happen and trying to clean
it up later - there is no way to tell which task will get killed first
Created attachment 101120 [details]
Allows better recovery in unrecoverable cases.
Here's a patch which will allow more robust "recovery" in certain error cases
in the lock subsystem (such as the one referenced in the logs) and quorum
daemon when reboot is disabled.
Thanks for the patch Lon. I've actually done something similar,
but in diskrawio.c to the diskrawwrite and read, but my code only
retries 3 times, with 1 second sleeps. This wasn't enough to get
over the bumps. I also "statically" allocated the page aligned
buffers for the raw read and writes by mmaping from /dev/zero on
init. We originally saw the ENOMEMs come out of the mmap in the
allocate code. The effect was the same.
We've been running e40.8 that was pulled from Jason Baron's people
site, associated with bug 117460. This has helped immensely with
heavy load conditions, but we still fail with ENOMEMs occasionally.
I'll apply your patch as soon as I can and let you know what happens.
With the patch, you'll still see ENOMEM, just that the cluster will
not do the funny raise(SIGSEGV) when it happens. Also, if it gets
ENOMEM while reading the other member's status, it'll just continue as
though the other member's state has not changed
I removed the singly-allocated buffer portion from my patch, figuring
that it wasn't directly related to handling the "error recovery paths".
(All of this is with "reallyDoShutdown = 0" in diskrawio.c, which
isn't how Red Hat ships the package.)
Created attachment 101150 [details]
Above patch + add one-time allocation in diskrawio.c
Same as above with one-time-allocated buffers for raw I/O in userland.
Regression tested; does not interfere with normal operation.
As far as failover is concerned, it is incorrect to assume that if the
system is out of memory that the OOM killer hasn't killed any
processes - one of which may be a service.
However, this doesn't mean that the allocation during the read/write
path in drivers/char/raw.c shouldn't be looked at more closely.
Created attachment 101423 [details]
Big retry patch
Patch increases MAX_CONSECUTIVE_IO_ERRORS to 5, and delays between retries when
ENOMEM is encountered. A bit of code was added to ensure errno gets passed
back up the stack properly.
ok, we've reworked raw i/o in the kernel to use a mempool when during
raw reads/writes, basically we allocated some kiobufs in a pool and
wait for them to free up, if the kernel gets stuck, before returning
an -ENOMEM. this should fix your problem. kernel w/that patch is at:
Thanks for the patch and the kernel update. We'll be installing
the 41.12 kernel today and run it over the weekend. It'll probably
be late Monday before we change the cluster manager over to the
The 41.12 kernel should obviate the need for the latest patch to
Comment on attachment 101423 [details]
Big retry patch
41.12 kernel should obviate the need for this patch.
Ok, won't do the patch. Thanks.
if you haven't updated yet, the official U5 beta kernel has just
landed, e.42...same place:
Thanks again for the update. Is there any chance that this
kernel contains the fixes in the 40.8 kernel (117902) that
'throttle' kswapd. We've been running 40.8 here with much success
and would like these changes in the new kernel.
yes...although we have tweaked these fixes a bit, but we are confident
that this beta kernel will address the issues contained in 117902.
Just wanted to let you know that we've been running e42 for a week
on 2-4 machines under a number of different conditions and have
been VERY pleased with its operation. Kswapd's behavior is much
more civilized, and we no longer get ENOMEMs on raw cluster i/o
or the 'mysterious' failovers we'd been seeing. We plan on going
forward with these fixes. Thanks again for your work.
Ok. You should be aware that:
(1) e42 is a beta kernel, the release version might change, and you
should not use it in production.
(2) The patches for clumanager should more or less be obviated by the
kernel improvements; so you should test without them.
We will remove the patches from clumanager and test, although
we have added messages to log when they come into play, and
we haven't seen any of these messages.
We're not really in production. We're developing a clustered
version of our product, so the e42 kernel will be used on our
development/test platforms- if it crashes, it's not the end of
the world. I noticed that e43 came out today; I checked the
source for the kswapd patches, but they weren't there. Do you
know if they will make it into an official release?
e.43 is a security kernel...the latest u5 is e.46 at:
Has JBarons patch made it into the e.48 kernel as i have checked the
kernel rpm SPEC file and cannot find reference to the patch.
Thanks for verifying this.
I just verified that it has not been included in e48 (at least the
pieces that I'm aware of). Is there a plan to put this in an
'official' errata release? It solves a host of problems for our
e.48 included critical security fixes...the fixes described in this
bugzilla will be included in U5, which should ship in a couple of
An errata has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.