Bug 125281

Summary: Cluster manager detects false failures under heavy system load
Product: Red Hat Enterprise Linux 2.1 Reporter: Paul Hansen <phansen>
Component: clumanagerAssignee: Jason Baron <jbaron>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 2.1CC: hagedorn, knoel, lhh, tao
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-08-18 14:25:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Set/get scheduling policy/priority of an arbitrary process
none
error log from clumanager
none
Allows better recovery in unrecoverable cases.
none
Above patch + add one-time allocation in diskrawio.c
none
Big retry patch none

Description Paul Hansen 2004-06-04 13:52:41 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

Description of problem:
Under heavy system load (e.g., high CPU usage, low levels 
of free memory) the quorum daemon detects cluster node
failures where none has occurred.

The system involved is an HP DL-380 with dual 3.06GHz Xeons
with 9GB of memory, running RedHat LAS2.1enterprise e.27 and
clumanager 1.0.26.

Version-Release number of selected component (if applicable):
clumanager-1.0.26

How reproducible:
Sometimes

Steps to Reproduce:
1.Run an application on a system with RHCM that creates a heavy load.
A mixture of heavy cpu usage, high memory consumption and high i/o 
rates is best.
2.Wait for failover
3.
    

Actual Results:  Unexpected failovers to the standby node occurred, 
where no failure
could be detected on the active host.

Expected Results:  The active host should continue to run.

Additional info:

This problem, on the surface, appears to be a scheduling issue-
that heavy loads prevent the components of the cluster manager
from running.  We are currently looking at using soft realtime
to improve this situation.  Any suggestions or fixes you could
provide would be welcome.

Comment 1 Lon Hohberger 2004-06-04 14:03:33 UTC
Another option would be to increase the failure detection time...

Comment 2 Lon Hohberger 2004-06-04 14:49:55 UTC
Created attachment 100868 [details]
Set/get scheduling policy/priority of an arbitrary process

Simple program which gets/sets the scheduling policy/priority of an arbitrary
process; you may use it for testing if you wish.

Build command:

gcc -Wall -Wstrict-prototypes -Werror -Wshadow -o prio prio.c

Comment 3 Michael Sporer 2004-06-04 17:12:11 UTC
Is there a way to modify the failure detection time without 
recompiling the entire src. From the code, it looks like there are 
ways to set % parameters, but I do not know how set these from the 
console.

Comment 4 Lon Hohberger 2004-06-04 18:57:36 UTC
Please file a ticket with Red Hat Support if you have not already done so:

http://www.redhat.com/apps/support/

You may include a reference to this bugzilla if you desire.  FYI, here
is how to adjust the failover detection time on RHEL 2.1:

http://www.redhat.com/docs/manuals/enterprise/RHEL-AS-2.1-Manual/cluster-manager/s1-hwinfo-tunefailover.html

For simplicity:

netdown * interval => Failover time (in seconds) without heartbeat
channels responding normally

netup * interval => Failover time (in seconds) with heartbeat channels
working normally


Comment 5 Michael Sporer 2004-06-11 19:01:47 UTC
Created attachment 101067 [details]
error log from clumanager

Comment 6 Michael Sporer 2004-06-11 19:16:08 UTC
I just added an attachment that shows the results of some 
experiments. First we changed MAX_SAME_TIMESTAMP_NETUP from 12 to 150 
and MAX_SAME_TIMESTAMP_NETDOWN from 7 to 90. Also, we set 
reallyDoShutdown to = in quorumd. We get ENOMEMs but do not shut 
ourselves down. This is due to very haevy system load and no free 
memory to allocate to the cluster manager. The cluster manager should 
be resilient to this. As the item 3) shows, we eventually get a 
signal 11 as a result of ignoring the ENOMEM.

Comment 7 Lon Hohberger 2004-06-14 17:52:50 UTC
It looks like the system was run out of low memory.  At least, it was
extremely stressed for low memory - this is evidenced by the fact that
read() returned -1/ENOMEM.  This only occurs, as far as I can tell, in
the kernel (drivers/char/raw.c) raw read/write path after it calls
alloc_kiovec_sz().

There may be a kernel issue with that particular path of allocating
too much RAM (as I understand it, allocating 4 pages instead of 1) for
the operation.

As for the signal 11: Cluster Manager raised it (SIGSEGV/signal 11)
when it could not figure out what to do.  There's a macro which logs
at EMERG, then raises SIGSEGV to prevent further operation.  As far as
I know, this can only occur if rebooting fails (or, is disabled -
which is intended for debugging purposes only).

Loss of access to shared storage is considered fatal by Cluster
Manager, which is why it tries to reboot: if it can't reach shared
storage, it is incorrect to assume that other tasks can: force failover.

In this case, the loss of access was due to the inability to allocate
kernel memory to complete the operation.  This is consistent with how
Cluster Manager would have behaved had it received MAP_FAILED/ENOMEM
during the mmap() call in the diskRawRead/diskRawWrite paths: if
Cluster Manager can not allocate memory, it is incorrect to assume
other tasks can: force failover.

Note that when the available memory is low enough, the kernel will
begin killing tasks (see mm/oom_kill.c).  It is better to prevent
entering this state rather than letting it happen and trying to clean
it up later - there is no way to tell which task will get killed first
from userland).

Comment 8 Lon Hohberger 2004-06-14 18:03:32 UTC
Created attachment 101120 [details]
Allows better recovery in unrecoverable cases.

Here's a patch which will allow more robust "recovery" in certain error cases
in the lock subsystem (such as the one referenced in the logs) and quorum
daemon when reboot is disabled.

Comment 10 Paul Hansen 2004-06-14 19:53:27 UTC
Thanks for the patch Lon.  I've actually done something similar,
but in diskrawio.c to the diskrawwrite and read, but my code only
retries 3 times, with 1 second sleeps.  This wasn't enough to get
over the bumps.  I also "statically" allocated the page aligned
buffers for the raw read and writes by mmaping from /dev/zero on
init.  We originally saw the ENOMEMs come out of the mmap in the
allocate code.  The effect was the same.

We've been running e40.8 that was pulled from Jason Baron's people
site, associated with bug 117460.  This has helped immensely with
heavy load conditions, but we still fail with ENOMEMs occasionally.

I'll apply your patch as soon as I can and let you know what happens.

Comment 11 Lon Hohberger 2004-06-14 21:55:41 UTC
With the patch, you'll still see ENOMEM, just that the cluster will
not do the funny raise(SIGSEGV) when it happens.  Also, if it gets
ENOMEM while reading the other member's status, it'll just continue as
though the other member's state has not changed

I removed the singly-allocated buffer portion from my patch, figuring
that it wasn't directly related to handling the "error recovery paths".

(All of this is with "reallyDoShutdown = 0" in diskrawio.c, which
isn't how Red Hat ships the package.)

Comment 12 Lon Hohberger 2004-06-15 15:00:08 UTC
Created attachment 101150 [details]
Above patch + add one-time allocation in diskrawio.c

Same as above with one-time-allocated buffers for raw I/O in userland. 
Regression tested; does not interfere with normal operation.

Comment 13 Lon Hohberger 2004-06-23 15:16:07 UTC
As far as failover is concerned, it is incorrect to assume that if the
system is out of memory that the OOM killer hasn't killed any
processes - one of which may be a service.

However, this doesn't mean that the allocation during the read/write
path in drivers/char/raw.c shouldn't be looked at more closely.

Comment 15 Lon Hohberger 2004-06-25 16:52:02 UTC
Created attachment 101423 [details]
Big retry patch

Patch increases MAX_CONSECUTIVE_IO_ERRORS to 5, and delays between retries when
ENOMEM is encountered.	A bit of code was added to ensure errno gets passed
back up the stack properly.

Comment 16 Jason Baron 2004-06-25 16:55:28 UTC
ok, we've reworked raw i/o in the kernel to use a mempool when during
raw reads/writes, basically we allocated some kiobufs in a pool and
wait for them to free up, if the kernel gets stuck, before returning
an -ENOMEM. this should fix your problem. kernel w/that patch is at:

http://people.redhat.com/~jbaron/.private/u5/2.4.9-e.41.12/

Comment 17 Paul Hansen 2004-06-25 19:56:04 UTC
Thanks for the patch and the kernel update.  We'll be installing
the 41.12 kernel today and run it over the weekend.  It'll probably
be late Monday before we change the cluster manager over to the
patched versions

Comment 18 Lon Hohberger 2004-06-25 19:58:36 UTC
The 41.12 kernel should obviate the need for the latest patch to
clumanager.

Comment 19 Lon Hohberger 2004-06-25 19:59:26 UTC
Comment on attachment 101423 [details]
Big retry patch

41.12 kernel should obviate the need for this patch.

Comment 20 Paul Hansen 2004-06-25 20:01:57 UTC
Ok, won't do the patch.  Thanks.

Comment 21 Jason Baron 2004-06-25 20:02:39 UTC
if you haven't updated yet, the official U5 beta kernel has just
landed, e.42...same place:

http://people.redhat.com/~jbaron/.private/u5/

thanks,

-Jason


Comment 22 Paul Hansen 2004-06-25 20:12:06 UTC
Thanks again for the update.  Is there any chance that this
kernel contains the fixes in the 40.8 kernel (117902) that 
'throttle' kswapd.  We've been running 40.8 here with much success
and would like these changes in the new kernel.

Comment 23 Jason Baron 2004-06-25 20:14:40 UTC
yes...although we have tweaked these fixes a bit, but we are confident
that this beta kernel will address the issues contained in 117902.

 

Comment 25 Paul Hansen 2004-07-02 18:28:10 UTC
Just wanted to let you know that we've been running e42 for a week
on 2-4 machines under a number of different conditions and have
been VERY pleased with its operation.  Kswapd's behavior is much
more civilized, and we no longer get ENOMEMs on raw cluster i/o
or the 'mysterious' failovers we'd been seeing.  We plan on going
forward with these fixes.  Thanks again for your work.

Comment 26 Lon Hohberger 2004-07-02 19:00:42 UTC
Ok.  You should be aware that:

(1) e42 is a beta kernel, the release version might change, and you
should not use it in production.

(2) The patches for clumanager should more or less be obviated by the
kernel improvements; so you should test without them.

Comment 27 Paul Hansen 2004-07-02 19:09:21 UTC
We will remove the patches from clumanager and test, although
we have added messages to log when they come into play, and
we haven't seen any of these messages.

We're not really in production.  We're developing a clustered
version of our product, so the e42 kernel will be used on our
development/test platforms- if it crashes, it's not the end of
the world.  I noticed that e43 came out today; I checked the
source for the kswapd patches, but they weren't there.  Do you
know if they will make it into an official release?

Comment 28 Jason Baron 2004-07-02 19:13:01 UTC
e.43 is a security kernel...the latest u5 is e.46 at:

http://people.redhat.com/~jbaron/.private/u5/

Comment 31 Roger Nunn 2004-08-05 14:48:02 UTC
Hi, 
Has JBarons patch made it into the e.48 kernel as i have checked the
kernel rpm SPEC file and cannot find reference to the patch.
Thanks for verifying this.
Kind regards
Roger Nunn

Comment 32 Paul Hansen 2004-08-05 14:50:58 UTC
I just verified that it has not been included in e48 (at least the
pieces that I'm aware of).  Is there a plan to put this in an 
'official' errata release?  It solves a host of problems for our
application.

Comment 33 Jason Baron 2004-08-05 14:54:45 UTC
e.48 included critical security fixes...the fixes described in this
bugzilla will be included in U5, which should ship in a couple of
weeks. thanks.

Comment 34 John Flanagan 2004-08-18 14:25:41 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-437.html