Bug 125281
Summary: | Cluster manager detects false failures under heavy system load | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Paul Hansen <phansen> |
Component: | clumanager | Assignee: | Jason Baron <jbaron> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 2.1 | CC: | hagedorn, knoel, lhh, tao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-08-18 14:25:41 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Paul Hansen
2004-06-04 13:52:41 UTC
Another option would be to increase the failure detection time... Created attachment 100868 [details]
Set/get scheduling policy/priority of an arbitrary process
Simple program which gets/sets the scheduling policy/priority of an arbitrary
process; you may use it for testing if you wish.
Build command:
gcc -Wall -Wstrict-prototypes -Werror -Wshadow -o prio prio.c
Is there a way to modify the failure detection time without recompiling the entire src. From the code, it looks like there are ways to set % parameters, but I do not know how set these from the console. Please file a ticket with Red Hat Support if you have not already done so: http://www.redhat.com/apps/support/ You may include a reference to this bugzilla if you desire. FYI, here is how to adjust the failover detection time on RHEL 2.1: http://www.redhat.com/docs/manuals/enterprise/RHEL-AS-2.1-Manual/cluster-manager/s1-hwinfo-tunefailover.html For simplicity: netdown * interval => Failover time (in seconds) without heartbeat channels responding normally netup * interval => Failover time (in seconds) with heartbeat channels working normally Created attachment 101067 [details]
error log from clumanager
I just added an attachment that shows the results of some experiments. First we changed MAX_SAME_TIMESTAMP_NETUP from 12 to 150 and MAX_SAME_TIMESTAMP_NETDOWN from 7 to 90. Also, we set reallyDoShutdown to = in quorumd. We get ENOMEMs but do not shut ourselves down. This is due to very haevy system load and no free memory to allocate to the cluster manager. The cluster manager should be resilient to this. As the item 3) shows, we eventually get a signal 11 as a result of ignoring the ENOMEM. It looks like the system was run out of low memory. At least, it was extremely stressed for low memory - this is evidenced by the fact that read() returned -1/ENOMEM. This only occurs, as far as I can tell, in the kernel (drivers/char/raw.c) raw read/write path after it calls alloc_kiovec_sz(). There may be a kernel issue with that particular path of allocating too much RAM (as I understand it, allocating 4 pages instead of 1) for the operation. As for the signal 11: Cluster Manager raised it (SIGSEGV/signal 11) when it could not figure out what to do. There's a macro which logs at EMERG, then raises SIGSEGV to prevent further operation. As far as I know, this can only occur if rebooting fails (or, is disabled - which is intended for debugging purposes only). Loss of access to shared storage is considered fatal by Cluster Manager, which is why it tries to reboot: if it can't reach shared storage, it is incorrect to assume that other tasks can: force failover. In this case, the loss of access was due to the inability to allocate kernel memory to complete the operation. This is consistent with how Cluster Manager would have behaved had it received MAP_FAILED/ENOMEM during the mmap() call in the diskRawRead/diskRawWrite paths: if Cluster Manager can not allocate memory, it is incorrect to assume other tasks can: force failover. Note that when the available memory is low enough, the kernel will begin killing tasks (see mm/oom_kill.c). It is better to prevent entering this state rather than letting it happen and trying to clean it up later - there is no way to tell which task will get killed first from userland). Created attachment 101120 [details]
Allows better recovery in unrecoverable cases.
Here's a patch which will allow more robust "recovery" in certain error cases
in the lock subsystem (such as the one referenced in the logs) and quorum
daemon when reboot is disabled.
Thanks for the patch Lon. I've actually done something similar, but in diskrawio.c to the diskrawwrite and read, but my code only retries 3 times, with 1 second sleeps. This wasn't enough to get over the bumps. I also "statically" allocated the page aligned buffers for the raw read and writes by mmaping from /dev/zero on init. We originally saw the ENOMEMs come out of the mmap in the allocate code. The effect was the same. We've been running e40.8 that was pulled from Jason Baron's people site, associated with bug 117460. This has helped immensely with heavy load conditions, but we still fail with ENOMEMs occasionally. I'll apply your patch as soon as I can and let you know what happens. With the patch, you'll still see ENOMEM, just that the cluster will not do the funny raise(SIGSEGV) when it happens. Also, if it gets ENOMEM while reading the other member's status, it'll just continue as though the other member's state has not changed I removed the singly-allocated buffer portion from my patch, figuring that it wasn't directly related to handling the "error recovery paths". (All of this is with "reallyDoShutdown = 0" in diskrawio.c, which isn't how Red Hat ships the package.) Created attachment 101150 [details]
Above patch + add one-time allocation in diskrawio.c
Same as above with one-time-allocated buffers for raw I/O in userland.
Regression tested; does not interfere with normal operation.
As far as failover is concerned, it is incorrect to assume that if the system is out of memory that the OOM killer hasn't killed any processes - one of which may be a service. However, this doesn't mean that the allocation during the read/write path in drivers/char/raw.c shouldn't be looked at more closely. Created attachment 101423 [details]
Big retry patch
Patch increases MAX_CONSECUTIVE_IO_ERRORS to 5, and delays between retries when
ENOMEM is encountered. A bit of code was added to ensure errno gets passed
back up the stack properly.
ok, we've reworked raw i/o in the kernel to use a mempool when during raw reads/writes, basically we allocated some kiobufs in a pool and wait for them to free up, if the kernel gets stuck, before returning an -ENOMEM. this should fix your problem. kernel w/that patch is at: http://people.redhat.com/~jbaron/.private/u5/2.4.9-e.41.12/ Thanks for the patch and the kernel update. We'll be installing the 41.12 kernel today and run it over the weekend. It'll probably be late Monday before we change the cluster manager over to the patched versions The 41.12 kernel should obviate the need for the latest patch to clumanager. Comment on attachment 101423 [details]
Big retry patch
41.12 kernel should obviate the need for this patch.
Ok, won't do the patch. Thanks. if you haven't updated yet, the official U5 beta kernel has just landed, e.42...same place: http://people.redhat.com/~jbaron/.private/u5/ thanks, -Jason Thanks again for the update. Is there any chance that this kernel contains the fixes in the 40.8 kernel (117902) that 'throttle' kswapd. We've been running 40.8 here with much success and would like these changes in the new kernel. yes...although we have tweaked these fixes a bit, but we are confident that this beta kernel will address the issues contained in 117902. Just wanted to let you know that we've been running e42 for a week on 2-4 machines under a number of different conditions and have been VERY pleased with its operation. Kswapd's behavior is much more civilized, and we no longer get ENOMEMs on raw cluster i/o or the 'mysterious' failovers we'd been seeing. We plan on going forward with these fixes. Thanks again for your work. Ok. You should be aware that: (1) e42 is a beta kernel, the release version might change, and you should not use it in production. (2) The patches for clumanager should more or less be obviated by the kernel improvements; so you should test without them. We will remove the patches from clumanager and test, although we have added messages to log when they come into play, and we haven't seen any of these messages. We're not really in production. We're developing a clustered version of our product, so the e42 kernel will be used on our development/test platforms- if it crashes, it's not the end of the world. I noticed that e43 came out today; I checked the source for the kswapd patches, but they weren't there. Do you know if they will make it into an official release? e.43 is a security kernel...the latest u5 is e.46 at: http://people.redhat.com/~jbaron/.private/u5/ Hi, Has JBarons patch made it into the e.48 kernel as i have checked the kernel rpm SPEC file and cannot find reference to the patch. Thanks for verifying this. Kind regards Roger Nunn I just verified that it has not been included in e48 (at least the pieces that I'm aware of). Is there a plan to put this in an 'official' errata release? It solves a host of problems for our application. e.48 included critical security fixes...the fixes described in this bugzilla will be included in U5, which should ship in a couple of weeks. thanks. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-437.html |