Bug 149650 - out of memory kills followed by crash running AS 3 U4 kernel
out of memory kills followed by crash running AS 3 U4 kernel
Status: CLOSED WORKSFORME
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
3
i386 Linux
medium Severity high
: ---
: ---
Assigned To: Ryan O'Hara
GFS Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-24 15:33 EST by Jonathan Woytek
Modified: 2010-01-11 22:03 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-01-12 18:55:02 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
quicksilver meminfo (three hours after reboot) (756 bytes, text/plain)
2005-02-24 15:37 EST, Jonathan Woytek
no flags Details
quicksilver slabinfo (three hours after reboot) (5.25 KB, text/plain)
2005-02-24 15:38 EST, Jonathan Woytek
no flags Details
storm meminfo (three hours after reboot) (754 bytes, text/plain)
2005-02-24 15:38 EST, Jonathan Woytek
no flags Details
storm slabinfo (three hours after reboot) (4.80 KB, text/plain)
2005-02-24 15:39 EST, Jonathan Woytek
no flags Details
hazycase meminfo (up for five days) (755 bytes, text/plain)
2005-02-24 15:40 EST, Jonathan Woytek
no flags Details
hazycase slabinfo (up for five days) (5.51 KB, text/plain)
2005-02-24 15:40 EST, Jonathan Woytek
no flags Details

  None (edit)
Description Jonathan Woytek 2005-02-24 15:33:45 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; chrome://navigator/locale/navigator.properties; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
I am running RHEL 3 AS U4 (with all current errata) on two Dell PowerEdge 1860 with dual Xeon processors and 4GB memory for each.  I'm running RHCS and RHGFS (built from current SRPMS from redhat) on these two machines, and I'm running a third GFS-only node on an additional PowerEdge 1860.  The three GFS nodes run lock_gulmd with all three acting as lock servers.  The two nodes running RHCS provide access to GFS filesystems through NFS and Samba for service to about 40 clients (most of which are Samba clients).  

The machines run fine for a random period of time.  At some point (usually a few hours after a boot), lowmem drops to between 16 and 64MB available.  That value will fluctuate in that range for a few hours to a few days.  At a seemingly random time, syslog will start to report Out of Memory process kills on random processes.  After about a minute of OOM kills, the machine will reboot.

I had originally thought that this was the same problem listed in bug 132639, but after applying all updates and errata, the problem still exists.  

Note that I had crashes that would consistently happen about every four hours for a few days.  That became every day for a little while.  Then I had a period of about two weeks of solid uptime (though lowmem was still in the range listed above).  Just recently, we've moved back into a scenario with at least one crash per day.  

Find attached current /proc/meminfo and /proc/slabcache from all three nodes.  "quicksilver" is the current primary Samba and NFS share server.

jonathan

Version-Release number of selected component (if applicable):
GFS-6.0.2-25

How reproducible:
Sometimes

Steps to Reproduce:
1.  Generate lots of filesharing activity (including backups).
2.
3.
  

Actual Results:  Sometimes this will cause the machine to go into an OOM loop and kill itself.  Sometimes it will only drag lowmem down to a few megabytes, then everything will recover when I stop the activity.  

Expected Results:  The machine should run solidly.

Additional info:

Linux quicksilver 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005 i686 i686 i386 GNU/Linux
Linux storm 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005 i686 i686 i386 GNU/Linux
Linux hazycase 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005 i686 i686 i386 GNU/Linux
Comment 1 Jonathan Woytek 2005-02-24 15:37:49 EST
Created attachment 111396 [details]
quicksilver meminfo (three hours after reboot)
Comment 2 Jonathan Woytek 2005-02-24 15:38:26 EST
Created attachment 111397 [details]
quicksilver slabinfo (three hours after reboot)
Comment 3 Jonathan Woytek 2005-02-24 15:38:58 EST
Created attachment 111398 [details]
storm meminfo (three hours after reboot)
Comment 4 Jonathan Woytek 2005-02-24 15:39:28 EST
Created attachment 111399 [details]
storm slabinfo (three hours after reboot)
Comment 5 Jonathan Woytek 2005-02-24 15:40:03 EST
Created attachment 111400 [details]
hazycase meminfo (up for five days)
Comment 6 Jonathan Woytek 2005-02-24 15:40:28 EST
Created attachment 111401 [details]
hazycase slabinfo (up for five days)
Comment 7 Kiersten (Kerri) Anderson 2005-07-18 16:12:30 EDT
Are these problems still occuring with the latest release of software - RHEL3 U5
and GFS 6.0 for U5?
Comment 8 Ryan O'Hara 2006-01-12 18:55:02 EST
I am unable to reproduce this bug on RHEL3 U6.

Note that some fixes were made to the kernel in regards to OOM kill issues since
this bug was first reported.

Customer also reports that running a "hugemem" kernel alleviates the problem.

Marking as closed since I am unable to reproduce this bug. Please re-open this
bug if problem persists.
Comment 9 Jonathan Woytek 2006-01-12 21:18:34 EST
Most of the problem has been alleviated by running with hyperthreading disabled
in the bios of all GFS cluster members.  iSCSI performance went up in general,
but GFS filesystems had the biggest performance boost (40-50% on average).  This
also seems to leave me with tons of lowmem free, so it may be possible to run
the standard kernel.  I did have to increase the GFS lock highwater setting to
clear-up some additional performance issues.  

jonathan

Note You need to log in before you can comment on or make changes to this bug.