Bug 149650

Summary: out of memory kills followed by crash running AS 3 U4 kernel
Product: [Retired] Red Hat Cluster Suite Reporter: Jonathan Woytek <woytek+>
Component: gfsAssignee: Ryan O'Hara <rohara>
Status: CLOSED WORKSFORME QA Contact: GFS Bugs <gfs-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: 3   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-01-12 23:55:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
quicksilver meminfo (three hours after reboot)
none
quicksilver slabinfo (three hours after reboot)
none
storm meminfo (three hours after reboot)
none
storm slabinfo (three hours after reboot)
none
hazycase meminfo (up for five days)
none
hazycase slabinfo (up for five days) none

Description Jonathan Woytek 2005-02-24 20:33:45 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; chrome://navigator/locale/navigator.properties; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
I am running RHEL 3 AS U4 (with all current errata) on two Dell PowerEdge 1860 with dual Xeon processors and 4GB memory for each.  I'm running RHCS and RHGFS (built from current SRPMS from redhat) on these two machines, and I'm running a third GFS-only node on an additional PowerEdge 1860.  The three GFS nodes run lock_gulmd with all three acting as lock servers.  The two nodes running RHCS provide access to GFS filesystems through NFS and Samba for service to about 40 clients (most of which are Samba clients).  

The machines run fine for a random period of time.  At some point (usually a few hours after a boot), lowmem drops to between 16 and 64MB available.  That value will fluctuate in that range for a few hours to a few days.  At a seemingly random time, syslog will start to report Out of Memory process kills on random processes.  After about a minute of OOM kills, the machine will reboot.

I had originally thought that this was the same problem listed in bug 132639, but after applying all updates and errata, the problem still exists.  

Note that I had crashes that would consistently happen about every four hours for a few days.  That became every day for a little while.  Then I had a period of about two weeks of solid uptime (though lowmem was still in the range listed above).  Just recently, we've moved back into a scenario with at least one crash per day.  

Find attached current /proc/meminfo and /proc/slabcache from all three nodes.  "quicksilver" is the current primary Samba and NFS share server.

jonathan

Version-Release number of selected component (if applicable):
GFS-6.0.2-25

How reproducible:
Sometimes

Steps to Reproduce:
1.  Generate lots of filesharing activity (including backups).
2.
3.
  

Actual Results:  Sometimes this will cause the machine to go into an OOM loop and kill itself.  Sometimes it will only drag lowmem down to a few megabytes, then everything will recover when I stop the activity.  

Expected Results:  The machine should run solidly.

Additional info:

Linux quicksilver 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005 i686 i686 i386 GNU/Linux
Linux storm 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005 i686 i686 i386 GNU/Linux
Linux hazycase 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005 i686 i686 i386 GNU/Linux

Comment 1 Jonathan Woytek 2005-02-24 20:37:49 UTC
Created attachment 111396 [details]
quicksilver meminfo (three hours after reboot)

Comment 2 Jonathan Woytek 2005-02-24 20:38:26 UTC
Created attachment 111397 [details]
quicksilver slabinfo (three hours after reboot)

Comment 3 Jonathan Woytek 2005-02-24 20:38:58 UTC
Created attachment 111398 [details]
storm meminfo (three hours after reboot)

Comment 4 Jonathan Woytek 2005-02-24 20:39:28 UTC
Created attachment 111399 [details]
storm slabinfo (three hours after reboot)

Comment 5 Jonathan Woytek 2005-02-24 20:40:03 UTC
Created attachment 111400 [details]
hazycase meminfo (up for five days)

Comment 6 Jonathan Woytek 2005-02-24 20:40:28 UTC
Created attachment 111401 [details]
hazycase slabinfo (up for five days)

Comment 7 Kiersten (Kerri) Anderson 2005-07-18 20:12:30 UTC
Are these problems still occuring with the latest release of software - RHEL3 U5
and GFS 6.0 for U5?

Comment 8 Ryan O'Hara 2006-01-12 23:55:02 UTC
I am unable to reproduce this bug on RHEL3 U6.

Note that some fixes were made to the kernel in regards to OOM kill issues since
this bug was first reported.

Customer also reports that running a "hugemem" kernel alleviates the problem.

Marking as closed since I am unable to reproduce this bug. Please re-open this
bug if problem persists.


Comment 9 Jonathan Woytek 2006-01-13 02:18:34 UTC
Most of the problem has been alleviated by running with hyperthreading disabled
in the bios of all GFS cluster members.  iSCSI performance went up in general,
but GFS filesystems had the biggest performance boost (40-50% on average).  This
also seems to leave me with tons of lowmem free, so it may be possible to run
the standard kernel.  I did have to increase the GFS lock highwater setting to
clear-up some additional performance issues.  

jonathan