Bug 469888 - oom panic on 72 GB system on boot
oom panic on 72 GB system on boot
Status: CLOSED DUPLICATE of bug 508829
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.2
x86_64 Linux
medium Severity high
: rc
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-04 11:27 EST by Steve Reichard
Modified: 2009-08-06 05:10 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-08-06 05:10:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Steve Reichard 2008-11-04 11:27:14 EST
Description of problem:

Heterogenous cluster of HP DL580 G5 (Intel 16 core, 64 GB and AMD 8 core 72 GB)
running RHEL 5.2 AP and recently updated


A cluster node was fenced for the issue seen in https://bugzilla.redhat.com/show_bug.cgi?id=469874, after booting, the node soon appeared non-responsive.  While the current shell was non-responsive and new ssh's failed, pings to public and private interconnects were responsive. 

After several minutes the node was power cycled.   Upon boot the following was found in the messages file:

Nov  3 15:24:31 renoir kernel: dlm_send invoked oom-killer: gfp_mask=0xd0, order=1, oomkilladj=0
Nov  3 15:24:31 renoir kernel:
Nov  3 15:24:31 renoir kernel: Call Trace:
Nov  3 15:24:31 renoir kernel:  [<ffffffff802bb708>] out_of_memory+0x8b/0x203
Nov  3 15:24:31 renoir kernel:  [<ffffffff8020f5f2>] __alloc_pages+0x245/0x2ce
Nov  3 15:24:31 renoir kernel:  [<ffffffff8025e80e>] cache_alloc_refill+0x269/0x4ba
Nov  3 15:24:31 renoir kernel:  [<ffffffff8020af77>] kmem_cache_alloc+0x50/0x6d
Nov  3 15:24:31 renoir kernel:  [<ffffffff80245a86>] sk_alloc+0x2e/0xf3
Nov  3 15:24:31 renoir kernel:  [<ffffffff8025c18f>] inet_create+0x137/0x270
Nov  3 15:24:31 renoir kernel:  [<ffffffff8024e8d9>] __sock_create+0x170/0x27c
Nov  3 15:24:31 renoir kernel:  [<ffffffff88513959>] :dlm:process_send_sockets+0x0/0x179
Nov  3 15:24:31 renoir kernel:  [<ffffffff885133df>] :dlm:tcp_connect_to_sock+0x70/0x1de
Nov  3 15:24:32 renoir kernel:  [<ffffffff80264905>] _spin_lock_irq+0x9/0x14
Nov  3 15:24:32 renoir kernel:  [<ffffffff80262ddb>] thread_return+0xb0/0xf7
Nov  3 15:24:32 renoir kernel:  [<ffffffff80260823>] error_exit+0x0/0x6e
Nov  3 15:24:32 renoir kernel:  [<ffffffff88513959>] :dlm:process_send_sockets+0x0/0x179
Nov  3 15:24:32 renoir kernel:  [<ffffffff80260823>] error_exit+0x0/0x6e
Nov  3 15:24:32 renoir kernel:  [<ffffffff88513959>] :dlm:process_send_sockets+0x0/0x179
Nov  3 15:24:32 renoir kernel:  [<ffffffff80260823>] error_exit+0x0/0x6e
Nov  3 15:24:32 renoir kernel:  [<ffffffff88513959>] :dlm:process_send_sockets+0x0/0x179
Nov  3 15:24:32 renoir kernel:  [<ffffffff88513979>] :dlm:process_send_sockets+0x20/0x179
Nov  3 15:24:32 renoir kernel:  [<ffffffff88513959>] :dlm:process_send_sockets+0x0/0x179
Nov  3 15:24:32 renoir kernel:  [<ffffffff8024f021>] run_workqueue+0x94/0xe4
Nov  3 15:24:32 renoir kernel:  [<ffffffff8024b987>] worker_thread+0x0/0x122
Nov  3 15:24:32 renoir kernel:  [<ffffffff8029b8c6>] keventd_create_kthread+0x0/0xc4
Nov  3 15:24:32 renoir kernel:  [<ffffffff8024ba77>] worker_thread+0xf0/0x122
Nov  3 15:24:32 renoir kernel:  [<ffffffff80288d49>] default_wake_function+0x0/0xe
Nov  3 15:24:35 renoir kernel:  [<ffffffff8029b8c6>] keventd_create_kthread+0x0/0xc4
Nov  3 15:24:35 renoir kernel:  [<ffffffff8029b8c6>] keventd_create_kthread+0x0/0xc4
Nov  3 15:24:35 renoir kernel:  [<ffffffff80233a32>] kthread+0xfe/0x132
Nov  3 15:24:35 renoir kernel:  [<ffffffff80260b24>] child_rip+0xa/0x12
Nov  3 15:24:35 renoir kernel:  [<ffffffff8029b8c6>] keventd_create_kthread+0x0/0xc4
Nov  3 15:24:35 renoir kernel:  [<ffffffff80233934>] kthread+0x0/0x132
Nov  3 15:24:35 renoir kernel:  [<ffffffff80260b1a>] child_rip+0x0/0x12

There was no entry in /var/crash upon boot.



Version-Release number of selected component (if applicable):
[root@renoir crash]# cat /etc/redhat-release ;  uname -a
Red Hat Enterprise Linux Server release 5.2 (Tikanga)
Linux renoir.lab.bos.redhat.com 2.6.18-92.1.13.el5xen #1 SMP Thu Sep 4 04:07:08
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@renoir crash]#


How reproducible:

Unable to reproduce at this time.


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Carlos Moro 2009-03-03 13:51:40 EST
OOM reproduced here as well: 2 node cluster on x86, just after changing from manual to iLO based fencing method in cluster.conf.

Will try to reproduce and come back with more info. Let me know if you need any other info from me (conf, logs, etc).

Cheers
Comment 2 Christine Caulfield 2009-08-06 05:10:57 EDT
In the absence of any more information I'm marking this as a duplicate of 508829 which is a known (and now fixed) cause of memory outages in the DLM.

Feel free to reopen this bug if the problem reappears with the patched kernel.

*** This bug has been marked as a duplicate of bug 508829 ***

Note You need to log in before you can comment on or make changes to this bug.