Bug 474210

Summary:	oom killer implicates dlm during clvmd startup hang
Product:	Red Hat Enterprise Linux 5	Reporter:	Nate Straz <nstraz>
Component:	lvm2-cluster	Assignee:	LVM and device-mapper development team <lvm-team>
Status:	CLOSED DUPLICATE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	low
Version:	5.3	CC:	agk, building, ccaulfie, dwysocha, edamato, fj5075fv, heinzm, jbrassow, mbroz, prockai, teigland
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-07-14 19:32:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nate Straz 2008-12-02 19:58:31 UTC

Description of problem:

While running revolver, during a scenario when more than quorum was shot, the cluster failed to recover.  There are five nodes in the cluster.  Three nodes were shot.  Two of those nodes were fenced a second time after falling victim to the out-of-memory-killer.

Test:
 Scenario iteration 1.4 started at Mon Dec  1 16:39:56 CST 2008
 Senario: DLM kill Quorum plus one
 
 Those picked to face the revolver... morph-04 tank-01 tank-04

All nodes rejoined the openais configuration correctly at about 16:44:46

On morph-04, the logs show clvmd connecting at 16:46:56, but the bluetooth init script not starting until 17:27:48.  It appears clvmd started up, but the vgchange command hung until 17:27:48 when tank-01 and tank-04 were fenced.  

On tank-01 and tank-04, clvmd startup timed out and the system continued with the bluetooth init script.  About 45 minutes later, the OOM killer started taking victims.

On tank-01, the OOM killer was invoked by dlm_recv:
dlm_recv invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
 [<c0459cb1>] out_of_memory+0x72/0x1a5
 [<c045b186>] __alloc_pages+0x216/0x297
 [<f8eb1f4e>] dlm_lowcomms_get_buffer+0xdd/0x150 [dlm]
 [<f8eb31ac>] create_rcom+0x29/0xcd [dlm]
 [<c05b148a>] sock_common_recvmsg+0x2f/0x45
 [<f8eb361d>] dlm_receive_rcom+0xe5/0x333 [dlm]
 [<f8eada09>] dlm_receive_buffer+0xe3/0xfe [dlm]
 [<f8eb097f>] dlm_process_incoming_buffer+0xed/0x122 [dlm]
 [<c045acd3>] get_page_from_freelist+0x96/0x333
 [<c045afc7>] __alloc_pages+0x57/0x297
 [<c05b0efc>] kernel_recvmsg+0x2b/0x3a
 [<f8eb260b>] receive_from_sock+0x4ed/0x5f4 [dlm]
 [<c0613d1f>] _spin_lock_bh+0x8/0x18
 [<c05b1a8e>] lock_sock+0x8e/0x96
 [<c0613d1f>] _spin_lock_bh+0x8/0x18
 [<c05b1a8e>] lock_sock+0x8e/0x96
 [<c05da880>] inet_csk_clear_xmit_timers+0x30/0x3e
 [<c04e5cc8>] __next_cpu+0x12/0x21
 [<c041cf63>] find_busiest_group+0x177/0x462
 [<f8eb18d5>] process_recv_sockets+0xe/0x14 [dlm]
 [<c043186a>] run_workqueue+0x78/0xb5
 [<f8eb18c7>] process_recv_sockets+0x0/0x14 [dlm]
 [<c043211e>] worker_thread+0xd9/0x10b
 [<c041e3d7>] default_wake_function+0x0/0xc
 [<c0432045>] worker_thread+0x0/0x10b
 [<c043452d>] kthread+0xc0/0xeb
 [<c043446d>] kthread+0x0/0xeb
 [<c0405c53>] kernel_thread_helper+0x7/0x10
 =======================

The memory summary shows that there is plenty of free high memory, but not enough low memory.

Normal free:3636kB min:3756kB low:4692kB high:5632kB active:0kB inactive:112kB present:901120kB pages_scanned:236578 all_unreclaimable? yes
HighMem free:2081144kB min:512kB low:2832kB high:5156kB active:60800kB inactive:75188kB present:2227712kB pages_scanned:0 all_unreclaimable? no

On tank-04, the OOM killer was invoked by dlm_send:
dlm_send invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
 [<c0459cb1>] out_of_memory+0x72/0x1a5
 [<c045b186>] __alloc_pages+0x216/0x297
 [<c046eee8>] cache_alloc_refill+0x26d/0x450
 [<c046ec71>] kmem_cache_alloc+0x41/0x4b
 [<c05af111>] sock_alloc_inode+0x10/0x4e
 [<c04880c0>] alloc_inode+0x12/0x166
 [<c04889cd>] new_inode+0x14/0x67
 [<c05afed6>] sock_alloc+0xd/0x4e
 [<c05affe0>] __sock_create+0xc9/0x213
 [<c05b0135>] sock_create_kern+0xb/0xe
 [<f8e1fd1e>] tcp_connect_to_sock+0x73/0x1e9 [dlm]
 [<c04e5cc8>] __next_cpu+0x12/0x21
 [<c041cf63>] find_busiest_group+0x177/0x462
 [<f8e201fa>] process_send_sockets+0x1a/0x14b [dlm]
 [<c043186a>] run_workqueue+0x78/0xb5
 [<f8e201e0>] process_send_sockets+0x0/0x14b [dlm]
 [<c043211e>] worker_thread+0xd9/0x10b
 [<c041e3d7>] default_wake_function+0x0/0xc
 [<c0432045>] worker_thread+0x0/0x10b
 [<c043452d>] kthread+0xc0/0xeb
 [<c043446d>] kthread+0x0/0xeb
 [<c0405c53>] kernel_thread_helper+0x7/0x10
 =======================

Again the memory summary shows that we're out of low memory but we have plenty of high memory available.

Normal free:3756kB min:3756kB low:4692kB high:5632kB active:0kB inactive:120kB present:901120kB pages_scanned:170158 all_unreclaimable? yes
HighMem free:1034316kB min:512kB low:1740kB high:2968kB active:60616kB inactive:75324kB present:1179136kB pages_scanned:0 all_unreclaimable? no


Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.40-7.el5
cman-2.0.97-1.el5
lvm2-2.02.40-6.el5


How reproducible:
Not easily.  I've seen this also on large cluster configurations.

Steps to Reproduce:
Reboot a bunch of cluster nodes and restart clvmd at the same time?
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2008-12-02 20:11:43 UTC

It would be good to have dlm debugging messages from all the nodes when
this happens.  Add <dlm log_debug="1"/> to cluster.conf and we should
see them on the console and /var/log/messages.

Comment 2 John Stange 2009-01-13 17:12:23 UTC

I've been running smack into this while trying to put together a small cluster to demo to the boss.  It's not entirely reproducible, but after tearing down and rebuilding the whole thing a few times I've worked out some common threads.

It seems to only affect the last (numerically last) node, and some communications failure with node 1 always seems to be involved.  When running in two-node mode, node 2 would be the one to have problems.  After adding a third node, node 3 always seems to be the one where dlm_send freaks out.  Concurrent with that are the following odd occurrences:

[root@node3 ~]# clustat
Cluster Status for test-cluster @ Tue Jan 13 11:55:55 2009
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node1-cluster                               1 Offline
 node2-cluster                               2 Online
 node3-cluster                               3 Online, Local

...this is inconsistent with node1 and node2, who both report everything as kosher with all members and services (I also note that node3 hasn't gotten as far as firing up rgmanager).

There's also this bizarre log entry:

Jan 13 11:46:15 node3 openais[5238]: [CMAN ] Node 1 conflict, remote cluster name='test-clusternode1-bios', local='test-cluster'

...where "node1-bios" refers to the DRAC fence device for node1.  I've seen this each time I've run into the dlm_send problem, whether they're related or not I couldn't tell you.  I can say that the most recent iteration of the problem did not crop up until I configured fencing for the nodes (note, the fencing devices were configured but unused, and no problems).  It all smacks of some strange configuration parsing problem to me.  I am curious as to the fencing configuration of the original poster.

Per above, I enabled dlm debug logging, and intriguingly enough get absolutely nothing.   I suspect the thing is running away before getting to any task of substance.

I'll go ahead and paste in my cluster.conf in case it provides any insight, hostnames/IPs/login information sanitized.

<?xml version="1.0"?>
<cluster config_version="85" name="test-cluster">
<dlm log_debug="1"/>
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="node1-cluster" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="node1-bios"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2-cluster" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="node2-bios"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node3-cluster" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="node3-bios"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="999.999.999.999" login="username" name="node1-bios" passwd="********"/>
                <fencedevice agent="fence_drac" ipaddr="999.999.999.999" login="username" name="node2-bios" passwd="********"/>
                <fencedevice agent="fence_drac" ipaddr="999.999.999.999" login="username" name="node3-bios" passwd="********"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="node1-only" ordered="0" restricted="1">
                                <failoverdomainnode name="node1-cluster" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="node2-only" restricted="1">
                                <failoverdomainnode name="node2-cluster" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="whole-cluster" ordered="0" restricted="0"/>
                        <failoverdomain name="test-nfs-servers" ordered="1" restricted="1">
                                <failoverdomainnode name="node1-cluster" priority="1"/>
                                <failoverdomainnode name="node2-cluster" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="node3-only" ordered="0" restricted="1">
                                <failoverdomainnode name="node3-cluster" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <clusterfs device="/dev/mapper/test-lvol0" force_unmount="1" fsid="7496" fstype="gfs" mountpoint="/test" name="testraid" options=""/>
                        <nfsexport name="test-nfs-export"/>
                        <nfsclient name="test-nfs-clients-128" options="async,rw" path="/test" target="999.999.999.999/255.255.255.254"/>
                        <ip address="999.999.999.999" monitor_link="1"/>
                        <nfsclient name="test-nfs-clients-246" options="async,rw" path="/test" target="999.999.999.999/255.255.255.240"/>
                        <nfsclient name="test-nfs-clients-227" options="async,rw" path="/test" target="999.999.999.999/255.255.255.192"/>
                </resources>
                <service autostart="1" domain="node1-only" name="node1-test-gfs">
                        <clusterfs ref="testraid"/>
                </service>
                <service autostart="1" domain="node2-only" name="node2-test-gfs">
                        <clusterfs ref="testraid"/>
                </service>
                <service autostart="1" domain="test-nfs-servers" name="test-nfs" recovery="relocate">
                        <clusterfs ref="testraid">
                                <nfsexport ref="test-nfs-export">
                                        <nfsclient ref="test-nfs-clients-128"/>
                                        <ip ref="999.999.999.999"/>
                                        <nfsclient ref="test-nfs-clients-246"/>
                                        <nfsclient ref="test-nfs-clients-227"/>
                                </nfsexport>
                        </clusterfs>
                </service>
        </rm>
</cluster>

Comment 3 John Stange 2009-02-04 19:59:17 UTC

I believe this has been solved, per the following:

https://bugzilla.redhat.com/show_bug.cgi?id=472786#c4

Latest and greatest (cman-2.0.98-1.el5, openais-0.80.3-22.el5) appears to fix my problems.

Comment 4 Nate Straz 2009-06-05 21:05:20 UTC

I think I hit this during 5.4 testing on ppc.  I had run cpgx on the cluster and after stopping it and a few minutes passed one node started spewing oom-killer messages:

[root@doral-p1 cpgx]# dlm_send invoked oom-killer: gfp_mask=0xd0, order=0, oomki
lladj=0
Call Trace:
[C00000007FCDB3A0] [C000000000010434] .show_stack+0x68/0x1b0 (unreliable)
[C00000007FCDB440] [C0000000000C2918] .out_of_memory+0xa8/0x3c0
[C00000007FCDB500] [C0000000000C5E84] .__alloc_pages+0x270/0x33c
[C00000007FCDB5F0] [C0000000000EEC28] .cache_grow+0x198/0x4e0
[C00000007FCDB6C0] [C0000000000EF3A8] .cache_alloc_refill+0x1e4/0x264
[C00000007FCDB770] [C0000000000EFA50] .kmem_cache_alloc+0xac/0xd8
[C00000007FCDB800] [C00000000032E7A4] .sock_alloc_inode+0x28/0x84
[C00000007FCDB880] [C00000000011E1C4] .alloc_inode+0x50/0x1ec
[C00000007FCDB910] [C00000000011F78C] .new_inode+0x30/0xd8
[C00000007FCDB9A0] [C00000000033091C] .sock_alloc+0x28/0x94
[C00000007FCDBA20] [C000000000330AD0] .__sock_create+0x148/0x3a0
[C00000007FCDBAE0] [D000000000CEEB44] .tcp_connect_to_sock+0x98/0x2a0 [dlm]
[C00000007FCDBCA0] [D000000000CEF1D0] .process_send_sockets+0x74/0x2cc [dlm]
[C00000007FCDBD50] [C000000000082448] .run_workqueue+0xdc/0x168
[C00000007FCDBDF0] [C0000000000831B4] .worker_thread+0x12c/0x19c
[C00000007FCDBEE0] [C000000000087CB0] .kthread+0x128/0x178
[C00000007FCDBF90] [C0000000000275CC] .kernel_thread+0x4c/0x68
Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: high 6, batch 1 used:0
cpu 0 cold: high 2, batch 1 used:1
cpu 1 hot: high 6, batch 1 used:3
cpu 1 cold: high 2, batch 1 used:1
Node 0 DMA32 per-cpu: empty
Node 0 Normal per-cpu: empty
Node 0 HighMem per-cpu: empty
Free pages:        5760kB (0kB HighMem)
Active:436 inactive:0 dirty:0 writeback:0 unstable:0 free:90 slab:31641 mapped-f
ile:58 mapped-anon:362 pagetables:152
Node 0 DMA free:5760kB min:5760kB low:7168kB high:8640kB active:25856kB inactive
:0kB present:2097152kB pages_scanned:148902 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0
kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:
0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:2048kB low:2048kB high:2048kB active:0kB inactive:0k
B present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB 0*8192kB 0
*16384kB = 5760kB
Node 0 DMA32: empty
Node 0 Normal: empty
Node 0 HighMem: empty
436 pagecache pages
Swap cache: add 2469, delete 2107, find 19/48, race 0+3
Free swap  = 3977984kB
Total swap = 4128640kB
Free swap:       3977984kB
32768 pages of RAM
350 reserved pages
899 pages shared
362 pages swap cached
Out of memory: Killed process 2337 (ricci).

Running on kernel 2.6.18-150.el5.

Comment 5 David Teigland 2009-07-14 19:32:34 UTC


*** This bug has been marked as a duplicate of bug 508829 ***