Description of problem: While running revolver, during a scenario when more than quorum was shot, the cluster failed to recover. There are five nodes in the cluster. Three nodes were shot. Two of those nodes were fenced a second time after falling victim to the out-of-memory-killer. Test: Scenario iteration 1.4 started at Mon Dec 1 16:39:56 CST 2008 Senario: DLM kill Quorum plus one Those picked to face the revolver... morph-04 tank-01 tank-04 All nodes rejoined the openais configuration correctly at about 16:44:46 On morph-04, the logs show clvmd connecting at 16:46:56, but the bluetooth init script not starting until 17:27:48. It appears clvmd started up, but the vgchange command hung until 17:27:48 when tank-01 and tank-04 were fenced. On tank-01 and tank-04, clvmd startup timed out and the system continued with the bluetooth init script. About 45 minutes later, the OOM killer started taking victims. On tank-01, the OOM killer was invoked by dlm_recv: dlm_recv invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 [<c0459cb1>] out_of_memory+0x72/0x1a5 [<c045b186>] __alloc_pages+0x216/0x297 [<f8eb1f4e>] dlm_lowcomms_get_buffer+0xdd/0x150 [dlm] [<f8eb31ac>] create_rcom+0x29/0xcd [dlm] [<c05b148a>] sock_common_recvmsg+0x2f/0x45 [<f8eb361d>] dlm_receive_rcom+0xe5/0x333 [dlm] [<f8eada09>] dlm_receive_buffer+0xe3/0xfe [dlm] [<f8eb097f>] dlm_process_incoming_buffer+0xed/0x122 [dlm] [<c045acd3>] get_page_from_freelist+0x96/0x333 [<c045afc7>] __alloc_pages+0x57/0x297 [<c05b0efc>] kernel_recvmsg+0x2b/0x3a [<f8eb260b>] receive_from_sock+0x4ed/0x5f4 [dlm] [<c0613d1f>] _spin_lock_bh+0x8/0x18 [<c05b1a8e>] lock_sock+0x8e/0x96 [<c0613d1f>] _spin_lock_bh+0x8/0x18 [<c05b1a8e>] lock_sock+0x8e/0x96 [<c05da880>] inet_csk_clear_xmit_timers+0x30/0x3e [<c04e5cc8>] __next_cpu+0x12/0x21 [<c041cf63>] find_busiest_group+0x177/0x462 [<f8eb18d5>] process_recv_sockets+0xe/0x14 [dlm] [<c043186a>] run_workqueue+0x78/0xb5 [<f8eb18c7>] process_recv_sockets+0x0/0x14 [dlm] [<c043211e>] worker_thread+0xd9/0x10b [<c041e3d7>] default_wake_function+0x0/0xc [<c0432045>] worker_thread+0x0/0x10b [<c043452d>] kthread+0xc0/0xeb [<c043446d>] kthread+0x0/0xeb [<c0405c53>] kernel_thread_helper+0x7/0x10 ======================= The memory summary shows that there is plenty of free high memory, but not enough low memory. Normal free:3636kB min:3756kB low:4692kB high:5632kB active:0kB inactive:112kB present:901120kB pages_scanned:236578 all_unreclaimable? yes HighMem free:2081144kB min:512kB low:2832kB high:5156kB active:60800kB inactive:75188kB present:2227712kB pages_scanned:0 all_unreclaimable? no On tank-04, the OOM killer was invoked by dlm_send: dlm_send invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 [<c0459cb1>] out_of_memory+0x72/0x1a5 [<c045b186>] __alloc_pages+0x216/0x297 [<c046eee8>] cache_alloc_refill+0x26d/0x450 [<c046ec71>] kmem_cache_alloc+0x41/0x4b [<c05af111>] sock_alloc_inode+0x10/0x4e [<c04880c0>] alloc_inode+0x12/0x166 [<c04889cd>] new_inode+0x14/0x67 [<c05afed6>] sock_alloc+0xd/0x4e [<c05affe0>] __sock_create+0xc9/0x213 [<c05b0135>] sock_create_kern+0xb/0xe [<f8e1fd1e>] tcp_connect_to_sock+0x73/0x1e9 [dlm] [<c04e5cc8>] __next_cpu+0x12/0x21 [<c041cf63>] find_busiest_group+0x177/0x462 [<f8e201fa>] process_send_sockets+0x1a/0x14b [dlm] [<c043186a>] run_workqueue+0x78/0xb5 [<f8e201e0>] process_send_sockets+0x0/0x14b [dlm] [<c043211e>] worker_thread+0xd9/0x10b [<c041e3d7>] default_wake_function+0x0/0xc [<c0432045>] worker_thread+0x0/0x10b [<c043452d>] kthread+0xc0/0xeb [<c043446d>] kthread+0x0/0xeb [<c0405c53>] kernel_thread_helper+0x7/0x10 ======================= Again the memory summary shows that we're out of low memory but we have plenty of high memory available. Normal free:3756kB min:3756kB low:4692kB high:5632kB active:0kB inactive:120kB present:901120kB pages_scanned:170158 all_unreclaimable? yes HighMem free:1034316kB min:512kB low:1740kB high:2968kB active:60616kB inactive:75324kB present:1179136kB pages_scanned:0 all_unreclaimable? no Version-Release number of selected component (if applicable): lvm2-cluster-2.02.40-7.el5 cman-2.0.97-1.el5 lvm2-2.02.40-6.el5 How reproducible: Not easily. I've seen this also on large cluster configurations. Steps to Reproduce: Reboot a bunch of cluster nodes and restart clvmd at the same time? Actual results: Expected results: Additional info:
It would be good to have dlm debugging messages from all the nodes when this happens. Add <dlm log_debug="1"/> to cluster.conf and we should see them on the console and /var/log/messages.
I've been running smack into this while trying to put together a small cluster to demo to the boss. It's not entirely reproducible, but after tearing down and rebuilding the whole thing a few times I've worked out some common threads. It seems to only affect the last (numerically last) node, and some communications failure with node 1 always seems to be involved. When running in two-node mode, node 2 would be the one to have problems. After adding a third node, node 3 always seems to be the one where dlm_send freaks out. Concurrent with that are the following odd occurrences: [root@node3 ~]# clustat Cluster Status for test-cluster @ Tue Jan 13 11:55:55 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node1-cluster 1 Offline node2-cluster 2 Online node3-cluster 3 Online, Local ...this is inconsistent with node1 and node2, who both report everything as kosher with all members and services (I also note that node3 hasn't gotten as far as firing up rgmanager). There's also this bizarre log entry: Jan 13 11:46:15 node3 openais[5238]: [CMAN ] Node 1 conflict, remote cluster name='test-clusternode1-bios', local='test-cluster' ...where "node1-bios" refers to the DRAC fence device for node1. I've seen this each time I've run into the dlm_send problem, whether they're related or not I couldn't tell you. I can say that the most recent iteration of the problem did not crop up until I configured fencing for the nodes (note, the fencing devices were configured but unused, and no problems). It all smacks of some strange configuration parsing problem to me. I am curious as to the fencing configuration of the original poster. Per above, I enabled dlm debug logging, and intriguingly enough get absolutely nothing. I suspect the thing is running away before getting to any task of substance. I'll go ahead and paste in my cluster.conf in case it provides any insight, hostnames/IPs/login information sanitized. <?xml version="1.0"?> <cluster config_version="85" name="test-cluster"> <dlm log_debug="1"/> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="node1-cluster" nodeid="1" votes="1"> <fence> <method name="1"> <device name="node1-bios"/> </method> </fence> </clusternode> <clusternode name="node2-cluster" nodeid="2" votes="1"> <fence> <method name="1"> <device name="node2-bios"/> </method> </fence> </clusternode> <clusternode name="node3-cluster" nodeid="3" votes="1"> <fence> <method name="1"> <device name="node3-bios"/> </method> </fence> </clusternode> </clusternodes> <cman/> <fencedevices> <fencedevice agent="fence_drac" ipaddr="999.999.999.999" login="username" name="node1-bios" passwd="********"/> <fencedevice agent="fence_drac" ipaddr="999.999.999.999" login="username" name="node2-bios" passwd="********"/> <fencedevice agent="fence_drac" ipaddr="999.999.999.999" login="username" name="node3-bios" passwd="********"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="node1-only" ordered="0" restricted="1"> <failoverdomainnode name="node1-cluster" priority="1"/> </failoverdomain> <failoverdomain name="node2-only" restricted="1"> <failoverdomainnode name="node2-cluster" priority="1"/> </failoverdomain> <failoverdomain name="whole-cluster" ordered="0" restricted="0"/> <failoverdomain name="test-nfs-servers" ordered="1" restricted="1"> <failoverdomainnode name="node1-cluster" priority="1"/> <failoverdomainnode name="node2-cluster" priority="2"/> </failoverdomain> <failoverdomain name="node3-only" ordered="0" restricted="1"> <failoverdomainnode name="node3-cluster" priority="1"/> </failoverdomain> </failoverdomains> <resources> <clusterfs device="/dev/mapper/test-lvol0" force_unmount="1" fsid="7496" fstype="gfs" mountpoint="/test" name="testraid" options=""/> <nfsexport name="test-nfs-export"/> <nfsclient name="test-nfs-clients-128" options="async,rw" path="/test" target="999.999.999.999/255.255.255.254"/> <ip address="999.999.999.999" monitor_link="1"/> <nfsclient name="test-nfs-clients-246" options="async,rw" path="/test" target="999.999.999.999/255.255.255.240"/> <nfsclient name="test-nfs-clients-227" options="async,rw" path="/test" target="999.999.999.999/255.255.255.192"/> </resources> <service autostart="1" domain="node1-only" name="node1-test-gfs"> <clusterfs ref="testraid"/> </service> <service autostart="1" domain="node2-only" name="node2-test-gfs"> <clusterfs ref="testraid"/> </service> <service autostart="1" domain="test-nfs-servers" name="test-nfs" recovery="relocate"> <clusterfs ref="testraid"> <nfsexport ref="test-nfs-export"> <nfsclient ref="test-nfs-clients-128"/> <ip ref="999.999.999.999"/> <nfsclient ref="test-nfs-clients-246"/> <nfsclient ref="test-nfs-clients-227"/> </nfsexport> </clusterfs> </service> </rm> </cluster>
I believe this has been solved, per the following: https://bugzilla.redhat.com/show_bug.cgi?id=472786#c4 Latest and greatest (cman-2.0.98-1.el5, openais-0.80.3-22.el5) appears to fix my problems.
I think I hit this during 5.4 testing on ppc. I had run cpgx on the cluster and after stopping it and a few minutes passed one node started spewing oom-killer messages: [root@doral-p1 cpgx]# dlm_send invoked oom-killer: gfp_mask=0xd0, order=0, oomki lladj=0 Call Trace: [C00000007FCDB3A0] [C000000000010434] .show_stack+0x68/0x1b0 (unreliable) [C00000007FCDB440] [C0000000000C2918] .out_of_memory+0xa8/0x3c0 [C00000007FCDB500] [C0000000000C5E84] .__alloc_pages+0x270/0x33c [C00000007FCDB5F0] [C0000000000EEC28] .cache_grow+0x198/0x4e0 [C00000007FCDB6C0] [C0000000000EF3A8] .cache_alloc_refill+0x1e4/0x264 [C00000007FCDB770] [C0000000000EFA50] .kmem_cache_alloc+0xac/0xd8 [C00000007FCDB800] [C00000000032E7A4] .sock_alloc_inode+0x28/0x84 [C00000007FCDB880] [C00000000011E1C4] .alloc_inode+0x50/0x1ec [C00000007FCDB910] [C00000000011F78C] .new_inode+0x30/0xd8 [C00000007FCDB9A0] [C00000000033091C] .sock_alloc+0x28/0x94 [C00000007FCDBA20] [C000000000330AD0] .__sock_create+0x148/0x3a0 [C00000007FCDBAE0] [D000000000CEEB44] .tcp_connect_to_sock+0x98/0x2a0 [dlm] [C00000007FCDBCA0] [D000000000CEF1D0] .process_send_sockets+0x74/0x2cc [dlm] [C00000007FCDBD50] [C000000000082448] .run_workqueue+0xdc/0x168 [C00000007FCDBDF0] [C0000000000831B4] .worker_thread+0x12c/0x19c [C00000007FCDBEE0] [C000000000087CB0] .kthread+0x128/0x178 [C00000007FCDBF90] [C0000000000275CC] .kernel_thread+0x4c/0x68 Mem-info: Node 0 DMA per-cpu: cpu 0 hot: high 6, batch 1 used:0 cpu 0 cold: high 2, batch 1 used:1 cpu 1 hot: high 6, batch 1 used:3 cpu 1 cold: high 2, batch 1 used:1 Node 0 DMA32 per-cpu: empty Node 0 Normal per-cpu: empty Node 0 HighMem per-cpu: empty Free pages: 5760kB (0kB HighMem) Active:436 inactive:0 dirty:0 writeback:0 unstable:0 free:90 slab:31641 mapped-f ile:58 mapped-anon:362 pagetables:152 Node 0 DMA free:5760kB min:5760kB low:7168kB high:8640kB active:25856kB inactive :0kB present:2097152kB pages_scanned:148902 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0 kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present: 0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 HighMem free:0kB min:2048kB low:2048kB high:2048kB active:0kB inactive:0k B present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 0*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB 0*8192kB 0 *16384kB = 5760kB Node 0 DMA32: empty Node 0 Normal: empty Node 0 HighMem: empty 436 pagecache pages Swap cache: add 2469, delete 2107, find 19/48, race 0+3 Free swap = 3977984kB Total swap = 4128640kB Free swap: 3977984kB 32768 pages of RAM 350 reserved pages 899 pages shared 362 pages swap cached Out of memory: Killed process 2337 (ricci). Running on kernel 2.6.18-150.el5.
*** This bug has been marked as a duplicate of bug 508829 ***