Description of problem: I've reproduced this quite a few times now where I let activator run for many days and it eventually ends up causing an OOM situation. Activator is a test that basically just activates and deactivates volume groups containing clustered volumes (linears,stripes,mirrors) over and over. I'll try and run this with 'clvmd -d' and see if that gives us more info. Iteration 7903 of 10000 started at Mon Aug 24 18:54:14 CDT 2009 Deactivating volume activator1 on grant-02 Deactivating volume activator2 on grant-03 Deactivating volume activator3 on grant-01 Verifying that all volumes are deactivated Verifying that all volumes no longer exist in device-mapper Reactivating volume activator1 on grant-01 Reactivating volume activator2 on grant-03 Reactivating volume activator3 on grant-02 Iteration 7904 of 10000 started at Mon Aug 24 18:55:14 CDT 2009 Deactivating volume activator1 on grant-02 Deactivating volume activator2 on grant-01 Deactivating volume activator3 on grant-01 Verifying that all volumes are deactivated Verifying that all volumes no longer exist in device-mapper Reactivating volume activator1 on grant-03 Reactivating volume activator2 on grant-02 Reactivating volume activator3 on grant-03 Error locking on node grant-03: Command timed out Error writing data to clvmd: Broken pipe Internal error: Volume Group activator3 was not unlocked Device '/dev/sdc3' has been left open. Device '/dev/sdc3' has been left open. Device '/dev/sdc2' has been left open. Device '/dev/sdc2' has been left open. Device '/dev/sdc6' has been left open. Device '/dev/sdc1' has been left open. Device '/dev/sdc1' has been left open. Device '/dev/sdc5' has been left open. Device '/dev/sdc5' has been left open. Device '/dev/sdc5' has been left open. Device '/dev/sdc1' has been left open. Device '/dev/sdc6' has been left open. Device '/dev/sdc6' has been left open. Device '/dev/sdc2' has been left open. Device '/dev/sdc3' has been left open. Iteration 7905 of 10000 started at Mon Aug 24 18:57:44 CDT 2009 Deactivating volume activator1 on grant-03 connect() failed on local socket: Connection refused WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Skipping clustered volume group activator1 couldn't deactivate on grant-03 Aug 24 18:56:23 grant-03 lvm[12709]: activator3-mirror2 is now in-sync Aug 24 18:56:25 grant-03 lvm[12709]: Monitoring mirror device activator3-mirror3 for events Aug 24 18:56:25 grant-03 lvm[12709]: activator3-mirror3 is now in-sync Aug 24 18:56:28 grant-03 lvm[12709]: Monitoring mirror device activator3-mirror4 for events Aug 24 18:56:28 grant-03 lvm[12709]: activator3-mirror4 is now in-sync Aug 24 18:57:53 grant-03 kernel: btimed invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0 Aug 24 18:57:53 grant-03 kernel: Aug 24 18:57:53 grant-03 kernel: Call Trace: Aug 24 18:57:53 grant-03 kernel: [<ffffffff800c5b10>] out_of_memory+0x8e/0x2f3 Aug 24 18:57:53 grant-03 kernel: [<ffffffff8000f487>] __alloc_pages+0x245/0x2ce Aug 24 18:57:53 grant-03 kernel: [<ffffffff80032768>] read_swap_cache_async+0x45/0xd8 Aug 24 18:57:53 grant-03 kernel: [<ffffffff8000926e>] __handle_mm_fault+0xadc/0xf99 Aug 24 18:57:53 grant-03 kernel: [<ffffffff880c4732>] :libata:ata_sff_pause+0x9/0x14 Aug 24 18:57:53 grant-03 kernel: [<ffffffff880c6081>] :libata:ata_sff_qc_issue+0x11a/0x287 Aug 24 18:57:53 grant-03 kernel: [<ffffffff880b8bd0>] :libata:ata_qc_issue+0x516/0x58e Aug 24 18:57:53 grant-03 kernel: [<ffffffff80066b25>] do_page_fault+0x4cb/0x830 Aug 24 18:57:53 grant-03 kernel: [<ffffffff880bd0df>] :libata:ata_scsi_translate+0x140/0x16d Aug 24 18:57:53 grant-03 kernel: [<ffffffff8008b63f>] dequeue_task+0x18/0x37 Aug 24 18:56:49 grant-03 openais[12425]: [TOTEM] entering GATHER state from 12. Aug 24 18:57:58 grant-03 kernel: [<ffffffff80062fc8>] thread_return+0x62/0xfe Aug 24 18:57:58 grant-03 openais[12425]: [TOTEM] Saving state aru 684320 high seq received 684320 Version-Release number of selected component (if applicable): 2.6.18-162.el5 lvm2-2.02.46-8.el5 BUILT: Thu Jun 18 08:06:12 CDT 2009 lvm2-cluster-2.02.46-8.el5 BUILT: Thu Jun 18 08:05:27 CDT 2009 cmirror-1.1.39-2.el5 BUILT: Mon Jul 27 15:39:05 CDT 2009 kmod-cmirror-0.1.22-1.el5 BUILT: Mon Jul 27 15:28:46 CDT 2009 How reproducible: Eventually everytime.
Reproduced this again, but 'clvmd -d' didn't really provide any help on the node that ran OOM. Iteration 7908 of 10000 started at Mon Aug 31 09:04:52 CDT 2009 Deactivating volume activator1 on grant-03 Didn't receive heartbeat for 120 seconds couldn't deactivate on grant-03 Aug 31 09:05:05 grant-03 lvm[9756]: activator3-mirror3 is now in-sync Aug 31 09:05:07 grant-03 lvm[9756]: Monitoring mirror device activator3-mirror4 for events Aug 31 09:05:07 grant-03 lvm[9756]: activator3-mirror4 is now in-sync Aug 31 09:05:09 grant-03 lvm[9756]: Monitoring mirror device activator3-mirror5 for events Aug 31 09:05:09 grant-03 lvm[9756]: activator3-mirror5 is now in-sync Aug 31 09:05:09 grant-03 xinetd[9183]: START: qarsh pid=11618 from=10.15.80.47 Aug 31 09:05:09 grant-03 qarshd[11618]: Talking to peer 10.15.80.47:46436 Aug 31 09:05:09 grant-03 qarshd[11618]: Running cmdline: vgchange -an activator1 Aug 31 09:05:28 grant-03 openais[9582]: [TOTEM] entering GATHER state from 12. Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] Saving state aru a317 high seq received a317 Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] Storing new sequence id for ring 994 Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] entering COMMIT state. Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] entering RECOVERY state. Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] position [0] member 10.15.89.151: Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] previous ring seq 2448 rep 10.15.89.151 Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] aru a318 high delivered a318 received flag 1 Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] position [1] member 10.15.89.152: Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] previous ring seq 2448 rep 10.15.89.151 Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] aru a318 high delivered a318 received flag 1 Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] position [2] member 10.15.89.153: Aug 31 09:05:29 grant-03 openais[9582]: [TOTEM] previous ring seq 2448 rep 10.15.89.151 Aug 31 09:05:31 grant-03 openais[9582]: [TOTEM] aru a317 high delivered a317 received flag 0 Aug 31 09:05:31 grant-03 openais[9582]: [TOTEM] Did not need to originate any messages in recovery. Aug 31 09:05:31 grant-03 openais[9582]: [CLM ] CLM CONFIGURATION CHANGE Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] New Configuration: Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] r(0) ip(10.15.89.151) Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] r(0) ip(10.15.89.152) Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] r(0) ip(10.15.89.153) Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] Members Left: Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] Members Joined: Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] CLM CONFIGURATION CHANGE Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] New Configuration: Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] r(0) ip(10.15.89.151) Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] r(0) ip(10.15.89.152) Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] r(0) ip(10.15.89.153) Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] Members Left: Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] Members Joined: Aug 31 09:05:32 grant-03 openais[9582]: [SYNC ] This node is within the primary component and will provide service. Aug 31 09:05:32 grant-03 openais[9582]: [TOTEM] entering OPERATIONAL state. Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] got nodejoin message 10.15.89.153 Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] got nodejoin message 10.15.89.151 Aug 31 09:05:32 grant-03 openais[9582]: [CLM ] got nodejoin message 10.15.89.152 Aug 31 09:05:32 grant-03 openais[9582]: [CPG ] got joinlist message from node 1 Aug 31 09:05:32 grant-03 openais[9582]: [CPG ] got joinlist message from node 2 Aug 31 09:05:32 grant-03 openais[9582]: [CPG ] got joinlist message from node 3 Aug 31 09:05:37 grant-03 lvm[9756]: No longer monitoring mirror device activator1-mirror1 for events Aug 31 09:42:01 grant-03 syslogd 1.4.1: restart. CLVMD[431d2940]: Aug 31 09:05:37 LVM thread waiting for work CLVMD[d4f42820]: Aug 31 09:05:37 Read on local socket 10, len = 85 CLVMD[d4f42820]: Aug 31 09:05:37 check_all_clvmds_running CLVMD[43bd3940]: Aug 31 09:05:37 Got pre command condition... CLVMD[43bd3940]: Aug 31 09:05:37 Writing status 0 down pipe 12 CLVMD[43bd3940]: Aug 31 09:05:37 Waiting to do post command - state = 0 CLVMD[d4f42820]: Aug 31 09:05:37 read on PIPE 11: 4 bytes: status: 0 CLVMD[d4f42820]: Aug 31 09:05:37 background routine status was 0, sock_client=0x2aac0c6c3e40 CLVMD[d4f42820]: Aug 31 09:05:37 distribute command: XID = 43604 CLVMD[d4f42820]: Aug 31 09:05:37 add_to_lvmqueue: cmd=0x2aac07f4e7d0. client=0x2aac0c6c3e40, msg=0x2aac0c6c3540, len=85, csid=(nil), xid=43604 CLVMD[d4f42820]: Aug 31 09:05:37 Sending message to all cluster nodes CLVMD[431d2940]: Aug 31 09:05:37 process_work_item: local CLVMD[431d2940]: Aug 31 09:05:37 process_local_command: LOCK_LV (0x32) msg=0x2aac0c6c3f50, msglen =85, client=0x2aac0c6c3e40 CLVMD[431d2940]: Aug 31 09:05:37 do_lock_lv: resource 'QZThNqybhmuZkq2bWm63sRGQ69G8aiBxvVsVIxwpA4ebewLw6NaVthm3K0xFVGMt', cmd = 0x18 LCK_LV_DEACTIVATE (NULL|LV|NONBLOCK), flags = 0x84 (DMEVENTD_MONITOR ) CLVMD[d4f42820]: Aug 31 09:05:37 Reply from node grant-02: 0 bytes CLVMD[d4f42820]: Aug 31 09:05:37 Got 1 replies, expecting: 3 CLVMD[d4f42820]: Aug 31 09:05:37 Reply from node grant-01: 0 bytes CLVMD[d4f42820]: Aug 31 09:05:37 Got 2 replies, expecting: 3
I am almost sure that if this is during activation, it is leak from bug 543146. *** This bug has been marked as a duplicate of bug 543146 ***