Description of problem: I was running I/O from all 3 nodes in the cluster (hayes-0[123]) to a simple 50G cmirror with GFS on top. I then reset hayes-02. Recovery on the remaining two nodes appeared fine as hayes-02 was properly fenced. I even created another cmirror on hayes-03 to make sure nothing was deadlocked. I then brought hayes-02 back into the cluster and started cmirror and clvmd. The activation of the VGs during the clvmd start deadlocked and even resulted in hayes-02 being once again fenced. Here are the two mirrors: [root@hayes-03 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 72.44G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 1.94G /dev/sda2(2318) mirror hayes Mwi-ao 50.00G mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0) [mirror_mimage_0] hayes iwi-ao 50.00G /dev/etherd/e1.1p1(0) [mirror_mimage_1] hayes iwi-ao 50.00G /dev/etherd/e1.1p2(0) [mirror_mlog] hayes lwi-ao 4.00M /dev/etherd/e1.1p7(0) test hayes mwi-a- 5.00G test_mlog 100.00 test_mimage_0(0),test_mimage_1(0) [test_mimage_0] hayes iwi-ao 5.00G /dev/etherd/e1.1p3(0) [test_mimage_1] hayes iwi-ao 5.00G /dev/etherd/e1.1p4(0) [test_mlog] hayes lwi-ao 4.00M /dev/etherd/e1.1p2(12800) I'll attach the messages from each of the nodes during the time of the deadlock/second fence. Version-Release number of selected component (if applicable): 2.6.18-71.el5 lvm2-2.02.31-1.el5 lvm2-cluster-2.02.31-1.el5 kmod-cmirror-0.1.4-1.el5 cmirror-1.1.7-1.el5
Created attachment 292998 [details] log from hayes-01
Created attachment 292999 [details] log from hayes-02
Created attachment 293000 [details] log from hayes-03
This issue is reproducable.
Reproduced this with the latest. It seems as though this may be related to bz 429599 and 383291, as I see those messages stream during the activate deadlocks. [...] Feb 5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Timed out waiting for cluster log server [DM_CLOG_RESUME] Feb 5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Retrying request [DM_CLOG_RESUME] Feb 5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: [2][0][0][0][0][0] [75] Feb 5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Timed out waiting for cluster log server [DM_CLOG_GET_SYNC_COUNT] Feb 5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Retrying request [DM_CLOG_GET_SYNC_COUNT] Feb 5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: [2][0][0][0][0][0] [76] Feb 5 14:06:11 grant-01 clogd[3241]: import_checkpoint: 0 checkpoint sections found Feb 5 14:06:15 grant-01 last message repeated 4 times [...] Feb 5 14:18:09 grant-01 clogd[3241]: import_checkpoint: 0 checkpoint sections found Feb 5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Attempting to contact cluster log server... Feb 5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_CTR] to server: -3 Feb 5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Attempting to contact cluster log server... Feb 5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_CTR] to server: -3 cmirror-1.1.9-1.el5 kmod-cmirror-0.1.5-2.el5
-3 means the server has probably died. When you see this, check 'ps -C clogd'... and see if you can find a core dump.
During the deadlock, clogd is still running on all three nodes. # After grant-01 was killed... [root@grant-01 ~]# service clvmd start Starting clvmd: [ OK ] Activating VGs: [deadlock] Feb 8 09:42:21 grant-01 clogd[3256]: import_checkpoint: 0 checkpoint sections found Feb 8 09:42:25 grant-01 last message repeated 4 times Feb 8 09:42:26 grant-01 kernel: device-mapper: dm-log-clustered: Timed out waiting for cluster log server [DM_CLOG_RESUME] Feb 8 09:42:26 grant-01 kernel: device-mapper: dm-log-clustered: Retrying request [DM_CLOG_RESUME] Feb 8 09:42:26 grant-01 kernel: device-mapper: dm-log-clustered: [2][0][0][0][0][0] [1] Feb 8 09:42:26 grant-01 clogd[3256]: import_checkpoint: 0 checkpoint sections found Feb 8 09:42:27 grant-01 kernel: device-mapper: dm-log-clustered: Timed out waiting for cluster log server [DM_CLOG_GET_SYNC_COUNT] Feb 8 09:42:27 grant-01 kernel: device-mapper: dm-log-clustered: Retrying request [DM_CLOG_GET_SYNC_COUNT] Feb 8 09:42:27 grant-01 kernel: device-mapper: dm-log-clustered: [2][0][0][0][0][0] [2] Feb 8 09:42:27 grant-01 clogd[3256]: import_checkpoint: 0 checkpoint sections found [...] [root@grant-01 ~]# ps -C clogd PID TTY TIME CMD 3256 ? 00:00:00 clogd [root@grant-01 ~]# ps -elf | grep clogd 5 S root 3256 1 0 75 0 - 4498 - 09:42 ? 00:00:00 clogd 0 S root 3389 3355 0 78 0 - 15282 pipe_w 09:46 pts/1 00:00:00 grep clogd
FYI - there is still an issue here in the lastest code. Though it didn't deadlock, the activate still failed and resulted in a stuck mirror. # Revolver output: [...] wait for clvm services on all nodes... activating VGs on grant-01 Error locking on node grant-01: device-mapper: reload ioctl failed: Invalid argument mirror grant mwi-d- 5.00G mirror_mlog 0.00 cmirror-1.1.13-1.el5/kmod-cmirror-0.1.6-1.el5
The trailing comments hint at something different from the leading comments... are you still doing the same thing to produce this bug? 1) I/O from multiple (all) nodes 2) kill one of the machines Also, if there are "unable to send to cluster" messages, the new logs will contain the reason for the failure... it'd be nice to have those.
I tried this again this afternoon. Same scenario that you described: 1) I/O from multiple (all) nodes 2) kill one of the machines (the cmirror server) [root@grant-01 ~]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] [root@grant-01 ~]# service cmirror start Loading clustered mirror log module: [ OK ] Starting clustered mirror log server: [ OK ] [root@grant-01 ~]# service clvmd start Starting clvmd: [ OK ] Activating VGs: [DEADLOCK] grant-01: Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] CLM CONFIGURATION CHANGE Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] New Configuration: Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] r(0) ip(10.15.89.151) Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] r(0) ip(10.15.89.152) Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] r(0) ip(10.15.89.153) Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] Members Left: Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] Members Joined: Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] r(0) ip(10.15.89.152) Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] r(0) ip(10.15.89.153) Feb 11 17:24:34 grant-01 openais[3196]: [SYNC ] This node is within the primary component and will provide service. Feb 11 17:24:34 grant-01 openais[3196]: [TOTEM] entering OPERATIONAL state. Feb 11 17:24:34 grant-01 openais[3196]: [CMAN ] quorum regained, resuming activity Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] got nodejoin message 10.15.89.151 Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] got nodejoin message 10.15.89.152 Feb 11 17:24:34 grant-01 openais[3196]: [CLM ] got nodejoin message 10.15.89.153 Feb 11 17:24:34 grant-01 openais[3196]: [CPG ] got joinlist message from node 2 Feb 11 17:24:34 grant-01 openais[3196]: [CPG ] got joinlist message from node 3 Feb 11 17:24:35 grant-01 ccsd[3189]: Initial status:: Quorate Feb 11 17:24:44 grant-01 kernel: device-mapper: dm-log-clustered: (built Feb 8 2008 11:13:43) installed Feb 11 17:24:44 grant-01 clogd[3257]: Starting clogd: Feb 11 17:24:44 grant-01 clogd[3257]: Built: Feb 8 2008 11:09:24 Feb 11 17:24:52 grant-01 kernel: dlm: Using TCP for communications Feb 11 17:24:52 grant-01 kernel: dlm: connecting to 2 Feb 11 17:24:52 grant-01 kernel: dlm: got connection from 3 Feb 11 17:24:52 grant-01 kernel: dlm: got connection from 2 Feb 11 17:24:53 grant-01 clvmd: Cluster LVM daemon started - connected to CMAN Feb 11 17:24:54 grant-01 clogd[3257]: import_checkpoint: 0 checkpoint sections found Feb 11 17:25:08 grant-01 last message repeated 14 times Feb 11 17:25:09 grant-01 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_RESUME:2 - retrying Feb 11 17:25:09 grant-01 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:3 - retrying Feb 11 17:25:09 grant-01 clogd[3257]: import_checkpoint: 0 checkpoint sections found Feb 11 17:25:23 grant-01 last message repeated 14 times Feb 11 17:25:24 grant-01 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_RESUME:4 - retrying Feb 11 17:25:24 grant-01 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:5 - retrying Feb 11 17:25:24 grant-01 clogd[3257]: import_checkpoint: 0 checkpoint sections found Feb 11 17:25:38 grant-01 last message repeated 14 times grant-02: Feb 11 17:16:09 grant-02 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:20301 - retrying Feb 11 17:16:35 grant-02 fenced[3238]: grant-01 not a cluster member after 30 sec post_fail_delay Feb 11 17:16:35 grant-02 fenced[3238]: fencing node "grant-01" Feb 11 17:16:48 grant-02 fenced[3238]: fence "grant-01" success Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs1.1: jid=0: Trying to acquire journal lock... Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs3.1: jid=0: Trying to acquire journal lock... Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Trying to acquire journal lock... Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs1.1: jid=0: Looking at journal... Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs3.1: jid=0: Looking at journal... Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Looking at journal... [...] Feb 11 17:16:58 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: replays = 2, skips = 0, sames = 0 Feb 11 17:16:59 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Journal replayed in 6s Feb 11 17:16:59 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Done Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering GATHER state from 11. Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] Saving state aru 15657 high seq received 15657 Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] Storing new sequence id for ring 150 Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering COMMIT state. Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering RECOVERY state. Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] position [0] member 10.15.89.151: Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] previous ring seq 332 rep 10.15.89.151 Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] aru 9 high delivered 9 received flag 1 Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] position [1] member 10.15.89.152: Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] previous ring seq 332 rep 10.15.89.152 Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] aru 15657 high delivered 15657 received flag 1 Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] position [2] member 10.15.89.153: Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] previous ring seq 332 rep 10.15.89.152 Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] aru 15657 high delivered 15657 received flag 1 Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] Did not need to originate any messages in recovery. Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] CLM CONFIGURATION CHANGE Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] New Configuration: Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] r(0) ip(10.15.89.152) Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] r(0) ip(10.15.89.153) Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] Members Left: Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] Members Joined: Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] CLM CONFIGURATION CHANGE Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] New Configuration: Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] r(0) ip(10.15.89.151) Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] r(0) ip(10.15.89.152) Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] r(0) ip(10.15.89.153) Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] Members Left: Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] Members Joined: Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] r(0) ip(10.15.89.151) Feb 11 17:24:24 grant-02 openais[3222]: [SYNC ] This node is within the primary component and will provide service. Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering OPERATIONAL state. Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] got nodejoin message 10.15.89.151 Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] got nodejoin message 10.15.89.152 Feb 11 17:24:24 grant-02 openais[3222]: [CLM ] got nodejoin message 10.15.89.153 Feb 11 17:24:24 grant-02 openais[3222]: [CPG ] got joinlist message from node 2 Feb 11 17:24:24 grant-02 openais[3222]: [CPG ] got joinlist message from node 3 Feb 11 17:24:41 grant-02 kernel: dlm: connecting to 1 Feb 11 17:24:41 grant-02 kernel: dlm: got connection from 1 grant-03: Feb 11 17:17:26 grant-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IN_SYNC:20939 - retrying Feb 11 17:18:21 grant-03 clogd[3276]: cpg_mcast_joined error: 9 Feb 11 17:18:21 grant-03 clogd[3276]: cluster_send failed at: local.c:212 (do_local_work) Feb 11 17:18:21 grant-03 clogd[3276]: [] Unable to send (null) to cluster: Invalid exchange Feb 11 17:18:21 grant-03 kernel: device-mapper: dm-log-clustered: Stray request returned: <NULL>, 0 Feb 11 17:18:21 grant-03 clogd[3276]: Bad callback on local/4 Feb 11 17:18:36 grant-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IN_SYNC:22840 - retrying Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering GATHER state from 11. Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] Saving state aru 15657 high seq received 15657 Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] Storing new sequence id for ring 150 Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering COMMIT state. Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering RECOVERY state. Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] position [0] member 10.15.89.151: Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] previous ring seq 332 rep 10.15.89.151 Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] aru 9 high delivered 9 received flag 1 Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] position [1] member 10.15.89.152: Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] previous ring seq 332 rep 10.15.89.152 Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] aru 15657 high delivered 15657 received flag 1 Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] position [2] member 10.15.89.153: Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] previous ring seq 332 rep 10.15.89.152 Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] aru 15657 high delivered 15657 received flag 1 Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] Did not need to originate any messages in recovery. Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] CLM CONFIGURATION CHANGE Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] New Configuration: Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] r(0) ip(10.15.89.152) Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] r(0) ip(10.15.89.153) Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] Members Left: Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] Members Joined: Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] CLM CONFIGURATION CHANGE Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] New Configuration: Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] r(0) ip(10.15.89.151) Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] r(0) ip(10.15.89.152) Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] r(0) ip(10.15.89.153) Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] Members Left: Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] Members Joined: Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] r(0) ip(10.15.89.151) Feb 11 17:24:51 grant-03 openais[3217]: [SYNC ] This node is within the primary component and will provide service. Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering OPERATIONAL state. Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] got nodejoin message 10.15.89.151 Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] got nodejoin message 10.15.89.152 Feb 11 17:24:51 grant-03 openais[3217]: [CLM ] got nodejoin message 10.15.89.153 Feb 11 17:24:51 grant-03 openais[3217]: [CPG ] got joinlist message from node 2 Feb 11 17:24:51 grant-03 openais[3217]: [CPG ] got joinlist message from node 3 Feb 11 17:25:09 grant-03 kernel: dlm: connecting to 1
seems to be an AIS checkpoint issue.
may be ais checkpoint issue opening seperate issue on that topic.
If it is an issue with AIS, it may affect posix locking too. You could try: 1) mount 1 gfs node 2) take out posix lock 3) mount rest of nodes 4) kill one of the nodes (not the first one) 5) bring node back and mount the file system I *think* the operations there would be similar... not sure w/o talking to dave though.
Reproduced this again while trying the node recovery that we mentioned in the cmirror mtg. I shot taft-03, and played around starting and stopping the cmirror service. 2.6.18-90.el5 kmod-cmirror-0.1.8-1.el5 openais-0.80.3-15.el5 Apr 23 16:11:13 taft-03 openais[6598]: [SYNC ] This node is within the primary component and will provide service. Apr 23 16:11:13 taft-03 openais[6598]: [TOTEM] entering OPERATIONAL state. Apr 23 16:11:13 taft-03 openais[6598]: [CMAN ] quorum regained, resuming activity Apr 23 16:11:13 taft-03 openais[6598]: [CLM ] got nodejoin message 10.15.89.67 Apr 23 16:11:13 taft-03 openais[6598]: [CLM ] got nodejoin message 10.15.89.68 Apr 23 16:11:13 taft-03 openais[6598]: [CLM ] got nodejoin message 10.15.89.69 Apr 23 16:11:13 taft-03 openais[6598]: [CLM ] got nodejoin message 10.15.89.70 Apr 23 16:11:13 taft-03 openais[6598]: [CPG ] got joinlist message from node 2 Apr 23 16:11:13 taft-03 openais[6598]: [CPG ] got joinlist message from node 4 Apr 23 16:11:13 taft-03 openais[6598]: [CPG ] got joinlist message from node 1 Apr 23 16:11:14 taft-03 ccsd[6592]: Initial status:: Quorate Apr 23 16:11:24 taft-03 clogd[6655]: Starting clogd: Apr 23 16:11:24 taft-03 clogd[6655]: Built: Feb 28 2008 14:04:28 Apr 23 16:11:29 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28 2008 14:01:10) removed Apr 23 16:11:30 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28 2008 14:01:10) installed Apr 23 16:11:30 taft-03 clogd[6702]: Starting clogd: Apr 23 16:11:30 taft-03 clogd[6702]: Built: Feb 28 2008 14:04:28 Apr 23 16:11:31 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28 2008 14:01:10) removed Apr 23 16:11:31 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28 2008 14:01:10) installed Apr 23 16:11:31 taft-03 clogd[6749]: Starting clogd: Apr 23 16:11:31 taft-03 clogd[6749]: Built: Feb 28 2008 14:04:28 Apr 23 16:11:32 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28 2008 14:01:10) removed Apr 23 16:11:33 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28 2008 14:01:10) installed Apr 23 16:11:33 taft-03 clogd[6796]: Starting clogd: Apr 23 16:11:33 taft-03 clogd[6796]: Built: Feb 28 2008 14:04:28 Apr 23 16:11:33 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28 2008 14:01:10) removed Apr 23 16:11:35 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28 2008 14:01:10) installed Apr 23 16:11:35 taft-03 clogd[6843]: Starting clogd: Apr 23 16:11:35 taft-03 clogd[6843]: Built: Feb 28 2008 14:04:28 Apr 23 16:11:44 taft-03 kernel: dlm: Using TCP for communications Apr 23 16:11:44 taft-03 kernel: dlm: connecting to 1 Apr 23 16:11:44 taft-03 kernel: dlm: got connection from 2 Apr 23 16:11:44 taft-03 kernel: dlm: got connection from 1 Apr 23 16:11:44 taft-03 kernel: dlm: got connection from 4 Apr 23 16:11:45 taft-03 clvmd: Cluster LVM daemon started - connected to CMAN Apr 23 16:11:46 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27 Apr 23 16:11:46 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found Apr 23 16:11:46 taft-03 dmeventd[6917]: dmeventd ready for processing. Apr 23 16:11:46 taft-03 dmeventd[6917]: Monitoring mirror device taft-mirror for events Apr 23 16:11:47 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27 Apr 23 16:11:47 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found [...] Apr 23 16:11:59 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27 Apr 23 16:11:59 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found Apr 23 16:12:00 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27 Apr 23 16:12:00 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found Apr 23 16:12:01 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:3 - retrying Apr 23 16:12:01 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:4 - retrying
Reassigning to openais.
FYI - reproduced this again w/o I/O. Was doing some lvm ops on the four nodes in the cluster, shot a node and it deadlock attempting to activate the volumes. [...] May 6 13:21:42 taft-04 clogd[6903]: saCkptSectionIterationNext failure: 27 May 6 13:21:42 taft-04 clogd[6903]: import_checkpoint: 0 checkpoint sections found May 6 13:21:43 taft-04 clogd[6903]: saCkptSectionIterationNext failure: 27 May 6 13:21:43 taft-04 clogd[6903]: import_checkpoint: 0 checkpoint sections found May 6 13:21:44 taft-04 clogd[6903]: saCkptSectionIterationNext failure: 27 May 6 13:21:44 taft-04 clogd[6903]: import_checkpoint: 0 checkpoint sections found [...]
believe this to be a duplicate of bugzilla #400491. *** This bug has been marked as a duplicate of 400491 ***
*** This bug has been marked as a duplicate of 400941 ***
Haven't seen any Ckpt failures in awhile. Marking verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0074.html