430296 – RHEL5 cmirror tracker: recovered cluster node deadlocks during volume activation

Bug 430296 - RHEL5 cmirror tracker: recovered cluster node deadlocks during volume activation

Summary: RHEL5 cmirror tracker: recovered cluster node deadlocks during volume activation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	openais
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Steven Dake
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	432877
Blocks:	430797 509886
TreeView+	depends on / blocked

Reported:	2008-01-25 20:46 UTC by Corey Marthaler
Modified:	2016-04-26 16:12 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 20:46:41 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log from hayes-01 (27.32 KB, text/plain) 2008-01-25 21:12 UTC, Corey Marthaler	no flags	Details
log from hayes-02 (47.76 KB, text/plain) 2008-01-25 21:13 UTC, Corey Marthaler	no flags	Details
log from hayes-03 (49.84 KB, text/plain) 2008-01-25 21:13 UTC, Corey Marthaler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:0074	0	normal	SHIPPED_LIVE	openais bug-fix update	2009-01-20 16:04:11 UTC

Description Corey Marthaler 2008-01-25 20:46:31 UTC

Description of problem:
I was running I/O from all 3 nodes in the cluster (hayes-0[123]) to a simple 50G
cmirror with GFS on top. I then reset hayes-02. Recovery on the remaining two
nodes appeared fine as hayes-02 was properly fenced. I even created another
cmirror on hayes-03 to make sure nothing was deadlocked. 

I then brought hayes-02 back into the cluster and started cmirror and clvmd. The
activation of the VGs during the clvmd start deadlocked and even resulted in
hayes-02 being once again fenced.

Here are the two mirrors:
[root@hayes-03 ~]# lvs -a -o +devices
  LV                VG         Attr   LSize  Origin Snap%  Move Log        
Copy%  Convert Devices
  LogVol00          VolGroup00 -wi-ao 72.44G                                   
           /dev/sda2(0)
  LogVol01          VolGroup00 -wi-ao  1.94G                                   
           /dev/sda2(2318)
  mirror            hayes      Mwi-ao 50.00G                    mirror_mlog
100.00         mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] hayes      iwi-ao 50.00G                                   
           /dev/etherd/e1.1p1(0)
  [mirror_mimage_1] hayes      iwi-ao 50.00G                                   
           /dev/etherd/e1.1p2(0)
  [mirror_mlog]     hayes      lwi-ao  4.00M                                   
           /dev/etherd/e1.1p7(0)
  test              hayes      mwi-a-  5.00G                    test_mlog  
100.00         test_mimage_0(0),test_mimage_1(0)
  [test_mimage_0]   hayes      iwi-ao  5.00G                                   
           /dev/etherd/e1.1p3(0)
  [test_mimage_1]   hayes      iwi-ao  5.00G                                   
           /dev/etherd/e1.1p4(0)
  [test_mlog]       hayes      lwi-ao  4.00M                                   
           /dev/etherd/e1.1p2(12800)

I'll attach the messages from each of the nodes during the time of the
deadlock/second fence.

Version-Release number of selected component (if applicable):
2.6.18-71.el5
lvm2-2.02.31-1.el5
lvm2-cluster-2.02.31-1.el5
kmod-cmirror-0.1.4-1.el5
cmirror-1.1.7-1.el5

Comment 1 Corey Marthaler 2008-01-25 21:12:33 UTC

Created attachment 292998 [details]
log from hayes-01

Comment 2 Corey Marthaler 2008-01-25 21:13:06 UTC

Created attachment 292999 [details]
log from hayes-02

Comment 3 Corey Marthaler 2008-01-25 21:13:36 UTC

Created attachment 293000 [details]
log from hayes-03

Comment 4 Corey Marthaler 2008-01-25 21:55:17 UTC

This issue is reproducable.

Comment 6 Corey Marthaler 2008-02-05 20:23:04 UTC

Reproduced this with the latest. It seems as though this may be related to bz
429599 and 383291, as I see those messages stream during the activate deadlocks.

[...]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_RESUME]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_RESUME]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered:
[2][0][0][0][0][0] [75]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_GET_SYNC_COUNT]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_GET_SYNC_COUNT]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered:
[2][0][0][0][0][0] [76]
Feb  5 14:06:11 grant-01 clogd[3241]: import_checkpoint: 0 checkpoint sections found
Feb  5 14:06:15 grant-01 last message repeated 4 times

[...]

Feb  5 14:18:09 grant-01 clogd[3241]: import_checkpoint: 0 checkpoint sections found
Feb  5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Attempting to
contact cluster log server...
Feb  5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Unable to send
cluster log request [DM_CLOG_CTR] to server: -3
Feb  5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Attempting to
contact cluster log server...
Feb  5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Unable to send
cluster log request [DM_CLOG_CTR] to server: -3


cmirror-1.1.9-1.el5
kmod-cmirror-0.1.5-2.el5

Comment 7 Jonathan Earl Brassow 2008-02-06 20:49:29 UTC

-3 means the server has probably died.

When you see this, check 'ps -C clogd'... and see if you can find a core dump.

Comment 8 Corey Marthaler 2008-02-08 15:48:38 UTC

During the deadlock, clogd is still running on all three nodes.

# After grant-01 was killed...
[root@grant-01 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                   
[deadlock]

Feb  8 09:42:21 grant-01 clogd[3256]: import_checkpoint: 0 checkpoint sections found
Feb  8 09:42:25 grant-01 last message repeated 4 times
Feb  8 09:42:26 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_RESUME]
Feb  8 09:42:26 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_RESUME]
Feb  8 09:42:26 grant-01 kernel: device-mapper: dm-log-clustered:
[2][0][0][0][0][0] [1]
Feb  8 09:42:26 grant-01 clogd[3256]: import_checkpoint: 0 checkpoint sections found
Feb  8 09:42:27 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_GET_SYNC_COUNT]
Feb  8 09:42:27 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_GET_SYNC_COUNT]
Feb  8 09:42:27 grant-01 kernel: device-mapper: dm-log-clustered:
[2][0][0][0][0][0] [2]
Feb  8 09:42:27 grant-01 clogd[3256]: import_checkpoint: 0 checkpoint sections found
[...]


[root@grant-01 ~]# ps -C clogd
  PID TTY          TIME CMD
 3256 ?        00:00:00 clogd
[root@grant-01 ~]# ps -elf | grep clogd
5 S root      3256     1  0  75   0 -  4498 -      09:42 ?        00:00:00 clogd
0 S root      3389  3355  0  78   0 - 15282 pipe_w 09:46 pts/1    00:00:00 grep
clogd

Comment 9 Corey Marthaler 2008-02-08 17:10:19 UTC

FYI - there is still an issue here in the lastest code. Though it didn't
deadlock, the activate still failed and resulted in a stuck mirror.

# Revolver output:
[...]
wait for clvm services on all nodes...
activating VGs on grant-01
  Error locking on node grant-01: device-mapper: reload ioctl failed: Invalid
argument

  mirror   grant      mwi-d-  5.00G                    mirror_mlog   0.00

cmirror-1.1.13-1.el5/kmod-cmirror-0.1.6-1.el5

Comment 11 Jonathan Earl Brassow 2008-02-11 21:46:41 UTC

The trailing comments hint at something different from the leading comments...
are you still doing the same thing to produce this bug?

1) I/O from multiple (all) nodes
2) kill one of the machines

Also, if there are "unable to send to cluster" messages, the new logs will
contain the reason for the failure... it'd be nice to have those.

Comment 12 Corey Marthaler 2008-02-11 23:50:21 UTC

I tried this again this afternoon. Same scenario that you described:

1) I/O from multiple (all) nodes
2) kill one of the machines (the cmirror server)


[root@grant-01 ~]# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]
[root@grant-01 ~]# service cmirror start
Loading clustered mirror log module:                       [  OK  ]
Starting clustered mirror log server:                      [  OK  ]
[root@grant-01 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                                           
[DEADLOCK]


grant-01:
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] New Configuration:
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] Members Left:
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] Members Joined:
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:34 grant-01 openais[3196]: [SYNC ] This node is within the primary
component and will provide service.
Feb 11 17:24:34 grant-01 openais[3196]: [TOTEM] entering OPERATIONAL state.
Feb 11 17:24:34 grant-01 openais[3196]: [CMAN ] quorum regained, resuming activity
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] got nodejoin message 10.15.89.151
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] got nodejoin message 10.15.89.152
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] got nodejoin message 10.15.89.153
Feb 11 17:24:34 grant-01 openais[3196]: [CPG  ] got joinlist message from node 2
Feb 11 17:24:34 grant-01 openais[3196]: [CPG  ] got joinlist message from node 3
Feb 11 17:24:35 grant-01 ccsd[3189]: Initial status:: Quorate
Feb 11 17:24:44 grant-01 kernel: device-mapper: dm-log-clustered: (built Feb  8
2008 11:13:43) installed
Feb 11 17:24:44 grant-01 clogd[3257]: Starting clogd:
Feb 11 17:24:44 grant-01 clogd[3257]:  Built: Feb  8 2008 11:09:24
Feb 11 17:24:52 grant-01 kernel: dlm: Using TCP for communications
Feb 11 17:24:52 grant-01 kernel: dlm: connecting to 2
Feb 11 17:24:52 grant-01 kernel: dlm: got connection from 3
Feb 11 17:24:52 grant-01 kernel: dlm: got connection from 2
Feb 11 17:24:53 grant-01 clvmd: Cluster LVM daemon started - connected to CMAN
Feb 11 17:24:54 grant-01 clogd[3257]: import_checkpoint: 0 checkpoint sections found
Feb 11 17:25:08 grant-01 last message repeated 14 times
Feb 11 17:25:09 grant-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_RESUME:2 - retrying
Feb 11 17:25:09 grant-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:3 - retrying
Feb 11 17:25:09 grant-01 clogd[3257]: import_checkpoint: 0 checkpoint sections found
Feb 11 17:25:23 grant-01 last message repeated 14 times
Feb 11 17:25:24 grant-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_RESUME:4 - retrying
Feb 11 17:25:24 grant-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:5 - retrying
Feb 11 17:25:24 grant-01 clogd[3257]: import_checkpoint: 0 checkpoint sections found
Feb 11 17:25:38 grant-01 last message repeated 14 times


grant-02:
Feb 11 17:16:09 grant-02 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:20301 - retrying
Feb 11 17:16:35 grant-02 fenced[3238]: grant-01 not a cluster member after 30
sec post_fail_delay
Feb 11 17:16:35 grant-02 fenced[3238]: fencing node "grant-01"
Feb 11 17:16:48 grant-02 fenced[3238]: fence "grant-01" success
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs1.1: jid=0: Trying to
acquire journal lock...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs3.1: jid=0: Trying to
acquire journal lock...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Trying to
acquire journal lock...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs1.1: jid=0: Looking at
journal...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs3.1: jid=0: Looking at
journal...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Looking at
journal...
[...]
Feb 11 17:16:58 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: replays = 2,
skips = 0, sames = 0
Feb 11 17:16:59 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Journal replayed
in 6s
Feb 11 17:16:59 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Done
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering GATHER state from 11.
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] Saving state aru 15657 high seq
received 15657
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] Storing new sequence id for ring 150
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering COMMIT state.
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering RECOVERY state.
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] position [0] member 10.15.89.151:
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] previous ring seq 332 rep
10.15.89.151
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] aru 9 high delivered 9 received
flag 1
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] position [1] member 10.15.89.152:
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] previous ring seq 332 rep
10.15.89.152
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] aru 15657 high delivered 15657
received flag 1
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] position [2] member 10.15.89.153:
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] previous ring seq 332 rep
10.15.89.152
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] aru 15657 high delivered 15657
received flag 1
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] Did not need to originate any
messages in recovery.
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] New Configuration:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] Members Left:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] Members Joined:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] New Configuration:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] Members Left:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] Members Joined:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:24 grant-02 openais[3222]: [SYNC ] This node is within the primary
component and will provide service.
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering OPERATIONAL state.
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] got nodejoin message 10.15.89.151
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] got nodejoin message 10.15.89.152
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] got nodejoin message 10.15.89.153
Feb 11 17:24:24 grant-02 openais[3222]: [CPG  ] got joinlist message from node 2
Feb 11 17:24:24 grant-02 openais[3222]: [CPG  ] got joinlist message from node 3
Feb 11 17:24:41 grant-02 kernel: dlm: connecting to 1
Feb 11 17:24:41 grant-02 kernel: dlm: got connection from 1



grant-03:
Feb 11 17:17:26 grant-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IN_SYNC:20939 - retrying
Feb 11 17:18:21 grant-03 clogd[3276]: cpg_mcast_joined error: 9
Feb 11 17:18:21 grant-03 clogd[3276]: cluster_send failed at: local.c:212
(do_local_work)
Feb 11 17:18:21 grant-03 clogd[3276]: [] Unable to send (null) to cluster:
Invalid exchange
Feb 11 17:18:21 grant-03 kernel: device-mapper: dm-log-clustered: Stray request
returned: <NULL>, 0
Feb 11 17:18:21 grant-03 clogd[3276]: Bad callback on local/4
Feb 11 17:18:36 grant-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IN_SYNC:22840 - retrying
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering GATHER state from 11.
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] Saving state aru 15657 high seq
received 15657
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] Storing new sequence id for ring 150
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering COMMIT state.
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering RECOVERY state.
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] position [0] member 10.15.89.151:
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] previous ring seq 332 rep
10.15.89.151
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] aru 9 high delivered 9 received
flag 1
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] position [1] member 10.15.89.152:
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] previous ring seq 332 rep
10.15.89.152
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] aru 15657 high delivered 15657
received flag 1
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] position [2] member 10.15.89.153:
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] previous ring seq 332 rep
10.15.89.152
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] aru 15657 high delivered 15657
received flag 1
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] Did not need to originate any
messages in recovery.
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] New Configuration:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] Members Left:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] Members Joined:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] New Configuration:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] Members Left:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] Members Joined:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:51 grant-03 openais[3217]: [SYNC ] This node is within the primary
component and will provide service.
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering OPERATIONAL state.
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] got nodejoin message 10.15.89.151
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] got nodejoin message 10.15.89.152
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] got nodejoin message 10.15.89.153
Feb 11 17:24:51 grant-03 openais[3217]: [CPG  ] got joinlist message from node 2
Feb 11 17:24:51 grant-03 openais[3217]: [CPG  ] got joinlist message from node 3
Feb 11 17:25:09 grant-03 kernel: dlm: connecting to 1

Comment 13 Jonathan Earl Brassow 2008-02-14 21:38:44 UTC

seems to be an AIS checkpoint issue.

Comment 14 Steven Dake 2008-02-14 21:42:12 UTC

may be ais checkpoint issue opening seperate issue on that topic.

Comment 15 Jonathan Earl Brassow 2008-02-15 16:04:41 UTC

If it is an issue with AIS, it may affect posix locking too.  You could try:

1) mount 1 gfs node
2) take out posix lock
3) mount rest of nodes
4) kill one of the nodes (not the first one)
5) bring node back and mount the file system

I *think* the operations there would be similar... not sure w/o talking to dave
though.

Comment 16 Corey Marthaler 2008-04-24 14:46:17 UTC

Reproduced this again while trying the node recovery that we mentioned in the
cmirror mtg. I shot taft-03, and played around starting and stopping the cmirror
service.

2.6.18-90.el5
kmod-cmirror-0.1.8-1.el5
openais-0.80.3-15.el5

Apr 23 16:11:13 taft-03 openais[6598]: [SYNC ] This node is within the primary
component and will provide service.
Apr 23 16:11:13 taft-03 openais[6598]: [TOTEM] entering OPERATIONAL state.
Apr 23 16:11:13 taft-03 openais[6598]: [CMAN ] quorum regained, resuming activity
Apr 23 16:11:13 taft-03 openais[6598]: [CLM  ] got nodejoin message 10.15.89.67
Apr 23 16:11:13 taft-03 openais[6598]: [CLM  ] got nodejoin message 10.15.89.68
Apr 23 16:11:13 taft-03 openais[6598]: [CLM  ] got nodejoin message 10.15.89.69
Apr 23 16:11:13 taft-03 openais[6598]: [CLM  ] got nodejoin message 10.15.89.70
Apr 23 16:11:13 taft-03 openais[6598]: [CPG  ] got joinlist message from node 2
Apr 23 16:11:13 taft-03 openais[6598]: [CPG  ] got joinlist message from node 4
Apr 23 16:11:13 taft-03 openais[6598]: [CPG  ] got joinlist message from node 1
Apr 23 16:11:14 taft-03 ccsd[6592]: Initial status:: Quorate
Apr 23 16:11:24 taft-03 clogd[6655]: Starting clogd:
Apr 23 16:11:24 taft-03 clogd[6655]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:29 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) removed
Apr 23 16:11:30 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) installed
Apr 23 16:11:30 taft-03 clogd[6702]: Starting clogd:
Apr 23 16:11:30 taft-03 clogd[6702]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:31 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) removed
Apr 23 16:11:31 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) installed
Apr 23 16:11:31 taft-03 clogd[6749]: Starting clogd:
Apr 23 16:11:31 taft-03 clogd[6749]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:32 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) removed
Apr 23 16:11:33 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) installed
Apr 23 16:11:33 taft-03 clogd[6796]: Starting clogd:
Apr 23 16:11:33 taft-03 clogd[6796]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:33 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) removed
Apr 23 16:11:35 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) installed
Apr 23 16:11:35 taft-03 clogd[6843]: Starting clogd:
Apr 23 16:11:35 taft-03 clogd[6843]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:44 taft-03 kernel: dlm: Using TCP for communications
Apr 23 16:11:44 taft-03 kernel: dlm: connecting to 1
Apr 23 16:11:44 taft-03 kernel: dlm: got connection from 2
Apr 23 16:11:44 taft-03 kernel: dlm: got connection from 1
Apr 23 16:11:44 taft-03 kernel: dlm: got connection from 4
Apr 23 16:11:45 taft-03 clvmd: Cluster LVM daemon started - connected to CMAN
Apr 23 16:11:46 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27
Apr 23 16:11:46 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found
Apr 23 16:11:46 taft-03 dmeventd[6917]: dmeventd ready for processing.
Apr 23 16:11:46 taft-03 dmeventd[6917]: Monitoring mirror device taft-mirror for
events
Apr 23 16:11:47 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27
Apr 23 16:11:47 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found
[...]
Apr 23 16:11:59 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27
Apr 23 16:11:59 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found
Apr 23 16:12:00 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27
Apr 23 16:12:00 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found
Apr 23 16:12:01 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:3 - retrying
Apr 23 16:12:01 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:4 - retrying

Comment 17 Corey Marthaler 2008-04-24 14:48:02 UTC

Reassigning to openais.

Comment 18 Corey Marthaler 2008-05-06 18:24:50 UTC

FYI - reproduced this again w/o I/O. Was doing some lvm ops on the four nodes in
the cluster, shot a node and it deadlock attempting to activate the volumes.


[...]
May  6 13:21:42 taft-04 clogd[6903]: saCkptSectionIterationNext failure: 27
May  6 13:21:42 taft-04 clogd[6903]: import_checkpoint: 0 checkpoint sections found
May  6 13:21:43 taft-04 clogd[6903]: saCkptSectionIterationNext failure: 27
May  6 13:21:43 taft-04 clogd[6903]: import_checkpoint: 0 checkpoint sections found
May  6 13:21:44 taft-04 clogd[6903]: saCkptSectionIterationNext failure: 27
May  6 13:21:44 taft-04 clogd[6903]: import_checkpoint: 0 checkpoint sections found
[...]

Comment 19 Steven Dake 2008-05-18 21:06:55 UTC

believe this to be a duplicate of bugzilla #400491.

*** This bug has been marked as a duplicate of 400491 ***

Comment 20 Corey Marthaler 2008-05-19 15:34:28 UTC


*** This bug has been marked as a duplicate of 400941 ***

Comment 23 Corey Marthaler 2008-08-22 21:37:17 UTC

Haven't seen any Ckpt failures in awhile. Marking verified.

Comment 25 errata-xmlrpc 2009-01-20 20:46:41 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0074.html

Note You need to log in before you can comment on or make changes to this bug.