Bug 430296 - RHEL5 cmirror tracker: recovered cluster node deadlocks during volume activation
Summary: RHEL5 cmirror tracker: recovered cluster node deadlocks during volume activation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.2
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 432877
Blocks: 430797 509886
TreeView+ depends on / blocked
 
Reported: 2008-01-25 20:46 UTC by Corey Marthaler
Modified: 2016-04-26 16:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 20:46:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log from hayes-01 (27.32 KB, text/plain)
2008-01-25 21:12 UTC, Corey Marthaler
no flags Details
log from hayes-02 (47.76 KB, text/plain)
2008-01-25 21:13 UTC, Corey Marthaler
no flags Details
log from hayes-03 (49.84 KB, text/plain)
2008-01-25 21:13 UTC, Corey Marthaler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:0074 0 normal SHIPPED_LIVE openais bug-fix update 2009-01-20 16:04:11 UTC

Description Corey Marthaler 2008-01-25 20:46:31 UTC
Description of problem:
I was running I/O from all 3 nodes in the cluster (hayes-0[123]) to a simple 50G
cmirror with GFS on top. I then reset hayes-02. Recovery on the remaining two
nodes appeared fine as hayes-02 was properly fenced. I even created another
cmirror on hayes-03 to make sure nothing was deadlocked. 

I then brought hayes-02 back into the cluster and started cmirror and clvmd. The
activation of the VGs during the clvmd start deadlocked and even resulted in
hayes-02 being once again fenced.

Here are the two mirrors:
[root@hayes-03 ~]# lvs -a -o +devices
  LV                VG         Attr   LSize  Origin Snap%  Move Log        
Copy%  Convert Devices
  LogVol00          VolGroup00 -wi-ao 72.44G                                   
           /dev/sda2(0)
  LogVol01          VolGroup00 -wi-ao  1.94G                                   
           /dev/sda2(2318)
  mirror            hayes      Mwi-ao 50.00G                    mirror_mlog
100.00         mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] hayes      iwi-ao 50.00G                                   
           /dev/etherd/e1.1p1(0)
  [mirror_mimage_1] hayes      iwi-ao 50.00G                                   
           /dev/etherd/e1.1p2(0)
  [mirror_mlog]     hayes      lwi-ao  4.00M                                   
           /dev/etherd/e1.1p7(0)
  test              hayes      mwi-a-  5.00G                    test_mlog  
100.00         test_mimage_0(0),test_mimage_1(0)
  [test_mimage_0]   hayes      iwi-ao  5.00G                                   
           /dev/etherd/e1.1p3(0)
  [test_mimage_1]   hayes      iwi-ao  5.00G                                   
           /dev/etherd/e1.1p4(0)
  [test_mlog]       hayes      lwi-ao  4.00M                                   
           /dev/etherd/e1.1p2(12800)

I'll attach the messages from each of the nodes during the time of the
deadlock/second fence.

Version-Release number of selected component (if applicable):
2.6.18-71.el5
lvm2-2.02.31-1.el5
lvm2-cluster-2.02.31-1.el5
kmod-cmirror-0.1.4-1.el5
cmirror-1.1.7-1.el5

Comment 1 Corey Marthaler 2008-01-25 21:12:33 UTC
Created attachment 292998 [details]
log from hayes-01

Comment 2 Corey Marthaler 2008-01-25 21:13:06 UTC
Created attachment 292999 [details]
log from hayes-02

Comment 3 Corey Marthaler 2008-01-25 21:13:36 UTC
Created attachment 293000 [details]
log from hayes-03

Comment 4 Corey Marthaler 2008-01-25 21:55:17 UTC
This issue is reproducable. 

Comment 6 Corey Marthaler 2008-02-05 20:23:04 UTC
Reproduced this with the latest. It seems as though this may be related to bz
429599 and 383291, as I see those messages stream during the activate deadlocks.

[...]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_RESUME]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_RESUME]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered:
[2][0][0][0][0][0] [75]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_GET_SYNC_COUNT]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_GET_SYNC_COUNT]
Feb  5 14:06:11 grant-01 kernel: device-mapper: dm-log-clustered:
[2][0][0][0][0][0] [76]
Feb  5 14:06:11 grant-01 clogd[3241]: import_checkpoint: 0 checkpoint sections found
Feb  5 14:06:15 grant-01 last message repeated 4 times

[...]

Feb  5 14:18:09 grant-01 clogd[3241]: import_checkpoint: 0 checkpoint sections found
Feb  5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Attempting to
contact cluster log server...
Feb  5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Unable to send
cluster log request [DM_CLOG_CTR] to server: -3
Feb  5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Attempting to
contact cluster log server...
Feb  5 14:18:10 grant-01 kernel: device-mapper: dm-log-clustered: Unable to send
cluster log request [DM_CLOG_CTR] to server: -3


cmirror-1.1.9-1.el5
kmod-cmirror-0.1.5-2.el5

Comment 7 Jonathan Earl Brassow 2008-02-06 20:49:29 UTC
-3 means the server has probably died.

When you see this, check 'ps -C clogd'... and see if you can find a core dump.


Comment 8 Corey Marthaler 2008-02-08 15:48:38 UTC
During the deadlock, clogd is still running on all three nodes.

# After grant-01 was killed...
[root@grant-01 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                   
[deadlock]

Feb  8 09:42:21 grant-01 clogd[3256]: import_checkpoint: 0 checkpoint sections found
Feb  8 09:42:25 grant-01 last message repeated 4 times
Feb  8 09:42:26 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_RESUME]
Feb  8 09:42:26 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_RESUME]
Feb  8 09:42:26 grant-01 kernel: device-mapper: dm-log-clustered:
[2][0][0][0][0][0] [1]
Feb  8 09:42:26 grant-01 clogd[3256]: import_checkpoint: 0 checkpoint sections found
Feb  8 09:42:27 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_GET_SYNC_COUNT]
Feb  8 09:42:27 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_GET_SYNC_COUNT]
Feb  8 09:42:27 grant-01 kernel: device-mapper: dm-log-clustered:
[2][0][0][0][0][0] [2]
Feb  8 09:42:27 grant-01 clogd[3256]: import_checkpoint: 0 checkpoint sections found
[...]


[root@grant-01 ~]# ps -C clogd
  PID TTY          TIME CMD
 3256 ?        00:00:00 clogd
[root@grant-01 ~]# ps -elf | grep clogd
5 S root      3256     1  0  75   0 -  4498 -      09:42 ?        00:00:00 clogd
0 S root      3389  3355  0  78   0 - 15282 pipe_w 09:46 pts/1    00:00:00 grep
clogd


Comment 9 Corey Marthaler 2008-02-08 17:10:19 UTC
FYI - there is still an issue here in the lastest code. Though it didn't
deadlock, the activate still failed and resulted in a stuck mirror.

# Revolver output:
[...]
wait for clvm services on all nodes...
activating VGs on grant-01
  Error locking on node grant-01: device-mapper: reload ioctl failed: Invalid
argument

  mirror   grant      mwi-d-  5.00G                    mirror_mlog   0.00

cmirror-1.1.13-1.el5/kmod-cmirror-0.1.6-1.el5

Comment 11 Jonathan Earl Brassow 2008-02-11 21:46:41 UTC
The trailing comments hint at something different from the leading comments...
are you still doing the same thing to produce this bug?

1) I/O from multiple (all) nodes
2) kill one of the machines

Also, if there are "unable to send to cluster" messages, the new logs will
contain the reason for the failure... it'd be nice to have those.


Comment 12 Corey Marthaler 2008-02-11 23:50:21 UTC
I tried this again this afternoon. Same scenario that you described:

1) I/O from multiple (all) nodes
2) kill one of the machines (the cmirror server)


[root@grant-01 ~]# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]
[root@grant-01 ~]# service cmirror start
Loading clustered mirror log module:                       [  OK  ]
Starting clustered mirror log server:                      [  OK  ]
[root@grant-01 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                                           
[DEADLOCK]


grant-01:
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] New Configuration:
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] Members Left:
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] Members Joined:
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:34 grant-01 openais[3196]: [SYNC ] This node is within the primary
component and will provide service.
Feb 11 17:24:34 grant-01 openais[3196]: [TOTEM] entering OPERATIONAL state.
Feb 11 17:24:34 grant-01 openais[3196]: [CMAN ] quorum regained, resuming activity
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] got nodejoin message 10.15.89.151
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] got nodejoin message 10.15.89.152
Feb 11 17:24:34 grant-01 openais[3196]: [CLM  ] got nodejoin message 10.15.89.153
Feb 11 17:24:34 grant-01 openais[3196]: [CPG  ] got joinlist message from node 2
Feb 11 17:24:34 grant-01 openais[3196]: [CPG  ] got joinlist message from node 3
Feb 11 17:24:35 grant-01 ccsd[3189]: Initial status:: Quorate
Feb 11 17:24:44 grant-01 kernel: device-mapper: dm-log-clustered: (built Feb  8
2008 11:13:43) installed
Feb 11 17:24:44 grant-01 clogd[3257]: Starting clogd:
Feb 11 17:24:44 grant-01 clogd[3257]:  Built: Feb  8 2008 11:09:24
Feb 11 17:24:52 grant-01 kernel: dlm: Using TCP for communications
Feb 11 17:24:52 grant-01 kernel: dlm: connecting to 2
Feb 11 17:24:52 grant-01 kernel: dlm: got connection from 3
Feb 11 17:24:52 grant-01 kernel: dlm: got connection from 2
Feb 11 17:24:53 grant-01 clvmd: Cluster LVM daemon started - connected to CMAN
Feb 11 17:24:54 grant-01 clogd[3257]: import_checkpoint: 0 checkpoint sections found
Feb 11 17:25:08 grant-01 last message repeated 14 times
Feb 11 17:25:09 grant-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_RESUME:2 - retrying
Feb 11 17:25:09 grant-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:3 - retrying
Feb 11 17:25:09 grant-01 clogd[3257]: import_checkpoint: 0 checkpoint sections found
Feb 11 17:25:23 grant-01 last message repeated 14 times
Feb 11 17:25:24 grant-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_RESUME:4 - retrying
Feb 11 17:25:24 grant-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:5 - retrying
Feb 11 17:25:24 grant-01 clogd[3257]: import_checkpoint: 0 checkpoint sections found
Feb 11 17:25:38 grant-01 last message repeated 14 times


grant-02:
Feb 11 17:16:09 grant-02 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:20301 - retrying
Feb 11 17:16:35 grant-02 fenced[3238]: grant-01 not a cluster member after 30
sec post_fail_delay
Feb 11 17:16:35 grant-02 fenced[3238]: fencing node "grant-01"
Feb 11 17:16:48 grant-02 fenced[3238]: fence "grant-01" success
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs1.1: jid=0: Trying to
acquire journal lock...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs3.1: jid=0: Trying to
acquire journal lock...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Trying to
acquire journal lock...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs1.1: jid=0: Looking at
journal...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs3.1: jid=0: Looking at
journal...
Feb 11 17:16:53 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Looking at
journal...
[...]
Feb 11 17:16:58 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: replays = 2,
skips = 0, sames = 0
Feb 11 17:16:59 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Journal replayed
in 6s
Feb 11 17:16:59 grant-02 kernel: GFS: fsid=GRANT:gfs2.2: jid=0: Done
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering GATHER state from 11.
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] Saving state aru 15657 high seq
received 15657
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] Storing new sequence id for ring 150
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering COMMIT state.
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering RECOVERY state.
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] position [0] member 10.15.89.151:
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] previous ring seq 332 rep
10.15.89.151
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] aru 9 high delivered 9 received
flag 1
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] position [1] member 10.15.89.152:
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] previous ring seq 332 rep
10.15.89.152
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] aru 15657 high delivered 15657
received flag 1
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] position [2] member 10.15.89.153:
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] previous ring seq 332 rep
10.15.89.152
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] aru 15657 high delivered 15657
received flag 1
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] Did not need to originate any
messages in recovery.
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] New Configuration:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] Members Left:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] Members Joined:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] New Configuration:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] Members Left:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] Members Joined:
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:24 grant-02 openais[3222]: [SYNC ] This node is within the primary
component and will provide service.
Feb 11 17:24:24 grant-02 openais[3222]: [TOTEM] entering OPERATIONAL state.
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] got nodejoin message 10.15.89.151
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] got nodejoin message 10.15.89.152
Feb 11 17:24:24 grant-02 openais[3222]: [CLM  ] got nodejoin message 10.15.89.153
Feb 11 17:24:24 grant-02 openais[3222]: [CPG  ] got joinlist message from node 2
Feb 11 17:24:24 grant-02 openais[3222]: [CPG  ] got joinlist message from node 3
Feb 11 17:24:41 grant-02 kernel: dlm: connecting to 1
Feb 11 17:24:41 grant-02 kernel: dlm: got connection from 1



grant-03:
Feb 11 17:17:26 grant-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IN_SYNC:20939 - retrying
Feb 11 17:18:21 grant-03 clogd[3276]: cpg_mcast_joined error: 9
Feb 11 17:18:21 grant-03 clogd[3276]: cluster_send failed at: local.c:212
(do_local_work)
Feb 11 17:18:21 grant-03 clogd[3276]: [] Unable to send (null) to cluster:
Invalid exchange
Feb 11 17:18:21 grant-03 kernel: device-mapper: dm-log-clustered: Stray request
returned: <NULL>, 0
Feb 11 17:18:21 grant-03 clogd[3276]: Bad callback on local/4
Feb 11 17:18:36 grant-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IN_SYNC:22840 - retrying
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering GATHER state from 11.
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] Saving state aru 15657 high seq
received 15657
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] Storing new sequence id for ring 150
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering COMMIT state.
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering RECOVERY state.
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] position [0] member 10.15.89.151:
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] previous ring seq 332 rep
10.15.89.151
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] aru 9 high delivered 9 received
flag 1
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] position [1] member 10.15.89.152:
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] previous ring seq 332 rep
10.15.89.152
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] aru 15657 high delivered 15657
received flag 1
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] position [2] member 10.15.89.153:
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] previous ring seq 332 rep
10.15.89.152
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] aru 15657 high delivered 15657
received flag 1
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] Did not need to originate any
messages in recovery.
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] New Configuration:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] Members Left:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] Members Joined:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] New Configuration:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.152)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.153)
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] Members Left:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] Members Joined:
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ]         r(0) ip(10.15.89.151)
Feb 11 17:24:51 grant-03 openais[3217]: [SYNC ] This node is within the primary
component and will provide service.
Feb 11 17:24:51 grant-03 openais[3217]: [TOTEM] entering OPERATIONAL state.
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] got nodejoin message 10.15.89.151
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] got nodejoin message 10.15.89.152
Feb 11 17:24:51 grant-03 openais[3217]: [CLM  ] got nodejoin message 10.15.89.153
Feb 11 17:24:51 grant-03 openais[3217]: [CPG  ] got joinlist message from node 2
Feb 11 17:24:51 grant-03 openais[3217]: [CPG  ] got joinlist message from node 3
Feb 11 17:25:09 grant-03 kernel: dlm: connecting to 1


Comment 13 Jonathan Earl Brassow 2008-02-14 21:38:44 UTC
seems to be an AIS checkpoint issue.


Comment 14 Steven Dake 2008-02-14 21:42:12 UTC
may be ais checkpoint issue opening seperate issue on that topic.

Comment 15 Jonathan Earl Brassow 2008-02-15 16:04:41 UTC
If it is an issue with AIS, it may affect posix locking too.  You could try:

1) mount 1 gfs node
2) take out posix lock
3) mount rest of nodes
4) kill one of the nodes (not the first one)
5) bring node back and mount the file system

I *think* the operations there would be similar... not sure w/o talking to dave
though.

Comment 16 Corey Marthaler 2008-04-24 14:46:17 UTC
Reproduced this again while trying the node recovery that we mentioned in the
cmirror mtg. I shot taft-03, and played around starting and stopping the cmirror
service.

2.6.18-90.el5
kmod-cmirror-0.1.8-1.el5
openais-0.80.3-15.el5

Apr 23 16:11:13 taft-03 openais[6598]: [SYNC ] This node is within the primary
component and will provide service.
Apr 23 16:11:13 taft-03 openais[6598]: [TOTEM] entering OPERATIONAL state.
Apr 23 16:11:13 taft-03 openais[6598]: [CMAN ] quorum regained, resuming activity
Apr 23 16:11:13 taft-03 openais[6598]: [CLM  ] got nodejoin message 10.15.89.67
Apr 23 16:11:13 taft-03 openais[6598]: [CLM  ] got nodejoin message 10.15.89.68
Apr 23 16:11:13 taft-03 openais[6598]: [CLM  ] got nodejoin message 10.15.89.69
Apr 23 16:11:13 taft-03 openais[6598]: [CLM  ] got nodejoin message 10.15.89.70
Apr 23 16:11:13 taft-03 openais[6598]: [CPG  ] got joinlist message from node 2
Apr 23 16:11:13 taft-03 openais[6598]: [CPG  ] got joinlist message from node 4
Apr 23 16:11:13 taft-03 openais[6598]: [CPG  ] got joinlist message from node 1
Apr 23 16:11:14 taft-03 ccsd[6592]: Initial status:: Quorate
Apr 23 16:11:24 taft-03 clogd[6655]: Starting clogd:
Apr 23 16:11:24 taft-03 clogd[6655]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:29 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) removed
Apr 23 16:11:30 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) installed
Apr 23 16:11:30 taft-03 clogd[6702]: Starting clogd:
Apr 23 16:11:30 taft-03 clogd[6702]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:31 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) removed
Apr 23 16:11:31 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) installed
Apr 23 16:11:31 taft-03 clogd[6749]: Starting clogd:
Apr 23 16:11:31 taft-03 clogd[6749]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:32 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) removed
Apr 23 16:11:33 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) installed
Apr 23 16:11:33 taft-03 clogd[6796]: Starting clogd:
Apr 23 16:11:33 taft-03 clogd[6796]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:33 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) removed
Apr 23 16:11:35 taft-03 kernel: device-mapper: dm-log-clustered: (built Feb 28
2008 14:01:10) installed
Apr 23 16:11:35 taft-03 clogd[6843]: Starting clogd:
Apr 23 16:11:35 taft-03 clogd[6843]:  Built: Feb 28 2008 14:04:28
Apr 23 16:11:44 taft-03 kernel: dlm: Using TCP for communications
Apr 23 16:11:44 taft-03 kernel: dlm: connecting to 1
Apr 23 16:11:44 taft-03 kernel: dlm: got connection from 2
Apr 23 16:11:44 taft-03 kernel: dlm: got connection from 1
Apr 23 16:11:44 taft-03 kernel: dlm: got connection from 4
Apr 23 16:11:45 taft-03 clvmd: Cluster LVM daemon started - connected to CMAN
Apr 23 16:11:46 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27
Apr 23 16:11:46 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found
Apr 23 16:11:46 taft-03 dmeventd[6917]: dmeventd ready for processing.
Apr 23 16:11:46 taft-03 dmeventd[6917]: Monitoring mirror device taft-mirror for
events
Apr 23 16:11:47 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27
Apr 23 16:11:47 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found
[...]
Apr 23 16:11:59 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27
Apr 23 16:11:59 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found
Apr 23 16:12:00 taft-03 clogd[6843]: saCkptSectionIterationNext failure: 27
Apr 23 16:12:00 taft-03 clogd[6843]: import_checkpoint: 0 checkpoint sections found
Apr 23 16:12:01 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:3 - retrying
Apr 23 16:12:01 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:4 - retrying


Comment 17 Corey Marthaler 2008-04-24 14:48:02 UTC
Reassigning to openais.

Comment 18 Corey Marthaler 2008-05-06 18:24:50 UTC
FYI - reproduced this again w/o I/O. Was doing some lvm ops on the four nodes in
the cluster, shot a node and it deadlock attempting to activate the volumes.


[...]
May  6 13:21:42 taft-04 clogd[6903]: saCkptSectionIterationNext failure: 27
May  6 13:21:42 taft-04 clogd[6903]: import_checkpoint: 0 checkpoint sections found
May  6 13:21:43 taft-04 clogd[6903]: saCkptSectionIterationNext failure: 27
May  6 13:21:43 taft-04 clogd[6903]: import_checkpoint: 0 checkpoint sections found
May  6 13:21:44 taft-04 clogd[6903]: saCkptSectionIterationNext failure: 27
May  6 13:21:44 taft-04 clogd[6903]: import_checkpoint: 0 checkpoint sections found
[...]

Comment 19 Steven Dake 2008-05-18 21:06:55 UTC
believe this to be a duplicate of bugzilla #400491.

*** This bug has been marked as a duplicate of 400491 ***

Comment 20 Corey Marthaler 2008-05-19 15:34:28 UTC

*** This bug has been marked as a duplicate of 400941 ***

Comment 23 Corey Marthaler 2008-08-22 21:37:17 UTC
Haven't seen any Ckpt failures in awhile. Marking verified.

Comment 25 errata-xmlrpc 2009-01-20 20:46:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0074.html


Note You need to log in before you can comment on or make changes to this bug.