Bug 233706

Summary: failed cluster node causes mirror recovery region requests to get stuck in loop
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: agk, dwysocha, jbrassow, mbroz, prockai
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-04-27 15:00:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2007-03-23 21:27:06 UTC
Description of problem:
I was running revolver on the 4 node x86_64 link cluster with 2 gfs filesystems
(one w/ 2 legs and one with 3 legs). During the second iteration I failed
link-04 and that cause the mirror recovery to go hay wire. When I brought
link-04 back into the cluster and attempt to remount the first gfs, it hung.

I'll leave the cluster in this state if you'd like to gather more info then
provided below:


Messages from link-07:
[...]
dm-cmirror: unable to notify server of completed resync work
Mar 23 15:19:50 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (8
415)
Mar 23 15:19:50 link-07 kernel: dm-cmirror: Reason :: 1
Mar 23 15:19:50 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (5
793)
Mar 23 15:19:50 link-07 kernel: dm-cmirror: Reason :: 1
Mar 23 15:19:50 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (5                       795)
Mar 23 15:19:50 link-07 kernel: dm-cmirror: Reason :: 1
Mar 23 15:19:50 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (6                       273)
Mar 23 15:19:50 link-07 kernel: dm-cmirror: Reason :: 1
Mar 23 15:19:51 link-07 kernel: dm-cmirror: unable to notify server of completed
resyn                       c work
dm-cmirror: unable to get server (3) to mark region (8192)
dm-cmirror: Reason :: 1
Mar 23 15:20:00 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (8                       192)
Mar 23 15:20:00 link-07 kernel: dm-cmirror: Reason :: 1
dm-cmirror: unable to get server (3) to mark region (2067)
dm-cmirror: Reason :: 1
Mar 23 15:20:35 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (2                       067)
Mar 23 15:20:35 link-07 kernel: dm-cmirror: Reason :: 1
dm-cmirror: unable to get server (3) to mark region (2067)
dm-cmirror: Reason :: 1
Mar 23 15:20:35 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (2                       067)
Mar 23 15:20:35 link-07 kernel: dm-cmirror: Reason :: 1


Messages from link-08 (looping over and over):
[...]
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Mark requester   : 4


[root@link-07 ~]# dmsetup table
revolver-mirror2_mimage_2: 0 10485760 linear 8:49 384
revolver-mirror1_mlog: 0 8192 linear 8:113 384
revolver-mirror2_mimage_1: 0 10485760 linear 8:33 384
revolver-mirror2_mimage_0: 0 10485760 linear 8:1 10486144
revolver-mirror2_mlog: 0 8192 linear 8:17 10486144
revolver-mirror1_mimage_1: 0 10485760 linear 8:17 384
revolver-mirror1_mimage_0: 0 10485760 linear 8:1 384
revolver-mirror2: 0 10485760 mirror clustered_disk 5 253:6 1024
LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3U9Iz0MnCtDa0QV7z8Qpsi749eaAIovqe nosync
block_on_error 3 253:7 0 253:8 0 253:9 0
VolGroup00-LogVol01: 0 4063232 linear 3:2 151781760
revolver-mirror1: 0 10485760 mirror clustered_disk 5 253:2 1024
LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik34coTfAbowArYJ4dLpVYZKagWC33UfFkJ nosync
block_on_error 2 253:3 0 253:4 0
VolGroup00-LogVol00: 0 151781376 linear 3:2 384
[root@link-07 ~]# dmsetup info
Name:              revolver-mirror2_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 9
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3isDcxNcz4wZdJDn4Xe8iZUruUQJ4ZRTe

Name:              revolver-mirror1_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 2
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik34coTfAbowArYJ4dLpVYZKagWC33UfFkJ

Name:              revolver-mirror2_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3mZUe0RrJKkU91zuIoqHmYHWD8uboH8f5

Name:              revolver-mirror2_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik34L8Fe5dimlQkHP0J4VRTo6VYGF1Vl6yp

Name:              revolver-mirror2_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 6
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3U9Iz0MnCtDa0QV7z8Qpsi749eaAIovqe

Name:              revolver-mirror1_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 4
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3kUiOVkdLj6U18LdGB92sLcSI1TO7Rgts

Name:              revolver-mirror1_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 3
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3W26y7KjUX3f6NjgBr0GjYrdSuxMZHA4b

Name:              revolver-mirror2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 10
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3OTZEJ9N5A6CnpxcgWLLsFYES7vrRWrGE

Name:              VolGroup00-LogVol01
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-8qGbKfLuKYoljGNFE1gsS77AYQM3dC4xQjIYEP6InPgUU5nsDPYSZl5EAEKqRWcY

Name:              revolver-mirror1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik33eay0GrlKTcJaZKdaEAig2hY2MNmHS5q

Name:              VolGroup00-LogVol00
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-8qGbKfLuKYoljGNFE1gsS77AYQM3dC4xrapcuzOGNgADTzIRUNTk0MZBbtWAyXhh



Version-Release number of selected component (if applicable):
2.6.9-50.ELsmp
cmirror-kernel-2.6.9-25.0

Comment 1 Jonathan Earl Brassow 2007-03-24 04:30:59 UTC
'Reason' should be a negative number.  This suggests that the client is
recieving a message from the server that is not a response that it is expecting.

sequence numbers were put int (3/22/2007) to fix this problem.  The
cmirror-kernel package you are using was built 3/14/2007.

new -> post

Comment 2 Jonathan Earl Brassow 2007-04-03 20:12:40 UTC
post -> modified

Comment 3 Corey Marthaler 2007-04-12 18:39:37 UTC
Fix verified in cmirror-kernel-2.6.9-30.0.

Comment 5 Alasdair Kergon 2010-04-27 15:00:03 UTC
Assuming this VERIFIED fix got released.  Closing.
Reopen if it's not yet resolved.