Bug 233706 - failed cluster node causes mirror recovery region requests to get stuck in loop
failed cluster node causes mirror recovery region requests to get stuck in loop
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cmirror (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-03-23 17:27 EDT by Corey Marthaler
Modified: 2010-04-27 11:00 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-04-27 11:00:03 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2007-03-23 17:27:06 EDT
Description of problem:
I was running revolver on the 4 node x86_64 link cluster with 2 gfs filesystems
(one w/ 2 legs and one with 3 legs). During the second iteration I failed
link-04 and that cause the mirror recovery to go hay wire. When I brought
link-04 back into the cluster and attempt to remount the first gfs, it hung.

I'll leave the cluster in this state if you'd like to gather more info then
provided below:


Messages from link-07:
[...]
dm-cmirror: unable to notify server of completed resync work
Mar 23 15:19:50 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (8
415)
Mar 23 15:19:50 link-07 kernel: dm-cmirror: Reason :: 1
Mar 23 15:19:50 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (5
793)
Mar 23 15:19:50 link-07 kernel: dm-cmirror: Reason :: 1
Mar 23 15:19:50 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (5                       795)
Mar 23 15:19:50 link-07 kernel: dm-cmirror: Reason :: 1
Mar 23 15:19:50 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (6                       273)
Mar 23 15:19:50 link-07 kernel: dm-cmirror: Reason :: 1
Mar 23 15:19:51 link-07 kernel: dm-cmirror: unable to notify server of completed
resyn                       c work
dm-cmirror: unable to get server (3) to mark region (8192)
dm-cmirror: Reason :: 1
Mar 23 15:20:00 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (8                       192)
Mar 23 15:20:00 link-07 kernel: dm-cmirror: Reason :: 1
dm-cmirror: unable to get server (3) to mark region (2067)
dm-cmirror: Reason :: 1
Mar 23 15:20:35 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (2                       067)
Mar 23 15:20:35 link-07 kernel: dm-cmirror: Reason :: 1
dm-cmirror: unable to get server (3) to mark region (2067)
dm-cmirror: Reason :: 1
Mar 23 15:20:35 link-07 kernel: dm-cmirror: unable to get server (3) to mark
region (2                       067)
Mar 23 15:20:35 link-07 kernel: dm-cmirror: Reason :: 1


Messages from link-08 (looping over and over):
[...]
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:21 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:22 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Mark requester   : 4
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Attempt to mark a region
5578/C33UfFkJ which is being recovered.
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Current recoverer: 1
Mar 23 11:44:23 link-08 kernel: dm-cmirror: Mark requester   : 4


[root@link-07 ~]# dmsetup table
revolver-mirror2_mimage_2: 0 10485760 linear 8:49 384
revolver-mirror1_mlog: 0 8192 linear 8:113 384
revolver-mirror2_mimage_1: 0 10485760 linear 8:33 384
revolver-mirror2_mimage_0: 0 10485760 linear 8:1 10486144
revolver-mirror2_mlog: 0 8192 linear 8:17 10486144
revolver-mirror1_mimage_1: 0 10485760 linear 8:17 384
revolver-mirror1_mimage_0: 0 10485760 linear 8:1 384
revolver-mirror2: 0 10485760 mirror clustered_disk 5 253:6 1024
LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3U9Iz0MnCtDa0QV7z8Qpsi749eaAIovqe nosync
block_on_error 3 253:7 0 253:8 0 253:9 0
VolGroup00-LogVol01: 0 4063232 linear 3:2 151781760
revolver-mirror1: 0 10485760 mirror clustered_disk 5 253:2 1024
LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik34coTfAbowArYJ4dLpVYZKagWC33UfFkJ nosync
block_on_error 2 253:3 0 253:4 0
VolGroup00-LogVol00: 0 151781376 linear 3:2 384
[root@link-07 ~]# dmsetup info
Name:              revolver-mirror2_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 9
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3isDcxNcz4wZdJDn4Xe8iZUruUQJ4ZRTe

Name:              revolver-mirror1_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 2
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik34coTfAbowArYJ4dLpVYZKagWC33UfFkJ

Name:              revolver-mirror2_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3mZUe0RrJKkU91zuIoqHmYHWD8uboH8f5

Name:              revolver-mirror2_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik34L8Fe5dimlQkHP0J4VRTo6VYGF1Vl6yp

Name:              revolver-mirror2_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 6
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3U9Iz0MnCtDa0QV7z8Qpsi749eaAIovqe

Name:              revolver-mirror1_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 4
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3kUiOVkdLj6U18LdGB92sLcSI1TO7Rgts

Name:              revolver-mirror1_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 3
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3W26y7KjUX3f6NjgBr0GjYrdSuxMZHA4b

Name:              revolver-mirror2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 10
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik3OTZEJ9N5A6CnpxcgWLLsFYES7vrRWrGE

Name:              VolGroup00-LogVol01
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-8qGbKfLuKYoljGNFE1gsS77AYQM3dC4xQjIYEP6InPgUU5nsDPYSZl5EAEKqRWcY

Name:              revolver-mirror1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-xVWv7JiOsSgNPv95Lg9FU6ckwsTQeik33eay0GrlKTcJaZKdaEAig2hY2MNmHS5q

Name:              VolGroup00-LogVol00
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-8qGbKfLuKYoljGNFE1gsS77AYQM3dC4xrapcuzOGNgADTzIRUNTk0MZBbtWAyXhh



Version-Release number of selected component (if applicable):
2.6.9-50.ELsmp
cmirror-kernel-2.6.9-25.0
Comment 1 Jonathan Earl Brassow 2007-03-24 00:30:59 EDT
'Reason' should be a negative number.  This suggests that the client is
recieving a message from the server that is not a response that it is expecting.

sequence numbers were put int (3/22/2007) to fix this problem.  The
cmirror-kernel package you are using was built 3/14/2007.

new -> post
Comment 2 Jonathan Earl Brassow 2007-04-03 16:12:40 EDT
post -> modified
Comment 3 Corey Marthaler 2007-04-12 14:39:37 EDT
Fix verified in cmirror-kernel-2.6.9-30.0.
Comment 5 Alasdair Kergon 2010-04-27 11:00:03 EDT
Assuming this VERIFIED fix got released.  Closing.
Reopen if it's not yet resolved.

Note You need to log in before you can comment on or make changes to this bug.