Bug 193596

Summary:	node recovery can cause cluster mirror copying to halt
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	cmirror	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	agk, cfeist, dwysocha, mbroz
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-08-05 21:31:14 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2006-05-30 20:46:20 UTC

Description of problem:
This is either a bug with the mirror copying or a bug in the status reporting:

Before the server relocation:

[root@taft-03 ~]# lvs
  LV             VG            Attr   LSize   Origin Snap%  Move Log           
     Copy%
  mirror_1_77760 mirror_1_7776 mwi-a- 136.23G                   
mirror_1_77760_mlog   0.13


[root@taft-03 ~]# dmsetup status
mirror_1_7776-mirror_1_77760: 0 285696000 mirror 2 253:3 253:4 627/279000 1 AA 3
clustered_disk 253:2 A
mirror_1_7776-mirror_1_77760_mlog: 0 8192 linear
VolGroup00-LogVol01: 0 4063232 linear
VolGroup00-LogVol00: 0 40960000 linear
mirror_1_7776-mirror_1_77760_mimage_1: 0 285696000 linear
mirror_1_7776-mirror_1_77760_mimage_0: 0 285696000 linear


After the server relocation:

[root@taft-03 ~]# lvs
  LV             VG            Attr   LSize   Origin Snap%  Move Log           
     Copy%
  mirror_1_77760 mirror_1_7776 mwi-a- 136.23G                   
mirror_1_77760_mlog   0.00

[root@taft-03 ~]# dmsetup status
mirror_1_7776-mirror_1_77760: 0 285696000 mirror 2 253:3 253:4 0/279000 1 AA 3
clustered_disk 253:2 A
mirror_1_7776-mirror_1_77760_mlog: 0 8192 linear
VolGroup00-LogVol01: 0 4063232 linear
VolGroup00-LogVol00: 0 40960000 linear
mirror_1_7776-mirror_1_77760_mimage_1: 0 285696000 linear
mirror_1_7776-mirror_1_77760_mimage_0: 0 285696000 linear


Version-Release number of selected component (if applicable):
[root@taft-03 ~]# rpm -qa  | grep lvm
lvm2-cluster-2.02.06-1.0.RHEL4
lvm2-cluster-debuginfo-2.02.01-1.2.RHEL4
lvm2-2.02.06-3.0.RHEL4


How reproducible:
everytime

Comment 1 Corey Marthaler 2006-06-02 19:12:32 UTC

Just a note that I have also seen this where the copy percent restarts and
instead of getting stuck at 0.00% it get's stuck else where, like at say 5.08%.

[root@taft-03 ~]# lvs
  LV         VG       Attr   LSize Origin Snap%  Move Log             Copy%
  deanmirror mirror_1 mwi-a- 1.00G                    deanmirror_mlog  16.02

(after recovery)

[root@taft-04 ~]# lvs
  LV         VG       Attr   LSize Origin Snap%  Move Log             Copy%
  deanmirror mirror_1 mwi-a- 1.00G                    deanmirror_mlog   0.00
[root@taft-04 ~]# lvs
  LV         VG       Attr   LSize Origin Snap%  Move Log             Copy%
  deanmirror mirror_1 mwi-a- 1.00G                    deanmirror_mlog   5.08
[root@taft-04 ~]# lvs
  LV         VG       Attr   LSize Origin Snap%  Move Log             Copy%
  deanmirror mirror_1 mwi-a- 1.00G                    deanmirror_mlog   5.08
[STUCK]

Comment 2 Corey Marthaler 2006-06-02 19:34:12 UTC

It turns out that recovering/relocating the mirror server isn't required for
this issue to appear. Any recovery of a node (server or not) can cause this.
Also, it appears that mirrors under 10G have a better chance of making this happen.

Comment 3 Jonathan Earl Brassow 2006-06-02 21:07:39 UTC

How did you relocate the server?

In any event, the following works to relocate the server and create the problem:
1) create cluster mirror
2) find log server (check log), and run 'lvchange -aln <vg>/<lv>' on it
3) recovery stops

Comment 4 Jonathan Earl Brassow 2006-06-02 21:55:46 UTC

Before we had the ability to detect whether or not another member node was
recovering, I made sure that only the server would recover the mirror.  However,
this meant that other nodes would recieve the signal not to recover.  So, in
this case, when the server failed/left, recovery would stop.

Now that we have the ability to detect remote recovery, I have remove the
stipulation that only the server can recover.  This has fixed the problem for me.

Fix will be in next rev after cmirror-kernel*-2.6.9-4.2.

Comment 5 Corey Marthaler 2006-07-05 16:38:04 UTC

Fix verified. 

However, recovery during cmirror creation still results in bz 195392.

Comment 6 Corey Marthaler 2006-07-12 17:27:16 UTC

May have spoke to soon. 

During a mirror creation, taft-01 hit bz 198563 which caused recovery to take
place and that then caused the the mirror coyping to halt at 96.06%.


[root@taft-02 ~]# lvs
  LV           VG          Attr   LSize   Origin Snap%  Move Log   Copy%
  mirror_1_340 mirror_1_34 mwi-a- 136.23G                    mirror_1_340_mlog 
96.06


[root@taft-02 ~]# dmsetup status
mirror_1_34-mirror_1_340_mlog: 0 8192 linear
mirror_1_34-mirror_1_340_mimage_1: 0 285696000 linear
VolGroup00-LogVol01: 0 4063232 linear
mirror_1_34-mirror_1_340_mimage_0: 0 285696000 linear
VolGroup00-LogVol00: 0 138674176 linear
mirror_1_34-mirror_1_340: 0 285696000 mirror 2 253:3 253:4 268017/279000 1 AA 3
clustered_disk 253:2 A


taft-02:
Jul 12 06:13:08 taft-02 kernel: device-mapper: I'm the cluster mirror log server
for BG1zl4bX
Jul 12 06:13:08 taft-02 kernel: device-mapper: Disk Resume::
Jul 12 06:13:08 taft-02 kernel: device-mapper:   Live nodes        :: 1
Jul 12 06:13:08 taft-02 kernel: device-mapper:   In-Use Regions    :: 0
Jul 12 06:13:08 taft-02 kernel: device-mapper:   Good IUR's        :: 0
Jul 12 06:13:08 taft-02 kernel: device-mapper:   Bad IUR's         :: 0
Jul 12 06:13:08 taft-02 kernel: device-mapper:   Sync count        :: 254424
Jul 12 06:13:08 taft-02 kernel: device-mapper:   Disk Region count ::
18446744073709551615
Jul 12 06:13:08 taft-02 kernel: device-mapper:   Region count      :: 279000
Jul 12 06:13:08 taft-02 kernel: device-mapper:   NOTE:  Mapping has changed.
Jul 12 06:13:08 taft-02 kernel: device-mapper: Marked regions::
Jul 12 06:13:08 taft-02 kernel: device-mapper:   0 - 24575
Jul 12 06:13:08 taft-02 kernel: device-mapper:   Total = 24576
Jul 12 06:13:08 taft-02 dmeventd[5286]: Monitoring mirror device,
mirror_1_34-mirror_1_340 for events
Jul 12 06:13:08 taft-02 kernel: device-mapper: Out-of-sync regions::
Jul 12 06:13:08 taft-02 kernel: device-mapper:   0 - 24575
Jul 12 06:13:08 taft-02 kernel: device-mapper:   Total = 24576
Jul 12 06:13:26 taft-02 kernel: CMAN: node taft-01 has been removed from the
cluster : No response to messages
Jul 12 06:13:27 taft-02 fenced[4328]: fencing deferred to taft-03
Jul 12 06:14:07 taft-02 kernel: device-mapper: A cluster mirror log member has
failed.
Jul 12 06:14:07 taft-02 kernel: device-mapper: Disk Resume::
Jul 12 06:14:07 taft-02 kernel: device-mapper:   Live nodes        :: 2
Jul 12 06:14:07 taft-02 kernel: device-mapper:   In-Use Regions    :: 2
Jul 12 06:14:07 taft-02 kernel: device-mapper:   Good IUR's        :: 2
Jul 12 06:14:07 taft-02 kernel: device-mapper:   Bad IUR's         :: 0
Jul 12 06:14:07 taft-02 kernel: device-mapper:   Sync count        :: 255053
Jul 12 06:14:07 taft-02 kernel: device-mapper:   Disk Region count :: 279000
Jul 12 06:14:07 taft-02 kernel: device-mapper:   Region count      :: 279000
Jul 12 06:14:07 taft-02 kernel: device-mapper: Marked regions::
Jul 12 06:14:07 taft-02 kernel: device-mapper:   627 - 24575
Jul 12 06:14:07 taft-02 kernel: device-mapper:   Total = 23949
Jul 12 06:14:07 taft-02 kernel: device-mapper: Out-of-sync regions::
Jul 12 06:14:07 taft-02 kernel: device-mapper:   629 - 24575
Jul 12 06:14:07 taft-02 kernel: device-mapper:   Total = 23947
device-mapper: Error (-22) while processing request (type = 5)
Jul 12 06:16:43 taft-02 kernel: device-mapper: Request to remove unrecorded
region user (3/13590)
Jul 12 06:16:43 taft-02 kernel: device-mapper: Error (-22) while processing
request (type = 5)



taft-03:
Jul 12 06:12:29 taft-03 dmeventd[4754]: Monitoring mirror device,
mirror_1_34-mirror_1_340 for events
Jul 12 06:12:42 taft-03 kernel: CMAN: node taft-01 has been removed from the
cluster : No response to messages
Jul 12 06:13:13 taft-03 fenced[4250]: taft-01 not a cluster member after 30 sec
post_fail_delay
Jul 12 06:13:13 taft-03 fenced[4250]: fencing node "taft-01"
Jul 12 06:13:13 taft-03 fenced[4250]: fence "taft-01" success
Jul 12 06:13:19 taft-03 kernel: device-mapper: A cluster mirror log member has
failed.



taft-04:
CMAN: removing node taft-01 from the cluster : No response to messages
Jul 12 07:21:11 taft-04 kernel: CMAN: removing node taft-01 from the cluster :
No response to messages
Jul 12 07:21:12 taft-04 fenced[4166]: fencing deferred to taft-03
Jul 12 07:21:57 taft-04 dmeventd[4659]: Monitoring mirror device,
mirror_1_34-mirror_1_340 for events
device-mapper: an error occured on the server while processing our request
Jul 12 07:24:28 taft-04 kernel: device-mapper: an error occured on the server
while processing our request



Version:
[root@taft-02 ~]# rpm -q lvm2
lvm2-2.02.06-4.0.RHEL4
[root@taft-02 ~]# rpm -q lvm2-cluster
lvm2-cluster-2.02.06-4.0.RHEL4
[root@taft-02 ~]# rpm -q cmirror
cmirror-1.0.1-0
[root@taft-02 ~]# rpm -q cmirror-kernel
cmirror-kernel-2.6.9-10.0
[root@taft-02 ~]# rpm -q device-mapper
device-mapper-1.02.07-3.0.RHEL4
[root@taft-02 ~]# uname -ar
Linux taft-02 2.6.9-39.1.ELsmp #1 SMP Fri Jun 16 16:47:43 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux

Comment 7 Corey Marthaler 2006-07-12 19:26:41 UTC

Just a note that once I brought the recovered node back into the cluster the
syncing started again.

Comment 8 Kiersten (Kerri) Anderson 2006-09-22 16:56:52 UTC

Devel ACK.

Comment 9 Corey Marthaler 2006-11-08 23:47:51 UTC

Just a note to re-iterate comment #7, am still able to reproduce this where the
copy percent stops when a node is recovered and will stay at the percent until
that node is brought back into the cluster and clvmd is once again started.

Comment 10 Jonathan Earl Brassow 2006-11-29 21:19:16 UTC

I am no longer able to reproduce this.

Can you recreate with latest rpms >= 11/29/2006?

Comment 11 Corey Marthaler 2006-11-30 00:03:52 UTC

I have been unable to reproduce this issue with the latest version of
cmirror-kernel, marking verified.

Comment 12 Chris Feist 2008-08-05 21:31:14 UTC

Closing as this has been fixed in the current (4.7) release.