Bug 225337

Summary:	conversion of mirrors can cause the sync percent to get stuck at different spots below 100%
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	cmirror	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	agk, dwysocha, jbrassow, mbroz, prockai
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-04-27 14:57:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2007-01-29 22:59:03 UTC

Description of problem:
I had a syncing cmirror and did a quick reboot of one of the legs (so that that
is was only off for a moment before coming back up) and that caused the the
mirror syncing to halt instead of failing and completely down converting the mirror.


[root@link-08 ~]# lvs -a -o +devices
  LV                 VG    Attr   LSize Origin Snap%  Move Log          Copy% 
Devices                  
  mirror1            corey mwi-a- 2.00G                    mirror1_mlog  43.55
mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] corey iwi-ao 2.00G                                       
/dev/sda1(0)             
  [mirror1_mimage_1] corey iwi-ao 2.00G                                       
/dev/sdh1(0)             
  [mirror1_mlog]     corey lwi-ao 4.00M                                       
/dev/sdg1(0)             

# Here's where the quick failure happened.

[root@link-08 ~]# lvs -a -o +devices
  LV                 VG    Attr   LSize Origin Snap%  Move Log          Copy% 
Devices                  
  mirror1            corey mwi-a- 2.00G                    mirror1_mlog  92.38
mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] corey iwi-ao 2.00G                                       
/dev/sda1(0)             
  [mirror1_mimage_1] corey iwi-ao 2.00G                                       
/dev/sdh1(0)             
  [mirror1_mlog]     corey lwi-ao 4.00M                                       
/dev/sdg1(0)             
[root@link-08 ~]#
[root@link-08 ~]#
[root@link-08 ~]# lvs -a -o +devices
  LV                 VG    Attr   LSize Origin Snap%  Move Log          Copy% 
Devices                  
  mirror1            corey mwi-a- 2.00G                    mirror1_mlog  99.41
mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] corey iwi-ao 2.00G                                       
/dev/sda1(0)             
  [mirror1_mimage_1] corey iwi-ao 2.00G                                       
/dev/sdh1(0)             
  [mirror1_mlog]     corey lwi-ao 4.00M                                       
/dev/sdg1(0)             
[root@link-08 ~]# lvs -a -o +devices
  LV                 VG    Attr   LSize Origin Snap%  Move Log          Copy% 
Devices                  
  mirror1            corey mwi-a- 2.00G                    mirror1_mlog  99.41
mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] corey iwi-ao 2.00G                                       
/dev/sda1(0)             
  [mirror1_mimage_1] corey iwi-ao 2.00G                                       
/dev/sdh1(0)             
  [mirror1_mlog]     corey lwi-ao 4.00M                                       
/dev/sdg1(0)             
[root@link-08 ~]# lvs -a -o +devices
  LV                 VG    Attr   LSize Origin Snap%  Move Log          Copy% 
Devices                  
  mirror1            corey mwi-a- 2.00G                    mirror1_mlog  99.41
mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] corey iwi-ao 2.00G                                       
/dev/sda1(0)             
  [mirror1_mimage_1] corey iwi-ao 2.00G                                       
/dev/sdh1(0)             
  [mirror1_mlog]     corey lwi-ao 4.00M                                       
/dev/sdg1(0)             


Console:
Jan 29 17:10:37 link-07 kernel: device-mapper: Write error during recovery
(error = 0x1)
Jan 29 17:10:37 link-07 kernel: device-mapper: incrementing error_count on 253:4
Jan 29 17:10:37 link-07 kernel: device-mapper: recovery failed on region 3243




Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.20-1.el4
lvm2-2.02.20-1.el4
cmirror-kernel-smp-2.6.9-18.6
device-mapper-1.02.16-1.el4

Comment 1 Corey Marthaler 2007-02-01 16:43:16 UTC

This is reproducable. Caused it to happen while after rebooting the cluster due
to bz 199433

Comment 2 Corey Marthaler 2007-02-01 16:45:18 UTC

[root@salem ~]# lvs -a -o +devices
  LV               VG   Attr   LSize  Origin Snap%  Move Log        Copy%  Devices
  kool             new  -wi-a- 10.00G                                     
/dev/sdc(2560)
  [kool_mimage_0]  new  vwi-a- 10.00G
  kool_mimage_1    new  -wi-a- 10.00G                                     
/dev/sde(0)
  kool_mlog        new  -wi-a-  4.00M                                     
/dev/sdd(0)
  salem            new  mwi-a- 10.00G                    salem_mlog   7.70
salem_mimage_0(0),salem_mimage_1(0)
  [salem_mimage_0] new  iwi-ao 10.00G                                     
/dev/sdc(0)
  [salem_mimage_1] new  iwi-ao 10.00G                                     
/dev/sde(2560)
  [salem_mlog]     new  lwi-ao  4.00M                                     
/dev/sdd(1)
[root@salem ~]#
[root@salem ~]#
[root@salem ~]# lvconvert -m 0 /dev/new/kool
  Logical volume kool is already not mirrored.
[root@salem ~]# lvconvert -m 1 /dev/new/kool
  Internal error: Duplicate LV name kool_mlog detected in new.
  Failed to create mirror log.

Comment 3 Corey Marthaler 2007-02-01 21:40:51 UTC

This can apparently happen by just doing mirror up and down converts.

[root@link-02 ~]# lvs -a -o +devices
  LV                 VG   Attr   LSize Origin Snap%  Move Log          Copy% 
Devices                   
  mirror1            vg   Mwi-so 5.00G                    mirror1_mlog  99.92
mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] vg   iwi-so 5.00G                                       
/dev/sdh1(1280)           
  [mirror1_mimage_1] vg   iwi-so 5.00G                                       
/dev/sdb1(0)              
  [mirror1_mlog]     vg   lwi-so 4.00M                                       
/dev/sda1(1283)           
  mirror2            vg   Mwi-so 5.00G                    mirror2_mlog  99.92
mirror2_mimage_0(0),mirror2_mimage_1(0)
  [mirror2_mimage_0] vg   iwi-so 5.00G                                       
/dev/sdh1(2560)           
  [mirror2_mimage_1] vg   iwi-so 5.00G                                       
/dev/sdc1(0)              
  [mirror2_mlog]     vg   lwi-so 4.00M                                       
/dev/sda1(1284)           
  mirror3            vg   Mwi-so 5.00G                    mirror3_mlog 100.00
mirror3_mimage_0(0),mirror3_mimage_1(0)
  [mirror3_mimage_0] vg   iwi-so 5.00G                                       
/dev/sdh1(3840)           
  [mirror3_mimage_1] vg   iwi-so 5.00G                                       
/dev/sdd1(0)              
  [mirror3_mlog]     vg   lwi-so 4.00M                                       
/dev/sda1(1282)

Comment 4 Jonathan Earl Brassow 2007-02-02 16:51:44 UTC

I suspect that in the quick failure case, only a recovery write failed.

It seems there is currently no mechanism to raise an event (what causes dmeventd
to reconfigure the mirror) when an error happens durring recovery.  This is
probably by design and probably the right thing to do.  This is certainly what
is causing your results.

If you lvchange -an; lvchange -ay, the problem will resolve itself.

This is less than ideal though...

We keep track of where we are in the recovery process through a variable called
'sync_search'.  This never gets set back to zero unless the mirror table is
reloaded (lvchange -an; lvchange -ay).  It might be nice to reset 'sync_search'
if it is  >= 'region_count' and we receive a 'get_resync_work' request, since we
won't get those requests if the client thinks the mirror is in-sync.

I'll have to give the above more thought, and this would definitly involve
kernel changes...  I am certain that this problem affects single machine
mirroring too.

Careful consideration will need to be made for mirrors that wish to ignore
failures (i.e. pvmove).

Comment 5 Jonathan Earl Brassow 2007-02-02 17:23:12 UTC

Wait... 'sync_search' is in the logging code.  This allows me to make the change
without affecting the kernel.  The change is:

static int _core_get_resync_work(struct log_c *lc, region_t *region)
{
	if (lc->sync_search >= lc->region_count) {
		/*
		 * FIXME: pvmove is not supported yet, but when it is,
		 * an audit of sync_count changes will need to be made
		 */
		if (lc->sync_count < lc->region_count) {
			lc->sync_search = 0;
		} else {
			return 0;
		}
	}
...

Comment 6 Corey Marthaler 2007-04-11 18:15:42 UTC

This has not been reproduced after running many up and down cmirror convert
operations. Marking VERIFIED.

Comment 8 Alasdair Kergon 2010-04-27 14:57:34 UTC

Assuming this VERIFIED fix got released.  Closing.
Reopen if it's not yet resolved.