Bug 510526

Summary: RHEL5 cmirror tracker: core log mirrors appear to unnecessarily resync twice after upconvert following leg failure
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.4CC: agk, ccaulfie, dwysocha, edamato, heinzm, jbrassow, mbroz, prockai
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 09:01:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log from taft-01
none
log from taft-02
none
log from taft-03
none
log from taft-04 none

Description Corey Marthaler 2009-07-09 16:26:28 UTC
Description of problem:
Although the copy percentage isn't stuck, I've noticed times when one or many nodes in the cluster believe the mirror to be fully synced yet another node(s) is still working on the resyncing it.

# From Helter_skelter output:
Waiting until all mirrors become fully syncd...
        0/1 mirror(s) are fully synced: ( 1=12.00% )
        0/1 mirror(s) are fully synced: ( 1=43.50% )
        0/1 mirror(s) are fully synced: ( 1=78.33% )
        1/1 mirror(s) are fully synced: ( 1=100.00% )

# At this point the mirror is fully synced according to taft-04, yet helter_skelter checks all nodes and finds that taft-01 thinks it's only at 34% finished and is still working on getting it finished.


syncd_primary_core_4legs_1 is not fully sync'ed on taft-01 (34.92%)
percent on taft-01 is now: 36.50
percent on taft-01 is now: 39.83
percent on taft-01 is now: 43.92
percent on taft-01 is now: 45.25
percent on taft-01 is now: 48.08
percent on taft-01 is now: 52.17
percent on taft-01 is now: 56.25
percent on taft-01 is now: 58.33
percent on taft-01 is now: 62.33
percent on taft-01 is now: 66.25

I'll attempt to debug this more and provide better details.

Version-Release number of selected component (if applicable):
2.6.18-150.el5

lvm2-2.02.46-8.el5    BUILT: Thu Jun 18 08:06:12 CDT 2009
lvm2-cluster-2.02.46-8.el5    BUILT: Thu Jun 18 08:05:27 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009

Running custom versions of cmirror/cmirror-kmod built by Brassow

How reproducible:
Every so often

Steps to Reproduce:
Up convert a mirror after a successful leg failure and recovery.

Comment 1 Corey Marthaler 2009-07-09 19:47:17 UTC
Looks like when this case hits, the mirror is actually resyncing twice. All nodes appear to see the same thing, all get to 100% synced and then all restart syncing and got up to 100% for the 2nd time.

Comment 2 Corey Marthaler 2009-07-10 14:16:21 UTC
The case that appears to hit this the easiest is:

Scenario: Kill primary leg of synced core log 4 leg mirror(s)                   

To execute just that scenario use the -e flag:

./helter_skelter -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.4 -R ../../var/share/resource_files/taft.xml -e kill_primary_synced_core_log_4_legs

I'm attaching the debugging logs from all four taft nodes just as they finished resyncing both mirrors and started syncing yet again.

Comment 3 Corey Marthaler 2009-07-10 14:16:55 UTC
Created attachment 351265 [details]
log from taft-01

Comment 4 Corey Marthaler 2009-07-10 14:17:26 UTC
Created attachment 351266 [details]
log from taft-02

Comment 5 Corey Marthaler 2009-07-10 14:17:55 UTC
Created attachment 351267 [details]
log from taft-03

Comment 6 Corey Marthaler 2009-07-10 14:18:34 UTC
Created attachment 351269 [details]
log from taft-04

Comment 7 Corey Marthaler 2009-07-10 22:02:43 UTC
Jon, 

I attempted this scenario with single machine mirrors and was unable to reproduce it. It appears to only exist with cluster mirrors.

Comment 8 Corey Marthaler 2009-07-15 19:10:30 UTC
Weird.

Waiting until all mirrors become fully syncd...
        0/3 mirror(s) are fully synced: ( 1=8.92% 2=13.42% 3=7.58% )
        0/3 mirror(s) are fully synced: ( 1=22.00% 2=28.83% 3=20.50% )
        0/3 mirror(s) are fully synced: ( 1=39.00% 2=43.92% 3=39.58% )
        0/3 mirror(s) are fully synced: ( 1=55.08% 2=57.83% 3=55.58% )
        0/3 mirror(s) are fully synced: ( 1=68.17% 2=72.50% 3=68.67% )
        0/3 mirror(s) are fully synced: ( 1=82.67% 2=87.92% 3=81.17% )
        1/3 mirror(s) are fully synced: ( 1=93.25% 2=100.00% 3=93.92% )
        2/3 mirror(s) are fully synced: ( 1=100.00% 2=2.00% 3=100.00% )
        1/3 mirror(s) are fully synced: ( 1=100.00% 2=17.50% 3=11.58% )
        0/3 mirror(s) are fully synced: ( 1=8.75% 2=25.58% 3=19.17% )
        0/3 mirror(s) are fully synced: ( 1=23.75% 2=36.58% 3=36.50% )
        0/3 mirror(s) are fully synced: ( 1=24.25% 2=54.00% 3=54.50% )
        0/3 mirror(s) are fully synced: ( 1=24.25% 2=73.58% 3=74.42% )
        0/3 mirror(s) are fully synced: ( 1=24.25% 2=92.58% 3=94.75% )
        2/3 mirror(s) are fully synced: ( 1=24.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=24.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=24.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=49.33% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=74.83% 2=100.00% 3=100.00% )

Comment 9 Corey Marthaler 2009-07-15 20:33:32 UTC
specific cmds to repo this:

1. lvcreate --corelog -m 3 -n syncd_primary_core_4legs_1 -L 600M helter_skelter /dev/sdd1:0-1000 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdh1:0-1000

2. wait until mirror is synced

3. kill primary leg on all nodes (echo offline > /sys/block/sdd/device/state)

4. cause down convert (dd if=/dev/zero of=/dev/VG/mirror count=2)

5. re-enable the device on all nodes (echo running > /sys/block/sdd/device/state)

6. re-create pv (pvcreate /dev/sdd1)

7. re-extend the volume group (vgextend helter_skelter /dev/sdd1)

8. lvconvert --corelog -m 3 -b helter_skelter/syncd_primary_core_4legs_1 /dev/sdd1:0-1000 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdh1:0-1000

9. Watch it finish, and then start over syncing


[root@taft-01 ~]# lvs -a -o +devices
  LV                                    VG             Attr   LSize   Origin Snap%  Move Log Copy%  Convert Devices                                                                                                                                                    
  LogVol00                              VolGroup00     -wi-ao  58.38G                                       /dev/sda2(0)                                                                                                                                               
  LogVol01                              VolGroup00     -wi-ao   9.75G                                       /dev/sda2(1868)                                                                                                                                            
  syncd_primary_core_4legs_1            helter_skelter mwi-a- 600.00M                         99.33         syncd_primary_core_4legs_1_mimage_1(0),syncd_primary_core_4legs_1_mimage_2(0),syncd_primary_core_4legs_1_mimage_3(0),syncd_primary_core_4legs_1_mimage_4(0)
  [syncd_primary_core_4legs_1_mimage_1] helter_skelter Iwi-ao 600.00M                                       /dev/sdf1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_2] helter_skelter Iwi-ao 600.00M                                       /dev/sde1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_3] helter_skelter Iwi-ao 600.00M                                       /dev/sdh1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_4] helter_skelter Iwi-ao 600.00M                                       /dev/sdd1(0)                                                                                                                                               
[root@taft-01 ~]# lvs -a -o +devices
^[[A
  LV                                    VG             Attr   LSize   Origin Snap%  Move Log Copy%  Convert Devices                                                                                                                                                    
  LogVol00                              VolGroup00     -wi-ao  58.38G                                       /dev/sda2(0)                                                                                                                                               
  LogVol01                              VolGroup00     -wi-ao   9.75G                                       /dev/sda2(1868)                                                                                                                                            
  syncd_primary_core_4legs_1            helter_skelter mwi-a- 600.00M                          0.00         syncd_primary_core_4legs_1_mimage_1(0),syncd_primary_core_4legs_1_mimage_2(0),syncd_primary_core_4legs_1_mimage_3(0),syncd_primary_core_4legs_1_mimage_4(0)
  [syncd_primary_core_4legs_1_mimage_1] helter_skelter Iwi-ao 600.00M                                       /dev/sdf1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_2] helter_skelter Iwi-ao 600.00M                                       /dev/sde1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_3] helter_skelter Iwi-ao 600.00M                                       /dev/sdh1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_4] helter_skelter Iwi-ao 600.00M                                       /dev/sdd1(0)                                                                                                                                               
[root@taft-01 ~]# lvs -a -o +devices
  LV                                    VG             Attr   LSize   Origin Snap%  Move Log Copy%  Convert Devices                                                                                                                                                    
  LogVol00                              VolGroup00     -wi-ao  58.38G                                       /dev/sda2(0)                                                                                                                                               
  LogVol01                              VolGroup00     -wi-ao   9.75G                                       /dev/sda2(1868)                                                                                                                                            
  syncd_primary_core_4legs_1            helter_skelter mwi-a- 600.00M                          0.67         syncd_primary_core_4legs_1_mimage_1(0),syncd_primary_core_4legs_1_mimage_2(0),syncd_primary_core_4legs_1_mimage_3(0),syncd_primary_core_4legs_1_mimage_4(0)
  [syncd_primary_core_4legs_1_mimage_1] helter_skelter Iwi-ao 600.00M                                       /dev/sdf1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_2] helter_skelter Iwi-ao 600.00M                                       /dev/sde1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_3] helter_skelter Iwi-ao 600.00M                                       /dev/sdh1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_4] helter_skelter Iwi-ao 600.00M                                       /dev/sdd1(0)

Comment 10 Jonathan Earl Brassow 2009-07-15 21:35:59 UTC
Can the bug be reproduced without the "kill primary leg" part?

Comment 11 Corey Marthaler 2009-07-16 19:11:37 UTC
Well what do you know, you can.

1. create mirror
lvcreate --corelog -m 3 -n syncd_primary_core_4legs_1 -L 600M helter_skelter
/dev/sdd1:0-1000 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdh1:0-1000

2. wait until mirror is synced

3. down convert from 4->3 legged mirror
lvconvert --corelog -m 2 helter_skelter/syncd_primary_core_4legs_1

4. Reconvert from 3->4 legged mirror and specify the devices
lvconvert --corelog -m 3 helter_skelter/syncd_primary_core_4legs_1 /dev/sdd1:0-1000 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdh1:0-1000 &

5. Watch it finish, and then start over syncing


[root@taft-01 ~]# lvs -a -o +devices
  LV                                    VG             Attr   LSize   Origin Snap%  Move Log Copy%  Convert Devices                                                                                                                                                    
  LogVol00                              VolGroup00     -wi-ao  58.38G                                       /dev/sda2(0)                                                                                                                                               
  LogVol01                              VolGroup00     -wi-ao   9.75G                                       /dev/sda2(1868)                                                                                                                                            
  syncd_primary_core_4legs_1            helter_skelter mwi-a- 600.00M                        100.00         syncd_primary_core_4legs_1_mimage_0(0),syncd_primary_core_4legs_1_mimage_1(0),syncd_primary_core_4legs_1_mimage_2(0),syncd_primary_core_4legs_1_mimage_3(0)
  [syncd_primary_core_4legs_1_mimage_0] helter_skelter iwi-ao 600.00M                                       /dev/sdd1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_1] helter_skelter iwi-ao 600.00M                                       /dev/sdf1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_2] helter_skelter iwi-ao 600.00M                                       /dev/sde1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_3] helter_skelter iwi-ao 600.00M                                       /dev/sdh1(0)                                                                                                                                               

[root@taft-01 ~]# lvs -a -o +devices  
  Logical volume syncd_primary_core_4legs_1 converted.
  LV                                    VG             Attr   LSize   Origin Snap%  Move Log Copy%  Convert Devices                                                                                                                                                    
  LogVol00                              VolGroup00     -wi-ao  58.38G                                       /dev/sda2(0)                                                                                                                                               
  LogVol01                              VolGroup00     -wi-ao   9.75G                                       /dev/sda2(1868)                                                                                                                                            
  syncd_primary_core_4legs_1            helter_skelter mwi-a- 600.00M                          4.00         syncd_primary_core_4legs_1_mimage_0(0),syncd_primary_core_4legs_1_mimage_1(0),syncd_primary_core_4legs_1_mimage_2(0),syncd_primary_core_4legs_1_mimage_3(0)
  [syncd_primary_core_4legs_1_mimage_0] helter_skelter Iwi-ao 600.00M                                       /dev/sdd1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_1] helter_skelter Iwi-ao 600.00M                                       /dev/sdf1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_2] helter_skelter Iwi-ao 600.00M                                       /dev/sde1(0)                                                                                                                                               
  [syncd_primary_core_4legs_1_mimage_3] helter_skelter Iwi-ao 600.00M                                       /dev/sdh1(0)

Comment 12 Jonathan Earl Brassow 2009-07-28 14:33:51 UTC
Doesn't appear to be a problem in the cmirror* packages, but rather LVM that is loosing tract of the 'sync' status when compressing the different mirror layers after the initial sync.

Comment 13 Corey Marthaler 2009-08-31 19:11:01 UTC
Changing the component based on comment #12.

Also, I reproduced this again today.

Comment 14 Corey Marthaler 2009-10-28 19:10:55 UTC
Repo'ed this again today. I'll comment this test case out until a fix for this is proposed.

2.6.18-160.el5

lvm2-2.02.46-10.el5    BUILT: Fri Sep 18 09:38:06 CDT 2009
lvm2-cluster-2.02.46-10.el5    BUILT: Fri Sep 18 09:39:48 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009
cmirror-1.1.39-2.el5    BUILT: Mon Jul 27 15:39:05 CDT 2009
kmod-cmirror-0.1.22-1.el5    BUILT: Mon Jul 27 15:28:46 CDT 2009

Comment 15 Corey Marthaler 2009-12-22 19:40:50 UTC
This issue is still in the latest build.

Waiting until all mirrors become fully syncd...
   0/1 mirror(s) are fully synced: ( 43.33% )
   0/1 mirror(s) are fully synced: ( 71.50% )
   0/1 mirror(s) are fully synced: ( 96.92% )
   1/1 mirror(s) are fully synced: ( 100.00% )
syncd_secondary_core_4legs_1 is not fully sync'ed on taft-01 (2.00%)

Comment 17 Jonathan Earl Brassow 2010-01-26 21:25:03 UTC
I don't think this bug is hugely important, but I'd still like to try to get it knocked off in 5.5... it will benefit 6.0 too, as I'm sure the bug exists there too.

Comment 18 Jonathan Earl Brassow 2010-02-03 14:57:10 UTC
The LCK_MIRROR_NOSYNC_MODE flag was not being set when collapsing up-converted mirrors.  This was not a problem for disk logs (because they rely on the sync information in the log), but core-logged mirrors would restart syncing - after syncing all of the mirrors once already.  Setting the flag by way of 'init_mirror_in_sync' in 'collapse_mirrored_lv' also was not working, suggesting that it was being called by disk-log mirrors, but not core-log mirrors.  The decision to collapse a mirror was being taken based on the results of 'find_temporary_mirror' - which checks for stacked mirrors.  Looking at the runtime layout of the structures showed that  mirrors with a core log are never stacked in the first place.  The logic in '_lvconvert_mirrors' confirms that core log mirrors are treated differently that disk log mirrors - and probably incorrectly.  Fixing the problem isn't as simple though.  We want to use mirror stacking to up-convert, but we don't have a good way of keeping the original mirror in-sync due to the granularity of the 'mirror_in_sync()' global (it is all or nothing, instead of LV based).

I'll look into stopping the second resync, but the stacking issue with corelog mirrors will be separate.

Comment 19 Jonathan Earl Brassow 2010-02-08 15:25:03 UTC
-checked-in upstream.

Comment 20 Milan Broz 2010-02-08 16:21:10 UTC
Fix in lvm2-2.02.56-7.el5.

Comment 22 Corey Marthaler 2010-02-15 23:26:55 UTC
Fix verified in lvm2-2.02.56-8.el5/lvm2-cluster-2.02.56-7.el5.

Comment 24 errata-xmlrpc 2010-03-30 09:01:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0298.html