Bug 491340

Summary: RHEL5 cmirror tracker: copy percentage can get stuck during sync attempt (creation/conversion)
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: agk, ccaulfie, cfeist, coughlan, dwysocha, edamato, heinzm, mbroz
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 11:05:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log from taft-01
none
log from taft-02
none
log from taft-03
none
log from taft-04
none
log from taft-01
none
log from taft-02
none
log from taft-03
none
log from taft-04 none

Description Corey Marthaler 2009-03-20 14:21:47 UTC
Description of problem:
This is a dup of bug 464550 which was mistakenly closed do to the errata
process. This bug is not high priority since there is a potential work around (a write to the stuck mirror), but that doesn't mean that the bug isn't still there and the work around doesn't always work.

Comment 1 Corey Marthaler 2009-03-20 15:08:29 UTC
Just hit this again in fact. Note how the 4th mirror was stuck at 0.05% until the hack was attempted.

Waiting until all mirrors become fully syncd...              
   0/8 mirror(s) are fully synced: ( 1=18.75% 2=10.55% 3=9.62% 4=0.05% 5=4.88% 6=6.79% 7=5.08% 8=2.78% )
   0/8 mirror(s) are fully synced: ( 1=20.56% 2=11.96% 3=11.47% 4=0.05% 5=6.64% 6=9.08% 7=7.32% 8=4.69% )
   0/8 mirror(s) are fully synced: ( 1=22.80% 2=14.06% 3=13.92% 4=0.05% 5=8.98% 6=12.01% 7=10.21% 8=7.13% )
   0/8 mirror(s) are fully synced: ( 1=25.00% 2=15.92% 3=16.16% 4=0.05% 5=11.23% 6=15.14% 7=13.28% 8=9.62% )
   0/8 mirror(s) are fully synced: ( 1=26.86% 2=17.77% 3=18.36% 4=0.05% 5=13.13% 6=17.68% 7=15.77% 8=11.67% )
   0/8 mirror(s) are fully synced: ( 1=28.61% 2=19.19% 3=20.61% 4=0.05% 5=14.75% 6=20.17% 7=18.46% 8=13.62% )
sync percent for sync_check_4 hasn't changed in the past minute
HACKING AROUND BZ 464550 by dd'ing to this stale mirror in order to restart the sync
   0/8 mirror(s) are fully synced: ( 1=31.64% 2=22.36% 3=23.73% 4=0.05% 5=17.53% 6=24.37% 7=22.71% 8=16.94% )
   0/8 mirror(s) are fully synced: ( 1=33.30% 2=24.22% 3=25.73% 4=1.71% 5=19.19% 6=26.51% 7=24.76% 8=18.60% )

Comment 2 Corey Marthaler 2009-04-02 14:30:35 UTC
Hit this again after 14 iterations of helter_skelter (instead of mirror_sanity) with 4-way mirrors.

Waiting until all mirrors become fully syncd...
        0/3 mirror(s) are fully synced: ( 1=0.25% 2=7.17% 3=4.17% )
        0/3 mirror(s) are fully synced: ( 1=0.25% 2=23.92% 3=20.42% )
        0/3 mirror(s) are fully synced: ( 1=0.25% 2=40.92% 3=37.50% )
        0/3 mirror(s) are fully synced: ( 1=0.25% 2=56.08% 3=53.00% )
        0/3 mirror(s) are fully synced: ( 1=0.25% 2=69.50% 3=66.25% )
        0/3 mirror(s) are fully synced: ( 1=0.25% 2=84.25% 3=80.92% )
        0/3 mirror(s) are fully synced: ( 1=0.25% 2=98.33% 3=95.33% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
After 10 minutes the mirror(s) are still not in sync                   
HACK AROUND BZ 464550 by dd'ing to the stuck mirrors in order to restart the sync process
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% )
After 15 minutes the mirror(s) are still not in sync
ADDITIONAL HACK AROUND BZ 474174 by suspending/resuming the dm devices in order to restart the sync process
dmsetup suspend helter_skelter-syncd_log_4legs_1


Taft-01:
Apr  2 05:52:19 taft-01 clogd[8553]: 192:22) SEQ#=12885614, UUID=htC68eIU, TYPE=DM_CLOG_CLEAR_REGION, ORIG=4, RESP= 
Apr  2 05:52:19 taft-01 clogd[8553]: 193:23) SEQ#=12885615, UUID=htC68eIU, TYPE=DM_CLOG_FLUSH, ORIG=4, RESP=NO 
Apr  2 05:52:19 taft-01 clogd[8553]: 194:24) SEQ#=12885616, UUID=htC68eIU, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=4, RE 
Apr  2 05:52:19 taft-01 clogd[8553]: 195:25) SEQ#=12885619, UUID=0ZZDcuEW, TYPE=DM_CLOG_SET_REGION_SYNC, ORIG=4, RE 
Apr  2 05:52:19 taft-01 clogd[8553]: 196:26) SEQ#=12885620, UUID=0ZZDcuEW, TYPE=DM_CLOG_CLEAR_REGION, ORIG=4, RESP= 
Apr  2 05:52:19 taft-01 clogd[8553]: 197:27) SEQ#=12885621, UUID=0ZZDcuEW, TYPE=DM_CLOG_FLUSH, ORIG=4, RESP=NO 
Apr  2 05:52:19 taft-01 clogd[8553]: 198:28) SEQ#=12885622, UUID=0ZZDcuEW, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=4, RE 
Apr  2 05:52:19 taft-01 clogd[8553]: 199:29) SEQ#=12885625, UUID=htC68eIU, TYPE=DM_CLOG_SET_REGION_SYNC, ORIG=4, RE 
Apr  2 05:52:38 taft-01 openais[8258]: [TOTEM] Retransmit List: 3af62  
Apr  2 05:52:58 taft-01 lvm[8645]: helter_skelter-syncd_log_4legs_2 is now in-sync 
Apr  2 05:52:58 taft-01 lvm[8645]: helter_skelter-syncd_log_4legs_3 is now in-sync 
Apr  2 05:55:07 taft-01 openais[8258]: [TOTEM] Retransmit List: 3ce3a  
Apr  2 06:02:18 taft-01 openais[8258]: [TOTEM] Retransmit List: 3f584  
Apr  2 06:03:44 taft-01 openais[8258]: [TOTEM] Retransmit List: 3fd53  
Apr  2 06:10:37 taft-01 openais[8258]: [TOTEM] Retransmit List: 423b9  


[root@taft-01 ~]# lvs -a -o +devices
  LV                           VG             Attr   LSize   Origin Snap%  Move Log                    Copy%  Convert Devices                                                                                                                
  LogVol00                     VolGroup00     -wi-ao  58.38G                                                          /dev/sda2(0)                                                                                                           
  LogVol01                     VolGroup00     -wi-ao   9.75G                                                          /dev/sda2(1868)                                                                                                        
  syncd_log_4legs_1            helter_skelter mwi-ao 600.00M                    syncd_log_4legs_1_mlog   0.00         syncd_log_4legs_1_mimage_0(0),syncd_log_4legs_1_mimage_1(0),syncd_log_4legs_1_mimage_2(0),syncd_log_4legs_1_mimage_3(0)
  [syncd_log_4legs_1_mimage_0] helter_skelter Iwi-ao 600.00M                                                          /dev/sde1(0)                                                                                                           
  [syncd_log_4legs_1_mimage_1] helter_skelter Iwi-ao 600.00M                                                          /dev/sdh1(0)                                                                                                           
  [syncd_log_4legs_1_mimage_2] helter_skelter Iwi-ao 600.00M                                                          /dev/sdf1(0)                                                                                                           
  [syncd_log_4legs_1_mimage_3] helter_skelter Iwi-ao 600.00M                                                          /dev/sdg1(0)                                                                                                           
  [syncd_log_4legs_1_mlog]     helter_skelter lwi-ao   4.00M                                                          /dev/sdc1(0)                                                                                                           
  syncd_log_4legs_2            helter_skelter mwi-ao 600.00M                    syncd_log_4legs_2_mlog 100.00         syncd_log_4legs_2_mimage_0(0),syncd_log_4legs_2_mimage_1(0),syncd_log_4legs_2_mimage_2(0),syncd_log_4legs_2_mimage_3(0)
  [syncd_log_4legs_2_mimage_0] helter_skelter iwi-ao 600.00M                                                          /dev/sde1(150)                                                                                                         
  [syncd_log_4legs_2_mimage_1] helter_skelter iwi-ao 600.00M                                                          /dev/sdh1(150)                                                                                                         
  [syncd_log_4legs_2_mimage_2] helter_skelter iwi-ao 600.00M                                                          /dev/sdf1(150)                                                                                                         
  [syncd_log_4legs_2_mimage_3] helter_skelter iwi-ao 600.00M                                                          /dev/sdg1(150)                                                                                                         
  [syncd_log_4legs_2_mlog]     helter_skelter lwi-ao   4.00M                                                          /dev/sdc1(1)                                                                                                           
  syncd_log_4legs_3            helter_skelter mwi-ao 600.00M                    syncd_log_4legs_3_mlog 100.00         syncd_log_4legs_3_mimage_0(0),syncd_log_4legs_3_mimage_1(0),syncd_log_4legs_3_mimage_2(0),syncd_log_4legs_3_mimage_3(0)
  [syncd_log_4legs_3_mimage_0] helter_skelter iwi-ao 600.00M                                                          /dev/sde1(300)                                                                                                         
  [syncd_log_4legs_3_mimage_1] helter_skelter iwi-ao 600.00M                                                          /dev/sdh1(300)                                                                                                         
  [syncd_log_4legs_3_mimage_2] helter_skelter iwi-ao 600.00M                                                          /dev/sdf1(300)                                                                                                         
  [syncd_log_4legs_3_mimage_3] helter_skelter iwi-ao 600.00M                                                          /dev/sdg1(300)                                                                                                         
  [syncd_log_4legs_3_mlog]     helter_skelter lwi-ao   4.00M                                                          /dev/sdc1(2

Comment 3 Corey Marthaler 2009-04-03 14:30:16 UTC
Ended up hitting this with 5-way mirrors as well.

Comment 4 Jonathan Earl Brassow 2009-04-07 15:45:17 UTC
I believe I have a fix pending for this in my queue.  I need to check it in.

Comment 5 Jonathan Earl Brassow 2009-05-05 16:28:39 UTC
commit 9d2e36adb7a14378ee5934ea273094428a278089
Author: Jonathan Brassow <jbrassow>
Date:   Mon Apr 20 11:02:50 2009 -0500

    clogd: Fix for bug 491340 - copy percentage can get stuck during sync attempt

    May also help fix 480000.

    Introduce a delay between suspend/resume cycles to prevent the receipt of
    stale checkpoint data.

Comment 7 Corey Marthaler 2009-06-02 16:51:20 UTC
This bug still exists, that or there's another lvconvert bug open affecting this test.

Recreating PVs /dev/sde1
  WARNING: Volume group helter_skelter is not consistent
  WARNING: Volume Group helter_skelter is not consistent
  WARNING: Volume group helter_skelter is not consistent
Extending the recreated PVs back into VG helter_skelter 
Up converting linear(s) back to mirror(s) on taft-02... 
taft-02: lvconvert --corelog -m 3 -b helter_skelter/syncd_secondary_core_4legs_1 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-1000 /dev/sdd1:0-1000                                                                                                          
Verifying the up conversion from linear(s) to mirror(s)                                                                        
Verifying device /dev/sdf1 *IS* one of the legs in the mirror(s)                                                               
Verifying device /dev/sde1 *IS* one of the legs in the mirror(s)                                                               
Verifying device /dev/sdg1 *IS* one of the legs in the mirror(s)                                                               
Verifying device /dev/sdd1 *IS* one of the legs in the mirror(s)                                                               
Verifying LOG device 0 *IS* in the mirror(s)                                                                                   

Waiting until all mirrors become fully syncd...
        0/1 mirror(s) are fully synced: ( 1=0.17% )
        0/1 mirror(s) are fully synced: ( 1=0.17% )
        0/1 mirror(s) are fully synced: ( 1=0.17% )
        0/1 mirror(s) are fully synced: ( 1=0.17% )
[...]

[root@taft-01 ~]# lvs -a -o +devices
  LV                                      VG             Attr   LSize   Origin Snap%  Move Log Copy%  Convert Devices                                                                                                                                                            
  LogVol00                                VolGroup00     -wi-ao  58.38G                                       /dev/sda2(0)                                                                                                                                                       
  LogVol01                                VolGroup00     -wi-ao   9.75G                                       /dev/sda2(1868)                                                                                                                                                    
  syncd_secondary_core_4legs_1            helter_skelter mwi-ao 600.00M                          0.00         syncd_secondary_core_4legs_1_mimage_0(0),syncd_secondary_core_4legs_1_mimage_2(0),syncd_secondary_core_4legs_1_mimage_3(0),syncd_secondary_core_4legs_1_mimage_4(0)
  [syncd_secondary_core_4legs_1_mimage_0] helter_skelter Iwi-ao 600.00M                                       /dev/sdf1(0)                                                                                                                                                       
  [syncd_secondary_core_4legs_1_mimage_2] helter_skelter Iwi-ao 600.00M                                       /dev/sdg1(0)                                                                                                                                                       
  [syncd_secondary_core_4legs_1_mimage_3] helter_skelter Iwi-ao 600.00M                                       /dev/sdd1(0)                                                                                                                                                       
  [syncd_secondary_core_4legs_1_mimage_4] helter_skelter Iwi-ao 600.00M                                       /dev/sde1(0)                                                                                                                                                       
[root@taft-01 ~]# dmsetup status
helter_skelter-syncd_secondary_core_4legs_1_mimage_2: 0 1228800 linear 
helter_skelter-syncd_secondary_core_4legs_1_mimage_0: 0 1228800 linear 
helter_skelter-syncd_secondary_core_4legs_1: 0 1228800 mirror 4 253:2 253:4 253:5 253:3 2/1200 1 AAAA 1 clustered_core
VolGroup00-LogVol01: 0 20447232 linear 
VolGroup00-LogVol00: 0 122421248 linear 
helter_skelter-syncd_secondary_core_4legs_1_mimage_4: 0 1228800 linear 
helter_skelter-syncd_secondary_core_4legs_1_mimage_3: 0 1228800 linear 


Name:              helter_skelter-syncd_secondary_core_4legs_1
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      3
Major, minor:      253, 6
Number of targets: 1


2.6.18-150.el5

lvm2-2.02.46-3.el5    BUILT: Mon Jun  1 14:02:52 CDT 2009
lvm2-cluster-2.02.46-3.el5    BUILT: Mon Jun  1 14:06:25 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009
cmirror-1.1.37-1.el5    BUILT: Tue May  5 11:46:05 CDT 2009
kmod-cmirror-0.1.21-14.el5    BUILT: Thu May 21 08:28:17 CDT 2009

Comment 8 Corey Marthaler 2009-06-12 14:06:55 UTC
Apear to have hit this running cmirror_lock_stress:

Creating a 3 redundant legged cmirror named taft-04.18446
  Logical volume "taft-04.18446" created

Down converting cmirror from 3 legs to 2 on taft-04
  Error locking on node taft-04-bond: Command timed out
  Failed to lock taft-04.18446
couldn't down convert cmirror on taft-04

# The syncs are stuck at 2 different percents on different nodes in the cluster.

[root@taft-03 ~]# lvs
  LV                     VG          Attr   LSize   Origin Snap%  Move Log                Copy%  Convert
  LogVol00               VolGroup00  -wi-ao  58.38G                                                     
  LogVol01               VolGroup00  -wi-ao   9.75G                                                     
  taft-01.18455          lock_stress -wi-a- 500.00M                                                     
  taft-02.18451          lock_stress -wi-a- 500.00M                                                     
  taft-03.18448          lock_stress -wi-a- 500.00M                                                     
  taft-04.18446          lock_stress mwi-s- 500.00M                    taft-04.18446_mlog  30.40        
  taft-04.18446_mimage_3 lock_stress -wi--- 500.00M                                                     


[root@taft-04 ~]# lvs
  LV                     VG          Attr   LSize   Origin Snap%  Move Log                Copy%  Convert
  LogVol00               VolGroup00  -wi-ao  58.38G                                                     
  LogVol01               VolGroup00  -wi-ao   9.75G                                                     
  taft-01.18455          lock_stress -wi-a- 500.00M                                                     
  taft-02.18451          lock_stress -wi-a- 500.00M                                                     
  taft-03.18448          lock_stress -wi-a- 500.00M                                                     
  taft-04.18446          lock_stress mwi-s- 500.00M                    taft-04.18446_mlog  31.20        
  taft-04.18446_mimage_3 lock_stress -wi--- 500.00M                                                     

2.6.18-150.el5

lvm2-2.02.46-5.el5    BUILT: Sat Jun  6 16:29:23 CDT 2009
lvm2-cluster-2.02.46-5.el5    BUILT: Sat Jun  6 16:28:13 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009
cmirror-1.1.37-1.el5    BUILT: Tue May  5 11:46:05 CDT 2009
kmod-cmirror-0.1.21-14.el5    BUILT: Thu May 21 08:28:17 CDT 2009

Comment 11 Corey Marthaler 2009-06-18 18:14:34 UTC
bumping the priority since the work around mentioned in comment #0 doesn't work.

Comment 12 Corey Marthaler 2009-07-02 20:10:59 UTC
Note for Jon... this is fairly easy to reproduce by running helter_skelter. 

I'm attaching current logs containing clogd debug output from all 4 taft nodes when this issue is present.

Comment 13 Corey Marthaler 2009-07-02 20:14:47 UTC
Created attachment 350339 [details]
log from taft-01

Comment 14 Corey Marthaler 2009-07-02 20:15:22 UTC
Created attachment 350340 [details]
log from taft-02

Comment 15 Corey Marthaler 2009-07-02 20:16:13 UTC
Created attachment 350341 [details]
log from taft-03

Comment 16 Corey Marthaler 2009-07-02 20:17:24 UTC
Created attachment 350342 [details]
log from taft-04

Comment 17 Corey Marthaler 2009-07-08 19:32:18 UTC
Posting new logs from new binary...

Comment 18 Corey Marthaler 2009-07-08 19:32:50 UTC
Created attachment 350985 [details]
log from taft-01

Comment 19 Corey Marthaler 2009-07-08 19:33:33 UTC
Created attachment 350986 [details]
log from taft-02

Comment 20 Corey Marthaler 2009-07-08 19:34:13 UTC
Created attachment 350987 [details]
log from taft-03

Comment 21 Corey Marthaler 2009-07-08 19:34:48 UTC
Created attachment 350988 [details]
log from taft-04

Comment 24 Jonathan Earl Brassow 2009-07-15 04:01:05 UTC
commit 6cf7dbc393d206e56a14261af81b1aeeb5bb160a
Author: Jonathan Brassow <jbrassow>
Date:   Wed Jul 8 17:16:19 2009 -0500

    clogd: Fix for bugs 474174 and 491340

    Whether to continue mirror recovery or not was being queued
    off of a variable that was set by doing (x - 1).  However,
    when 'x' was 0 the result was a very large number (unsigned).
    When this large number was compared against the number of
    total regions in the mirror, it was determined that recovery
    was finished.  The fix is simply to do:
        y = x ? (x - 1) : 0;
    instead of:
        y = (x - 1);

Comment 26 Corey Marthaler 2009-08-06 15:48:36 UTC
Have not seen this bug with the latest build (though we've seen plenty of other bugs that it could still potentially be hiding behind). Marking verified in cmirror-1.1.39-2.el5.

Comment 28 errata-xmlrpc 2009-09-02 11:05:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1340.html