Bug 464550

Summary:

RHEL5 cmirror tracker: copy percentage can get stuck during sync attempt (creation/conversion)

Product:

Red Hat Enterprise Linux 5

Reporter:

Corey Marthaler <cmarthal>

Component:

cmirror

Assignee:

Jonathan Earl Brassow <jbrassow>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

5.3

CC:

agk, ccaulfie, dwysocha, edamato, heinzm, mbroz, pep, syeghiay

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-01-20 21:25:12 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
current state of clogd	none

Description Corey Marthaler 2008-09-29 15:13:36 UTC

Description of problem:
I ran though about 100 iterations of helter_skelter (running just the disklog failure scenarios) and I saw this issue. Since it took so long to see this, I wouldn't consider this a big deal but figured it should be document regardless.

It's also possible that this is related to 431621.

****** Mirror hash info for this scenario ******
* name:         syncd_log_2legs                 
* sync:         1                               
* num mirrors:  2                               
* disklog:      /dev/sdf1                       
* failpv:       /dev/sdf1                       
* leg devices:  /dev/sdg1 /dev/sde1             
************************************************

Creating mirror(s) on taft-03...
taft-03: lvcreate -m 1 -n syncd_log_2legs_1 -L 600M helter_skelter /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-150
taft-03: lvcreate -m 1 -n syncd_log_2legs_2 -L 600M helter_skelter /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-150

Waiting until all mirrors become fully syncd...
        0/2 mirror(s) are fully synced: ( 1=10.67% 2=0.08% )
        0/2 mirror(s) are fully synced: ( 1=63.33% 2=0.08% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=0.08% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=0.08% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=0.08% )
[...]


There's no additional information given in the logs as to why it got stuck:

Sep 27 00:01:00 taft-03 qarshd[14362]: Running cmdline: lvcreate -m 1 -n syncd_log_2legs_1 -L 600M helter_skelter /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-150
Sep 27 00:01:02 taft-03 [7046]: Monitoring mirror device helter_skelter-syncd_log_2legs_1 for events
Sep 27 00:01:02 taft-03 xinetd[6377]: EXIT: qarsh status=0 pid=14362 duration=2(sec)
Sep 27 00:01:02 taft-03 xinetd[6377]: START: qarsh pid=14438 from=10.15.80.47
Sep 27 00:01:02 taft-03 qarshd[14438]: Talking to peer 10.15.80.47:33740
Sep 27 00:01:02 taft-03 qarshd[14438]: Running cmdline: lvcreate -m 1 -n syncd_log_2legs_2 -L 600M helter_skelter /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-150
Sep 27 00:01:04 taft-03 lvm[7046]: Monitoring mirror device helter_skelter-syncd_log_2legs_2 for events


[root@taft-03 ~]# lvs -a -o +devices
  LV                           VG             Attr   LSize   Origin Snap%  Move Log                    Copy%  Convert Devices                                                    
  LogVol00                     VolGroup00     -wi-ao  66.19G                                                          /dev/sda2(0)                                               
  LogVol01                     VolGroup00     -wi-ao   1.94G                                                          /dev/sda2(2118)                                            
  syncd_log_2legs_1            helter_skelter mwi-a- 600.00M                    syncd_log_2legs_1_mlog 100.00         syncd_log_2legs_1_mimage_0(0),syncd_log_2legs_1_mimage_1(0)
  [syncd_log_2legs_1_mimage_0] helter_skelter iwi-ao 600.00M                                                          /dev/sdg1(0)                                               
  [syncd_log_2legs_1_mimage_1] helter_skelter iwi-ao 600.00M                                                          /dev/sde1(0)                                               
  [syncd_log_2legs_1_mlog]     helter_skelter lwi-ao   4.00M                                                          /dev/sdf1(0)                                               
  syncd_log_2legs_2            helter_skelter mwi-a- 600.00M                    syncd_log_2legs_2_mlog   0.00         syncd_log_2legs_2_mimage_0(0),syncd_log_2legs_2_mimage_1(0)
  [syncd_log_2legs_2_mimage_0] helter_skelter Iwi-ao 600.00M                                                          /dev/sdg1(150)                                             
  [syncd_log_2legs_2_mimage_1] helter_skelter Iwi-ao 600.00M                                                          /dev/sde1(150)                                             
  [syncd_log_2legs_2_mlog]     helter_skelter lwi-ao   4.00M                                                          /dev/sdf1(1)                                               

[root@taft-03 ~]# dmsetup status
helter_skelter-syncd_log_2legs_1_mlog: 0 8192 linear 
helter_skelter-syncd_log_2legs_2_mimage_1: 0 1228800 linear 
helter_skelter-syncd_log_2legs_2_mimage_0: 0 1228800 linear 
helter_skelter-syncd_log_2legs_2_mlog: 0 8192 linear 
helter_skelter-syncd_log_2legs_1_mimage_1: 0 1228800 linear 
helter_skelter-syncd_log_2legs_1_mimage_0: 0 1228800 linear 
helter_skelter-syncd_log_2legs_2: 0 1228800 mirror 2 253:7 253:8 1/1200 1 AA 3 clustered_disk 253:6 A
VolGroup00-LogVol01: 0 4063232 linear 
helter_skelter-syncd_log_2legs_1: 0 1228800 mirror 2 253:3 253:4 1200/1200 1 AA 3 clustered_disk 253:2 A
VolGroup00-LogVol00: 0 138805248 linear 


Version-Release number of selected component (if applicable):
2.6.18-116.el5

lvm2-2.02.40-3.el5    BUILT: Thu Sep 25 14:59:07 CDT 2008
lvm2-cluster-2.02.40-3.el5    BUILT: Thu Sep 25 15:00:54 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.25-1.el5    BUILT: Fri Sep 19 16:27:46 CDT 2008
kmod-cmirror-0.1.17-1.el5    BUILT: Fri Sep 19 16:27:33 CDT 2008


How reproducible:
seldom if ever

Comment 1 Corey Marthaler 2008-09-29 15:16:52 UTC

FYI - a reactivation got the sync moving again.

Comment 2 Corey Marthaler 2008-10-01 15:26:48 UTC

There's an issue that appears to be this bz where the sync can get stuck when it's almost finished for 10+ mins (maybe even hours).

Creating mirror(s) on taft-04...
taft-04: lvcreate --corelog -m 1 -n syncd_secondary_core_2legs_1 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000                                                                                       
taft-04: lvcreate --corelog -m 1 -n syncd_secondary_core_2legs_2 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000                                                                                       

Waiting until all mirrors become fully syncd...
        0/2 mirror(s) are fully synced: ( 1=8.00% 2=1.58% )
        0/2 mirror(s) are fully synced: ( 1=42.17% 2=38.25% )
        0/2 mirror(s) are fully synced: ( 1=75.50% 2=74.08% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% )
        [...]
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% )
After 10 minutes the mirror(s) are still not in sync

This happened last night but by the time I checked on the mirrors this morning, they were actually fully synced. I'll have to bump up the sync wait time in my test, but this bz may end up blocking me from being able to verify the corruption bugs.

Comment 3 Corey Marthaler 2008-10-01 19:21:09 UTC

This doesn't appear reproducible with single machine mirrors.

Comment 4 Jonathan Earl Brassow 2008-10-02 22:22:30 UTC

commit 7d0f35186a7b2209f77424e2e1117cb5fc3803f3
Author: Jonathan Brassow <jbrassow>
Date:   Thu Oct 2 17:19:20 2008 -0500

    clogd: Fix for bug 464550 - sync can get stuck after creation

    Nodes that get their sync/clean bits from checkpoints were being
    allowed to alter them in a resume - something on the master node
    should be allowed to do.

Comment 6 Milan Broz 2008-10-06 20:18:50 UTC

*** Bug 464933 has been marked as a duplicate of this bug. ***

Comment 7 Corey Marthaler 2008-10-07 16:06:50 UTC

I have been unable to reproduce this sync hang after creation during the small amount of testing that I've done with the latest fix (cmirror-1.1.29-1.el5).
 
However, there are still issues with up convert sync's getting stuck, as well as pvmoves (bz 464933) still getting stuck, though no longer every single time. So it's clear that there is still a bug lurking around. I'll attempt to reproduce this and gather more info.

Here are the cmds that caused the sync to get stuck:


Oct  6 15:36:32 taft-03 qarshd[14462]: Running cmdline: lvconvert -m 1 -b helter_skelter/syncd_secondary_2legs_1 /dev/sde1:0-1000 /dev/sdg1:0-1000 /dev/sdh1:0-150
Oct  6 15:36:35 taft-03 qarshd[14527]: Running cmdline: lvconvert -m 1 -b helter_skelter/syncd_secondary_2legs_2 /dev/sde1:0-1000 /dev/sdg1:0-1000 /dev/sdh1:0-150


  syncd_secondary_2legs_1            helter_skelter mwi-ao 600.00M                    syncd_secondary_2legs_1_mlog   0.00         syncd_secondary_2legs_1_mimage_0(0),syncd_secondary_2legs_1_mimage_1(0)
  [syncd_secondary_2legs_1_mimage_0] helter_skelter Iwi-ao 600.00M                                                                /dev/sde1(0)                                                           
  [syncd_secondary_2legs_1_mimage_1] helter_skelter Iwi-ao 600.00M                                                                /dev/sdg1(0)                                                           
  [syncd_secondary_2legs_1_mlog]     helter_skelter lwi-ao   4.00M                                                                /dev/sdh1(0)                                                           
  syncd_secondary_2legs_2            helter_skelter mwi-ao 600.00M                    syncd_secondary_2legs_2_mlog 100.00         syncd_secondary_2legs_2_mimage_0(0),syncd_secondary_2legs_2_mimage_1(0)
  [syncd_secondary_2legs_2_mimage_0] helter_skelter iwi-ao 600.00M                                                                /dev/sde1(150)                                                         
  [syncd_secondary_2legs_2_mimage_1] helter_skelter iwi-ao 600.00M                                                                /dev/sdg1(150)                                                         
  [syncd_secondary_2legs_2_mlog]     helter_skelter lwi-ao   4.00M   


helter_skelter-syncd_secondary_2legs_2: 0 1228800 mirror 2 253:7 253:8 1200/1200 1 AA 3 clustered_disk 253:6 A
helter_skelter-syncd_secondary_2legs_1: 0 1228800 mirror 2 253:3 253:4 4/1200 1 AA 3 clustered_disk 253:2 A

Comment 8 Corey Marthaler 2008-10-07 16:10:14 UTC

Here's info from the stuck pvmove:

Oct  6 15:33:38 grant-03 kernel: device-mapper: dm-log-clustered: [NMtewrfC] Request timed out: [DM_CLOG_GET_RESYNC_WORK/88695] - retrying
Oct  6 16:55:35 grant-03 clogd[9684]: cluster_queue: 1  
Oct  6 16:55:35 grant-03 clogd[9684]:   1) DM_CLOG_GET_RESYNC_WORK, originator = 0 
Oct  6 16:55:35 grant-03 clogd[9684]: free_queue   : 99 
Oct  6 16:55:35 grant-03 clogd[9684]: Official log list: 
Oct  6 16:55:35 grant-03 clogd[9684]: LVM-ShXf5vAJ7evNcZERSDEWOBNfTDb48asFw5i1zsZr4UAjTg588qFEbxtFNMtewrfC
Oct  6 16:55:35 grant-03 clogd[9684]: sync_bits:  
Oct  6 16:55:35 grant-03 clogd[9684]: [  0 -  15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Oct  6 16:55:35 grant-03 clogd[9684]: [ 16 -  31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Oct  6 16:55:35 grant-03 clogd[9684]: [ 32 -  47] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Oct  6 16:55:35 grant-03 clogd[9684]: [ 48 -  63] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Oct  6 16:55:35 grant-03 clogd[9684]: [ 64 -  79] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Oct  6 16:55:35 grant-03 clogd[9684]: [ 80 -  95] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Oct  6 16:55:35 grant-03 clogd[9684]: [ 96 - 111] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Oct  6 16:55:35 grant-03 clogd[9684]: [112 - 127] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF


[root@grant-03 ~]# lvs -a -o +devices
  LV        VG         Attr   LSize  Origin Snap%  Move      Log Copy%  Convert Devices                             
  LogVol00  VolGroup00 -wi-ao 72.44G                                            /dev/sda2(0)                        
  LogVol01  VolGroup00 -wi-ao  1.94G                                            /dev/sda2(2318)                     
  linear    pv_shuffle -wi---  5.00G                                            /dev/sdb3(0)                        
  [pvmove0] pv_shuffle p-C-ao  1.67G               /dev/sdc2      34.66         /dev/sdc2(0),/dev/sdc1(0)           
  stripe    pv_shuffle -wI-a-  5.00G                                            pvmove0(0),/dev/sdb1(0),/dev/sdc4(0)

Comment 9 Corey Marthaler 2008-10-07 18:43:43 UTC

A fairly easy way to reproduce the issue of the stuck pvmove is to run pv_shuffle (a pvmove test located in sts). I've eventually hit this everytime I've run that test.

Comment 10 Jonathan Earl Brassow 2008-10-08 21:16:13 UTC

commit 027ab643fa43ab1b97724b5702f55bcbe619b12e
Author: Jonathan Brassow <jbrassow>
Date:   Wed Oct 8 16:11:46 2008 -0500

    clogd:  Fix one cause of bug 464550 - copy percentage can get stuck during

    If the cluster log gets a request from the kernel to shutdown immediately
    after it was created, then it must perform some clean-up.  This covers the
    case where it is told to shutdown after it has requested a checkpoint, but
    before it has read/unlinked it.

Comment 12 Corey Marthaler 2008-10-09 18:10:07 UTC

This does appear to be fixed in cmirror-1.1.30-1.el5, however I'm going to let the tests run over night before marking this VERIFIED.

Comment 13 Corey Marthaler 2008-10-13 15:14:28 UTC

There is still an issue with copy percents getting stuck. 

Although it's no longer as easily reproducible, I was still able to hit this while running both the helter_skelter and pv_shuffle tests over the weekend.

In the helter_skelter case it was just doing a mirror creation of 3 mirrors and the 2nd one didn't get fully synced.

================================================================================
Iteration 26.1 started at Fri Oct 10 06:50:59 CDT 2008                          
================================================================================
Scenario: Kill primary leg of synced 2 leg mirror(s)                            

****** Mirror hash info for this scenario ******
* name:         syncd_primary_2legs             
* sync:         1                               
* num mirrors:  3                               
* disklog:      /dev/sdg1                       
* failpv:       /dev/sde1                       
* leg devices:  /dev/sde1 /dev/sdh1             
************************************************

Creating mirror(s) on taft-04...
taft-04: lvcreate -m 1 -n syncd_primary_2legs_1 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150                                                                            
taft-04: lvcreate -m 1 -n syncd_primary_2legs_2 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150                                                                            
taft-04: lvcreate -m 1 -n syncd_primary_2legs_3 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150                                                                            

Waiting until all mirrors become fully syncd...
        0/3 mirror(s) are fully synced: ( 1=31.75% 2=0.08% 3=1.75% )
        0/3 mirror(s) are fully synced: ( 1=65.17% 2=0.08% 3=36.25% )
        0/3 mirror(s) are fully synced: ( 1=98.58% 2=0.08% 3=70.33% )
        2/3 mirror(s) are fully synced: ( 1=100.00% 2=0.08% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=100.00% 2=0.08% 3=100.00% )
        [...]
        2/3 mirror(s) are fully synced: ( 1=100.00% 2=0.08% 3=100.00% )
        2/3 mirror(s) are fully synced: ( 1=100.00% 2=0.08% 3=100.00% )
After 60 minutes the mirror(s) are still not in sync
run_iteration() method failed at /home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21.

helter_skelter-syncd_primary_2legs_2: 0 1228800 mirror 2 253:7 253:8 1/1200 1 AA 3 clustered_disk 253:6 A


[root@taft-03 ~]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 
2.6.18-117.el5

lvm2-2.02.40-4.el5    BUILT: Mon Oct  6 05:21:53 CDT 2008
lvm2-cluster-2.02.40-4.el5    BUILT: Mon Oct  6 05:23:25 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.30-1.el5    BUILT: Wed Oct  8 18:18:36 CDT 2008
kmod-cmirror-0.1.18-1.el5    BUILT: Mon Sep 29 16:20:21 CDT 2008

Comment 14 Corey Marthaler 2008-10-13 15:15:24 UTC

The pv_shuffle test case got stuck as well, although it looks like all of lvm may be stuck in that case.

Adding /dev/sdc1 to the volume group on grant-03
Moving data from /dev/sdb1 to /dev/sdc1 on grant-03
    Finding volume group "pv_shuffle"
    Executing: /sbin/modprobe dm-log-clustered
    Archiving volume group "pv_shuffle" metadata (seqno 54).
    Creating logical volume pvmove0
    Moving 427 extents of logical volume pv_shuffle/stripe
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 55).
    [STUCK]

[root@grant-01 ~]#  lvs -a -o +devices
  LV        VG         Attr   LSize  Origin Snap%  Move      Log Copy%  Convert Devices                             
  LogVol00  VolGroup00 -wi-ao 72.44G                                            /dev/sda2(0)                        
  LogVol01  VolGroup00 -wi-ao  1.94G                                            /dev/sda2(2318)                     
  linear    pv_shuffle -wi---  5.00G                                            /dev/sdb2(0)                        
  [pvmove0] pv_shuffle p-C-ao  1.67G               /dev/sdc3      84.78         /dev/sdc3(0),/dev/sdb3(0)           
  stripe    pv_shuffle -wI-a-  5.00G                                            /dev/sdb1(0),pvmove0(0),/dev/sdc4(0)
[root@grant-01 ~]#  lvs -a -o +devices
    [HUNG]

device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84685] - retrying
device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84686] - retrying
device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84687] - retrying
device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84688] - retrying
device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84689] - retrying

2.6.18-117.el5

lvm2-2.02.40-4.el5    BUILT: Mon Oct  6 05:21:53 CDT 2008
lvm2-cluster-2.02.40-4.el5    BUILT: Mon Oct  6 05:23:25 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.30-1.el5    BUILT: Wed Oct  8 18:18:36 CDT 2008
kmod-cmirror-0.1.18-1.el5    BUILT: Mon Sep 29 16:20:21 CDT 2008

Comment 15 Corey Marthaler 2008-10-15 14:18:09 UTC

Both helter_skelter and pv_shuffle failed with sync issues with the latest rpms.

2.6.18-117.el5

lvm2-2.02.40-4.el5    BUILT: Mon Oct  6 05:21:53 CDT 2008
lvm2-cluster-2.02.40-4.el5    BUILT: Mon Oct  6 05:23:25 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.31-1.el5    BUILT: Tue Oct 14 14:14:55 CDT 2008
kmod-cmirror-0.1.19-1.el5    BUILT: Tue Oct 14 14:17:45 CDT 2008


The blocker flag needs to get set for this BZ.

Comment 16 Corey Marthaler 2008-10-15 15:07:47 UTC

Stumbled upon this with just a simple create.

lvcreate -m 1 -n mirror -L 1G grant

Ended up stuck at 12.50%

Oct 15 10:00:28 grant-03 kernel: device-mapper: dm-log-clustered: [HlWcfNdu] Request timed out: [DM_CLOG_GET_RESYNC_WORK/1519] - retrying

  mirror            grant      mwi-a-  1.00G                    mirror_mlog  12.50         mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] grant      Iwi-ao  1.00G                                               /dev/sdb1(0)                         
  [mirror_mimage_1] grant      Iwi-ao  1.00G                                               /dev/sdb2(0)                         
  [mirror_mlog]     grant      lwi-ao  4.00M                                               /dev/sdc4(0)

Comment 17 Jonathan Earl Brassow 2008-10-20 19:00:13 UTC

commit e07369b28d7a569e742d80152ef10c9d42bc2650
Author: Jonathan Brassow <jbrassow>
Date:   Mon Oct 20 12:27:08 2008 -0500

    clogd: Additional fixes for bug 460156 and 464550

    - Fixed a checkpoint ordering issue, where a checkpoint could be populated
      with data from the wrong moment in time.
    - Fixed problem where resending requests when a server leaves would include
      requests that should have been recieved after the server had gone (so no
      resend should have been necessary).

Comment 21 Corey Marthaler 2008-10-21 18:52:28 UTC

Appeared to have hit this again after 48 iterations of pv_shuffle.

Iteration 49 of 10000 started on hayes-03 at Tue Oct 21 07:21:10 CDT 2008
FREE PVS: /dev/etherd/e1.1p1 /dev/etherd/e1.1p6
INUSE PVS: /dev/etherd/e1.1p2 /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p7
NOT INUSE PVS: /dev/etherd/e1.1p3

Adding /dev/etherd/e1.1p6 to the volume group on hayes-03
Moving data from /dev/etherd/e1.1p2 to /dev/etherd/e1.1p6 on hayes-03
    Finding volume group "pv_shuffle"
    Executing: /sbin/modprobe dm-log-clustered
    Archiving volume group "pv_shuffle" metadata (seqno 244).
    Creating logical volume pvmove0
    Moving 1280 extents of logical volume pv_shuffle/linear
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 245).
    Checking progress every 15 seconds

[STUCK]

lvs shows it's stuck at 66%:
  [pvmove0] pv_shuffle p-C-ao  5.00G               /dev/etherd/e1.1p2      66.09         /dev/etherd/e1.1p2(0),/dev/etherd/e1.1p6(0)                      
  stripe    pv_shuffle -wi---  5.00G                                                     /dev/etherd/e1.1p4(0),/dev/etherd/e1.1p7(0),/dev/etherd/e1.1p5(0)

There were a bunch of time out messages as well.
device-mapper: dm-log-clustered: [4l2zWeXM] Request timed out: [DM_CLOG_FLUSH/780360] - retrying
device-mapper: dm-log-clustered: [4l2zWeXM] Request timed out: [DM_CLOG_GET_RESYNC_WORK/803642] - retrying



I'll try this again with the newly built lvm2 rpm, but for now marking fails_qa since this is the latest cmirror[-kern] rpms.

2.6.18-117.el5

lvm2-2.02.40-4.el5    BUILT: Mon Oct  6 05:21:53 CDT 2008
lvm2-cluster-2.02.40-4.el5    BUILT: Mon Oct  6 05:23:25 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.32-1.el5    BUILT: Mon Oct 20 14:15:12 CDT 2008
kmod-cmirror-0.1.20-1.el5    BUILT: Mon Oct 20 14:13:37 CDT 2008

Comment 22 Jonathan Earl Brassow 2008-10-27 22:54:33 UTC

Two things to check:
1) Is clogd running on all machines 'ps -C clogd'
- If not, then there is probably a core dump (I'll need that.)

2) If clogd is running, it could be stuck in an infinite loop
- 'gdb clogd <clogd pid>' will tell you pretty quickly if it is in a loop

I'm pretty certain that this bug is a result of the list corruption issue that I am seeing (bug 468438).  I'm gonna mark this bug MODIFIED (rather than NEEDINFO), because I think this bug has been fixed by:

commit cc5877e65fad20dd8657881a7ca7361e6e4c08bf
Author: Jonathan Brassow <jbrassow>
Date:   Fri Oct 24 13:42:06 2008 -0500

    clogd: Fix for bug 468438 - list corruption

    'commit e07369b28d7a569e742d80152ef10c9d42bc2650' introduced the
    concept of a delay queue to hold requests while membership changes
    occurred.  Sometimes, a request would be added to the delay_queue
    /and/ the cluster_queue, resulting in list corruption.  Depending
    on how the list was corrupted, infinite loops could occur, or
    requests could simply be lost.

Comment 25 Corey Marthaler 2008-10-28 18:55:10 UTC

I'm now seeing a issue where the copy percent goes over 100%. I'll file a new BZ for that since it may be new/different issue than this one, though it does still appear that there are sync issues like this BZ implies.

  syncd_log_2legs_1 helter_skelter mwi-ao 600.00M                    syncd_log_2legs_1_mlog 100.00        
  syncd_log_2legs_2 helter_skelter mwi-ao 600.00M                    syncd_log_2legs_2_mlog 100.00        
  syncd_log_2legs_3 helter_skelter mwi-ao 600.00M                    syncd_log_2legs_3_mlog 101.33

Comment 26 Corey Marthaler 2008-10-28 19:14:55 UTC

The bz filed for the issue in comment #25 is 468911.

Comment 27 Corey Marthaler 2008-10-30 14:00:22 UTC

Hit this again last night.

================================================================================
Iteration 5.4 started at Wed Oct 29 15:52:03 CDT 2008                           
================================================================================
Scenario: Kill secondary leg of synced 3 leg mirror(s)                          

****** Mirror hash info for this scenario ******
* name:         syncd_secondary_3legs           
* sync:         1                               
* num mirrors:  3                               
* disklog:      /dev/sde1                       
* failpv:       /dev/sdh1                       
* leg devices:  /dev/sdf1 /dev/sdh1 /dev/sdg1   
************************************************

Creating mirror(s) on taft-04...
taft-04: lvcreate -m 2 -n syncd_secondary_3legs_1 -L 600M helter_skelter /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150                                                                               
taft-04: lvcreate -m 2 -n syncd_secondary_3legs_2 -L 600M helter_skelter /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150                                                                               
taft-04: lvcreate -m 2 -n syncd_secondary_3legs_3 -L 600M helter_skelter /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150                                                                               

Waiting until all mirrors become fully syncd...
        0/3 mirror(s) are fully synced: ( 1=14.75% 2=5.92% 3=0.08% )
        0/3 mirror(s) are fully synced: ( 1=43.92% 2=35.58% 3=0.08% )
        0/3 mirror(s) are fully synced: ( 1=73.75% 2=65.08% 3=0.08% )
        1/3 mirror(s) are fully synced: ( 1=100.00% 2=96.67% 3=0.08% )
        2/3 mirror(s) are fully synced: ( 1=100.00% 2=100.00% 3=0.08% )
        2/3 mirror(s) are fully synced: ( 1=100.00% 2=100.00% 3=0.08% )
        2/3 mirror(s) are fully synced: ( 1=100.00% 2=100.00% 3=0.08% )
        2/3 mirror(s) are fully synced: ( 1=100.00% 2=100.00% 3=0.08% )
After 60 minutes the mirror(s) are still not in sync
run_iteration() method failed at /home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21.

[root@taft-01 ~]# lvs
  LV                      VG             Attr   LSize   Origin Snap%  Move Log                          Copy%  Convert
  LogVol00                VolGroup00     -wi-ao  66.19G                                                               
  LogVol01                VolGroup00     -wi-ao   1.94G                                                               
  syncd_secondary_3legs_1 helter_skelter mwi-a- 600.00M                    syncd_secondary_3legs_1_mlog 100.00        
  syncd_secondary_3legs_2 helter_skelter mwi-a- 600.00M                    syncd_secondary_3legs_2_mlog 100.00        
  syncd_secondary_3legs_3 helter_skelter mwi-a- 600.00M                    syncd_secondary_3legs_3_mlog   0.00        

[root@taft-01 ~]# dmsetup status
helter_skelter-syncd_secondary_3legs_2: 0 1228800 mirror 3 253:8 253:9 253:10 1200/1200 1 AAA 3 clustered_disk 253:7 A
helter_skelter-syncd_secondary_3legs_3_mimage_1: 0 1228800 linear 
helter_skelter-syncd_secondary_3legs_2_mimage_2: 0 1228800 linear 
helter_skelter-syncd_secondary_3legs_1: 0 1228800 mirror 3 253:3 253:4 253:5 1200/1200 1 AAA 3 clustered_disk 253:2 A
helter_skelter-syncd_secondary_3legs_3_mimage_0: 0 1228800 linear 
helter_skelter-syncd_secondary_3legs_2_mimage_1: 0 1228800 linear 
helter_skelter-syncd_secondary_3legs_1_mimage_2: 0 1228800 linear 
helter_skelter-syncd_secondary_3legs_2_mimage_0: 0 1228800 linear 
helter_skelter-syncd_secondary_3legs_1_mimage_1: 0 1228800 linear 
helter_skelter-syncd_secondary_3legs_1_mimage_0: 0 1228800 linear 
helter_skelter-syncd_secondary_3legs_1_mlog: 0 8192 linear 
helter_skelter-syncd_secondary_3legs_2_mlog: 0 8192 linear 
helter_skelter-syncd_secondary_3legs_3_mlog: 0 8192 linear 
VolGroup00-LogVol01: 0 4063232 linear 
VolGroup00-LogVol00: 0 138805248 linear 
helter_skelter-syncd_secondary_3legs_3: 0 1228800 mirror 3 253:13 253:14 253:15 1/1200 1 AAA 3 clustered_disk 253:12 A
helter_skelter-syncd_secondary_3legs_3_mimage_2: 0 1228800 linear 

clogd is still running on all nodes in the cluster.


2.6.18-117.el5

lvm2-2.02.40-6.el5    BUILT: Fri Oct 24 07:37:33 CDT 2008
lvm2-cluster-2.02.40-6.el5    BUILT: Fri Oct 24 07:38:44 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.33-1.el5    BUILT: Mon Oct 27 14:10:34 CDT 2008
kmod-cmirror-0.1.20-1.el5    BUILT: Mon Oct 20 14:13:37 CDT 2008

Comment 28 Corey Marthaler 2008-11-03 15:38:47 UTC

Just a note in addition to comment #27... the pv_shuffle tests pass now with out issue. I ran it over the weekend (almost 600 iterations) without any problems. So although there is still a reproducible issue here w/ mirror syncing, it is much better than before the fix that went in in comment #22.

Comment 29 Jonathan Earl Brassow 2008-11-05 21:17:13 UTC

commit 4dc9f187336bbee91872e613a6b9614a358228de
Author: Jonathan Brassow <jbrassow>
Date:   Thu Oct 30 13:45:42 2008 -0500

    clogd: Additional fixes for bug 464550 + improved debugging

    Needed to reset the 'recoverer' when the mirror log
    server was suspending.  Otherwise, sync work could stop.

    Also added signal triggered debugging output.  This gives the
    testing folks an easy way to get useful information when a bug
    is hit.

Comment 30 Corey Marthaler 2008-11-07 18:12:06 UTC

I'm still seeing this with the latest builds.

Both with while running helter_skelter and mirror_sanity. It also appears due to that following test case taking longer to comoplete, that mirror syncing is taking longer now.


MIRROR_SANITY:
SCENARIO - [verify_sync_completions]
Create 8 mirrors and verify that their copy percents complete
hayes-01: lvcreate -m 1 -n sync_check_1 -L 1G mirror_sanity  
hayes-03: lvcreate -m 1 -n sync_check_2 -L 1G mirror_sanity  
hayes-01: lvcreate -m 1 -n sync_check_3 -L 1G mirror_sanity  
hayes-02: lvcreate -m 1 -n sync_check_4 -L 1G mirror_sanity  
hayes-02: lvcreate -m 1 -n sync_check_5 -L 1G mirror_sanity  
hayes-03: lvcreate -m 1 -n sync_check_6 -L 1G mirror_sanity  
hayes-02: lvcreate -m 1 -n sync_check_7 -L 1G mirror_sanity  
hayes-02: lvcreate -m 1 -n sync_check_8 -L 1G mirror_sanity  
Waiting until all mirrors become fully syncd...              
   1/8 mirror(s) are fully synced: ( 1=100.00% 2=6.59% 3=5.52% 4=0.05% 5=2.34% 6=39.94% 7=15.33% 8=0.05% )
   1/8 mirror(s) are fully synced: ( 1=100.00% 2=7.37% 3=6.64% 4=0.05% 5=3.17% 6=50.78% 7=26.12% 8=1.03% )
[...]
   3/8 mirror(s) are fully synced: ( 1=100.00% 2=60.60% 3=66.11% 4=0.05% 5=57.62% 6=100.00% 7=100.00% 8=55.62% )
   3/8 mirror(s) are fully synced: ( 1=100.00% 2=61.52% 3=67.14% 4=0.05% 5=58.59% 6=100.00% 7=100.00% 8=56.69% )
After 10 minutes the mirror(s) are still not in sync


[root@hayes-01 ~]# lvs
  LV           VG            Attr   LSize  Origin Snap%  Move Log               Copy%  Convert
  LogVol00     VolGroup00    -wi-ao 72.44G                                                    
  LogVol01     VolGroup00    -wi-ao  1.94G                                                    
  sync_check_1 mirror_sanity mwi-a-  1.00G                    sync_check_1_mlog 100.00        
  sync_check_2 mirror_sanity mwi-a-  1.00G                    sync_check_2_mlog 100.00        
  sync_check_3 mirror_sanity mwi-a-  1.00G                    sync_check_3_mlog 100.00        
  sync_check_4 mirror_sanity mwi-a-  1.00G                    sync_check_4_mlog   0.00        
  sync_check_5 mirror_sanity mwi-a-  1.00G                    sync_check_5_mlog 100.00        
  sync_check_6 mirror_sanity mwi-a-  1.00G                    sync_check_6_mlog 100.00        
  sync_check_7 mirror_sanity mwi-a-  1.00G                    sync_check_7_mlog 100.00        
  sync_check_8 mirror_sanity mwi-a-  1.00G                    sync_check_8_mlog 100.00        


[root@hayes-01 ~]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 
2.6.18-117.el5

lvm2-2.02.40-6.el5    BUILT: Fri Oct 24 07:37:33 CDT 2008
lvm2-cluster-2.02.40-6.el5    BUILT: Fri Oct 24 07:38:44 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.34-5.el5    BUILT: Thu Nov  6 15:10:44 CST 2008
kmod-cmirror-0.1.21-2.el5    BUILT: Thu Nov  6 14:12:07 CST 2008



HELTER_SKELTER:
Enabling device sde on taft-01
Enabling device sde on taft-02
Enabling device sde on taft-03
Enabling device sde on taft-04

Recreating PV /dev/sde1
  WARNING: Volume group helter_skelter is not consistent
  WARNING: Volume Group helter_skelter is not consistent
  WARNING: Volume group helter_skelter is not consistent
Extending the recreated PV back into VG helter_skelter  
Up converting linear(s) back to mirror(s) on taft-04... 
taft-04: lvconvert -m 1 -b helter_skelter/syncd_secondary_2legs_1 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-150                                                                                                             
taft-04: lvconvert -m 1 -b helter_skelter/syncd_secondary_2legs_2 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-150                                                                                                             
Verifying the up conversion from linear(s) to mirror(s)                                                         
Verifying device /dev/sdg1 *IS* one of the legs in the mirror(s)                                                
Verifying device /dev/sde1 *IS* one of the legs in the mirror(s)                                                

Waiting until all mirrors become fully syncd...
        0/2 mirror(s) are fully synced: ( 1=0.33% 2=10.00% )
        0/2 mirror(s) are fully synced: ( 1=0.33% 2=54.75% )
        0/2 mirror(s) are fully synced: ( 1=0.33% 2=98.83% )
        1/2 mirror(s) are fully synced: ( 1=0.33% 2=100.00% )
        1/2 mirror(s) are fully synced: ( 1=0.33% 2=100.00% )
        1/2 mirror(s) are fully synced: ( 1=0.33% 2=100.00% )
        1/2 mirror(s) are fully synced: ( 1=0.33% 2=100.00% )
After 10 minutes the mirror(s) are still not in sync

[root@taft-04 ~]# lvs -a -o +devices
  LV                                 VG             Attr   LSize   Origin Snap%  Move Log                          Copy%  Convert Devices                                 
  LogVol00                           VolGroup00     -wi-ao  66.19G                                                                /dev/sda2(0)                            
  LogVol01                           VolGroup00     -wi-ao   1.94G                                                                /dev/sda2(2118)                         
  syncd_secondary_2legs_1            helter_skelter mwi-ao 600.00M                    syncd_secondary_2legs_1_mlog   0.00         syncd_secondary_2legs_1_mimage_0(0),syn)
  [syncd_secondary_2legs_1_mimage_0] helter_skelter Iwi-ao 600.00M                                                                /dev/sdg1(0)                            
  [syncd_secondary_2legs_1_mimage_1] helter_skelter Iwi-ao 600.00M                                                                /dev/sde1(0)                            
  [syncd_secondary_2legs_1_mlog]     helter_skelter lwi-ao   4.00M                                                                /dev/sdf1(0)                            
  syncd_secondary_2legs_2            helter_skelter mwi-ao 600.00M                    syncd_secondary_2legs_2_mlog 100.00         syncd_secondary_2legs_2_mimage_0(0),syn)
  [syncd_secondary_2legs_2_mimage_0] helter_skelter iwi-ao 600.00M                                                                /dev/sdg1(150)                          
  [syncd_secondary_2legs_2_mimage_1] helter_skelter iwi-ao 600.00M                                                                /dev/sde1(150)                          
  [syncd_secondary_2legs_2_mlog]     helter_skelter lwi-ao   4.00M                                                                /dev/sdf1(1)            

ccs/       dlm/       gfs2/      lvm2/      nfs/       var/       
[root@taft-04 ~]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 
2.6.18-117.el5

lvm2-2.02.40-6.el5    BUILT: Fri Oct 24 07:37:33 CDT 2008
lvm2-cluster-2.02.40-6.el5    BUILT: Fri Oct 24 07:38:44 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.34-5.el5    BUILT: Thu Nov  6 15:10:44 CST 2008
kmod-cmirror-0.1.21-2.el5    BUILT: Thu Nov  6 14:12:07 CST 2008

Comment 31 Corey Marthaler 2008-11-10 19:58:23 UTC

Marking this as a 'Testblocker' since it prevents 2 different tests from completing.

Comment 32 Corey Marthaler 2008-11-10 21:39:07 UTC

Created attachment 323120 [details]
current state of clogd

Attaching the dump from clogd during the latest reproduction

Comment 34 Corey Marthaler 2008-11-13 22:23:15 UTC

FYI - Jon is correct (at least in the mirror_sanity [verify_sync_completions] test case) when the mirror gets stuck, a write to that device will start the sync up again.

Comment 38 Corey Marthaler 2008-12-02 19:49:57 UTC

QA has been able to hack around this bug, since there is a workaround. Removing the TestBlocker flag and proposing this bug for rhel5.4.

Comment 40 Corey Marthaler 2009-01-07 15:37:33 UTC

This issue should remain open for 5.4 and should not have been closed.

Comment 41 errata-xmlrpc 2009-01-20 21:25:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0158.html