Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 972993 - [cluster mirror] When created mirror is not fully synchronized on all nodes
[cluster mirror] When created mirror is not fully synchronized on all nodes
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lvm2 (Show other bugs)
7.0
Unspecified Unspecified
urgent Severity high
: rc
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
: TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-10 22:58 EDT by Marian Csontos
Modified: 2014-06-17 21:18 EDT (History)
11 users (show)

See Also:
Fixed In Version: lvm2-2.02.102-1.el7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-06-13 07:27:50 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Relevant part from messages (37.91 KB, text/plain)
2013-07-16 19:00 EDT, Marian Csontos
no flags Details

  None (edit)
Description Marian Csontos 2013-06-10 22:58:07 EDT
Description of problem:
Creating a mirror in a cluster, it is often not 100% in sync on all nodes.
Surprisingly, it is the node where lvcreate were issued which is not synchronized.

Version-Release number of selected component (if applicable):
lvm2-2.02.99-0.48.el7.x86_64
kernel-3.10.0-0.rc4.60.el7.x86_64
pcs-0.9.44-4.el7.x86_64
corosync-2.3.0-3.el7.x86_64
dlm-4.0.1-1.el7.x86_64

How reproducible:
50+ %

Steps to Reproduce:
1. lvcreate -n base2 --mirrorlog core -m 1 -L 64M centipede

Actual results:
One region is not synchronized

Expected results:
Mirror should be fully synchronized

Additional info:

# lvs
  LV         VG        Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  base2      centipede mwi-a-m--  64.00m                                93.75

# dmsetup status
centipede-base2: 0 131072 mirror 2 253:8 253:9 127/128 1 AA 1 clustered-core
Comment 1 Marian Csontos 2013-06-10 23:02:02 EDT
Jonathan, do you need lvmdumps from all(?) cluster nodes, or do you prefer live access to the cluster? It's fairly easy to reproduce so whatever suits you.
Comment 2 Marian Csontos 2013-06-10 23:07:37 EDT
Suggesting a blocker flag as this breaks important functionality like pvmove.
Comment 3 Marian Csontos 2013-06-10 23:30:28 EDT
2 observations:
It is not limited to core log.
The larger the LV, the larger number of regions claimed not synced, but the percentage stops after 98 or 99.
Comment 5 Marian Csontos 2013-07-16 19:00:59 EDT
Created attachment 774527 [details]
Relevant part from messages

It always happens when processing stored reuqsts as here:

    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: Log not valid yet, storing request
    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: Data ready on 8
    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: Issuing callback on cluster/8

Then when stored request is processed I guess it goes like this:
1. bitmaps are read
2. sync_count is initialised (it is not always the first bit, sometimes second is left unset)
3. bitmap has advanced on other node(s)
4. next request is setting sync_count and skipping bit(s):

    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: [8pkzeAwf]  Request from kernel received: [DM_ULOG_IN_SYNC/16348]
    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: [8pkzeAwf] Region is not in-sync: 2
    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: Data ready on 5
    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: Data ready on 8
    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: Issuing callback on local/5
    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: [8pkzeAwf]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/16349]
    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: Issuing callback on cluster/8
    Jul 15 10:04:42 zaphodc1-node02 cmirrord[19940]: [8pkzeAwf] loading sync_bits (sync_count = 0):
Comment 6 Corey Marthaler 2013-07-25 17:05:42 EDT
I can't seem to repo this issue due to the fact that I end up hitting locking issues before I can get many mirror creations going.

SCENARIO - [verify_sync_completions]
Create 8 mirrors and verify that their copy percents complete
qalvm-03: lvcreate -m 1 -n sync_check_1 -L 500M mirror_sanity
  Error locking on node 2: device-mapper: reload ioctl on  failed: Invalid argument
  Error locking on node 1: device-mapper: reload ioctl on  failed: Invalid argument
  Error locking on node 3: device-mapper: reload ioctl on  failed: Invalid argument
  Failed to activate new LV.
couldn't create mirror:
        qalvm-03 lvcreate -m 1 -n sync_check_1 -L 500M mirror_sanity


[root@qalvm-01 ~]# service cmirrord status
cmirrord.service - LSB: Starts and stops cmirrord
   Loaded: loaded (/etc/rc.d/init.d/cmirrord)
   Active: active (running) since Thu 2013-07-25 15:43:23 CDT; 13min ago
  Process: 970 ExecStart=/etc/rc.d/init.d/cmirrord start (code=exited, status=0/SUCCESS)
 Main PID: 974 (cmirrord)
   CGroup: name=systemd:/system/cmirrord.service
           └─974 cmirrord

Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: Clustered log disk is /dev/mapper/mirror_sanity-sync_check_1_mlog
Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: Unable to open log device, /dev/mapper/mirror_sanity-sync_check_1_mlog: Permission denied
Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: Failed to create cluster log (LVM-0dTezE2FDJfjawsiElrTGdocYkHp2ak45Ywxnv6oBBkWY1pKJnba8C8tyUsnIWLe)
Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: argv[0] = clustered-disk
Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: argv[1] = 253:2
Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: argv[2] = 1024
Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: Programmer error: rq->u_rq.error && rq->u_rq.data_size
Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: Data ready on 5
Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: Issuing callback on local/5
Jul 25 15:55:32 qalvm-01.lab.msp.redhat.com cmirrord[974]: [yUsnIWLe]  Request from kernel received: [DM_ULOG_GET_REGION_SIZE/1]
cmirror is running.
Comment 7 Marian Csontos 2013-07-26 03:31:04 EDT
That looks like SElinux related Bug 948073 - cluster mirror creation deadlocks.
Is qalvm cluster up to date?
Comment 8 Marian Csontos 2013-07-31 07:46:35 EDT
Jon, any update on this bug, please?

Corey has not confirmed it yet, but if it is reproducible in his (non-virtual) environment as well, it will block most of cluster tests.
Comment 9 Corey Marthaler 2013-08-07 18:44:09 EDT
I have now reproduced this issue.

[root@qalvm-03 ~]# lvcreate -n mirror1 --mirrorlog core -m 1 -L 64M VG
[root@qalvm-03 ~]# lvcreate -n mirror2 --mirrorlog core -m 1 -L 64M VG
[root@qalvm-02 ~]# lvcreate -n mirror3 --mirrorlog core -m 1 -L 64M VG

[root@qalvm-01 ~]# lvs -a -o +devices
 LV                 VG Attr       LSize  Cpy%Sync Devices
 mirror1            VG mwi-a-m--- 64.00m   100.00 mirror1_mimage_0(0),mirror1_mimage_1(0)
 [mirror1_mimage_0] VG iwi-aom--- 64.00m          /dev/vda1(0)
 [mirror1_mimage_1] VG iwi-aom--- 64.00m          /dev/vda2(0)
 mirror2            VG mwi-a-m--- 64.00m   100.00 mirror2_mimage_0(0),mirror2_mimage_1(0)
 [mirror2_mimage_0] VG iwi-aom--- 64.00m          /dev/vda1(16)
 [mirror2_mimage_1] VG iwi-aom--- 64.00m          /dev/vda2(16)
 mirror3            VG mwi-a-m--- 64.00m   100.00 mirror3_mimage_0(0),mirror3_mimage_1(0)
 [mirror3_mimage_0] VG iwi-aom--- 64.00m          /dev/vda1(32)
 [mirror3_mimage_1] VG iwi-aom--- 64.00m          /dev/vda2(32)


[root@qalvm-02 ~]# lvs -a -o +devices
 LV                 VG Attr       LSize  Cpy%Sync Devices
 mirror1            VG mwi-a-m--- 64.00m   100.00 mirror1_mimage_0(0),mirror1_mimage_1(0)
 [mirror1_mimage_0] VG iwi-aom--- 64.00m          /dev/vda1(0)
 [mirror1_mimage_1] VG iwi-aom--- 64.00m          /dev/vda2(0)
 mirror2            VG mwi-a-m--- 64.00m   100.00 mirror2_mimage_0(0),mirror2_mimage_1(0)
 [mirror2_mimage_0] VG iwi-aom--- 64.00m          /dev/vda1(16)
 [mirror2_mimage_1] VG iwi-aom--- 64.00m          /dev/vda2(16)
 mirror3            VG mwi-a-m--- 64.00m    93.75 mirror3_mimage_0(0),mirror3_mimage_1(0)
 [mirror3_mimage_0] VG Iwi-aom--- 64.00m          /dev/vda1(32)
 [mirror3_mimage_1] VG Iwi-aom--- 64.00m          /dev/vda2(32)


[root@qalvm-03 ~]# lvs -a -o +devices
 LV                 VG Attr       LSize  Cpy%Sync Devices
 mirror1            VG mwi-a-m--- 64.00m   100.00 mirror1_mimage_0(0),mirror1_mimage_1(0)
 [mirror1_mimage_0] VG iwi-aom--- 64.00m          /dev/vda1(0)
 [mirror1_mimage_1] VG iwi-aom--- 64.00m          /dev/vda2(0)
 mirror2            VG mwi-a-m--- 64.00m   100.00 mirror2_mimage_0(0),mirror2_mimage_1(0)
 [mirror2_mimage_0] VG iwi-aom--- 64.00m          /dev/vda1(16)
 [mirror2_mimage_1] VG iwi-aom--- 64.00m          /dev/vda2(16)
 mirror3            VG mwi-a-m--- 64.00m   100.00 mirror3_mimage_0(0),mirror3_mimage_1(0)
 [mirror3_mimage_0] VG iwi-aom--- 64.00m          /dev/vda1(32)
 [mirror3_mimage_1] VG iwi-aom--- 64.00m          /dev/vda2(32)


[root@qalvm-02 ~]# pcs status
Cluster name: QALVM
Last updated: Wed Aug  7 17:41:56 2013
Last change: Wed Aug  7 15:29:23 2013 via cibadmin on qalvm-01
Stack: corosync
Current DC: qalvm-03 (3) - partition with quorum
Version: 1.1.10-3.el7-d19719c
3 Nodes configured, unknown expected votes
0 Resources configured.


Online: [ qalvm-01 qalvm-02 qalvm-03 ]

Full list of resources:


3.10.0-0.rc5.61.el7.x86_64
lvm2-2.02.100-0.110.el7    BUILT: Fri Aug  2 19:27:39 CDT 2013
lvm2-libs-2.02.100-0.110.el7    BUILT: Fri Aug  2 19:27:39 CDT 2013
lvm2-cluster-2.02.100-0.110.el7    BUILT: Fri Aug  2 19:27:39 CDT 2013
device-mapper-1.02.79-0.110.el7    BUILT: Fri Aug  2 19:27:39 CDT 2013
device-mapper-libs-1.02.79-0.110.el7    BUILT: Fri Aug  2 19:27:39 CDT 2013
device-mapper-event-1.02.79-0.110.el7    BUILT: Fri Aug  2 19:27:39 CDT 2013
device-mapper-event-libs-1.02.79-0.110.el7    BUILT: Fri Aug  2 19:27:39 CDT 2013
cmirror-2.02.100-0.110.el7    BUILT: Fri Aug  2 19:27:39 CDT 2013
Comment 10 Jonathan Earl Brassow 2013-08-19 09:43:49 EDT
I started working on this over the weekend.  I have 4 VMs backed by loopback devices for local storage and 8 physical disks for shared storage.  My test consists of a loop that creates a 64M 2-way mirror, waits for sync, checks that all machines agree that the mirror is in-sync, and then removes the mirror.  It took 576 iterations to reproduce the error this first time.  So, it is reproducible but it takes a while.

Might be able to hit the issue more quickly with more mirrors, but that might make it harder to debug...
Comment 11 Jonathan Earl Brassow 2013-08-19 09:55:13 EDT
There are a few immediate errors that need investigation:

1) cmirrord[892]: Error:: multiple loading of bits (6)

2) cmirrord[892]: cpg_dispatch failed: 9

I don't yet know if it always happens in responsed to processing stored requests, like in comment 5.  The debugging output is only printed on request from the user by issuing a 'SIGUSR1' to cmirrord.  I think your automated tests are simply issuing the signal when they encounter a failure.  Thus, you only see the message when the issue happens, but it is not a precurser to the problem.  It could just be printed as a result of the problem.
Comment 12 Jonathan Earl Brassow 2013-08-20 12:25:42 EDT
(In reply to Jonathan Earl Brassow from comment #11)
> There are a few immediate errors that need investigation:
> 
> 2) cmirrord[892]: cpg_dispatch failed: 9

Seems that this one is coming when shutting down a cluster mirror.  I'm not sure why yet, but it seems harmless at the moment and unrelated to this issue.

> 1) cmirrord[892]: Error:: multiple loading of bits (6)

This one, I believe, is indicating the problem.  It implies that the bitmap has been populated more than once - this should never happen.  Further investigation has revealed that some bits in the bitmap were not in-sync even though the log clearly shows that the request was made to mark them as such.  I believe multiple check-points are being allowed to write the bitmap.  The logs indicate this as well:
Resync request history:
cmirrord[1416]: 39:1) CKPT INIT - SEQ#=X, UUID=1wfnLUDK, nodeid = X:: recovering_region=1, recoverer=3
cmirrord[1416]: 40:2) [1wfnLUDK] Initial sync_count = 1
cmirrord[1416]: 41:3) SET - SEQ#=131688, UUID=1wfnLUDK, nodeid = 3:: Setting region (1)
cmirrord[1416]: 42:4) GET - SEQ#=125574, UUID=1wfnLUDK, nodeid = 2:: Assigning resync work (2)
cmirrord[1416]: 43:5) CKPT INIT - SEQ#=X, UUID=1wfnLUDK, nodeid = X:: recovering_region=1, recoverer=3
cmirrord[1416]: 44:6) CKPT INIT - SEQ#=X, UUID=1wfnLUDK, nodeid = X:: recovering_region=2, recoverer=3
cmirrord[1416]: 45:7) GET - SEQ#=124782, UUID=1wfnLUDK, nodeid = 4:: Someone already recovering (2)
cmirrord[1416]: 46:8) GET - SEQ#=388142, UUID=1wfnLUDK, nodeid = 1:: Someone already recovering (2)
cmirrord[1416]: 47:9) SET - SEQ#=125579, UUID=1wfnLUDK, nodeid = 2:: Setting region (2)
cmirrord[1416]: 48:10) *** Previous bit not set ***
cmirrord[1416]: 49:11) SET - SEQ#=125579, UUID=1wfnLUDK, nodeid = 2:: sync_count(3) != bitmap count(2)

In the above sequence, we have:
1) CHKPT - initial sync_count = 1 (region #0 must be in-sync)
2) set region #1 in-sync - sync_count should now be 2
3) CHKPT - should be ignored, we already have bits
4) CHKPT - should be ignored, we already have bits
*) get sync work requests
5) set region #2 in-sync - sync_count should be 3
6) !! detection that previous bit should have been cleared !!
7) !! bitmap code detects a mismatch between what it thinks sync_count should be and what is calculated from the bitmap !!

Looking at the checkpointing code, we see in lvm2/daemons/cmirrord/cluster.c:import_checkpoint() that the parameter 'no_read' is being completely ignored.  This parameter is meant to prevent all but the initial checkpoints from being loaded.  So it seems that steps 3 and 4 above are being allowed to trample the bitmap data set in previous steps - critically, it tramples step #2 which contains the operation that marks that portion of the bitmap in-sync.

A patch should be ready shortly after testing.
Comment 13 Alasdair Kergon 2013-08-20 12:56:29 EDT
So does that mean the data actually is in sync, just it's not being displayed as such?
Comment 14 Jonathan Earl Brassow 2013-08-20 13:52:05 EDT
yes.

That has many ramifications for our tests though - many of which wait for sync before proceeding.  In this case, even though the mirror is in-sync, it displays < 100%.

Also, some mirror operations are not allowed when the mirror is < 100%.
Comment 15 Jonathan Earl Brassow 2013-08-20 14:23:31 EDT
Fix checked in upstream:

commit f0be9ac9047a30bb2befe6a8d9130f466d37d9fd
Author: Jonathan Brassow <jbrassow@redhat.com>
Date:   Tue Aug 20 13:21:09 2013 -0500

    cmirrord: Prevent secondary checkpoints from corrupting bitmaps
    
    The bug addressed by this patch manifested itself during testing
    by showing a mirror that never became 'in-sync' after creation.
    The bug is isolated to distributions that do not have support
    for openAIS checkpointing (i.e. > RHEL6, > F16).
    
    When a node joins a group that is managing a mirror log, the other
    machines in the group send it a checkpoint representing the current
    state of the bitmap.  More than one machine can send a checkpoint,
    but only the initial one should be imported.  Once the bitmap state
    has been imported from the initial checkpoint, operations (such
    as resync, mark, and clear operations) can begin.  When subsequent
    checkpoints are allowed to be imported, it has the effect of erasing
    all the log operations between the initial checkpoint and the ones
    that follow.
    
    When cmirrord was updated to handle the absence of openAIS
    checkpointing (commit 62e38da133d9801cdf36b0f2aaec615ce14b9000),
    the new import_checkpoint() function failed to honor the 'no_read'
    parameter.  This parameter was designed to avoid reading all but
    the initial checkpoint.  Honoring this parameter has solved the
    issue of corrupting bitmap data with secondary checkpoints.
Comment 16 Marian Csontos 2013-08-21 04:54:07 EDT
Pvmove is one of the operations which block, and I were unable to work it around.
I will report results after retesting. Thanks.
Comment 17 Marian Csontos 2013-08-21 14:37:32 EDT
Problem solved. Thanks.
Comment 18 Marian Csontos 2013-08-21 14:40:51 EDT
BZ/Firefox losing mind and reset the fields...
Comment 20 Nenad Peric 2013-11-19 07:27:06 EST
Tested with automatic and manual tests. 

Scenario:[verify_sync_completions]

Waiting until all mirror|raid volumes become fully syncd...
   0/8 mirror(s) are fully synced: ( 79.40% 66.40% 62.70% 53.00% 39.40% 45.60% 22.90% 19.90% )
   0/8 mirror(s) are fully synced: ( 90.50% 77.60% 74.00% 64.00% 50.30% 60.30% 33.80% 34.20% )
   1/8 mirror(s) are fully synced: ( 100.00% 88.90% 84.80% 74.80% 61.80% 74.90% 44.70% 48.90% )
   2/8 mirror(s) are fully synced: ( 100.00% 100.00% 95.70% 85.70% 77.60% 89.40% 55.60% 63.20% )
   4/8 mirror(s) are fully synced: ( 100.00% 100.00% 100.00% 98.80% 98.00% 100.00% 68.40% 78.70% )
   8/8 mirror(s) are fully synced: ( 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% )




[root@virt-001 bin]# for i in {2..4}; do qarsh root@virt-00$i lvs ; done
  LV      VG     Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  mirror1 mirror mwi-a-m--- 640.00m                               100.00        
  mirror2 mirror mwi-a-m--- 640.00m                               100.00        
  mirror3 mirror mwi-a-m--- 640.00m                               100.00        
  root    rhel   -wi-ao----   6.71g                                             
  swap    rhel   -wi-ao---- 820.00m                                             
  LV      VG     Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  mirror1 mirror mwi-a-m--- 640.00m                               100.00        
  mirror2 mirror mwi-a-m--- 640.00m                               100.00        
  mirror3 mirror mwi-a-m--- 640.00m                               100.00        
  root    rhel   -wi-ao----   6.71g                                             
  swap    rhel   -wi-ao---- 820.00m                                             
  LV      VG     Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  mirror1 mirror mwi-a-m--- 640.00m                               100.00        
  mirror2 mirror mwi-a-m--- 640.00m                               100.00        
  mirror3 mirror mwi-a-m--- 640.00m                               100.00        
  root    rhel   -wi-ao----   6.71g                                             
  swap    rhel   -wi-ao---- 820.00m   


tested and marking verified with        lvm2-2.02.103-5.el7
Comment 21 Ludek Smid 2014-06-13 07:27:50 EDT
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Note You need to log in before you can comment on or make changes to this bug.