Bug 1683893 - [geo-rep]: Checksum mismatch when 2x2 vols are converted to arbiter
Summary: [geo-rep]: Checksum mismatch when 2x2 vols are converted to arbiter
Status: ASSIGNED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: hari gowtham
QA Contact: Rochelle
URL:
Whiteboard:
Keywords: ZStream
Depends On: 1686568
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-28 03:51 UTC by Rochelle
Modified: 2019-05-16 05:40 UTC (History)
17 users (show)

(edit)
Consequence: In a geo-rep setup, while converting n*2 master volume to n*3, if worker corresponding to newly added brick (including arbiter) to replica set becomes 'Active', there is a chance that geo-rep never syncs self healed data causing data loss at slave. 

Cause: 
The worker corresponding to newly added brick to replica set should not go to 'Faulty' until it syncs the self-healed data. In any case, if it goes to 'Faulty and other worker becomes 'Active', there is a race that causes this issue.

Fix/Workaround:
This a known issue and there is no clean workaround for this. So n*2 to n*3 volume conversion at master should not be done if geo-replication is configured.
Clone Of:
: 1686568 (view as bug list)
(edit)
Last Closed:


Attachments (Terms of Use)

Description Rochelle 2019-02-28 03:51:23 UTC
Description of problem:
=======================
While converting 2x2 to 2x(2+1) (arbiter), there was a checksum mismatch:

[root@dhcp43-143 ~]# ./arequal-checksum -p /mnt/master/

Entry counts
Regular files   : 10000
Directories     : 2011
Symbolic links  : 11900
Other           : 0
Total           : 23911

Metadata checksums
Regular files   : 5ce564791c
Directories     : 288ecb21ce24
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 8e69e8576625d36f9ee1866c92bfb6a3
Directories     : 4a596e7e1e792061
Symbolic links  : 756e690d61497f6a
Other           : 0
Total           : 2fbf69488baa3ac7


[root@dhcp43-143 ~]# ./arequal-checksum -p /mnt/slave/

Entry counts
Regular files   : 10000
Directories     : 2011
Symbolic links  : 11900
Other           : 0
Total           : 23911

Metadata checksums
Regular files   : 5ce564791c
Directories     : 288ecb21ce24
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 53c64bd1144f6d9855f0af3edb55e614
Directories     : 4a596e7e1e792061
Symbolic links  : 756e690d61497f6a
Other           : 0
Total           : 3901e39cb02ad487



Everything matches except under "CHECKSUMS", Regular files and the total are a mismatch. 



Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.12.2-45.el7rhgs.x86_64

How reproducible:
=================
2/2

Steps to Reproduce:
====================
1. Create and start a geo-rep session with master and slave being 2x2
2. Mount the vols and start pumping data
3. Disable and stop self healing (prior to add-brick)

# gluster volume set VOLNAME cluster.data-self-heal off
# gluster volume set VOLNAME cluster.metadata-self-heal off
# gluster volume set VOLNAME cluster.entry-self-heal off
# gluster volume set VOLNAME self-heal-daemon off

4. Add brick to the master and slave to convert them to 2x(2+1) arbiter vols
5. Start rebalance on master and slave

6. Re-enable self healing :

# gluster volume set VOLNAME cluster.data-self-heal on
# gluster volume set VOLNAME cluster.metadata-self-heal on
# gluster volume set VOLNAME cluster.entry-self-heal on
# gluster volume set VOLNAME self-heal-daemon on

7. Wait for rebalance to complete
8. Check the checksum between master and slave


Actual results:
===============
Checksum does not fully match


Expected results:
================
Checksum should match


Note You need to log in before you can comment on or make changes to this bug.