Bug 1470967

Summary: [GSS] geo-replication failed due to ENTRY failures on slave volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Abhishek Kumar <abhishku>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED ERRATA QA Contact: Rochelle <rallan>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: abhishku, amukherj, atoborek, bkunal, ccalhoun, csaba, khiremat, rallan, rhinduja, rhs-bugs, sheggodu, srmukher, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: rebase
Fixed In Version: glusterfs-3.12.2-1 Doc Type: Bug Fix
Doc Text:
Geo-replication expects the gfid to be same on both master and slave. However, geo-replication failed to sync the entry on to slave when a file already existed with different gfid. Previously, this required manual intervention to fix the gfid conflicts. With this fix, the gfid mismatch failures are handled with appropriate decisions by automatically verifying them on the master which is the source of truth
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:34:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503135    
Attachments:
Description Flags
rsync error log
none
sync error none

Description Abhishek Kumar 2017-07-14 07:24:36 UTC
Description of problem:

geo-replication failed due to ENTRY failures on slave volume

Version-Release number of selected component (if applicable):

glusterfs-3.8.4-18.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-18.el7rhgs.x86_64

How reproducible:

Customer Environment
 
Steps to Reproduce:
1. Stop the geo-rep session
2. Rename the directory on master volume
3. Create a new directory on master volume with previous name
4. Start the geo-rep session again

Actual results:

Geo-replication sync stopped and logs report about ENTRY failures 

Expected results:

Geo-replication sync should handle the new directory as well as renamed directory

Additional info:

Comment 4 Abhishek Kumar 2017-07-14 12:19:53 UTC
Created attachment 1298294 [details]
rsync error log

Comment 5 Abhishek Kumar 2017-07-14 12:27:49 UTC
Created attachment 1298297 [details]
sync error

Comment 43 Rahul Hinduja 2017-09-21 19:02:57 UTC
Following is the summary of qualification done on the build mentioned at comment 35:

Hybrid Rename Scenarios:
========================
=> If a file (f1) is renamed to (f2) in the hybrid crawl, at slave we see both the files (Original f1 and Renamed file f2) as a hardlink to each other with the same gfid. Any subsequent creation for file f1 with different gfid will correct the slave. 

However if the dir is created with f1, it doesnt correct it at slave. At slave, f1 remains a file. This would require manual efforts to clean it.


=> If a directory is renamed in the hybrid crawl, at slave renamed directory do not appear. However all the data populated at renamed directory at master gets into the the original directory at slave (This is already as designed and known). If the directory is recreated at the master with different gfid, the slave gets the correction too. 


Customer Workload:
==================

Customer workloads involves the following pattern:

A. Create a file f1 => It gets sync to slave
B. Hardlink a file f1 to f2 => Hardlink file syncs to slave
C. Delete the file f1 => f1 stays at slave
D. Rename file f2 to f3 => f2 stays at slave, f1 stays from C, f3 is synced from D.

In the above case, at slave we consume more inodes as they are all hardlinks. No data loss but the penalty of inodes. 

Workload mentioned in comment 10 works:
=======================================


1. CREATE DIR1  (gfid = g1)
2. RENAME DIR1 DIR1.1  (gfid = g1)
3. CREATE DIR1  (gfid = g2)

Additional testing carried on the builds:
=========================================

=> different fops (create,chmod,chown,chgrp,hardlink,symlink,rename,truncate) during hybrid and changelog crawl.
=> Brick Scenarios: Add-brick, remove-brick, brick kill scenarios
=> Upgrade from the 3.2.0 to hotfix build.
=> Creating a directory or a file at slave and then to be synced via master {Negative case, to simulate another scenario of gfid mismatch}

Above testing covers the planned testing for the hotfix. However please note the following: 

1. The hotfix have been qualified for very specific scenarios which are mentioned above.
2. There still exists the ambiguity if the file or directory is not recreated with the same name after rename during Hybrid crawl.
3. Only limited regression test coverage is carried.

Please set the right expectations to the customer with this hotfix build. 

===== Short Summary as part of recently agreed process ==========

QE has qualified the hotfix build mentioned at comment 35 against the rename cases during hybrid crawl. Create, Rename, Create of a same file/directory works with the build, also sanity check is carried on the hotfix for ensuring the stability of the build along with the upgrade path validation. 

=================================================================

Comment 71 errata-xmlrpc 2018-09-04 06:34:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607