Bug 1462210 - [RFE][GSS]Geo-replication skip the deletion of files if a slave subvolume was down [NEEDINFO]
[RFE][GSS]Geo-replication skip the deletion of files if a slave subvolume was...
Status: NEW
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
3.3
All All
unspecified Severity medium
: ---
: ---
Assigned To: Kotresh HR
Rahul Hinduja
: FutureFeature, ZStream
Depends On:
Blocks: 1408949 RHGS-usability-bug-GSS
  Show dependency treegraph
 
Reported: 2017-06-16 08:30 EDT by Riyas Abdulrasak
Modified: 2017-09-28 13:29 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
avishwan: needinfo? (rgowdapp)


Attachments (Terms of Use)

  None (edit)
Description Riyas Abdulrasak 2017-06-16 08:30:53 EDT
Description of problem:

Geo-replication is not syncing files in case a subvolume of slave volume(distribute-replicate) goes down and comes back.  

Version-Release number of selected component (if applicable):


Red Hat Gluster Storage Server 3.3.0
glusterfs-3.8.4-27.el7rhgs.x86_64

How reproducible:

Always

Steps to Reproduce:

- Create slave and master volumes 2x2 . 
- Kill 1 set replica bricks on the slave side. Keep the other replica set running.

Eg: - 
on a 2x2 volume
kill brick1 & brick2 which were a replica set and keep the brick4 and brick5 up. 


- geo-replications status showed the sessions active & passive(no faulty sessions)

- Delete the contents from master volume. Delete will be successful. 


- Bring up both the replica bricks at slave side. 
- The files in the slave volume which were in the bricks which were down will not be deleted. 


This causes the slave and master volumes to be out of sync.  

Actual results:

The slave volume has some stale data

Expected results:

The master and slave volumes should be in sync. 


Additional info:

Customers can hit this issue easily. In case the brick process gets killed , The geo-replication status is not reporting any errors, but the slave and master volumes will be out of sync.
Comment 2 Aravinda VK 2017-06-16 09:26:51 EDT
I think this is the expected behavior if all the bricks of a subvolume is down and the file is hashed to that subvolume.

Simple example:

gluster volume create gv1 node1:/bricks/b1 node2:/bricks/b2 force
gluster volume start gv1

mount -t glusterfs localhost:gv1 /mnt/gv1
echo "Hello World" > /mnt/gv1/f1

Check in backend and kill the brick where it is hashed. Then try to access the file or delete the file. We always get "rm: cannot remove 'f1': No such file or directory". Geo-rep thinks that file is already deleted and proceeds without logging.

There is no way to differentiate from Geo-replication if the file is already deleted or subvolume is down. I think DHT can be enhanced to return different error code if the subvolume is down.

Adding Raghavendra to check the possibility from DHT to differentiate these errors.

Note You need to log in before you can comment on or make changes to this bug.