1622076 – [geo-rep]: geo-rep reverse sync in FO/FB can accidentally delete the content at original master incase of gfid conflict in 3.4.0 without explicit user rmdir

Bug 1622076 - [geo-rep]: geo-rep reverse sync in FO/FB can accidentally delete the content at original master incase of gfid conflict in 3.4.0 without explicit user rmdir

Summary: [geo-rep]: geo-rep reverse sync in FO/FB can accidentally delete the content ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kotresh HR
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1622029
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-24 11:50 UTC by Kotresh HR
Modified:	2018-10-23 15:17 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-5.0
Clone Of:	1622029
Environment:
Last Closed:	2018-10-23 15:17:29 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Kotresh HR 2018-08-24 11:51:31 UTC

Description of problem:
=======================

The new enhancement of automatic resolution of the gfid conflicts between Master and Slave results in data being deleted at Master if their exists a gfid conflict from Salve in FB scenario. 

Without the feature the geo-rep used to be faulty and it would log in the messages for admin intervention. 

Following are detailed scenarios: 

Scenario 1:
+++++++++++

If the file is synced to slave and than appended at master (when geo-rep is stopped) and again it got appended as part of slave write. 

With feature=> Than the geo-rep reverse sync from slave=>master will overwrite the data that was originally written on Master. 

Without feature => The content didnt change at Master. It didnt sync from slave. Theoretically it should have but HYBRID crawl seems to have not picked this up and picked directories before it. The workers keeps crashing because of directories missmatch. 


Scenario 2:
+++++++++++

If the file is synced to slave and geo-rep went down. A new file is created with the same name at slave (Has different gfid). 

With feature => Than the  geo-rep reverse sync from slave=>master will overwrite the file that was originally present at master. 

Without feature => File logs error and ignores to sync. Following are errors: 

[2018-08-24 07:57:26.793184] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '32040ac3-7437-4df4-a238-cb0d6e43cf89', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/rahul', 'op': 'MKNOD'}, 17, '0e3c52a4-03c9-4c1b-938f-2017b04a6c34')

Scenario 3:
+++++++++++

If the directory (containing some files) got synced to slave and geo-rep went down. A new directory created with same name (Has different gfid) with different content.

With feature=> Than the geo-rep reverse sync from slave=>master will overwrite the content which was present at master. 

Without feature => Worker crashes and remains faulty. It doesnt do explicit rmdir and warns the user. 

[2018-08-24 07:57:26.794929] E [master(/rhs/brick2/b3):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '86f710c9-1e2f-4c76-8695-475dc639236b', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/NEW_DIR', 'op': 'MKDIR'}, 17, '61a2eee2-05c6-4a7f-93d8-2b1e3e179896')
[2018-08-24 07:57:26.795558] E [syncdutils(/rhs/brick2/b3):280:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2018-08-24 07:57:26.796639] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '86f710c9-1e2f-4c76-8695-475dc639236b', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/NEW_DIR', 'op': 'MKDIR'}, 17, '61a2eee2-05c6-4a7f-93d8-2b1e3e179896')
[2018-08-24 07:57:26.797055] E [syncdutils(/rhs/brick1/b1):280:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2018-08-24 07:57:26.822345] I [syncdutils(/rhs/brick2/b3):253:finalize] <top>: exiting.


Version-Release number of selected component (if applicable):
=============================================================

mainline


How reproducible:
=================

Always

Additional info:
================

1. FO / FB scenarios are rare and is done only as a disaster recovery. 
2. The intention and the enhancement actually helps the user to have the auto resolution and provide better usability. 
3. This however could execute "rm -rf or rmdir" via code which is not used by the user
4. If their exists a content with same name different gfid which was written in the actual master, than that would be accidentally

Solution:
=========

1. Everyone that needs to do FO/FB has to follow the right steps mentioned in the admin guide. They are not commonly used and hence require a reference  
2. Explain these scenario as a "NOTE / WARNING / Expectation" in the admin guide for user awareness.
3. Create a config option which can disable the auto gfid resolution
4. Provide a not in admin guide to use the config option (from 3) if the user choose to handle the conflicting directories manually as a precautionary measure before deleting (from 2).

Comment 2 Worker Ant 2018-08-24 12:15:27 UTC

REVIEW: https://review.gluster.org/20986 (geo-rep: Make automatic gfid conflict resolution optional) posted (#1) for review on master by Kotresh HR

Comment 3 Worker Ant 2018-08-27 03:46:36 UTC

COMMIT: https://review.gluster.org/20986 committed in master by "Aravinda VK" <avishwan> with a commit message- geo-rep: Make automatic gfid conflict resolution optional

Autmatic gfid conflict resolution needs to be disabled
during failover/failback as it might lead to data loss
in the following scenario.

1. Master went down without syncing directory "dir1" to slave.
2. When slave is failed over to master, if a new file
   is written inside "dir1", creating dir1 again if not
   present, "dir1" ends up with different gfid on original
   slave.
3. When original master is up and failed back, due to
   automatic gfid conflict resolution, "dir1" present in
   original master is deleted losing all files and only
   new file created on original slave is restored.

Hence during failover/failback, automatic gfid conflict
resolution should be disabled. So in these cases, appropriate
decision is taken.

fixes: bz#1622076
Signed-off-by: Kotresh HR <khiremat>
Change-Id: I433616f5d3e13d4b6eb675475bd554ca34928573

Comment 4 Shyamsundar 2018-10-23 15:17:29 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.