1622029 – [geo-rep]: geo-rep reverse sync in FO/FB can accidentally delete the content at original master incase of gfid conflict in 3.4.0 without explicit user rmdir

Bug 1622029 - [geo-rep]: geo-rep reverse sync in FO/FB can accidentally delete the content at original master incase of gfid conflict in 3.4.0 without explicit user rmdir

Summary: [geo-rep]: geo-rep reverse sync in FO/FB can accidentally delete the content ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Kotresh HR
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137 1622076
TreeView+	depends on / blocked

Reported:	2018-08-24 08:08 UTC by Rahul Hinduja
Modified:	2018-09-14 05:56 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.12.2-18
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1622076 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:52:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:53:50 UTC

Description Rahul Hinduja 2018-08-24 08:08:14 UTC

Description of problem:
=======================

3.4.0 has a design enhancements for geo-rep to auto resolve the gfid conflicts between Master and Slave. However, this could mean the data being deleted at Master in 3.4.0 cycle if their exists a gfid conflict from Salve in FB scenario. 

In 3.3.1 the geo-rep used to be faulty and it would log in the messages for admin intervention. 

Following are detailed scenarios: 

Scenario 1:
+++++++++++

If the file is synced to slave and than appended at master (when geo-rep is stopped) and again it got appended as part of slave write. 

3.4.0 => Than the geo-rep reverse sync from slave=>master will overwrite the data that was originally written on Master. 

3.3.1 => The content didnt change at Master. It didnt sync from slave. Theoretically it should have but HYBRID crawl seems to have not picked this up and picked directories before it. The workers keeps crashing because of directories missmatch. 


Scenario 2:
+++++++++++

If the file is synced to slave and geo-rep went down. A new file is created with the same name at slave (Has different gfid). 

3.4.0 => Than the  geo-rep reverse sync from slave=>master will overwrite the file that was originally present at master. 

3.3.1 => File logs error and ignores to sync. Following are errors: 

[2018-08-24 07:57:26.793184] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '32040ac3-7437-4df4-a238-cb0d6e43cf89', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/rahul', 'op': 'MKNOD'}, 17, '0e3c52a4-03c9-4c1b-938f-2017b04a6c34')

Scenario 3:
+++++++++++

If the directory (containing some files) got synced to slave and geo-rep went down. A new directory created with same name (Has different gfid) with different content.

3.4.0 => Than the geo-rep reverse sync from slave=>master will overwrite the content which was present at master. 

3.3.1 => Worker crashes and remains faulty. It doesnt do explicit rmdir and warns the user. 

[2018-08-24 07:57:26.794929] E [master(/rhs/brick2/b3):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '86f710c9-1e2f-4c76-8695-475dc639236b', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/NEW_DIR', 'op': 'MKDIR'}, 17, '61a2eee2-05c6-4a7f-93d8-2b1e3e179896')
[2018-08-24 07:57:26.795558] E [syncdutils(/rhs/brick2/b3):280:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2018-08-24 07:57:26.796639] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '86f710c9-1e2f-4c76-8695-475dc639236b', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/NEW_DIR', 'op': 'MKDIR'}, 17, '61a2eee2-05c6-4a7f-93d8-2b1e3e179896')
[2018-08-24 07:57:26.797055] E [syncdutils(/rhs/brick1/b1):280:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2018-08-24 07:57:26.822345] I [syncdutils(/rhs/brick2/b3):253:finalize] <top>: exiting.


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-server-3.12.2-16.el7rhgs.x86_64


How reproducible:
=================

Always

Additional info:
================

1. FO / FB scenarios are rare and is done only as a disaster recovery. 
2. The intention and the enhancement at 3.4.0 actually helps the user to have the auto resolution and provide better usability. 
3. This however could execute "rm -rf or rmdir" via code which is not used by the user
4. If their exists a content with same name different gfid which was written in the actual master, than that would be accidentally

Solution:
=========

1. Everyone that needs to do FO/FB has to follow the right steps mentioned in the admin guide. They are not commonly used and hence require a reference  
2. Explain these scenario as a "NOTE / WARNING / Expectation" in the admin guide for user awareness.
3. Create a config option which can disable the auto gfid resolution
4. Provide a not in admin guide to use the config option (from 3) if the user choose to handle the conflicting directories manually as a precautionary measure before deleting (from 2).

Comment 7 Rahul Hinduja 2018-08-28 08:41:59 UTC

Multiple scenario Validation: Build glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64


Normal Setup => Original Master and Original Slave

Scenario 1:
-----------

Validating the gfid functionality to work as expected in 3.4.0. 

A. Create geo-rep between Master and Slave
B. Create a directory A at slave and create data inside it.
C. Create a directory with same Name A without any data in it. 
D. Create a file with name file.1 at slave with certain data.
F. Create a file with same name file.1 at Master with different data set. 


Expectation:

For Directory =>  The directory created at Master has different gfid than at slave. The auto gfid resolution detects that and syncs the content from Master to Slave. In this case, the result is both Master and Slave has directory A without any data in it. 

Log: 

[2018-08-28 08:27:28.479295] I [master(/rhs/brick2/b5):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry        retry_count=1   entry=({'uid': 0, 'gfid': '99f45f16-5340-4740-a49a-c394f4b2354c', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_on', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'a5b6d78f-0295-4f09-a82b-eb304ebf9d77', 'name_mismatch': False, 'dst': False})
[2018-08-28 08:27:28.480962] I [master(/rhs/brick1/b1):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry        retry_count=1   entry=({'uid': 0, 'gfid': '99f45f16-5340-4740-a49a-c394f4b2354c', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_on', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'a5b6d78f-0295-4f09-a82b-eb304ebf9d77', 'name_mismatch': False, 'dst': False})
[2018-08-28 08:27:28.769546] I [master(/rhs/brick3/b9):1450:crawl] _GM

For File => The file created at Master has different gfid than at slave. The auto gfid resolution detects that and syncs the content from Master to Slave. In this case, the result is both Master and Slave has a file withe the content of Master. 

Log:

[2018-08-28 08:37:57.299675] I [master(/rhs/brick3/b9):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry        retry_count=1   entry=({'uid': 0, 'gfid': '00131c50-d3f7-4360-866c-3f715a3c36fd', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/hosts', 'op': 'CREATE'}, 17, {'slave_isdir': False, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'b2f34de0-d2b5-4c8e-b8b1-0d021fe98094', 'name_mismatch': False, 'dst': False})

Comment 8 Rahul Hinduja 2018-08-28 09:34:45 UTC

Multiple scenario Validation: Build glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64


Normal Setup => Original Master and Original Slave

Scenario 2:
-----------

Validate the config cli

A. Setting up wrong value to the cli other than boolean should Fail
Actual: It is successful => bug is in place (1622957)

B. Resetting the value should work. => Works
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution
rahul
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config \!gfid-conflict-resolution
geo-replication config updated successfully
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution
[root@rhsauto032 scripts]# 

C. Setting up boolean values

[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution on
geo-replication config updated successfully
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution
on
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution off
geo-replication config updated successfully
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution
off
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution 1
geo-replication config updated successfully
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution
1
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution 0
geo-replication config updated successfully
[root@rhsauto032 scripts]# 
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution
0
[root@rhsauto032 scripts]#
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution
0
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution true
geo-replication config updated successfully
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution
true
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution false
geo-replication config updated successfully
[root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution
false
[root@rhsauto032 scripts]#

Comment 9 Rahul Hinduja 2018-08-28 10:56:12 UTC

Multiple scenario Validation: Build glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64


Normal Setup => Original Master and Original Slave

Scenario 3:
-----------

Validating the "gfid-conflict-resolution" functionality when it is "false"

A. Setup geo-rep between Master and Slave
B. Set the gfid-conflict-resolution to "false"
C. Create a directory (A) with content at Slave
D. Create a file (file) with content at Slave
F. Create the same directory (A) at master with different content
G. Create the same file (file) with different content at master

For Files: 
==========

gfid conflict on: Successfully fixes the issue
++++++++++++++++++++++++++++++++++++++++++++++

[2018-08-28 08:37:57.299675] I [master(/rhs/brick3/b9):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry        retry_count=1   entry=({'uid': 0, 'gfid': '00131c50-d3f7-4360-866c-3f715a3c36fd', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/hosts', 'op': 'CREATE'}, 17, {'slave_isdir': False, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'b2f34de0-d2b5-4c8e-b8b1-0d021fe98094', 'name_mismatch': False, 'dst': False})
[2018-08-28 08:37:57.305568] I [master(/rhs/brick3/b9):930:handle_entry_failures] _GMaster: Sucessfully fixed entry ops with gfid mismatch      retry_count=1

gfid conflict off: Logs as an error with "ENTRY FAILED" and moves forward. Geo-rep doesn't go to faulty
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

[2018-08-28 09:40:44.24735] E [master(/rhs/brick1/b1):785:log_failures] _GMaster: ENTRY FAILED  data=({'uid': 0, 'gfid': '14eb2254-0ddb-4866-b733-b0f268328bf6', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/hosts.allow', 'op': 'CREATE'}, 17, {'slave_isdir': False, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': '7d812da5-3b86-4f10-8234-4f1b4bfaf07f', 'name_mismatch': False, 'dst': False})
[2018-08-28 09:40:44.648068] I [master(/rhs/brick1/b1):1932:syncjob] Syncer: Sync Time Taken    duration=0.1531 num_files=1     job=1   return_code=0


For Directory: 
==============


gfid conflict on: Successfully fixes the issue
+++++++++++++++++++++++++++++++++++++++++++++++

[2018-08-28 08:27:28.479295] I [master(/rhs/brick2/b5):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry        retry_count=1   entry=({'uid': 0, 'gfid': '99f45f16-5340-4740-a49a-c394f4b2354c', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_on', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'a5b6d78f-0295-4f09-a82b-eb304ebf9d77', 'name_mismatch': False, 'dst': False})
[2018-08-28 08:27:28.480962] I [master(/rhs/brick1/b1):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry        retry_count=1   entry=({'uid': 0, 'gfid': '99f45f16-5340-4740-a49a-c394f4b2354c', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_on', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'a5b6d78f-0295-4f09-a82b-eb304ebf9d77', 'name_mismatch': False, 'dst': False})
[2018-08-28 08:27:28.769546] I [master(/rhs/brick3/b9):1450:crawl] _GMaster: slave's time       stime=(1535444200, 0)
[2018-08-28 08:27:28.798682] I [master(/rhs/brick2/b5):930:handle_entry_failures] _GMaster: Sucessfully fixed entry ops with gfid mismatch      retry_count=1


gfid conflict off: Logs as an error with "ENTRY FAILED" and geo-rep remains "FAULTY". It also states to fix the issue to proceed further.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


[2018-08-28 09:55:07.164000] E [master(/rhs/brick3/b9):785:log_failures] _GMaster: ENTRY FAILED data=({'uid': 0, 'gfid': 'be29aa70-e124-4985-9d58-e888b771fd00', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_off', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'afac7618-a878-4e02-b48c-447a4f8e1d7d', 'name_mismatch': False, 'dst': False})
[2018-08-28 09:55:07.164477] E [syncdutils(/rhs/brick3/b9):317:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2018-08-28 09:55:07.175034] I [syncdutils(/rhs/brick3/b9):289:finalize] <top>: exiting.


After fixing the problematic directories, things works fine.

Comment 10 Rahul Hinduja 2018-08-28 10:56:48 UTC

Based on comment 7,8 and 9. Moving this bug to verified state for 3.4.0

Comment 12 errata-xmlrpc 2018-09-04 06:52:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.