Description of problem: ======================= 3.4.0 has a design enhancements for geo-rep to auto resolve the gfid conflicts between Master and Slave. However, this could mean the data being deleted at Master in 3.4.0 cycle if their exists a gfid conflict from Salve in FB scenario. In 3.3.1 the geo-rep used to be faulty and it would log in the messages for admin intervention. Following are detailed scenarios: Scenario 1: +++++++++++ If the file is synced to slave and than appended at master (when geo-rep is stopped) and again it got appended as part of slave write. 3.4.0 => Than the geo-rep reverse sync from slave=>master will overwrite the data that was originally written on Master. 3.3.1 => The content didnt change at Master. It didnt sync from slave. Theoretically it should have but HYBRID crawl seems to have not picked this up and picked directories before it. The workers keeps crashing because of directories missmatch. Scenario 2: +++++++++++ If the file is synced to slave and geo-rep went down. A new file is created with the same name at slave (Has different gfid). 3.4.0 => Than the geo-rep reverse sync from slave=>master will overwrite the file that was originally present at master. 3.3.1 => File logs error and ignores to sync. Following are errors: [2018-08-24 07:57:26.793184] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '32040ac3-7437-4df4-a238-cb0d6e43cf89', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/rahul', 'op': 'MKNOD'}, 17, '0e3c52a4-03c9-4c1b-938f-2017b04a6c34') Scenario 3: +++++++++++ If the directory (containing some files) got synced to slave and geo-rep went down. A new directory created with same name (Has different gfid) with different content. 3.4.0 => Than the geo-rep reverse sync from slave=>master will overwrite the content which was present at master. 3.3.1 => Worker crashes and remains faulty. It doesnt do explicit rmdir and warns the user. [2018-08-24 07:57:26.794929] E [master(/rhs/brick2/b3):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '86f710c9-1e2f-4c76-8695-475dc639236b', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/NEW_DIR', 'op': 'MKDIR'}, 17, '61a2eee2-05c6-4a7f-93d8-2b1e3e179896') [2018-08-24 07:57:26.795558] E [syncdutils(/rhs/brick2/b3):280:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further. [2018-08-24 07:57:26.796639] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '86f710c9-1e2f-4c76-8695-475dc639236b', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/NEW_DIR', 'op': 'MKDIR'}, 17, '61a2eee2-05c6-4a7f-93d8-2b1e3e179896') [2018-08-24 07:57:26.797055] E [syncdutils(/rhs/brick1/b1):280:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further. [2018-08-24 07:57:26.822345] I [syncdutils(/rhs/brick2/b3):253:finalize] <top>: exiting. Version-Release number of selected component (if applicable): ============================================================= glusterfs-server-3.12.2-16.el7rhgs.x86_64 How reproducible: ================= Always Additional info: ================ 1. FO / FB scenarios are rare and is done only as a disaster recovery. 2. The intention and the enhancement at 3.4.0 actually helps the user to have the auto resolution and provide better usability. 3. This however could execute "rm -rf or rmdir" via code which is not used by the user 4. If their exists a content with same name different gfid which was written in the actual master, than that would be accidentally Solution: ========= 1. Everyone that needs to do FO/FB has to follow the right steps mentioned in the admin guide. They are not commonly used and hence require a reference 2. Explain these scenario as a "NOTE / WARNING / Expectation" in the admin guide for user awareness. 3. Create a config option which can disable the auto gfid resolution 4. Provide a not in admin guide to use the config option (from 3) if the user choose to handle the conflicting directories manually as a precautionary measure before deleting (from 2).
Multiple scenario Validation: Build glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64 Normal Setup => Original Master and Original Slave Scenario 1: ----------- Validating the gfid functionality to work as expected in 3.4.0. A. Create geo-rep between Master and Slave B. Create a directory A at slave and create data inside it. C. Create a directory with same Name A without any data in it. D. Create a file with name file.1 at slave with certain data. F. Create a file with same name file.1 at Master with different data set. Expectation: For Directory => The directory created at Master has different gfid than at slave. The auto gfid resolution detects that and syncs the content from Master to Slave. In this case, the result is both Master and Slave has directory A without any data in it. Log: [2018-08-28 08:27:28.479295] I [master(/rhs/brick2/b5):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry retry_count=1 entry=({'uid': 0, 'gfid': '99f45f16-5340-4740-a49a-c394f4b2354c', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_on', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'a5b6d78f-0295-4f09-a82b-eb304ebf9d77', 'name_mismatch': False, 'dst': False}) [2018-08-28 08:27:28.480962] I [master(/rhs/brick1/b1):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry retry_count=1 entry=({'uid': 0, 'gfid': '99f45f16-5340-4740-a49a-c394f4b2354c', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_on', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'a5b6d78f-0295-4f09-a82b-eb304ebf9d77', 'name_mismatch': False, 'dst': False}) [2018-08-28 08:27:28.769546] I [master(/rhs/brick3/b9):1450:crawl] _GM For File => The file created at Master has different gfid than at slave. The auto gfid resolution detects that and syncs the content from Master to Slave. In this case, the result is both Master and Slave has a file withe the content of Master. Log: [2018-08-28 08:37:57.299675] I [master(/rhs/brick3/b9):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry retry_count=1 entry=({'uid': 0, 'gfid': '00131c50-d3f7-4360-866c-3f715a3c36fd', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/hosts', 'op': 'CREATE'}, 17, {'slave_isdir': False, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'b2f34de0-d2b5-4c8e-b8b1-0d021fe98094', 'name_mismatch': False, 'dst': False})
Multiple scenario Validation: Build glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64 Normal Setup => Original Master and Original Slave Scenario 2: ----------- Validate the config cli A. Setting up wrong value to the cli other than boolean should Fail Actual: It is successful => bug is in place (1622957) B. Resetting the value should work. => Works [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution rahul [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config \!gfid-conflict-resolution geo-replication config updated successfully [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution [root@rhsauto032 scripts]# C. Setting up boolean values [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution on geo-replication config updated successfully [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution on [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution off geo-replication config updated successfully [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution off [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution 1 geo-replication config updated successfully [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution 1 [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution 0 geo-replication config updated successfully [root@rhsauto032 scripts]# [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution 0 [root@rhsauto032 scripts]# [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution 0 [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution true geo-replication config updated successfully [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution true [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution false geo-replication config updated successfully [root@rhsauto032 scripts]# gluster volume geo-replication Master rhsauto022::Slave config gfid-conflict-resolution false [root@rhsauto032 scripts]#
Multiple scenario Validation: Build glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64 Normal Setup => Original Master and Original Slave Scenario 3: ----------- Validating the "gfid-conflict-resolution" functionality when it is "false" A. Setup geo-rep between Master and Slave B. Set the gfid-conflict-resolution to "false" C. Create a directory (A) with content at Slave D. Create a file (file) with content at Slave F. Create the same directory (A) at master with different content G. Create the same file (file) with different content at master For Files: ========== gfid conflict on: Successfully fixes the issue ++++++++++++++++++++++++++++++++++++++++++++++ [2018-08-28 08:37:57.299675] I [master(/rhs/brick3/b9):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry retry_count=1 entry=({'uid': 0, 'gfid': '00131c50-d3f7-4360-866c-3f715a3c36fd', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/hosts', 'op': 'CREATE'}, 17, {'slave_isdir': False, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'b2f34de0-d2b5-4c8e-b8b1-0d021fe98094', 'name_mismatch': False, 'dst': False}) [2018-08-28 08:37:57.305568] I [master(/rhs/brick3/b9):930:handle_entry_failures] _GMaster: Sucessfully fixed entry ops with gfid mismatch retry_count=1 gfid conflict off: Logs as an error with "ENTRY FAILED" and moves forward. Geo-rep doesn't go to faulty ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ [2018-08-28 09:40:44.24735] E [master(/rhs/brick1/b1):785:log_failures] _GMaster: ENTRY FAILED data=({'uid': 0, 'gfid': '14eb2254-0ddb-4866-b733-b0f268328bf6', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/hosts.allow', 'op': 'CREATE'}, 17, {'slave_isdir': False, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': '7d812da5-3b86-4f10-8234-4f1b4bfaf07f', 'name_mismatch': False, 'dst': False}) [2018-08-28 09:40:44.648068] I [master(/rhs/brick1/b1):1932:syncjob] Syncer: Sync Time Taken duration=0.1531 num_files=1 job=1 return_code=0 For Directory: ============== gfid conflict on: Successfully fixes the issue +++++++++++++++++++++++++++++++++++++++++++++++ [2018-08-28 08:27:28.479295] I [master(/rhs/brick2/b5):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry retry_count=1 entry=({'uid': 0, 'gfid': '99f45f16-5340-4740-a49a-c394f4b2354c', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_on', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'a5b6d78f-0295-4f09-a82b-eb304ebf9d77', 'name_mismatch': False, 'dst': False}) [2018-08-28 08:27:28.480962] I [master(/rhs/brick1/b1):814:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry retry_count=1 entry=({'uid': 0, 'gfid': '99f45f16-5340-4740-a49a-c394f4b2354c', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_on', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'a5b6d78f-0295-4f09-a82b-eb304ebf9d77', 'name_mismatch': False, 'dst': False}) [2018-08-28 08:27:28.769546] I [master(/rhs/brick3/b9):1450:crawl] _GMaster: slave's time stime=(1535444200, 0) [2018-08-28 08:27:28.798682] I [master(/rhs/brick2/b5):930:handle_entry_failures] _GMaster: Sucessfully fixed entry ops with gfid mismatch retry_count=1 gfid conflict off: Logs as an error with "ENTRY FAILED" and geo-rep remains "FAULTY". It also states to fix the issue to proceed further. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ [2018-08-28 09:55:07.164000] E [master(/rhs/brick3/b9):785:log_failures] _GMaster: ENTRY FAILED data=({'uid': 0, 'gfid': 'be29aa70-e124-4985-9d58-e888b771fd00', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/auto_gfid_default_off', 'op': 'MKDIR'}, 17, {'slave_isdir': True, 'gfid_mismatch': True, 'slave_name': None, 'slave_gfid': 'afac7618-a878-4e02-b48c-447a4f8e1d7d', 'name_mismatch': False, 'dst': False}) [2018-08-28 09:55:07.164477] E [syncdutils(/rhs/brick3/b9):317:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further. [2018-08-28 09:55:07.175034] I [syncdutils(/rhs/brick3/b9):289:finalize] <top>: exiting. After fixing the problematic directories, things works fine.
Based on comment 7,8 and 9. Moving this bug to verified state for 3.4.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607