Description of problem: ======================= The new enhancement of automatic resolution of the gfid conflicts between Master and Slave results in data being deleted at Master if their exists a gfid conflict from Salve in FB scenario. Without the feature the geo-rep used to be faulty and it would log in the messages for admin intervention. Following are detailed scenarios: Scenario 1: +++++++++++ If the file is synced to slave and than appended at master (when geo-rep is stopped) and again it got appended as part of slave write. With feature=> Than the geo-rep reverse sync from slave=>master will overwrite the data that was originally written on Master. Without feature => The content didnt change at Master. It didnt sync from slave. Theoretically it should have but HYBRID crawl seems to have not picked this up and picked directories before it. The workers keeps crashing because of directories missmatch. Scenario 2: +++++++++++ If the file is synced to slave and geo-rep went down. A new file is created with the same name at slave (Has different gfid). With feature => Than the geo-rep reverse sync from slave=>master will overwrite the file that was originally present at master. Without feature => File logs error and ignores to sync. Following are errors: [2018-08-24 07:57:26.793184] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '32040ac3-7437-4df4-a238-cb0d6e43cf89', 'gid': 0, 'mode': 33188, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/rahul', 'op': 'MKNOD'}, 17, '0e3c52a4-03c9-4c1b-938f-2017b04a6c34') Scenario 3: +++++++++++ If the directory (containing some files) got synced to slave and geo-rep went down. A new directory created with same name (Has different gfid) with different content. With feature=> Than the geo-rep reverse sync from slave=>master will overwrite the content which was present at master. Without feature => Worker crashes and remains faulty. It doesnt do explicit rmdir and warns the user. [2018-08-24 07:57:26.794929] E [master(/rhs/brick2/b3):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '86f710c9-1e2f-4c76-8695-475dc639236b', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/NEW_DIR', 'op': 'MKDIR'}, 17, '61a2eee2-05c6-4a7f-93d8-2b1e3e179896') [2018-08-24 07:57:26.795558] E [syncdutils(/rhs/brick2/b3):280:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further. [2018-08-24 07:57:26.796639] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '86f710c9-1e2f-4c76-8695-475dc639236b', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/NEW_DIR', 'op': 'MKDIR'}, 17, '61a2eee2-05c6-4a7f-93d8-2b1e3e179896') [2018-08-24 07:57:26.797055] E [syncdutils(/rhs/brick1/b1):280:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further. [2018-08-24 07:57:26.822345] I [syncdutils(/rhs/brick2/b3):253:finalize] <top>: exiting. Version-Release number of selected component (if applicable): ============================================================= mainline How reproducible: ================= Always Additional info: ================ 1. FO / FB scenarios are rare and is done only as a disaster recovery. 2. The intention and the enhancement actually helps the user to have the auto resolution and provide better usability. 3. This however could execute "rm -rf or rmdir" via code which is not used by the user 4. If their exists a content with same name different gfid which was written in the actual master, than that would be accidentally Solution: ========= 1. Everyone that needs to do FO/FB has to follow the right steps mentioned in the admin guide. They are not commonly used and hence require a reference 2. Explain these scenario as a "NOTE / WARNING / Expectation" in the admin guide for user awareness. 3. Create a config option which can disable the auto gfid resolution 4. Provide a not in admin guide to use the config option (from 3) if the user choose to handle the conflicting directories manually as a precautionary measure before deleting (from 2).
REVIEW: https://review.gluster.org/20986 (geo-rep: Make automatic gfid conflict resolution optional) posted (#1) for review on master by Kotresh HR
COMMIT: https://review.gluster.org/20986 committed in master by "Aravinda VK" <avishwan> with a commit message- geo-rep: Make automatic gfid conflict resolution optional Autmatic gfid conflict resolution needs to be disabled during failover/failback as it might lead to data loss in the following scenario. 1. Master went down without syncing directory "dir1" to slave. 2. When slave is failed over to master, if a new file is written inside "dir1", creating dir1 again if not present, "dir1" ends up with different gfid on original slave. 3. When original master is up and failed back, due to automatic gfid conflict resolution, "dir1" present in original master is deleted losing all files and only new file created on original slave is restored. Hence during failover/failback, automatic gfid conflict resolution should be disabled. So in these cases, appropriate decision is taken. fixes: bz#1622076 Signed-off-by: Kotresh HR <khiremat> Change-Id: I433616f5d3e13d4b6eb675475bd554ca34928573
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/