Bug 1468186
Summary: | [Geo-rep]: entry failed to sync to slave with ENOENT errror | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> |
Component: | geo-replication | Assignee: | Kotresh HR <khiremat> |
Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.3 | CC: | amukherj, bturner, bugs, csaba, khiremat, rcyriac, rhinduja, rhs-bugs, storage-qa-internal |
Target Milestone: | --- | ||
Target Release: | RHGS 3.3.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.8.4-33 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 1467718 | Environment: | |
Last Closed: | 2017-09-21 05:02:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1467718, 1468198, 1468200 | ||
Bug Blocks: | 1417151 |
Description
Rahul Hinduja
2017-07-06 08:49:08 UTC
Its a very corner case and a race between how 2 changelogs are processed during rmdir and mkdir. Following is one of the case: 1. Two Subvolume having dir d1. d1 having files f1,f2 in first subvolume and f3,f4 in second subvolume. 2. rmdir of d1 is issued and mkdir with same name (d1) is issued. New files created with name f5,f6,f7,f8. If rmdir failed on one subvolume (A) for any reason, recursive rmdir is retried. At the same time some of the new files are hashed to different subvolume (B). Once the rmdir is reprocessed at A, it would delete the newly created files at B and will have only the files created after changelog processed mkdir on A. Proposing as blocker because it can cause a data loss (Or, data miss match) at slave in the specific scenario mentioned at comment 5. upstream patch : https://review.gluster.org/#/c/17695/ Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/111301/ There were multiple consequences for this bug: 1. ENTRY Errors in the logs 2. Data loss at the slave (Either the whole directory was missing or few files from it) Was able to reproduce this issue on 3.2.0 (3.8.4-18) build using the following steps: 1. touch dir1 => This is to find which subvolume the file hashes too 2. rm dir1 3. mkdir dir1 and create some files inside it (touch {1..99}) 4. Let it sync to slave 5. Stop the geo-replication 6. Attach gdb to mount pid and breakpoint at dht_rmdir_lock_cbk 7. continue 8. rm -rf dir1/ 9. Kill the complete Hashed subvolume (captured from step 1) 10. continue 11. Start volume with force (bring back bricks) 12. ls /mnt/dir1 13. Wait for dht heal 14. Write some more files into dir/ {touch file{1..99}} 15. Start the geo-replication On 3.2.0_async builds, tried the above use case twice and following were results: 1. In first instance some of the files were missing from slave 2. In second instance directory dir1 was missing from slave and Entry failures were reported Tried the same case with build: glusterfs-geo-replication-3.8.4-37.el7rhgs.x86_64 In both the iterations, the files were properly synced to slave without any Entry errors in the logs. Moving this bug to verified state Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |