Description of problem: Geo-replication fails to sync data (stuck on one changelog) when compiling GlusterFS (or anything else that deals with Makefiles). This is due to the following sequence of operations that are shown below (typical 'make' workload): E 8d603c98-d6ac-48eb-90f9-58612bf8b828 MKDIR 852ee298-dfb0-423e-8827-01438fcd0af4%2FconfCxOmQZ M 8d603c98-d6ac-48eb-90f9-58612bf8b828 E d8c0083a-e667-4e72-adca-0180ebed5af9 CREATE 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fsubs1.awk E 454c43e2-f271-477a-b927-3549215ac06f CREATE 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fsubs.awk E f9e3c190-3e17-45ab-a560-b43d9eb77e15 CREATE 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fdefines.awk E fb3bff29-e864-42b1-86d3-564252d29a56 MKNOD 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fout E 00000000-0000-0000-0000-000000000000 RENAME 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fout 852ee298-dfb0-423e-8827-01438fcd0af4%2FMakefile E e61b3d3c-e129-403c-b849-5402729ca305 CREATE 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fconfig.h E 00000000-0000-0000-0000-000000000000 RENAME 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fconfig.h 852ee298-dfb0-423e-8827-01438fcd0af4%2Fconfig.h E d8c0083a-e667-4e72-adca-0180ebed5af9 UNLINK 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fsubs1.awk E 454c43e2-f271-477a-b927-3549215ac06f UNLINK 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fsubs.awk E f9e3c190-3e17-45ab-a560-b43d9eb77e15 UNLINK 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fdefines.awk E 8d603c98-d6ac-48eb-90f9-58612bf8b828 RMDIR 852ee298-dfb0-423e-8827-01438fcd0af4%2FconfCxOmQZ When processing this changelog, object with gfid '8d603c98-d6ac-48eb-90f9-58612bf8b828' (directory) does not exist as it's rmdir'ed by then (last entry op), thereby skipping it's creation. This is dangerous as the operation which are dependent on it would be effected ie. the MKNOD and RENAME (as in this example) would not be performed too (creation of gfid fb3bff29-e864-42b1-86d3-564252d29a56 would fail as the parent does not exist and therefore the RENAME). This would result in data loss on the slave. Version-Release number of selected component (if applicable): How reproducible: Mostly Steps to Reproduce: 1. 2. 3. Actual results: Data loss on the slave and gsyncd going into loops. Expected results: data b/w master and slave should be in sync and gsyncd should be resilient to such operations. Additional info:
This is not fixed yet. Tested in Version: glusterfs-3.4.0.17rhs-1.el6rhs.x86_64 First the make fails on the gluster mount. And then even after about 19 hours the files created are not synced to the slave volume. The arequal checksums of master and slave are very different. I'm going to try this once more on the glusterfs-3.4.0.18rhs and then move the bug to required state.
Tried with glusterfs-3.4.0.18rhs-1.el6rhs.x86_64 Now the make succeeds but the files are not synced to slave even after a about an hour. Moving it back to ASSIGNED.
Marking 'blocker' flag as per Rich/Sayan's Blocker assessment spreadsheet.
MSV, does restarting geo-replication session fix the issue?
(In reply to Amar Tumballi from comment #5) > MSV, does restarting geo-replication session fix the issue? This is probably another case like I've mentioned in Comment #1. I'll look into it.
The following patch reduces the failure rates but the bug is still valid. Pasting the patch here (and not sending it out for review) as: 1. It's still does not _fully_ fix the issue 2. Not sure how would this impact other things To fully fix this we may need extra entries in the changelog (creation mode etc..) diff --git a/geo-replication/syncdaemon/resource.py b/geo-replication/syncdaemon/resource.py index 8bd3939..76f20b9 100644 --- a/geo-replication/syncdaemon/resource.py +++ b/geo-replication/syncdaemon/resource.py @@ -455,6 +455,7 @@ class Server(object): @classmethod def entry_ops(cls, entries): pfx = gauxpfx() + unsafe_unlink = gconf.unsafe_unlink logging.debug('entries: %s' % repr(entries)) # regular file def entry_pack_reg(gf, bn, st): @@ -485,11 +486,12 @@ class Server(object): # to be purged is the GFID gotten from the changelog. # (a stat(changelog_gfid) would also be valid here) # The race here is between the GFID check and the purge. - disk_gfid = cls.gfid(entry) - if isinstance(disk_gfid, int): - return - if not gfid == disk_gfid: - return + if not unsafe_unlink: + disk_gfid = cls.gfid(entry) + if isinstance(disk_gfid, int): + return + if not gfid == disk_gfid: + return er = errno_wrap(os.unlink, [entry], [ENOENT, EISDIR]) if isinstance(er, int): if er == EISDIR:
https://code.engineering.redhat.com/gerrit/#/c/11750 && https://code.engineering.redhat.com/gerrit/#/c/11751/
Still there may be few more corner cases here. The basic tests we did works.
This is still failing. Although this time 'make' itself succeeded on the mountpoint, which was actually failing during my first test of this. But the files are not synced. I keep getting it in a loop. [2013-08-26 18:42:05.36058] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/270fb4e9-bddc-47fc-b182-4b4c96e3fbad [errcode: 23] [2013-08-26 18:42:05.36920] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f289e4e7-ce78-4f20-b2c6-4611406d3a93 [errcode: 23] [2013-08-26 18:42:05.37681] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/78162039-fba7-48a5-be66-f441fbc8d260 [errcode: 23] [2013-08-26 18:42:05.38522] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a776ff1f-a09e-4de8-abdd-ea93ce842df8 [errcode: 23] [2013-08-26 18:42:05.39215] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/2f8febc0-9227-4a77-86d4-8229647500fb [errcode: 23] [2013-08-26 18:42:05.39918] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/56f2a543-268f-4c9c-b84d-2840ec3da1c0 [errcode: 23] [2013-08-26 18:42:05.40774] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/47b98e52-3a10-4cb9-be91-d0af8a8df980 [errcode: 23] [2013-08-26 18:42:05.41575] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/34532ebf-d45c-4cca-a862-2d945d9a463f [errcode: 23] [2013-08-26 18:42:05.42299] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f50e981a-5168-4e44-86ae-514905099c74 [errcode: 23] [2013-08-26 18:42:05.43118] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f9cc6b44-e8f9-4a5c-bfce-bbcac505aaf6 [errcode: 23] [2013-08-26 18:42:05.43943] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/9fcd01fd-d2e1-4d89-a6db-58630e1b0dc6 [errcode: 23] [2013-08-26 18:42:05.44749] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/8abafb51-2e83-4a58-aed8-44ee1cdf16af [errcode: 23] [2013-08-26 18:42:05.45574] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/39eee537-2d74-4506-aa3e-1f409f64afed [errcode: 23] [2013-08-26 18:42:05.45943] W [master(/rhs/bricks/brick0):748:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.35.90%3Agluster%3A%2F%2F127.0.0.1%3Aslave/59ddf777397e52a13ba1333653d63854/.processing/CHANGELOG.1377518885
Looking at the comment #10, it seems very similar to 1001066 and 1001224 (fix @ https://code.engineering.redhat.com/gerrit/#/c/12027/). Can we try after the fix gets into a release ? -Amar
Yes that might be the case that 1001224 might be blocking this test case. I realised is after updating the bug. I initially did this 'make' test and after updating the BZ, I realized that is the case for even the simple untar. Please move it back to ON_QA after the above patch is taken in and will verify it again.
the above patch is already part of glusterfs-3.4.0.24rhs.
I tried with glusterfs-3.4.0.24rhs-1. The files are still not synced properly from master to slave. [root@archimedes ~]# find /mnt/master/glusterfs-3.4.0.23rhs | wc -l 2152 [root@archimedes ~]# find /mnt/slave/glusterfs-3.4.0.23rhs | wc -l 2254 [root@archimedes ~]# ls /mnt/master/glusterfs-3.4.0.23rhs/libtool /mnt/master/glusterfs-3.4.0.23rhs/libtool [root@archimedes ~]# ls /mnt/slave/glusterfs-3.4.0.23rhs/libtool ls: cannot access /mnt/slave/glusterfs-3.4.0.23rhs/libtool: No such file or directory
Marking it not a blocker for GA as per Big Bend Readout on 27th Aug, 2013. Should be fixed before Update1.
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.