Bug 987082 - dist-geo-rep: running "make" failes to sync files to the slave
dist-geo-rep: running "make" failes to sync files to the slave
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
Unspecified Unspecified
high Severity medium
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
Rahul Hinduja
: ZStream
Depends On:
Blocks: 1285209 1003800
  Show dependency treegraph
Reported: 2013-07-22 12:37 EDT by Venky Shankar
Modified: 2015-11-25 03:52 EST (History)
9 users (show)

See Also:
Fixed In Version: glusterfs-
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1003800 1285209 (view as bug list)
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Venky Shankar 2013-07-22 12:37:42 EDT
Description of problem:

Geo-replication fails to sync data (stuck on one changelog) when compiling GlusterFS (or anything else that deals with Makefiles). This is due to the following sequence of operations that are shown below (typical 'make' workload):

E 8d603c98-d6ac-48eb-90f9-58612bf8b828 MKDIR 852ee298-dfb0-423e-8827-01438fcd0af4%2FconfCxOmQZ
M 8d603c98-d6ac-48eb-90f9-58612bf8b828
E d8c0083a-e667-4e72-adca-0180ebed5af9 CREATE 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fsubs1.awk
E 454c43e2-f271-477a-b927-3549215ac06f CREATE 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fsubs.awk
E f9e3c190-3e17-45ab-a560-b43d9eb77e15 CREATE 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fdefines.awk
E fb3bff29-e864-42b1-86d3-564252d29a56 MKNOD 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fout
E 00000000-0000-0000-0000-000000000000 RENAME 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fout 852ee298-dfb0-423e-8827-01438fcd0af4%2FMakefile
E e61b3d3c-e129-403c-b849-5402729ca305 CREATE 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fconfig.h
E 00000000-0000-0000-0000-000000000000 RENAME 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fconfig.h 852ee298-dfb0-423e-8827-01438fcd0af4%2Fconfig.h
E d8c0083a-e667-4e72-adca-0180ebed5af9 UNLINK 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fsubs1.awk
E 454c43e2-f271-477a-b927-3549215ac06f UNLINK 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fsubs.awk
E f9e3c190-3e17-45ab-a560-b43d9eb77e15 UNLINK 8d603c98-d6ac-48eb-90f9-58612bf8b828%2Fdefines.awk
E 8d603c98-d6ac-48eb-90f9-58612bf8b828 RMDIR 852ee298-dfb0-423e-8827-01438fcd0af4%2FconfCxOmQZ

When processing this changelog, object with gfid '8d603c98-d6ac-48eb-90f9-58612bf8b828' (directory) does not exist as it's rmdir'ed by then (last entry op), thereby skipping it's creation. This is dangerous as the operation which are dependent on it would be effected ie. the MKNOD and RENAME (as in this example) would not be performed too (creation of gfid fb3bff29-e864-42b1-86d3-564252d29a56 would fail as the parent does not exist and therefore the RENAME). This would result in data loss on the slave.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Data loss on the slave and gsyncd going into loops.

Expected results:
data b/w master and slave should be in sync and gsyncd should be resilient to such operations.

Additional info:
Comment 2 M S Vishwanath Bhat 2013-08-08 04:45:29 EDT
This is not fixed yet.

Tested in Version: glusterfs-

First the make fails on the gluster mount. And then even after about 19 hours the files created are not synced to the slave volume. The arequal checksums of master and slave are very different.

I'm going to try this once more on the glusterfs- and then move the bug to required state.
Comment 3 M S Vishwanath Bhat 2013-08-08 06:20:02 EDT
Tried with glusterfs-

Now the make succeeds but the files are not synced to slave even after a about an hour. Moving it back to ASSIGNED.
Comment 4 Amar Tumballi 2013-08-13 04:45:35 EDT
Marking 'blocker' flag as per Rich/Sayan's Blocker assessment spreadsheet.
Comment 5 Amar Tumballi 2013-08-14 08:50:37 EDT
MSV, does restarting geo-replication session fix the issue?
Comment 6 Venky Shankar 2013-08-14 12:09:32 EDT
(In reply to Amar Tumballi from comment #5)
> MSV, does restarting geo-replication session fix the issue?

This is probably another case like I've mentioned in Comment #1. I'll look into it.
Comment 7 Venky Shankar 2013-08-20 08:57:52 EDT
The following patch reduces the failure rates but the bug is still valid.

Pasting the patch here (and not sending it out for review) as:

1. It's still does not _fully_ fix the issue
2. Not sure how would this impact other things

To fully fix this we may need extra entries in the changelog (creation mode etc..)

diff --git a/geo-replication/syncdaemon/resource.py b/geo-replication/syncdaemon/resource.py
index 8bd3939..76f20b9 100644
--- a/geo-replication/syncdaemon/resource.py
+++ b/geo-replication/syncdaemon/resource.py
@@ -455,6 +455,7 @@ class Server(object):
     def entry_ops(cls, entries):
         pfx = gauxpfx()
+        unsafe_unlink = gconf.unsafe_unlink
         logging.debug('entries: %s' % repr(entries))
         # regular file
         def entry_pack_reg(gf, bn, st):
@@ -485,11 +486,12 @@ class Server(object):
             # to be purged is the GFID gotten from the changelog.
             # (a stat(changelog_gfid) would also be valid here)
             # The race here is between the GFID check and the purge.
-            disk_gfid = cls.gfid(entry)
-            if isinstance(disk_gfid, int):
-                return
-            if not gfid == disk_gfid:
-                return
+            if not unsafe_unlink:
+                disk_gfid = cls.gfid(entry)
+                if isinstance(disk_gfid, int):
+                    return
+                if not gfid == disk_gfid:
+                    return
             er = errno_wrap(os.unlink, [entry], [ENOENT, EISDIR])
             if isinstance(er, int):
                 if er == EISDIR:
Comment 9 Amar Tumballi 2013-08-22 08:39:24 EDT
Still there may be few more corner cases here. The basic tests we did works.
Comment 10 M S Vishwanath Bhat 2013-08-26 14:43:16 EDT
This is still failing. Although this time 'make' itself succeeded on the mountpoint, which was actually failing during my first test of this. But the files are not synced.

I keep getting it in a loop.

[2013-08-26 18:42:05.36058] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/270fb4e9-bddc-47fc-b182-4b4c96e3fbad [errcode: 23]
[2013-08-26 18:42:05.36920] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f289e4e7-ce78-4f20-b2c6-4611406d3a93 [errcode: 23]
[2013-08-26 18:42:05.37681] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/78162039-fba7-48a5-be66-f441fbc8d260 [errcode: 23]
[2013-08-26 18:42:05.38522] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a776ff1f-a09e-4de8-abdd-ea93ce842df8 [errcode: 23]
[2013-08-26 18:42:05.39215] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/2f8febc0-9227-4a77-86d4-8229647500fb [errcode: 23]
[2013-08-26 18:42:05.39918] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/56f2a543-268f-4c9c-b84d-2840ec3da1c0 [errcode: 23]
[2013-08-26 18:42:05.40774] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/47b98e52-3a10-4cb9-be91-d0af8a8df980 [errcode: 23]
[2013-08-26 18:42:05.41575] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/34532ebf-d45c-4cca-a862-2d945d9a463f [errcode: 23]
[2013-08-26 18:42:05.42299] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f50e981a-5168-4e44-86ae-514905099c74 [errcode: 23]
[2013-08-26 18:42:05.43118] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f9cc6b44-e8f9-4a5c-bfce-bbcac505aaf6 [errcode: 23]
[2013-08-26 18:42:05.43943] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/9fcd01fd-d2e1-4d89-a6db-58630e1b0dc6 [errcode: 23]
[2013-08-26 18:42:05.44749] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/8abafb51-2e83-4a58-aed8-44ee1cdf16af [errcode: 23]
[2013-08-26 18:42:05.45574] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/39eee537-2d74-4506-aa3e-1f409f64afed [errcode: 23]
[2013-08-26 18:42:05.45943] W [master(/rhs/bricks/brick0):748:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.35.90%3Agluster%3A%2F%2F127.0.0.1%3Aslave/59ddf777397e52a13ba1333653d63854/.processing/CHANGELOG.1377518885
Comment 11 Amar Tumballi 2013-08-27 06:40:27 EDT
Looking at the comment #10, it seems very similar to 1001066 and 1001224 (fix @ https://code.engineering.redhat.com/gerrit/#/c/12027/). Can we try after the fix gets into a release ?

Comment 12 M S Vishwanath Bhat 2013-08-27 06:48:44 EDT
Yes that might be the case that 1001224 might be blocking this test case. I realised is after updating the bug. I initially did this 'make' test and after updating the BZ, I realized that is the case for even the simple untar. Please move it back to ON_QA after the above patch is taken in and will verify it again.
Comment 13 Amar Tumballi 2013-08-27 12:46:13 EDT
the above patch is already part of glusterfs-
Comment 14 M S Vishwanath Bhat 2013-08-28 05:19:19 EDT
I tried with glusterfs-

The files are still not synced properly from master to slave.

[root@archimedes ~]# find /mnt/master/glusterfs- | wc -l

[root@archimedes ~]# find /mnt/slave/glusterfs- | wc -l

[root@archimedes ~]# ls /mnt/master/glusterfs-

[root@archimedes ~]# ls /mnt/slave/glusterfs-
ls: cannot access /mnt/slave/glusterfs- No such file or directory
Comment 15 Amar Tumballi 2013-08-28 09:40:30 EDT
Marking it not a blocker for GA as per Big Bend Readout on 27th Aug, 2013. Should be fixed before Update1.
Comment 16 Aravinda VK 2015-11-25 03:51:32 EST
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.
Comment 17 Aravinda VK 2015-11-25 03:52:21 EST
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Note You need to log in before you can comment on or make changes to this bug.