Bug 987082
Summary: | dist-geo-rep: running "make" failes to sync files to the slave | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Venky Shankar <vshankar> | |
Component: | geo-replication | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> | |
Status: | CLOSED EOL | QA Contact: | Rahul Hinduja <rhinduja> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 2.1 | CC: | avishwan, chrisw, csaba, david.macdonald, rhinduja, rhs-bugs, rwheeler, vagarwal, vbhat | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.4.0.24rhs-1 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1003800 1285209 (view as bug list) | Environment: | ||
Last Closed: | Type: | Bug | ||
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1003800, 1285209 |
Description
Venky Shankar
2013-07-22 16:37:42 UTC
This is not fixed yet. Tested in Version: glusterfs-3.4.0.17rhs-1.el6rhs.x86_64 First the make fails on the gluster mount. And then even after about 19 hours the files created are not synced to the slave volume. The arequal checksums of master and slave are very different. I'm going to try this once more on the glusterfs-3.4.0.18rhs and then move the bug to required state. Tried with glusterfs-3.4.0.18rhs-1.el6rhs.x86_64 Now the make succeeds but the files are not synced to slave even after a about an hour. Moving it back to ASSIGNED. Marking 'blocker' flag as per Rich/Sayan's Blocker assessment spreadsheet. MSV, does restarting geo-replication session fix the issue? (In reply to Amar Tumballi from comment #5) > MSV, does restarting geo-replication session fix the issue? This is probably another case like I've mentioned in Comment #1. I'll look into it. The following patch reduces the failure rates but the bug is still valid. Pasting the patch here (and not sending it out for review) as: 1. It's still does not _fully_ fix the issue 2. Not sure how would this impact other things To fully fix this we may need extra entries in the changelog (creation mode etc..) diff --git a/geo-replication/syncdaemon/resource.py b/geo-replication/syncdaemon/resource.py index 8bd3939..76f20b9 100644 --- a/geo-replication/syncdaemon/resource.py +++ b/geo-replication/syncdaemon/resource.py @@ -455,6 +455,7 @@ class Server(object): @classmethod def entry_ops(cls, entries): pfx = gauxpfx() + unsafe_unlink = gconf.unsafe_unlink logging.debug('entries: %s' % repr(entries)) # regular file def entry_pack_reg(gf, bn, st): @@ -485,11 +486,12 @@ class Server(object): # to be purged is the GFID gotten from the changelog. # (a stat(changelog_gfid) would also be valid here) # The race here is between the GFID check and the purge. - disk_gfid = cls.gfid(entry) - if isinstance(disk_gfid, int): - return - if not gfid == disk_gfid: - return + if not unsafe_unlink: + disk_gfid = cls.gfid(entry) + if isinstance(disk_gfid, int): + return + if not gfid == disk_gfid: + return er = errno_wrap(os.unlink, [entry], [ENOENT, EISDIR]) if isinstance(er, int): if er == EISDIR: https://code.engineering.redhat.com/gerrit/#/c/11750 && https://code.engineering.redhat.com/gerrit/#/c/11751/ Still there may be few more corner cases here. The basic tests we did works. This is still failing. Although this time 'make' itself succeeded on the mountpoint, which was actually failing during my first test of this. But the files are not synced. I keep getting it in a loop. [2013-08-26 18:42:05.36058] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/270fb4e9-bddc-47fc-b182-4b4c96e3fbad [errcode: 23] [2013-08-26 18:42:05.36920] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f289e4e7-ce78-4f20-b2c6-4611406d3a93 [errcode: 23] [2013-08-26 18:42:05.37681] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/78162039-fba7-48a5-be66-f441fbc8d260 [errcode: 23] [2013-08-26 18:42:05.38522] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a776ff1f-a09e-4de8-abdd-ea93ce842df8 [errcode: 23] [2013-08-26 18:42:05.39215] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/2f8febc0-9227-4a77-86d4-8229647500fb [errcode: 23] [2013-08-26 18:42:05.39918] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/56f2a543-268f-4c9c-b84d-2840ec3da1c0 [errcode: 23] [2013-08-26 18:42:05.40774] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/47b98e52-3a10-4cb9-be91-d0af8a8df980 [errcode: 23] [2013-08-26 18:42:05.41575] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/34532ebf-d45c-4cca-a862-2d945d9a463f [errcode: 23] [2013-08-26 18:42:05.42299] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f50e981a-5168-4e44-86ae-514905099c74 [errcode: 23] [2013-08-26 18:42:05.43118] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f9cc6b44-e8f9-4a5c-bfce-bbcac505aaf6 [errcode: 23] [2013-08-26 18:42:05.43943] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/9fcd01fd-d2e1-4d89-a6db-58630e1b0dc6 [errcode: 23] [2013-08-26 18:42:05.44749] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/8abafb51-2e83-4a58-aed8-44ee1cdf16af [errcode: 23] [2013-08-26 18:42:05.45574] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/39eee537-2d74-4506-aa3e-1f409f64afed [errcode: 23] [2013-08-26 18:42:05.45943] W [master(/rhs/bricks/brick0):748:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.35.90%3Agluster%3A%2F%2F127.0.0.1%3Aslave/59ddf777397e52a13ba1333653d63854/.processing/CHANGELOG.1377518885 Looking at the comment #10, it seems very similar to 1001066 and 1001224 (fix @ https://code.engineering.redhat.com/gerrit/#/c/12027/). Can we try after the fix gets into a release ? -Amar Yes that might be the case that 1001224 might be blocking this test case. I realised is after updating the bug. I initially did this 'make' test and after updating the BZ, I realized that is the case for even the simple untar. Please move it back to ON_QA after the above patch is taken in and will verify it again. the above patch is already part of glusterfs-3.4.0.24rhs. I tried with glusterfs-3.4.0.24rhs-1. The files are still not synced properly from master to slave. [root@archimedes ~]# find /mnt/master/glusterfs-3.4.0.23rhs | wc -l 2152 [root@archimedes ~]# find /mnt/slave/glusterfs-3.4.0.23rhs | wc -l 2254 [root@archimedes ~]# ls /mnt/master/glusterfs-3.4.0.23rhs/libtool /mnt/master/glusterfs-3.4.0.23rhs/libtool [root@archimedes ~]# ls /mnt/slave/glusterfs-3.4.0.23rhs/libtool ls: cannot access /mnt/slave/glusterfs-3.4.0.23rhs/libtool: No such file or directory Marking it not a blocker for GA as per Big Bend Readout on 27th Aug, 2013. Should be fixed before Update1. Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again. Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again. |