Description of problem: ======================= When the active bricks were killed with making sure that the corresponding replica brick is Online, the geo-rep session goes to faulty for all the bricks in this subvolume. geo-rep logs reports Exceptions as: [2015-06-29 20:49:29.336266] I [master(/rhs/brick2/b2):1123:crawl] _GMaster: starting history crawl... turns: 1, stime: (1435581529, 0) [2015-06-29 20:49:29.337650] E [repce(agent):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history num_parallel) File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog cls.raise_changelog_err() File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err raise ChangelogException(errn, os.strerror(errn)) ChangelogException: [Errno 2] No such file or directory [2015-06-29 20:49:29.339186] E [repce(/rhs/brick2/b2):207:__call__] RepceClient: call 22632:140213611554560:1435591169.34 (history) failed on peer with ChangelogException [2015-06-29 20:49:29.339458] E [resource(/rhs/brick2/b2):1447:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory With the use of meta_volume, the other brick should acquire the lock and should become ACTIVE, instead the whole subvolume becomes Faulty. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.1-5.el6rhs.x86_64 How reproducible: ================= Tried once, will update the result again in next try Steps Carried: ============== 1. Create Master and Slave Cluster 2. Create and Start Master volume (4x2) from four nodes (node1..node4) 3. Create and Start slave volume (2x2) 4. Create Meta volume (1x3) (node1..node3) 5. Create geo-rep session between master and slave volume 6. Set the config use_meta_volume to true 7. Start the geo-rep session 8. Mount the volume on Fuse 9. Start creating data from fuse client 10. While data creation is in progress, kill few active bricks {kill -9 pid}. {Make sure that the corresponding replica brick is UP} 11. Check the geo-rep status and log.
Not a Valid Setup. Hence the issue. As per the admin doc, the gluster cluster nodes should be in NTP synchronization maintaining the same time. This setup is not. [root@georep3 htime]# date Wed Jul 1 20:05:08 IST 2015 [root@georep4 htime]# date Wed Jul 1 22:41:52 IST 2015 From above geo-rep4 is ~2hrs ahead of georep3. When bricks from georep3 is killed where geo-rep workers were ACTIVE, worker in georep4 did become ACTIVE but history failed on georep4. This is because, the start time is max of replica and min of distribute, that would be georep3's time, and history obviously will fail. Please retest with all nodes with the same time.
Even when the ntp is configured and the systems are in sync and same timezone. Killing the Active bricks makes the passive brick faulty too with the history crawl failing. [2015-07-01 15:31:06.146286] I [master(/rhs/brick1/b1):1123:crawl] _GMaster: starting history crawl... turns: 1, stime: (1435744752, 0) [2015-07-01 15:31:06.147336] E [repce(agent):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history num_parallel) File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog cls.raise_changelog_err() File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err raise ChangelogException(errn, os.strerror(errn)) ChangelogException: [Errno 2] No such file or directory [2015-07-01 15:31:06.149779] Since it succeed in retry, removing the blocker flag and regression keyword. Keeping the bug still open to root cause the history failure.
I got it the reason for first time failure. The register time is the end time we pass for the history API. Since the PASSIVE worker register much earlier along with ACTIVE worker and start time it passes the stime i.e., register time < stime For history API, start time > end time which obviously fails. When it registers for second time, register time > stime and hence it passes. There are no side effects with respect to DATA sync. It is just worker going down and coming back. We will fix this but not a BLOCKER definitely.
Upstream Patch (Master): http://review.gluster.org/11524 Upstream Patch (3.7): http://review.gluster.org/#/c/11784/
Merged in upstream (master) and upstream (3.7). Hence moving it to POST.
Downstream Patch: https://code.engineering.redhat.com/gerrit/55051
Verified with build: glusterfs-3.7.1-13.el7rhgs.x86_64 When active brick goes down, the corresponding passive brick becomes active. Only the bricks which went offline are shown as Faulty which is expected. Didn't see "[Errno 2] No such file or directory" for passive bricks. Marking this bug as verified. [root@georep2 scripts]# grep -i "ChangelogException" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.* [root@georep3 scripts]# grep -i "No such file or directory" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.* [root@georep4 scripts]# grep -i "No such file or directory" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.* [root@georep4 scripts]# [root@georep4 scripts]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1845.html