Description of problem: If you start a geo-rep session with dist-rep master volume, intermittently the status of the other replica pair goes to faulty from where there is no syncing happening, which is the gsync which is idle. This happens intermittently. This is the excerpt from the idle gsync log file, [2013-07-01 18:24:03.26717] D [master(/bricks/brick2):757:volinfo_state_machine] <top>: (None, f92305f7) << (None, f92305f7) -> (None, f92305f7) [2013-07-01 18:24:06.538655] E [syncdutils(/bricks/brick2):189:log_raise_exception] <top>: connection to peer is broken [2013-07-01 18:24:06.541389] E [syncdutils(/bricks/brick2):206:log_raise_exception] <top>: FULL EXCEPTION TRACE: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 232, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 157, in listen rid, exc, res = recv(self.inf) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 48, in recv return pickle.load(inf) EOFError [2013-07-01 18:24:06.543339] I [syncdutils(/bricks/brick2):158:finalize] <top>: exiting. [2013-07-01 18:24:06.551038] I [monitor(monitor):81:set_state] Monitor: new state: faulty, Version-Release number of selected component (if applicable):glusterfs-3.4.0.12rhs.beta1-1.el6rhs.x86_64 How reproducible: Intermittent Steps to Reproduce: 1.Create and start a geo-rep session with dist-rep master and slave 2.Create lot of data on the master, like untar a kernel. 3.Check the status of the geo-rep Actual results: Sometime geo-rep rep status of the replica pairs goes to faulty Expected results: Status should be stable. Additional info:
I have a fix for this. Will send out the patch soon.
*** Bug 980734 has been marked as a duplicate of this bug. ***
Verified on glusterfs-3.4.0.15rhs-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html