Bug 1236546
Summary: | [geo-rep]: killing brick from replica pair makes geo-rep session faulty with Traceback "ChangelogException" | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> | |
Component: | geo-replication | Assignee: | Kotresh HR <khiremat> | |
Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.1 | CC: | annair, asrivast, avishwan, chrisw, csaba, divya, nlevinki, nsathyan, rcyriac | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 3.1.1 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.7.1-12 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, both ACTIVE and PASSIVE geo-replication workers registered to changelog at almost the same time. When PASSIVE worker becomes ACTIVE, the start and end time would be current stime and register_time respectively for history API. Hence register_time would be less then stime for which history API fails. As a consequence, passive worker, which becomes active dies for the first time. With this fix, the passive worker, which becomes active does not die for the first time.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1239044 (view as bug list) | Environment: | ||
Last Closed: | 2015-10-05 07:15:35 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1236554, 1239044, 1247882, 1251815 |
Description
Rahul Hinduja
2015-06-29 12:01:43 UTC
Not a Valid Setup. Hence the issue. As per the admin doc, the gluster cluster nodes should be in NTP synchronization maintaining the same time. This setup is not. [root@georep3 htime]# date Wed Jul 1 20:05:08 IST 2015 [root@georep4 htime]# date Wed Jul 1 22:41:52 IST 2015 From above geo-rep4 is ~2hrs ahead of georep3. When bricks from georep3 is killed where geo-rep workers were ACTIVE, worker in georep4 did become ACTIVE but history failed on georep4. This is because, the start time is max of replica and min of distribute, that would be georep3's time, and history obviously will fail. Please retest with all nodes with the same time. Even when the ntp is configured and the systems are in sync and same timezone. Killing the Active bricks makes the passive brick faulty too with the history crawl failing. [2015-07-01 15:31:06.146286] I [master(/rhs/brick1/b1):1123:crawl] _GMaster: starting history crawl... turns: 1, stime: (1435744752, 0) [2015-07-01 15:31:06.147336] E [repce(agent):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history num_parallel) File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog cls.raise_changelog_err() File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err raise ChangelogException(errn, os.strerror(errn)) ChangelogException: [Errno 2] No such file or directory [2015-07-01 15:31:06.149779] Since it succeed in retry, removing the blocker flag and regression keyword. Keeping the bug still open to root cause the history failure. I got it the reason for first time failure. The register time is the end time we pass for the history API. Since the PASSIVE worker register much earlier along with ACTIVE worker and start time it passes the stime i.e., register time < stime For history API, start time > end time which obviously fails. When it registers for second time, register time > stime and hence it passes. There are no side effects with respect to DATA sync. It is just worker going down and coming back. We will fix this but not a BLOCKER definitely. Upstream Patch (Master): http://review.gluster.org/11524 Upstream Patch (3.7): http://review.gluster.org/#/c/11784/ Merged in upstream (master) and upstream (3.7). Hence moving it to POST. Downstream Patch: https://code.engineering.redhat.com/gerrit/55051 Verified with build: glusterfs-3.7.1-13.el7rhgs.x86_64 When active brick goes down, the corresponding passive brick becomes active. Only the bricks which went offline are shown as Faulty which is expected. Didn't see "[Errno 2] No such file or directory" for passive bricks. Marking this bug as verified. [root@georep2 scripts]# grep -i "ChangelogException" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.* [root@georep3 scripts]# grep -i "No such file or directory" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.* [root@georep4 scripts]# grep -i "No such file or directory" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.* [root@georep4 scripts]# [root@georep4 scripts]# Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1845.html |