Bug 1236546

Summary:	[geo-rep]: killing brick from replica pair makes geo-rep session faulty with Traceback "ChangelogException"
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	geo-replication	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	annair, asrivast, avishwan, chrisw, csaba, divya, nlevinki, nsathyan, rcyriac
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.1-12	Doc Type:	Bug Fix
Doc Text:	Previously, both ACTIVE and PASSIVE geo-replication workers registered to changelog at almost the same time. When PASSIVE worker becomes ACTIVE, the start and end time would be current stime and register_time respectively for history API. Hence register_time would be less then stime for which history API fails. As a consequence, passive worker, which becomes active dies for the first time. With this fix, the passive worker, which becomes active does not die for the first time.	Story Points:	---
Clone Of:
Clones:	1239044 (view as bug list)		Environment:
Last Closed:	2015-10-05 07:15:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1236554, 1239044, 1247882, 1251815

Description Rahul Hinduja 2015-06-29 12:01:43 UTC

Description of problem:
=======================

When the active bricks were killed with making sure that the corresponding replica brick is Online, the geo-rep session goes to faulty for all the bricks in this subvolume. geo-rep logs reports Exceptions as:

[2015-06-29 20:49:29.336266] I [master(/rhs/brick2/b2):1123:crawl] _GMaster: starting history crawl... turns: 1, stime: (1435581529, 0)
[2015-06-29 20:49:29.337650] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2015-06-29 20:49:29.339186] E [repce(/rhs/brick2/b2):207:__call__] RepceClient: call 22632:140213611554560:1435591169.34 (history) failed on peer with ChangelogException
[2015-06-29 20:49:29.339458] E [resource(/rhs/brick2/b2):1447:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory

With the use of meta_volume, the other brick should acquire the lock and should become ACTIVE, instead the whole subvolume becomes Faulty.


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.1-5.el6rhs.x86_64


How reproducible:
=================

Tried once, will update the result again in next try


Steps Carried:
==============

1. Create Master and Slave Cluster
2. Create and Start Master volume (4x2) from four nodes (node1..node4)
3. Create and Start slave volume (2x2)
4. Create Meta volume (1x3) (node1..node3)
5. Create geo-rep session between master and slave volume
6. Set the config use_meta_volume to true
7. Start the geo-rep session
8. Mount the volume on Fuse
9. Start creating data from fuse client
10. While data creation is in progress, kill few active bricks {kill -9 pid}. {Make sure that the corresponding replica brick is UP}
11. Check the geo-rep status and log.

Comment 5 Kotresh HR 2015-07-01 09:22:15 UTC

Not a Valid Setup. Hence the issue.

As per the admin doc, the gluster cluster nodes should be in NTP synchronization maintaining the same time. This setup is not. 

[root@georep3 htime]# date
Wed Jul  1 20:05:08 IST 2015
[root@georep4 htime]# date
Wed Jul  1 22:41:52 IST 2015

From above geo-rep4 is ~2hrs ahead of georep3.

When bricks from georep3 is killed where geo-rep workers were ACTIVE,
worker in georep4 did become ACTIVE but history failed on georep4.
This is because, the start time is max of replica and min of distribute, that would be georep3's time, and history obviously will fail.

Please retest with all nodes with the same time.

Comment 6 Rahul Hinduja 2015-07-01 11:25:54 UTC

Even when the ntp is configured and the systems are in sync and same timezone. Killing the Active bricks makes the passive brick faulty too with the history crawl failing.

[2015-07-01 15:31:06.146286] I [master(/rhs/brick1/b1):1123:crawl] _GMaster: starting history crawl... turns: 1, stime: (1435744752, 0)
[2015-07-01 15:31:06.147336] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2015-07-01 15:31:06.149779] 

Since it succeed in retry, removing the blocker flag and regression keyword. Keeping the bug still open to root cause the history failure.

Comment 7 Kotresh HR 2015-07-01 12:27:24 UTC

I got it the reason for first time failure. The register time is the end time we pass for the history API. Since the PASSIVE worker register much earlier along with ACTIVE worker and start time it passes the stime i.e., register time < stime

For history API, start time > end time which obviously fails.

When it registers for second time,  register time > stime and hence it passes.

There are no side effects with respect to DATA sync. It is just worker going down and coming back. We will fix this but not a BLOCKER definitely.

Comment 9 Kotresh HR 2015-08-06 06:03:48 UTC

Upstream Patch (Master):
http://review.gluster.org/11524

Upstream Patch (3.7):
http://review.gluster.org/#/c/11784/

Comment 12 Kotresh HR 2015-08-11 09:10:09 UTC

Merged in upstream (master) and upstream (3.7). Hence moving it to POST.

Comment 13 Kotresh HR 2015-08-13 07:27:42 UTC

Downstream Patch:

 https://code.engineering.redhat.com/gerrit/55051

Comment 14 Rahul Hinduja 2015-08-31 10:22:34 UTC

Verified with build: glusterfs-3.7.1-13.el7rhgs.x86_64

When active brick goes down, the corresponding passive brick becomes active. Only the bricks which went offline are shown as Faulty which is expected. Didn't see "[Errno 2] No such file or directory" for passive bricks. 

Marking this bug as verified. 

[root@georep2 scripts]# grep -i "ChangelogException" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.*


[root@georep3 scripts]# grep -i "No such file or directory" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.*


[root@georep4 scripts]# grep -i "No such file or directory" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.*
[root@georep4 scripts]# 
[root@georep4 scripts]#

Comment 17 errata-xmlrpc 2015-10-05 07:15:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1845.html