1236546 – [geo-rep]: killing brick from replica pair makes geo-rep session faulty with Traceback "ChangelogException"

Bug 1236546 - [geo-rep]: killing brick from replica pair makes geo-rep session faulty with Traceback "ChangelogException"

Summary: [geo-rep]: killing brick from replica pair makes geo-rep session faulty with ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.1
Assignee:	Kotresh HR
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1236554 1239044 1247882 1251815
TreeView+	depends on / blocked

Reported:	2015-06-29 12:01 UTC by Rahul Hinduja
Modified:	2015-10-05 07:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.7.1-12
Doc Type:	Bug Fix
Doc Text:	Previously, both ACTIVE and PASSIVE geo-replication workers registered to changelog at almost the same time. When PASSIVE worker becomes ACTIVE, the start and end time would be current stime and register_time respectively for history API. Hence register_time would be less then stime for which history API fails. As a consequence, passive worker, which becomes active dies for the first time. With this fix, the passive worker, which becomes active does not die for the first time.
Clone Of:
Clones:	1239044 (view as bug list)
Environment:
Last Closed:	2015-10-05 07:15:35 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1845	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.1 update	2015-10-05 11:06:22 UTC

Description Rahul Hinduja 2015-06-29 12:01:43 UTC

Description of problem:
=======================

When the active bricks were killed with making sure that the corresponding replica brick is Online, the geo-rep session goes to faulty for all the bricks in this subvolume. geo-rep logs reports Exceptions as:

[2015-06-29 20:49:29.336266] I [master(/rhs/brick2/b2):1123:crawl] _GMaster: starting history crawl... turns: 1, stime: (1435581529, 0)
[2015-06-29 20:49:29.337650] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2015-06-29 20:49:29.339186] E [repce(/rhs/brick2/b2):207:__call__] RepceClient: call 22632:140213611554560:1435591169.34 (history) failed on peer with ChangelogException
[2015-06-29 20:49:29.339458] E [resource(/rhs/brick2/b2):1447:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory

With the use of meta_volume, the other brick should acquire the lock and should become ACTIVE, instead the whole subvolume becomes Faulty.


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.1-5.el6rhs.x86_64


How reproducible:
=================

Tried once, will update the result again in next try


Steps Carried:
==============

1. Create Master and Slave Cluster
2. Create and Start Master volume (4x2) from four nodes (node1..node4)
3. Create and Start slave volume (2x2)
4. Create Meta volume (1x3) (node1..node3)
5. Create geo-rep session between master and slave volume
6. Set the config use_meta_volume to true
7. Start the geo-rep session
8. Mount the volume on Fuse
9. Start creating data from fuse client
10. While data creation is in progress, kill few active bricks {kill -9 pid}. {Make sure that the corresponding replica brick is UP}
11. Check the geo-rep status and log.

Comment 5 Kotresh HR 2015-07-01 09:22:15 UTC

Not a Valid Setup. Hence the issue.

As per the admin doc, the gluster cluster nodes should be in NTP synchronization maintaining the same time. This setup is not. 

[root@georep3 htime]# date
Wed Jul  1 20:05:08 IST 2015
[root@georep4 htime]# date
Wed Jul  1 22:41:52 IST 2015

From above geo-rep4 is ~2hrs ahead of georep3.

When bricks from georep3 is killed where geo-rep workers were ACTIVE,
worker in georep4 did become ACTIVE but history failed on georep4.
This is because, the start time is max of replica and min of distribute, that would be georep3's time, and history obviously will fail.

Please retest with all nodes with the same time.

Comment 6 Rahul Hinduja 2015-07-01 11:25:54 UTC

Even when the ntp is configured and the systems are in sync and same timezone. Killing the Active bricks makes the passive brick faulty too with the history crawl failing.

[2015-07-01 15:31:06.146286] I [master(/rhs/brick1/b1):1123:crawl] _GMaster: starting history crawl... turns: 1, stime: (1435744752, 0)
[2015-07-01 15:31:06.147336] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2015-07-01 15:31:06.149779] 

Since it succeed in retry, removing the blocker flag and regression keyword. Keeping the bug still open to root cause the history failure.

Comment 7 Kotresh HR 2015-07-01 12:27:24 UTC

I got it the reason for first time failure. The register time is the end time we pass for the history API. Since the PASSIVE worker register much earlier along with ACTIVE worker and start time it passes the stime i.e., register time < stime

For history API, start time > end time which obviously fails.

When it registers for second time,  register time > stime and hence it passes.

There are no side effects with respect to DATA sync. It is just worker going down and coming back. We will fix this but not a BLOCKER definitely.

Comment 9 Kotresh HR 2015-08-06 06:03:48 UTC

Upstream Patch (Master):
http://review.gluster.org/11524

Upstream Patch (3.7):
http://review.gluster.org/#/c/11784/

Comment 12 Kotresh HR 2015-08-11 09:10:09 UTC

Merged in upstream (master) and upstream (3.7). Hence moving it to POST.

Comment 13 Kotresh HR 2015-08-13 07:27:42 UTC

Downstream Patch:

 https://code.engineering.redhat.com/gerrit/55051

Comment 14 Rahul Hinduja 2015-08-31 10:22:34 UTC

Verified with build: glusterfs-3.7.1-13.el7rhgs.x86_64

When active brick goes down, the corresponding passive brick becomes active. Only the bricks which went offline are shown as Faulty which is expected. Didn't see "[Errno 2] No such file or directory" for passive bricks. 

Marking this bug as verified. 

[root@georep2 scripts]# grep -i "ChangelogException" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.*


[root@georep3 scripts]# grep -i "No such file or directory" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.*


[root@georep4 scripts]# grep -i "No such file or directory" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.167%3Agluster%3A%2F%2F127.0.0.1%3Aslave.*
[root@georep4 scripts]# 
[root@georep4 scripts]#

Comment 17 errata-xmlrpc 2015-10-05 07:15:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1845.html

Note You need to log in before you can comment on or make changes to this bug.