Bug 1292463

Summary:	[geo-rep]: ChangelogException: [Errno 22] Invalid argument observed upon rebooting the ACTIVE master node
Product:	[Community] GlusterFS	Reporter:	Kotresh HR <khiremat>
Component:	geo-replication	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	bugs, chrisw, csaba, nlevinki, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.8rc2	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1291988
Clones:	1293595 (view as bug list)		Environment:
Last Closed:	2016-06-16 13:51:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1291988
Bug Blocks:	1293595

Description Kotresh HR 2015-12-17 14:21:21 UTC

+++ This bug was initially created as a clone of Bug #1291988 +++

Description of problem:
=======================

In a geo-rep setup if a ACTIVE master is rebooted observed ChangelogException as:

[2015-12-15 12:30:39.725385] I [resource(/rhgs/brick1/master):1433:service_loop] GLUSTER: Register time: 1450162839
[2015-12-15 12:30:39.741044] I [master(/rhgs/brick1/master):549:crawlwrap] _GMaster: primary master with volume id e560607b-131b-4e35-b6b7-4e9b1e369224 ...
[2015-12-15 12:30:39.744363] I [master(/rhgs/brick1/master):558:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-12-15 12:30:39.748029] I [master(/rhgs/brick1/master):1173:crawl] _GMaster: starting history crawl... turns: 1, stime: (1450086953, 0), etime: 1450162839
[2015-12-15 12:30:39.748600] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2015-12-15 12:30:39.749222] E [repce(/rhgs/brick1/master):207:__call__] RepceClient: call 13048:139759777986368:1450162839.75 (history) failed on peer with ChangelogException
[2015-12-15 12:30:39.749366] E [resource(/rhgs/brick1/master):1448:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory
[2015-12-15 12:30:39.749612] I [syncdutils(/rhgs/brick1/master):220:finalize] <top>: exiting.
[2015-12-15 12:30:39.753611] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-12-15 12:30:39.753813] I [syncdutils(agent):220:finalize] <top>: exiting.
[2015-12-15 12:30:40.705105] I [monitor(monitor):229:monitor] Monitor: ------------------------------------------------------------
[2015-12-15 12:30:40.705337] I [monitor(monitor):230:monitor] Monitor: starting gsyncd worker
[2015-12-15 12:30:40.717370] I [monitor(monitor):290:monitor] Monitor: worker(/rhgs/brick1/master) died in startup phase
[2015-12-15 12:30:40.780814] I [gsyncd(/rhgs/brick2/master):652:main_i] <top>: syncing: gluster://localhost:Master_vol -> ssh://root.35.196:gluster://localhost:Slave_vol
[2015-12-15 12:30:40.781518] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...
[2015-12-15 12:30:43.313604] I [master(/rhgs/brick2/master):83:gmaster_builder] <top>: setting up xsync change detection mode
[2015-12-15 12:30:43.314054] I [master(/rhgs/brick2/master):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-12-15 12:30:43.314963] I [master(/rhgs/brick2/master):83:gmaster_builder] <top>: setting up changelog change detection mode
[2015-12-15 12:30:43.315217] I [master(/rhgs/brick2/master):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-12-15 12:30:43.316255] I [master(/rhgs/brick2/master):83:gmaster_builder] <top>: setting up changeloghistory change detection mode
[2015-12-15 12:30:43.316457] I [master(/rhgs/brick2/master):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-12-15 12:30:45.333230] I [master(/rhgs/brick2/master):1259:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/Master_vol/ssh%3A%2F%2Froot%4010.70.35.196%3Agluster%3A%2F%2F127.0.0.1%3ASla
ve_vol/78ce16e9d2384148a6abf384c52b8ab1/xsync
[2015-12-15 12:30:45.333684] I [resource(/rhgs/brick2/master):1433:service_loop] GLUSTER: Register time: 1450162845
[2015-12-15 12:30:45.346733] I [master(/rhgs/brick2/master):549:crawlwrap] _GMaster: primary master with volume id e560607b-131b-4e35-b6b7-4e9b1e369224 ...
[2015-12-15 12:30:45.349627] I [master(/rhgs/brick2/master):558:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-12-15 12:30:45.353549] I [master(/rhgs/brick2/master):1173:crawl] _GMaster: starting history crawl... turns: 1, stime: (1450086953, 0), etime: 1450162845
[2015-12-15 12:30:46.354907] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 57, in history_scan
    return Changes.cl_history_scan()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 89, in cl_history_scan
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 22] Invalid argument
[2015-12-15 12:30:46.355663] E [repce(/rhgs/brick2/master):207:__call__] RepceClient: call 13117:140461545350976:1450162845.35 (history_scan) failed on peer with ChangelogException
[2015-12-15 12:30:46.355832] E [resource(/rhgs/brick2/master):1448:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 22] Invalid argument
[2015-12-15 12:30:46.356125] I [syncdutils(/rhgs/brick2/master):220:finalize] <top>: exiting.
[2015-12-15 12:30:46.359954] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-12-15 12:30:46.360156] I [syncdutils(agent):220:finalize] <top>: exiting.


At the same time the errors logged in the changes.log
=====================================================


[2015-12-15 07:02:21.780179] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f24942e06c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs.so.0(mem_get
0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:02:21.780396] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f24942e06c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs.so.0(mem_get
0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:02:21.780516] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f24942e06c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs.so.0(mem_get
0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:02:21.781449] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f249429dc12] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs
.so.0(mem_get0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:02:21.782716] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f249429dc12] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs
.so.0(mem_get0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]


ve_vol/78ce16e9d2384148a6abf384c52b8ab1/.history/.processing/CHANGELOG.1450086954 to processed directory
[2015-12-15 07:29:57.229711] I [gf-changelog.c:542:gf_changelog_register_generic] 0-gfchangelog: Registering brick: /rhgs/brick2/master [notify filter: 1]
[2015-12-15 07:29:57.230785] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f49899686c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f49899085df] -->/lib64/libglusterfs.so.0(mem_get0+0x81) [0x7f498993a811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:29:57.230955] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f49899686c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f49899085df] -->/lib64/libglusterfs.so.0(mem_get0+0x81) [0x7f498993a811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:29:57.232027] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f4989925c12] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f49899085df] -->/lib64/libglusterfs.so.0(mem_get0+0x81) [0x7f498993a811] ) 0-mem-pool: invalid argument [Invalid argument]


Version-Release number of selected component (if applicable):
=============================================================



How reproducible:
=================

Observed Traceback only on one setup and only once while able to reproduce invalid argument error with every reboot


Steps to Reproduce:
===================
1. Create 2 node master and slave clusters
2. Create and start 2x2 master volume
3. Create and start 2x2 slave volume
4. Setup geo-rep between master and slave volume
5. Reboot master ACTIVE node

Actual results:
===============

Once the node is up the Worker goes faulty for sometime with traceback and errors [Invalid argument]


Expected results:
=================

Once the node is down, the bricks on peer node should be ACTIVE. Once the node comes back, the bricks should be PASSIVE without going to faulty or reporting error.

Comment 1 Vijay Bellur 2015-12-17 14:26:16 UTC

REVIEW: http://review.gluster.org/12997 (libgfchangelog: Allocate logbuf_pool in master xlator) posted (#1) for review on master by Kotresh HR (khiremat)

Comment 2 Vijay Bellur 2015-12-22 09:33:17 UTC

COMMIT: http://review.gluster.org/12997 committed in master by Venky Shankar (vshankar) 
------
commit d73f5c9eae0f5c8796598fa1930844c15c2c37d4
Author: Kotresh HR <khiremat>
Date:   Tue Dec 15 19:55:05 2015 +0530

    libgfchangelog: Allocate logbuf_pool in master xlator
    
    The master xlator needs to allocate 'logbuf_pool'
    else 'gf_msg' fails with EINVAL.
    
    Change-Id: I6b2d3450250de7e77126d12b75b0dbc4db414bfb
    BUG: 1292463
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/12997
    Tested-by: NetBSD Build System <jenkins.org>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Venky Shankar <vshankar>

Comment 3 Niels de Vos 2016-06-16 13:51:17 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user