1291988 – [geo-rep]: ChangelogException: [Errno 22] Invalid argument observed upon rebooting the ACTIVE master node

Bug 1291988 - [geo-rep]: ChangelogException: [Errno 22] Invalid argument observed upon rebooting the ACTIVE master node

Summary: [geo-rep]: ChangelogException: [Errno 22] Invalid argument observed upon rebo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Kotresh HR
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1292463 1293595 1299184
TreeView+	depends on / blocked

Reported:	2015-12-16 06:52 UTC by Rahul Hinduja
Modified:	2016-06-23 04:59 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.7.9-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1292463 (view as bug list)
Environment:
Last Closed:	2016-06-23 04:59:16 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1240	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 Update 3	2016-06-23 08:51:28 UTC

Description Rahul Hinduja 2015-12-16 06:52:48 UTC

Description of problem:
=======================

In a geo-rep setup if a ACTIVE master is rebooted observed ChangelogException as:

[2015-12-15 12:30:39.725385] I [resource(/rhgs/brick1/master):1433:service_loop] GLUSTER: Register time: 1450162839
[2015-12-15 12:30:39.741044] I [master(/rhgs/brick1/master):549:crawlwrap] _GMaster: primary master with volume id e560607b-131b-4e35-b6b7-4e9b1e369224 ...
[2015-12-15 12:30:39.744363] I [master(/rhgs/brick1/master):558:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-12-15 12:30:39.748029] I [master(/rhgs/brick1/master):1173:crawl] _GMaster: starting history crawl... turns: 1, stime: (1450086953, 0), etime: 1450162839
[2015-12-15 12:30:39.748600] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2015-12-15 12:30:39.749222] E [repce(/rhgs/brick1/master):207:__call__] RepceClient: call 13048:139759777986368:1450162839.75 (history) failed on peer with ChangelogException
[2015-12-15 12:30:39.749366] E [resource(/rhgs/brick1/master):1448:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory
[2015-12-15 12:30:39.749612] I [syncdutils(/rhgs/brick1/master):220:finalize] <top>: exiting.
[2015-12-15 12:30:39.753611] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-12-15 12:30:39.753813] I [syncdutils(agent):220:finalize] <top>: exiting.
[2015-12-15 12:30:40.705105] I [monitor(monitor):229:monitor] Monitor: ------------------------------------------------------------
[2015-12-15 12:30:40.705337] I [monitor(monitor):230:monitor] Monitor: starting gsyncd worker
[2015-12-15 12:30:40.717370] I [monitor(monitor):290:monitor] Monitor: worker(/rhgs/brick1/master) died in startup phase
[2015-12-15 12:30:40.780814] I [gsyncd(/rhgs/brick2/master):652:main_i] <top>: syncing: gluster://localhost:Master_vol -> ssh://root.35.196:gluster://localhost:Slave_vol
[2015-12-15 12:30:40.781518] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...
[2015-12-15 12:30:43.313604] I [master(/rhgs/brick2/master):83:gmaster_builder] <top>: setting up xsync change detection mode
[2015-12-15 12:30:43.314054] I [master(/rhgs/brick2/master):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-12-15 12:30:43.314963] I [master(/rhgs/brick2/master):83:gmaster_builder] <top>: setting up changelog change detection mode
[2015-12-15 12:30:43.315217] I [master(/rhgs/brick2/master):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-12-15 12:30:43.316255] I [master(/rhgs/brick2/master):83:gmaster_builder] <top>: setting up changeloghistory change detection mode
[2015-12-15 12:30:43.316457] I [master(/rhgs/brick2/master):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-12-15 12:30:45.333230] I [master(/rhgs/brick2/master):1259:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/Master_vol/ssh%3A%2F%2Froot%4010.70.35.196%3Agluster%3A%2F%2F127.0.0.1%3ASla
ve_vol/78ce16e9d2384148a6abf384c52b8ab1/xsync
[2015-12-15 12:30:45.333684] I [resource(/rhgs/brick2/master):1433:service_loop] GLUSTER: Register time: 1450162845
[2015-12-15 12:30:45.346733] I [master(/rhgs/brick2/master):549:crawlwrap] _GMaster: primary master with volume id e560607b-131b-4e35-b6b7-4e9b1e369224 ...
[2015-12-15 12:30:45.349627] I [master(/rhgs/brick2/master):558:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-12-15 12:30:45.353549] I [master(/rhgs/brick2/master):1173:crawl] _GMaster: starting history crawl... turns: 1, stime: (1450086953, 0), etime: 1450162845
[2015-12-15 12:30:46.354907] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 57, in history_scan
    return Changes.cl_history_scan()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 89, in cl_history_scan
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 22] Invalid argument
[2015-12-15 12:30:46.355663] E [repce(/rhgs/brick2/master):207:__call__] RepceClient: call 13117:140461545350976:1450162845.35 (history_scan) failed on peer with ChangelogException
[2015-12-15 12:30:46.355832] E [resource(/rhgs/brick2/master):1448:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 22] Invalid argument
[2015-12-15 12:30:46.356125] I [syncdutils(/rhgs/brick2/master):220:finalize] <top>: exiting.
[2015-12-15 12:30:46.359954] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-12-15 12:30:46.360156] I [syncdutils(agent):220:finalize] <top>: exiting.


At the same time the errors logged in the changes.log
=====================================================


[2015-12-15 07:02:21.780179] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f24942e06c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs.so.0(mem_get
0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:02:21.780396] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f24942e06c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs.so.0(mem_get
0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:02:21.780516] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f24942e06c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs.so.0(mem_get
0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:02:21.781449] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f249429dc12] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs
.so.0(mem_get0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:02:21.782716] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f249429dc12] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f24942805df] -->/lib64/libglusterfs
.so.0(mem_get0+0x81) [0x7f24942b2811] ) 0-mem-pool: invalid argument [Invalid argument]


ve_vol/78ce16e9d2384148a6abf384c52b8ab1/.history/.processing/CHANGELOG.1450086954 to processed directory
[2015-12-15 07:29:57.229711] I [gf-changelog.c:542:gf_changelog_register_generic] 0-gfchangelog: Registering brick: /rhgs/brick2/master [notify filter: 1]
[2015-12-15 07:29:57.230785] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f49899686c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f49899085df] -->/lib64/libglusterfs.so.0(mem_get0+0x81) [0x7f498993a811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:29:57.230955] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(+0x876c2) [0x7f49899686c2] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f49899085df] -->/lib64/libglusterfs.so.0(mem_get0+0x81) [0x7f498993a811] ) 0-mem-pool: invalid argument [Invalid argument]
[2015-12-15 07:29:57.232027] E [mem-pool.c:417:mem_get0] (-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f4989925c12] -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f49899085df] -->/lib64/libglusterfs.so.0(mem_get0+0x81) [0x7f498993a811] ) 0-mem-pool: invalid argument [Invalid argument]


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.5-11.el7rhgs.x86_64


How reproducible:
=================

Observed Traceback only on one setup and only once while able to reproduce invalid argument error with every reboot


Steps to Reproduce:
===================
1. Create 2 node master and slave clusters
2. Create and start 2x2 master volume
3. Create and start 2x2 slave volume
4. Setup geo-rep between master and slave volume
5. Reboot master ACTIVE node

Actual results:
===============

Once the node is up the Worker goes faulty for sometime with tracebak and errors [Invalid argument]


Expected results:
=================

Once the node is down, the bricks on peer node should be ACTIVE. Once the node comes back, the bricks should be PASSIVE without going to faulty or reporting error.

Comment 2 Kotresh HR 2016-01-13 07:20:58 UTC

Merged in upstream:

Master branch:
http://review.gluster.org/12997

3.7 branch:
http://review.gluster.org/13064

Hence moving it to POST

Comment 4 Aravinda VK 2016-03-23 06:24:42 UTC

Patch for this bug is available in rhgs-3.1.3 branch as part of rebase from upstream release-3.7.9.

Comment 6 Rahul Hinduja 2016-04-18 16:56:52 UTC

Verified with build: glusterfs-3.7.9-1

Rebooted all ACTIVE nodes and as expected, passive bricks became active. No changelogException for [Errno 22] Invalid argument observed. 

Moving this bug to verified state.

Comment 8 errata-xmlrpc 2016-06-23 04:59:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.