1022582 – dist-geo-rep: Worker process crashing because of "Invalid Argument" error in slave

Bug 1022582 - dist-geo-rep: Worker process crashing because of "Invalid Argument" error in slave

Summary: dist-geo-rep: Worker process crashing because of "Invalid Argument" error in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Ajeet Jha
QA Contact:	M S Vishwanath Bhat
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-23 14:58 UTC by M S Vishwanath Bhat
Modified:	2016-06-01 01:56 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.4.0.38rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-11-27 15:43:46 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Logs from master node (1.26 MB, text/x-log) 2013-10-23 14:58 UTC, M S Vishwanath Bhat	no flags	Details
Log from slave node (496.48 KB, text/x-log) 2013-10-23 14:59 UTC, M S Vishwanath Bhat	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:1769	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #1	2013-11-27 20:17:39 UTC

Description M S Vishwanath Bhat 2013-10-23 14:58:49 UTC

Created attachment 815447 [details]
Logs from master node

Description of problem:
When I start geo-rep process and copy some files on the master, sessions in two of the nodes went into the faulty state. Worker in both of those machines are crashing because of a Errno 22 in slave.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.36rhs-1.el6rhs.x86_64


How reproducible:
Hit one time out of as many tries

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and slave volumes.
2. Now cp -r /etc/ <master_mount_point>
3. Run geo-rep status of geo-rep status detail

Actual results:
# gluster v geo master hornet::slave status detail
 
                                        MASTER: master  SLAVE: hornet::slave
 
NODE                         HEALTH    UPTIME      FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING   
--------------------------------------------------------------------------------------------------------------------
spitfire.blr.redhat.com      faulty    N/A         N/A            N/A              N/A              N/A               
typhoon.blr.redhat.com       Stable    02:20:19    0              0                0Bytes           0                 
mustang.blr.redhat.com       Stable    02:20:19    0              0                0Bytes           0                 
harrier.blr.redhat.com       faulty    N/A         N/A            N/A              N/A              N/A               



Expected results:
Status should not go into the 'faulty' state.

Additional info:


Logs in master node

[2013-10-23 20:20:43.561261] I [master(/rhs/bricks/brick2):345:crawlwrap] _GMaster: crawl interval: 3 seconds
[2013-10-23 20:20:51.155394] E [repce(/rhs/bricks/brick2):188:__call__] RepceClient: call 31994:139915356534528:1382539850.02 (meta_ops) failed on peer with OSError
[2013-10-23 20:20:51.156507] E [syncdutils(/rhs/bricks/brick2):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 530, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1074, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 369, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 799, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 760, in process
    if self.process_change(change, done, retry):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 740, in process_change
    self.slave.server.meta_ops(meta_entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in __call__
    raise res
OSError: [Errno 22] Invalid argument: '.gfid/1da8224d-aa34-433b-8abf-b07a13e5cfd2'
[2013-10-23 20:20:51.159435] I [syncdutils(/rhs/bricks/brick2):159:finalize] <top>: exiting.
[2013-10-23 20:21:01.268341] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-10-23 20:21:01.269103] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-10-23 20:21:01.531034] I [gsyncd(/rhs/bricks/brick2):520:main_i] <top>: syncing: gluster://localhost:master -> ssh://root@hornet:gluster://localhost:slave
[2013-10-23 20:21:04.483725] I [master(/rhs/bricks/brick2):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-10-23 20:21:04.487170] I [master(/rhs/bricks/brick2):57:gmaster_builder] <top>: setting up changelog change detection mode
[2013-10-23 20:21:04.490206] I [master(/rhs/bricks/brick2):835:register] _GMaster: xsync temp directory: /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.194%3Agluster%3A%2F%2F127.0.0.1%3Aslave/68fa5cc90f61530aea097cdc78c2b376/xsync
[2013-10-23 20:21:04.659682] I [master(/rhs/bricks/brick2):335:crawlwrap] _GMaster: primary master with volume id c2ba3dc1-58f2-4dad-93af-d08c249923d2 ...



Logs from slave node


[2013-10-23 20:21:33.999643] I [repce(slave):78:service_loop] RepceServer: terminating on reaching EOF.
[2013-10-23 20:21:34.35] I [syncdutils(slave):159:finalize] <top>: exiting.
[2013-10-23 20:21:45.735359] I [gsyncd(slave):520:main_i] <top>: syncing: gluster://localhost:slave
[2013-10-23 20:21:46.827519] I [resource(slave):642:service_loop] GLUSTER: slave listening
[2013-10-23 20:21:50.195013] E [repce(slave):103:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 99, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 545, in meta_ops
    errno_wrap(os.chmod, [go, mode], [ENOENT])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 383, in errno_wrap
    return call(*arg)
OSError: [Errno 22] Invalid argument: '.gfid/0e3d546c-ca85-4913-8e9f-e3ed822fcf46'

Comment 1 M S Vishwanath Bhat 2013-10-23 14:59:24 UTC

Created attachment 815448 [details]
Log from slave node

Comment 3 M S Vishwanath Bhat 2013-10-24 09:27:55 UTC

It's a regression. I just tried with 35rhs build. But it's working there. Only 36rhs has this issue.

Comment 6 Amar Tumballi 2013-10-25 11:50:20 UTC

https://code.engineering.redhat.com/gerrit/#/c/14551/

Comment 7 M S Vishwanath Bhat 2013-11-02 11:24:35 UTC

Issue is not being reproduced with the glusterfs-3.4.0.38rhs-1.el6rhs.x86_64 build. I followed the same steps I mentioned earlier in the bug description and issue is not hit. Moving to verified.

Comment 9 errata-xmlrpc 2013-11-27 15:43:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html

Note You need to log in before you can comment on or make changes to this bug.