Bug 1022582

Summary: dist-geo-rep: Worker process crashing because of "Invalid Argument" error in slave
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: M S Vishwanath Bhat <vbhat>
Component: geo-replicationAssignee: Ajeet Jha <ajha>
Status: CLOSED ERRATA QA Contact: M S Vishwanath Bhat <vbhat>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.1CC: aavati, ajha, amarts, csaba, grajaiya, mzywusko, nsathyan, vagarwal, vshankar
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.38rhs-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-27 15:43:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs from master node
none
Log from slave node none

Description M S Vishwanath Bhat 2013-10-23 14:58:49 UTC
Created attachment 815447 [details]
Logs from master node

Description of problem:
When I start geo-rep process and copy some files on the master, sessions in two of the nodes went into the faulty state. Worker in both of those machines are crashing because of a Errno 22 in slave.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.36rhs-1.el6rhs.x86_64


How reproducible:
Hit one time out of as many tries

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and slave volumes.
2. Now cp -r /etc/ <master_mount_point>
3. Run geo-rep status of geo-rep status detail

Actual results:
# gluster v geo master hornet::slave status detail
 
                                        MASTER: master  SLAVE: hornet::slave
 
NODE                         HEALTH    UPTIME      FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING   
--------------------------------------------------------------------------------------------------------------------
spitfire.blr.redhat.com      faulty    N/A         N/A            N/A              N/A              N/A               
typhoon.blr.redhat.com       Stable    02:20:19    0              0                0Bytes           0                 
mustang.blr.redhat.com       Stable    02:20:19    0              0                0Bytes           0                 
harrier.blr.redhat.com       faulty    N/A         N/A            N/A              N/A              N/A               



Expected results:
Status should not go into the 'faulty' state.

Additional info:


Logs in master node

[2013-10-23 20:20:43.561261] I [master(/rhs/bricks/brick2):345:crawlwrap] _GMaster: crawl interval: 3 seconds
[2013-10-23 20:20:51.155394] E [repce(/rhs/bricks/brick2):188:__call__] RepceClient: call 31994:139915356534528:1382539850.02 (meta_ops) failed on peer with OSError
[2013-10-23 20:20:51.156507] E [syncdutils(/rhs/bricks/brick2):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 530, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1074, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 369, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 799, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 760, in process
    if self.process_change(change, done, retry):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 740, in process_change
    self.slave.server.meta_ops(meta_entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in __call__
    raise res
OSError: [Errno 22] Invalid argument: '.gfid/1da8224d-aa34-433b-8abf-b07a13e5cfd2'
[2013-10-23 20:20:51.159435] I [syncdutils(/rhs/bricks/brick2):159:finalize] <top>: exiting.
[2013-10-23 20:21:01.268341] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-10-23 20:21:01.269103] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-10-23 20:21:01.531034] I [gsyncd(/rhs/bricks/brick2):520:main_i] <top>: syncing: gluster://localhost:master -> ssh://root@hornet:gluster://localhost:slave
[2013-10-23 20:21:04.483725] I [master(/rhs/bricks/brick2):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-10-23 20:21:04.487170] I [master(/rhs/bricks/brick2):57:gmaster_builder] <top>: setting up changelog change detection mode
[2013-10-23 20:21:04.490206] I [master(/rhs/bricks/brick2):835:register] _GMaster: xsync temp directory: /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.194%3Agluster%3A%2F%2F127.0.0.1%3Aslave/68fa5cc90f61530aea097cdc78c2b376/xsync
[2013-10-23 20:21:04.659682] I [master(/rhs/bricks/brick2):335:crawlwrap] _GMaster: primary master with volume id c2ba3dc1-58f2-4dad-93af-d08c249923d2 ...



Logs from slave node


[2013-10-23 20:21:33.999643] I [repce(slave):78:service_loop] RepceServer: terminating on reaching EOF.
[2013-10-23 20:21:34.35] I [syncdutils(slave):159:finalize] <top>: exiting.
[2013-10-23 20:21:45.735359] I [gsyncd(slave):520:main_i] <top>: syncing: gluster://localhost:slave
[2013-10-23 20:21:46.827519] I [resource(slave):642:service_loop] GLUSTER: slave listening
[2013-10-23 20:21:50.195013] E [repce(slave):103:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 99, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 545, in meta_ops
    errno_wrap(os.chmod, [go, mode], [ENOENT])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 383, in errno_wrap
    return call(*arg)
OSError: [Errno 22] Invalid argument: '.gfid/0e3d546c-ca85-4913-8e9f-e3ed822fcf46'

Comment 1 M S Vishwanath Bhat 2013-10-23 14:59:24 UTC
Created attachment 815448 [details]
Log from slave node

Comment 3 M S Vishwanath Bhat 2013-10-24 09:27:55 UTC
It's a regression. I just tried with 35rhs build. But it's working there. Only 36rhs has this issue.

Comment 7 M S Vishwanath Bhat 2013-11-02 11:24:35 UTC
Issue is not being reproduced with the glusterfs-3.4.0.38rhs-1.el6rhs.x86_64 build. I followed the same steps I mentioned earlier in the bug description and issue is not hit. Moving to verified.

Comment 9 errata-xmlrpc 2013-11-27 15:43:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html