Bug 1065631

Summary: dist-geo-rep: gsyncd in one of the node crashed with "OSError: [Errno 2] No such file or directory"
Product: [Community] GlusterFS Reporter: M S Vishwanath Bhat <vbhat>
Component: geo-replicationAssignee: bugs <bugs>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: pre-releaseCC: avishwan, bugs, gluster-bugs, mzywusko, vbhat, vshankar
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-22 15:40:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
geo-rep logs from one Active nore
none
gsync log from another "Active" node none

Description M S Vishwanath Bhat 2014-02-15 09:34:35 UTC
Created attachment 863569 [details]
geo-rep logs from one Active nore

Description of problem:
While doing geo-replication one of the gsyncd in master crashed with OsError. And later few file deletes are not synced to slave and I believe this is the reason for the same.

Version-Release number of selected component (if applicable):
[root@mustang ~]# gluster --version
glusterfs 3.5.0beta3 built on Feb 15 2014 13:24:29
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

How reproducible:
Once out of one try

Steps to Reproduce:
1. Create and start geo-rep from 2*2 distributed-replicated master node to 282 distributed-replicated slave nodes.
2. Now create some files (untar linux kernel)
3. Wait for the crash to happen in one of the nodes.

Actual results:

[2014-02-15 14:01:47.166320] W [master(/rhs/bricks/brick2):877:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1392452987
[2014-02-15 14:02:56.410913] E [syncdutils(/rhs/bricks/brick2):240:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/local/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/local/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 542, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 1176, in service_loop
    g1.crawlwrap(oneshot=True)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 467, in crawlwrap
    self.crawl()
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 1137, in crawl
    self.upd_stime(item[1][1], item[1][0])
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 884, in upd_stime
    self.sendmark(path, stime)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 658, in sendmark
    self.set_slave_xtime(path, mark)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 152, in set_slave_xtime
    self.slave.server.set_stime(path, self.uuid, mark)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 1163, in <lambda>
    slave.server.set_stime = types.MethodType(lambda _self, path, uuid, mark: brickserver.set_stime(path, uuid + '.' + gconf.slave_id, mark), slave.server)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 299, in ff
    return f(*a)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 496, in set_stime
    Xattr.lsetxattr(path, '.'.join([cls.GX_NSPACE, uuid, 'stime']), struct.pack('!II', *mark))
  File "/usr/local/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 66, in lsetxattr
    cls.raise_oserr()
  File "/usr/local/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 2] No such file or directory
[2014-02-15 14:02:56.412710] I [syncdutils(/rhs/bricks/brick2):192:finalize] <top>: exiting.
[2014-02-15 14:02:56.428134] I [monitor(monitor):81:set_state] Monitor: new state: faulty

Expected results:
There should be no crash.

Additional info:


I saw this crash twice in one node and once in other node. Also they recovered themselves, but after it came back online, I did some deletes on master which didn't get synced to slave. I guess this might be the reason. I will do some more checks and log a different bug if I think that's caused by different bug.

Will upload those files.

Comment 1 M S Vishwanath Bhat 2014-02-15 09:38:29 UTC
Created attachment 863570 [details]
gsync log from another "Active" node

Comment 2 Venky Shankar 2014-02-17 06:17:18 UTC
MS,

Anything in geo-rep mount logs?

Please attach those too.

Comment 3 Venky Shankar 2014-02-19 03:40:12 UTC
MS,

From the logs, geo-replication is running in xsync mode as the main crawl loop.
It seems that a brick got killed (or a crash), due to which geo-replication daemon started running in xsync mode. As this mode cannot handle deletes (and renames), that's why the deletes were not propagated to the slave.

For the OSError backtrace, I would need the geo-rep mount logs.

Comment 4 M S Vishwanath Bhat 2014-03-03 06:10:37 UTC
The deletes were done once all data were synced. And at that point all the gsyncd were up.


The testcase is very simple and reproducible every time. I don't have the setup now. You should be able to reproduce is in the dev environment itself.

Comment 5 Kaleb KEITHLEY 2015-10-22 15:40:20 UTC
pre-release version is ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.