Bug 1065631 - dist-geo-rep: gsyncd in one of the node crashed with "OSError: [Errno 2] No such file or directory"
Summary: dist-geo-rep: gsyncd in one of the node crashed with "OSError: [Errno 2] No s...
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: pre-release
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-15 09:34 UTC by M S Vishwanath Bhat
Modified: 2016-06-01 01:57 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-22 15:40:20 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
geo-rep logs from one Active nore (7.88 MB, text/x-log)
2014-02-15 09:34 UTC, M S Vishwanath Bhat
no flags Details
gsync log from another "Active" node (3.25 MB, text/x-log)
2014-02-15 09:38 UTC, M S Vishwanath Bhat
no flags Details

Description M S Vishwanath Bhat 2014-02-15 09:34:35 UTC
Created attachment 863569 [details]
geo-rep logs from one Active nore

Description of problem:
While doing geo-replication one of the gsyncd in master crashed with OsError. And later few file deletes are not synced to slave and I believe this is the reason for the same.

Version-Release number of selected component (if applicable):
[root@mustang ~]# gluster --version
glusterfs 3.5.0beta3 built on Feb 15 2014 13:24:29
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

How reproducible:
Once out of one try

Steps to Reproduce:
1. Create and start geo-rep from 2*2 distributed-replicated master node to 282 distributed-replicated slave nodes.
2. Now create some files (untar linux kernel)
3. Wait for the crash to happen in one of the nodes.

Actual results:

[2014-02-15 14:01:47.166320] W [master(/rhs/bricks/brick2):877:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1392452987
[2014-02-15 14:02:56.410913] E [syncdutils(/rhs/bricks/brick2):240:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/local/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/local/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 542, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 1176, in service_loop
    g1.crawlwrap(oneshot=True)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 467, in crawlwrap
    self.crawl()
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 1137, in crawl
    self.upd_stime(item[1][1], item[1][0])
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 884, in upd_stime
    self.sendmark(path, stime)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 658, in sendmark
    self.set_slave_xtime(path, mark)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 152, in set_slave_xtime
    self.slave.server.set_stime(path, self.uuid, mark)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 1163, in <lambda>
    slave.server.set_stime = types.MethodType(lambda _self, path, uuid, mark: brickserver.set_stime(path, uuid + '.' + gconf.slave_id, mark), slave.server)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 299, in ff
    return f(*a)
  File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 496, in set_stime
    Xattr.lsetxattr(path, '.'.join([cls.GX_NSPACE, uuid, 'stime']), struct.pack('!II', *mark))
  File "/usr/local/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 66, in lsetxattr
    cls.raise_oserr()
  File "/usr/local/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 2] No such file or directory
[2014-02-15 14:02:56.412710] I [syncdutils(/rhs/bricks/brick2):192:finalize] <top>: exiting.
[2014-02-15 14:02:56.428134] I [monitor(monitor):81:set_state] Monitor: new state: faulty

Expected results:
There should be no crash.

Additional info:


I saw this crash twice in one node and once in other node. Also they recovered themselves, but after it came back online, I did some deletes on master which didn't get synced to slave. I guess this might be the reason. I will do some more checks and log a different bug if I think that's caused by different bug.

Will upload those files.

Comment 1 M S Vishwanath Bhat 2014-02-15 09:38:29 UTC
Created attachment 863570 [details]
gsync log from another "Active" node

Comment 2 Venky Shankar 2014-02-17 06:17:18 UTC
MS,

Anything in geo-rep mount logs?

Please attach those too.

Comment 3 Venky Shankar 2014-02-19 03:40:12 UTC
MS,

From the logs, geo-replication is running in xsync mode as the main crawl loop.
It seems that a brick got killed (or a crash), due to which geo-replication daemon started running in xsync mode. As this mode cannot handle deletes (and renames), that's why the deletes were not propagated to the slave.

For the OSError backtrace, I would need the geo-rep mount logs.

Comment 4 M S Vishwanath Bhat 2014-03-03 06:10:37 UTC
The deletes were done once all data were synced. And at that point all the gsyncd were up.


The testcase is very simple and reproducible every time. I don't have the setup now. You should be able to reproduce is in the dev environment itself.

Comment 5 Kaleb KEITHLEY 2015-10-22 15:40:20 UTC
pre-release version is ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.


Note You need to log in before you can comment on or make changes to this bug.