1181870 – Geo-replication fails with OSError: [Errno 16] Device or resource busy

Bug 1181870 - Geo-replication fails with OSError: [Errno 16] Device or resource busy

Summary: Geo-replication fails with OSError: [Errno 16] Device or resource busy

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	3.6.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1258379
TreeView+	depends on / blocked

Reported:	2015-01-14 00:30 UTC by Jordan Tomkinson
Modified:	2016-08-19 11:39 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Clones:	1258379 (view as bug list)
Environment:
Last Closed:	2016-08-19 11:39:51 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jordan Tomkinson 2015-01-14 00:30:37 UTC

Description of problem:

Brand new 2 brick replica volume, geo-replication set up before any data went into the volume. rsync'd about 40GB of data from a range of small (1k) to large (800MB) files.

I waited for geo-replication to finish and then ran an 'rsync --update' to catch some files that had changed on the source, geo-replication went into Faulty mode.

Error log from gluster brick server:

[2015-01-14 01:05:41.227730] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-14 01:05:41.227941] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-14 01:05:41.329799] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-14 01:05:41.330574] I [gsyncd(/export/sitedata):633:main_i] <top>: syncing: gluster://localhost:ds0 -> ssh://root@gaige:gluster://localhost:geo-repl
[2015-01-14 01:05:45.691020] I [master(/export/sitedata):75:gmaster_builder] <top>: setting up xsync change detection mode
[2015-01-14 01:05:45.691421] I [master(/export/sitedata):403:__init__] _GMaster: using 'rsync' as the sync engine
[2015-01-14 01:05:45.692125] I [master(/export/sitedata):75:gmaster_builder] <top>: setting up changelog change detection mode
[2015-01-14 01:05:45.692292] I [master(/export/sitedata):403:__init__] _GMaster: using 'rsync' as the sync engine
[2015-01-14 01:05:45.692955] I [master(/export/sitedata):75:gmaster_builder] <top>: setting up changeloghistory change detection mode
[2015-01-14 01:05:45.693108] I [master(/export/sitedata):403:__init__] _GMaster: using 'rsync' as the sync engine
[2015-01-14 01:05:45.703636] I [master(/export/sitedata):1301:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/ds0/ssh%3A%2F%2Froot%4010.42.0.8%3Agluster%3A%2F%2F127.0.0.1%3Ageo-repl/669c18441eb6dec3d3bb2994485b6206/xsync
[2015-01-14 01:05:45.703836] I [resource(/export/sitedata):1318:service_loop] GLUSTER: Register time: 1421157945
[2015-01-14 01:05:45.721772] I [master(/export/sitedata):480:crawlwrap] _GMaster: primary master with volume id 88bb6b5d-cc89-47cf-a74f-1a9c17f194d7 ...
[2015-01-14 01:05:45.731245] I [master(/export/sitedata):491:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-01-14 01:05:45.747462] I [master(/export/sitedata):1216:crawl] _GMaster: starting history crawl... turns: 1, stime: (1421156284, 0)
[2015-01-14 01:05:45.749725] I [master(/export/sitedata):1245:crawl] _GMaster: slave's time: (1421156284, 0)
[2015-01-14 01:05:48.813282] E [repce(/export/sitedata):207:__call__] RepceClient: call 20573:139902371477312:1421157947.51 (entry_ops) failed on peer with OSError
[2015-01-14 01:05:48.813823] E [syncdutils(/export/sitedata):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", line 1325, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 529, in crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 1256, in crawl
    self.process(changes)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 946, in process
    self.process_change(change, done, retry)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 910, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 16] Device or resource busy
[2015-01-14 01:05:48.815470] I [syncdutils(/export/sitedata):214:finalize] <top>: exiting.
[2015-01-14 01:05:48.818737] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-14 01:05:48.819194] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-14 01:05:49.695061] I [monitor(monitor):222:monitor] Monitor: worker(/export/sitedata) died in startup phase


Error log from geo-replication slave:

[2015-01-14 01:07:07.787431] I [gsyncd(slave):633:main_i] <top>: syncing: gluster://localhost:geo-repl
[2015-01-14 01:07:08.919622] I [resource(slave):761:service_loop] GLUSTER: slave listening
[2015-01-14 01:07:11.168715] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", line 658, in entry_ops
    errno_wrap(os.rename, [entry, en], [ENOENT, EEXIST])
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line 470, in errno_wrap
    return call(*arg)
OSError: [Errno 16] Device or resource busy
[2015-01-14 01:07:11.176639] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-14 01:07:11.177026] I [syncdutils(slave):214:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
glusterfs-server 3.6.1-ubuntu1~trusty1 from the Ubuntu launchpad PPA


How reproducible:
Rsync some data, change some files from the source and rsync --update


Steps to Reproduce:
1. rsync files from an external location into gluster
2. change some files in the external location, rsync --update
3. geo-replication failed

Actual results:

geo-replication fails

Expected results:

geo-replication works

Additional info:

This was a fresh gluster 3.6.1 installation, no upgrade was performed and the volume was empty when geo-replication was setup.

Comment 1 Aravinda VK 2015-01-15 06:24:16 UTC

This error comes if Rebalance is happening in Slave nodes. DHT layer holds lock during Rebalance, hence rename fails with this error. Since Geo-rep monitor restarts the worker dies, this error logged again. Once the Rebalance complete, Geo-rep become normal and continue syncing data from Master to Slave.

We will enhance Geo-rep to handle this situation smoothly.

Comment 2 Aravinda VK 2016-08-19 11:39:51 UTC

GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. mainline bug is available BZ 1258379. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.

Note You need to log in before you can comment on or make changes to this bug.