Bug 1181870

Summary:	Geo-replication fails with OSError: [Errno 16] Device or resource busy
Product:	[Community] GlusterFS	Reporter:	Jordan Tomkinson <jordan.tomkinson>
Component:	geo-replication	Assignee:	bugs <bugs>
Status:	CLOSED NEXTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	3.6.1	CC:	avishwan, bugs
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1258379 (view as bug list)		Environment:
Last Closed:	2016-08-19 11:39:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1258379

Description Jordan Tomkinson 2015-01-14 00:30:37 UTC

Description of problem:

Brand new 2 brick replica volume, geo-replication set up before any data went into the volume. rsync'd about 40GB of data from a range of small (1k) to large (800MB) files.

I waited for geo-replication to finish and then ran an 'rsync --update' to catch some files that had changed on the source, geo-replication went into Faulty mode.

Error log from gluster brick server:

[2015-01-14 01:05:41.227730] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-14 01:05:41.227941] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-14 01:05:41.329799] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-14 01:05:41.330574] I [gsyncd(/export/sitedata):633:main_i] <top>: syncing: gluster://localhost:ds0 -> ssh://root@gaige:gluster://localhost:geo-repl
[2015-01-14 01:05:45.691020] I [master(/export/sitedata):75:gmaster_builder] <top>: setting up xsync change detection mode
[2015-01-14 01:05:45.691421] I [master(/export/sitedata):403:__init__] _GMaster: using 'rsync' as the sync engine
[2015-01-14 01:05:45.692125] I [master(/export/sitedata):75:gmaster_builder] <top>: setting up changelog change detection mode
[2015-01-14 01:05:45.692292] I [master(/export/sitedata):403:__init__] _GMaster: using 'rsync' as the sync engine
[2015-01-14 01:05:45.692955] I [master(/export/sitedata):75:gmaster_builder] <top>: setting up changeloghistory change detection mode
[2015-01-14 01:05:45.693108] I [master(/export/sitedata):403:__init__] _GMaster: using 'rsync' as the sync engine
[2015-01-14 01:05:45.703636] I [master(/export/sitedata):1301:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/ds0/ssh%3A%2F%2Froot%4010.42.0.8%3Agluster%3A%2F%2F127.0.0.1%3Ageo-repl/669c18441eb6dec3d3bb2994485b6206/xsync
[2015-01-14 01:05:45.703836] I [resource(/export/sitedata):1318:service_loop] GLUSTER: Register time: 1421157945
[2015-01-14 01:05:45.721772] I [master(/export/sitedata):480:crawlwrap] _GMaster: primary master with volume id 88bb6b5d-cc89-47cf-a74f-1a9c17f194d7 ...
[2015-01-14 01:05:45.731245] I [master(/export/sitedata):491:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-01-14 01:05:45.747462] I [master(/export/sitedata):1216:crawl] _GMaster: starting history crawl... turns: 1, stime: (1421156284, 0)
[2015-01-14 01:05:45.749725] I [master(/export/sitedata):1245:crawl] _GMaster: slave's time: (1421156284, 0)
[2015-01-14 01:05:48.813282] E [repce(/export/sitedata):207:__call__] RepceClient: call 20573:139902371477312:1421157947.51 (entry_ops) failed on peer with OSError
[2015-01-14 01:05:48.813823] E [syncdutils(/export/sitedata):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", line 1325, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 529, in crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 1256, in crawl
    self.process(changes)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 946, in process
    self.process_change(change, done, retry)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 910, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 16] Device or resource busy
[2015-01-14 01:05:48.815470] I [syncdutils(/export/sitedata):214:finalize] <top>: exiting.
[2015-01-14 01:05:48.818737] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-14 01:05:48.819194] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-14 01:05:49.695061] I [monitor(monitor):222:monitor] Monitor: worker(/export/sitedata) died in startup phase


Error log from geo-replication slave:

[2015-01-14 01:07:07.787431] I [gsyncd(slave):633:main_i] <top>: syncing: gluster://localhost:geo-repl
[2015-01-14 01:07:08.919622] I [resource(slave):761:service_loop] GLUSTER: slave listening
[2015-01-14 01:07:11.168715] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", line 658, in entry_ops
    errno_wrap(os.rename, [entry, en], [ENOENT, EEXIST])
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line 470, in errno_wrap
    return call(*arg)
OSError: [Errno 16] Device or resource busy
[2015-01-14 01:07:11.176639] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-14 01:07:11.177026] I [syncdutils(slave):214:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
glusterfs-server 3.6.1-ubuntu1~trusty1 from the Ubuntu launchpad PPA


How reproducible:
Rsync some data, change some files from the source and rsync --update


Steps to Reproduce:
1. rsync files from an external location into gluster
2. change some files in the external location, rsync --update
3. geo-replication failed

Actual results:

geo-replication fails

Expected results:

geo-replication works

Additional info:

This was a fresh gluster 3.6.1 installation, no upgrade was performed and the volume was empty when geo-replication was setup.

Comment 1 Aravinda VK 2015-01-15 06:24:16 UTC

This error comes if Rebalance is happening in Slave nodes. DHT layer holds lock during Rebalance, hence rename fails with this error. Since Geo-rep monitor restarts the worker dies, this error logged again. Once the Rebalance complete, Geo-rep become normal and continue syncing data from Master to Slave.

We will enhance Geo-rep to handle this situation smoothly.

Comment 2 Aravinda VK 2016-08-19 11:39:51 UTC

GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. mainline bug is available BZ 1258379. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.