1258379 – Geo-replication fails with OSError: [Errno 16] Device or resource busy

Bug 1258379 - Geo-replication fails with OSError: [Errno 16] Device or resource busy

Summary: Geo-replication fails with OSError: [Errno 16] Device or resource busy

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Aravinda VK
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1181870
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-31 09:00 UTC by Aravinda VK
Modified:	2018-08-29 03:37 UTC (History)
CC List:	3 users (show)
Fixed In Version:	glusterfs-4.1.3 (or later)
Clone Of:	1181870
Environment:
Last Closed:	2018-08-29 03:37:27 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Aravinda VK 2015-08-31 09:00:16 UTC

+++ This bug was initially created as a clone of Bug #1181870 +++

Description of problem:

Brand new 2 brick replica volume, geo-replication set up before any data went into the volume. rsync'd about 40GB of data from a range of small (1k) to large (800MB) files.

I waited for geo-replication to finish and then ran an 'rsync --update' to catch some files that had changed on the source, geo-replication went into Faulty mode.

Error log from gluster brick server:

[2015-01-14 01:05:41.227730] I [monitor(monitor):163:monitor] Monitor: ------------------------------------------------------------
[2015-01-14 01:05:41.227941] I [monitor(monitor):164:monitor] Monitor: starting gsyncd worker
[2015-01-14 01:05:41.329799] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining...
[2015-01-14 01:05:41.330574] I [gsyncd(/export/sitedata):633:main_i] <top>: syncing: gluster://localhost:ds0 -> ssh://root@gaige:gluster://localhost:geo-repl
[2015-01-14 01:05:45.691020] I [master(/export/sitedata):75:gmaster_builder] <top>: setting up xsync change detection mode
[2015-01-14 01:05:45.691421] I [master(/export/sitedata):403:__init__] _GMaster: using 'rsync' as the sync engine
[2015-01-14 01:05:45.692125] I [master(/export/sitedata):75:gmaster_builder] <top>: setting up changelog change detection mode
[2015-01-14 01:05:45.692292] I [master(/export/sitedata):403:__init__] _GMaster: using 'rsync' as the sync engine
[2015-01-14 01:05:45.692955] I [master(/export/sitedata):75:gmaster_builder] <top>: setting up changeloghistory change detection mode
[2015-01-14 01:05:45.693108] I [master(/export/sitedata):403:__init__] _GMaster: using 'rsync' as the sync engine
[2015-01-14 01:05:45.703636] I [master(/export/sitedata):1301:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/ds0/ssh%3A%2F%2Froot%4010.42.0.8%3Agluster%3A%2F%2F127.0.0.1%3Ageo-repl/669c18441eb6dec3d3bb2994485b6206/xsync
[2015-01-14 01:05:45.703836] I [resource(/export/sitedata):1318:service_loop] GLUSTER: Register time: 1421157945
[2015-01-14 01:05:45.721772] I [master(/export/sitedata):480:crawlwrap] _GMaster: primary master with volume id 88bb6b5d-cc89-47cf-a74f-1a9c17f194d7 ...
[2015-01-14 01:05:45.731245] I [master(/export/sitedata):491:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-01-14 01:05:45.747462] I [master(/export/sitedata):1216:crawl] _GMaster: starting history crawl... turns: 1, stime: (1421156284, 0)
[2015-01-14 01:05:45.749725] I [master(/export/sitedata):1245:crawl] _GMaster: slave's time: (1421156284, 0)
[2015-01-14 01:05:48.813282] E [repce(/export/sitedata):207:__call__] RepceClient: call 20573:139902371477312:1421157947.51 (entry_ops) failed on peer with OSError
[2015-01-14 01:05:48.813823] E [syncdutils(/export/sitedata):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", line 1325, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 529, in crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 1256, in crawl
    self.process(changes)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 946, in process
    self.process_change(change, done, retry)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 910, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 16] Device or resource busy
[2015-01-14 01:05:48.815470] I [syncdutils(/export/sitedata):214:finalize] <top>: exiting.
[2015-01-14 01:05:48.818737] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-14 01:05:48.819194] I [syncdutils(agent):214:finalize] <top>: exiting.
[2015-01-14 01:05:49.695061] I [monitor(monitor):222:monitor] Monitor: worker(/export/sitedata) died in startup phase


Error log from geo-replication slave:

[2015-01-14 01:07:07.787431] I [gsyncd(slave):633:main_i] <top>: syncing: gluster://localhost:geo-repl
[2015-01-14 01:07:08.919622] I [resource(slave):761:service_loop] GLUSTER: slave listening
[2015-01-14 01:07:11.168715] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", line 658, in entry_ops
    errno_wrap(os.rename, [entry, en], [ENOENT, EEXIST])
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line 470, in errno_wrap
    return call(*arg)
OSError: [Errno 16] Device or resource busy
[2015-01-14 01:07:11.176639] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-01-14 01:07:11.177026] I [syncdutils(slave):214:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
glusterfs-server 3.6.1-ubuntu1~trusty1 from the Ubuntu launchpad PPA


How reproducible:
Rsync some data, change some files from the source and rsync --update


Steps to Reproduce:
1. rsync files from an external location into gluster
2. change some files in the external location, rsync --update
3. geo-replication failed

Actual results:

geo-replication fails

Expected results:

geo-replication works

Additional info:

This was a fresh gluster 3.6.1 installation, no upgrade was performed and the volume was empty when geo-replication was setup.

--- Additional comment from Aravinda VK on 2015-01-15 01:24:16 EST ---

This error comes if Rebalance is happening in Slave nodes. DHT layer holds lock during Rebalance, hence rename fails with this error. Since Geo-rep monitor restarts the worker dies, this error logged again. Once the Rebalance complete, Geo-rep become normal and continue syncing data from Master to Slave.

We will enhance Geo-rep to handle this situation smoothly.

Comment 1 Aravinda VK 2016-01-13 06:22:25 UTC

Upstream Patch is sent http://review.gluster.org/12060

Comment 2 Mike McCune 2016-03-28 22:17:27 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 3 Amar Tumballi 2018-08-29 03:37:27 UTC

This update is done in bulk based on the state of the patch and the time since last activity. If the issue is still seen, please reopen the bug.

Note You need to log in before you can comment on or make changes to this bug.