1427870 – [geo-rep]: Worker crashes with [Errno 16] Device or resource busy: '.gfid/00000000-0000-0000-0000-000000000001/dir.166 while renaming directories

Bug 1427870 - [geo-rep]: Worker crashes with [Errno 16] Device or resource busy: '.gfid/00000000-0000-0000-0000-000000000001/dir.166 while renaming directories

Summary: [geo-rep]: Worker crashes with [Errno 16] Device or resource busy: '.gfid/000...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Kotresh HR
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:
Depends On:	1441927
Blocks:	1385589 1417147
TreeView+	depends on / blocked

Reported:	2017-03-01 12:46 UTC by Rahul Hinduja
Modified:	2017-09-21 04:57 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.4-23
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1434018 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:33:25 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Rahul Hinduja 2017-03-01 12:46:44 UTC

Description of problem:
=======================

While renaming directories in a loop, I am seeing worker crash with the following traceback:

Master:
=======

[2017-03-01 07:34:23.844472] E [master(/rhs/brick3/b5):785:log_failures] _GMaster: ENTRY FAILED: ({'stat': {'atime': 1488353577.9969134, 'gid': 0, 'mtime': 1488353577.9
969134, 'mode': 16877, 'uid': 0}, 'entry1': '.gfid/00000000-0000-0000-0000-000000000001/rename_dir.124', 'gfid': 'a9adc254-3ec0-402d-945d-f1dcddbe411d', 'link': None, '
entry': '.gfid/00000000-0000-0000-0000-000000000001/dir.124', 'op': 'RENAME'}, 2)
[2017-03-01 07:34:28.105679] E [repce(/rhs/brick3/b5):207:__call__] RepceClient: call 21221:140592415500096:1488353664.61 (entry_ops) failed on peer with OSError
[2017-03-01 07:34:28.109591] E [syncdutils(/rhs/brick3/b5):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 757, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1555, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 573, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1136, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1111, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 994, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 935, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 16] Device or resource busy: '.gfid/00000000-0000-0000-0000-000000000001/dir.166'
[2017-03-01 07:34:28.117834] I [syncdutils(/rhs/brick3/b5):237:finalize] <top>: exiting.
[2017-03-01 07:34:28.138552] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-03-01 07:34:28.141488] I [syncdutils(agent):237:finalize] <top>: exiting.
[2017-03-01 07:34:36.280246] E [master(/rhs/brick1/b1):785:log_failures] _GMaster: ENTRY FAILED: ({'stat': {'atime': 1488353579.1139069, 'gid': 0, 'mtime': 1488353579.1139069, 'mode': 16877, 'uid': 0}, 'entry1': '.gfid/00000000-0000-0000-0000-000000000001/rename_dir.135', 'gfid': 'e15667ad-e647-4253-a84e-0a0c6143e730', 'link': None, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/dir.135', 'op': 'RENAME'}, 2)
 
Slave:
======

[2017-03-01 07:20:33.380264] I [resource(slave):932:service_loop] GLUSTER: slave listening
[2017-03-01 07:34:28.50796] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 766, in entry_ops
    st = lstat(entry)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 512, in lstat
    return errno_wrap(os.lstat, [e], [ENOENT], [ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap
    return call(*arg)
OSError: [Errno 16] Device or resource busy: '.gfid/00000000-0000-0000-0000-000000000001/dir.166'
[2017-03-01 07:34:28.146219] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-03-01 07:34:28.147622] I [syncdutils(slave):237:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-15.el7rhgs.x86_64


How reproducible:
=================

Always

Steps to Reproduce:
===================
Seen this on non-root fanout setup, but should also see on normal setup. Writing the exact steps as carried:

1. Create Master (2 nodes) and Slave Cluster (4 nodes)
2. Create and Start Master and 2 Slave Volumes (Each 2x2)
3. Create mount-broker geo-rep session between master and 2 slave volumes
4. Mount the Master and Slave Volume (NFS and Fuse)
5. Create dir on master and rename it.
From one client: for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done
From second client: for i in {1..1000}; do mv dir.$i rename_dir.$i; done
From third client: for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done

Actual results:
===============

Multiple crashes seen during rename

Comment 5 Kotresh HR 2017-03-20 14:56:54 UTC

Upstream Patch:

https://review.gluster.org/16924

Comment 6 Kotresh HR 2017-04-13 07:16:19 UTC

Upstream Patches:

master:
https://review.gluster.org/#/c/16924/
https://review.gluster.org/#/c/17011/

3.10:
https://review.gluster.org/#/c/17049/
https://review.gluster.org/#/c/17050/

3.8:
https://review.gluster.org/#/c/17052/
https://review.gluster.org/#/c/17053/

Comment 7 Kotresh HR 2017-04-13 07:25:04 UTC

Downstream Patches:

https://code.engineering.redhat.com/gerrit/#/c/103382/
https://code.engineering.redhat.com/gerrit/#/c/103383/

Comment 9 Rahul Hinduja 2017-07-16 15:28:34 UTC

Verified with build: glusterfs-geo-replication-3.8.4-32.el7rhgs.x86_64

Use case mentioned in the description is carried with the following data set:

Set 1:
------

#client 1
for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done
#client 2
mkdir dir.{1..1000}
for i in {1..1000}; do mv dir.$i rename_dir.$i; done
#client 3
for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done
#client 4
for i in {1..1999}; do mkdir rochelle.$i ; sleep 1 ; mv rochelle.$i allan.$i ; done


Set 2:
------

#client 1
for i in {1..1999}; do mkdir volks.$i ; sleep 1 ; mv volks.$i weagan.$i ; done
#client 2
touch Sun{1..1000}
for i in {1..1000}; do mv Sun.$i Moon.$i; done
#client 3
for i in {1..500}; do mkdir Flash.$i ; mv Flash.$i Red.$i ; done
#client 4
for i in {1..1999}; do touch brother.$i ; sleep 1 ; mv brother.$i sister.$i ; done


No worker crash is seen, moving this bug to verified state.

Comment 11 errata-xmlrpc 2017-09-21 04:33:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 12 errata-xmlrpc 2017-09-21 04:57:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.