1385589 – [geo-rep]: Worker crashes seen while renaming directories in loop

Bug 1385589 - [geo-rep]: Worker crashes seen while renaming directories in loop

Summary: [geo-rep]: Worker crashes seen while renaming directories in loop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Aravinda VK
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:
Depends On:	1427870
Blocks:	1396062 1399090 1399092 1417147
TreeView+	depends on / blocked

Reported:	2016-10-17 11:09 UTC by Rahul Hinduja
Modified:	2017-09-21 04:54 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-6
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1396062 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:28:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Rahul Hinduja 2016-10-17 11:09:51 UTC

Description of problem:
=======================

While Testing the create and rename of directories in a loop, found multiple crashes as follows:

[root@dhcp37-177 Master]# grep -ri "OSError: " *
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError: [Errno 2] No such file or directory: '.gfid/00000000-0000-0000-0000-000000000001/nfs_dir.426'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/22'
[root@dhcp37-177 Master]# 

Master:
=======
Crash 1: [Errno 2] No such file or directory:
=============================================

[2016-10-16 17:35:06.867371] E [syncdutils(/rhs/brick2/b4):289:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'


Crash 2: [Errno 21] Is a directory
==================================

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'


These crashes are propagated from slave as:
===========================================

[2016-10-16 17:31:06.800229] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops
    os.unlink(entry)
OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957'
[2016-10-16 17:31:06.825957] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2016-10-16 17:31:06.826287] I [syncdutils(slave):230:finalize] <top>: exiting.
[2016-10-16 17:31:18.37847] I [gsyncd(slave):733:main_i] <top>: syncing: gluster://localhost:Slave1
[2016-10-16 17:31:23.391367] I [resource(slave):914:service_loop] GLUSTER: slave listening
[2016-10-16 17:35:06.864521] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops
    os.unlink(entry)
OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'
[2016-10-16 17:35:06.884804] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2016-10-16 17:35:06.885364] I [syncdutils(slave):230:finalize] <top>: exiting.
[2016-10-16 17:35:17.967597] I [gsyncd(slave):733:main_i] <top>: syncing: gluster://localhost:Slave1
[2016-10-16 17:35:23.303258] I [resource(slave):914:service_loop] GLUSTER: slave listening
[2016-10-16 17:46:21.666467] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops
    os.unlink(entry)
OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'
[2016-10-16 17:46:21.687004] I [repce(slave):92:service_loop] RepceServer: terminating on


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-server-3.8.4-2.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-2.el7rhgs.x86_64


How reproducible:
=================
Always


Steps to Reproduce:
===================
Seen this on non-root fanout setup, but should also see on normal setup. Writing the exact steps as carried:

1. Create Master (2 nodes) and Slave Cluster (4 nodes)
2. Create and Start Master and 2 Slave Volumes (Each 2x2)
3. Create mount-broker geo-rep session between master and 2 slave volumes
4. Mount the Master and Slave Volume (NFS and Fuse)
5. Create dir on master and rename it.
for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done
for i in {1..1000}; do mv dir.$i rename_dir.$i; done
for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done

Actual results:
===============

Worker Crashes seen with Errno 2 and 21

Comment 3 Aravinda VK 2016-10-17 11:55:22 UTC

Crash 1: [Errno 2] No such file or directory:
This looks like two workers trying unlink at the same time. (As part of rename, while Changelog reprocessing)

Solution: Handle the ENOENT and ESTALE errors during unlink.


Crash 2: [Errno 21] Is a directory:
"Is a Directory" issue is fixed in upstream
http://review.gluster.org/15132 

Related BZ https://bugzilla.redhat.com/show_bug.cgi?id=1365694#c3

Comment 5 Aravinda VK 2016-11-17 11:42:20 UTC

Upstream Patches:
http://review.gluster.org/15132
http://review.gluster.org/15868

Comment 7 Aravinda VK 2016-11-28 09:29:07 UTC

Patches sent to downstream:
https://code.engineering.redhat.com/gerrit/91362
https://code.engineering.redhat.com/gerrit/91363

Comment 8 Aravinda VK 2016-11-28 09:58:04 UTC

Release 3.8 Patch: http://review.gluster.org/15939
Release 3.9 patch: http://review.gluster.org/15940

Comment 17 Atin Mukherjee 2017-03-06 13:17:58 UTC

Given the verification of this BZ is blocked on BZ 1427870 and considering both of these BZs are not release blockers all the stakeholder as part of blocker bug triage and rhgs-3.2.0 bug status check exercise agreed to drop this bug from 3.2.0 release and consider the verification of this BZ as well as BZ 1427870 in rhgs-3.3.0. With that, resetting the flags.

Comment 20 Rahul Hinduja 2017-07-16 15:29:21 UTC

Verified with build: glusterfs-geo-replication-3.8.4-32.el7rhgs.x86_64

Use case mentioned in the description is carried with the following data set:

Set 1:
------

#client 1
for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done
#client 2
mkdir dir.{1..1000}
for i in {1..1000}; do mv dir.$i rename_dir.$i; done
#client 3
for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done
#client 4
for i in {1..1999}; do mkdir rochelle.$i ; sleep 1 ; mv rochelle.$i allan.$i ; done


Set 2:
------

#client 1
for i in {1..1999}; do mkdir volks.$i ; sleep 1 ; mv volks.$i weagan.$i ; done
#client 2
touch Sun{1..1000}
for i in {1..1000}; do mv Sun.$i Moon.$i; done
#client 3
for i in {1..500}; do mkdir Flash.$i ; mv Flash.$i Red.$i ; done
#client 4
for i in {1..1999}; do touch brother.$i ; sleep 1 ; mv brother.$i sister.$i ; done


No worker crash is seen, moving this bug to verified state.

Comment 22 errata-xmlrpc 2017-09-21 04:28:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 23 errata-xmlrpc 2017-09-21 04:54:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.