1396062 – [geo-rep]: Worker crashes seen while renaming directories in loop

Bug 1396062 - [geo-rep]: Worker crashes seen while renaming directories in loop

Summary: [geo-rep]: Worker crashes seen while renaming directories in loop

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Aravinda VK
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1385589
Blocks:	1399090 1399092
TreeView+	depends on / blocked

Reported:	2016-11-17 11:39 UTC by Aravinda VK
Modified:	2017-03-06 17:34 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.10.0
Clone Of:	1385589
Clones:	1399090 1399092 (view as bug list)
Environment:
Last Closed:	2017-03-06 17:34:44 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Aravinda VK 2016-11-17 11:39:31 UTC

+++ This bug was initially created as a clone of Bug #1385589 +++

Description of problem:
=======================

While Testing the create and rename of directories in a loop, found multiple crashes as follows:

[root@dhcp37-177 Master]# grep -ri "OSError: " *
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError: [Errno 2] No such file or directory: '.gfid/00000000-0000-0000-0000-000000000001/nfs_dir.426'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/22'
[root@dhcp37-177 Master]# 

Master:
=======
Crash 1: [Errno 2] No such file or directory:
=============================================

[2016-10-16 17:35:06.867371] E [syncdutils(/rhs/brick2/b4):289:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'


Crash 2: [Errno 21] Is a directory
==================================

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'


These crashes are propagated from slave as:
===========================================

[2016-10-16 17:31:06.800229] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops
    os.unlink(entry)
OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957'
[2016-10-16 17:31:06.825957] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2016-10-16 17:31:06.826287] I [syncdutils(slave):230:finalize] <top>: exiting.
[2016-10-16 17:31:18.37847] I [gsyncd(slave):733:main_i] <top>: syncing: gluster://localhost:Slave1
[2016-10-16 17:31:23.391367] I [resource(slave):914:service_loop] GLUSTER: slave listening
[2016-10-16 17:35:06.864521] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops
    os.unlink(entry)
OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'
[2016-10-16 17:35:06.884804] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2016-10-16 17:35:06.885364] I [syncdutils(slave):230:finalize] <top>: exiting.
[2016-10-16 17:35:17.967597] I [gsyncd(slave):733:main_i] <top>: syncing: gluster://localhost:Slave1
[2016-10-16 17:35:23.303258] I [resource(slave):914:service_loop] GLUSTER: slave listening
[2016-10-16 17:46:21.666467] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops
    os.unlink(entry)
OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'
[2016-10-16 17:46:21.687004] I [repce(slave):92:service_loop] RepceServer: terminating on


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-server-3.8.4-2.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-2.el7rhgs.x86_64


How reproducible:
=================
Always


Steps to Reproduce:
===================
Seen this on non-root fanout setup, but should also see on normal setup. Writing the exact steps as carried:

1. Create Master (2 nodes) and Slave Cluster (4 nodes)
2. Create and Start Master and 2 Slave Volumes (Each 2x2)
3. Create mount-broker geo-rep session between master and 2 slave volumes
4. Mount the Master and Slave Volume (NFS and Fuse)
5. Create dir on master and rename it.
for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done
for i in {1..1000}; do mv dir.$i rename_dir.$i; done
for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done

Actual results:
===============

Worker Crashes seen with Errno 2 and 21


Master:
=======

[root@dhcp37-58 ~]# gluster v info 
 
Volume Name: Master
Type: Distributed-Replicate
Volume ID: a4dc4c5c-95d7-4c71-ad52-3bbe70fc7240
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.58:/rhs/brick1/b1
Brick2: 10.70.37.177:/rhs/brick1/b2
Brick3: 10.70.37.58:/rhs/brick2/b3
Brick4: 10.70.37.177:/rhs/brick2/b4
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: off
cluster.enable-shared-storage: enable
 
Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: cb7be148-8b85-43a2-837b-bb9d7de41a20
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.37.177:/var/lib/glusterd/ss_brick
Brick2: dhcp37-58.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.enable-shared-storage: enable
[root@dhcp37-58 ~]# 


Slave;
======

[root@dhcp37-214 ~]# gluster v info 
 
Volume Name: Slave1
Type: Distributed-Replicate
Volume ID: 928051ec-0177-4d13-b1cc-71d7783bfd95
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.214:/rhs/brick1/b1
Brick2: 10.70.37.62:/rhs/brick1/b2
Brick3: 10.70.37.214:/rhs/brick2/b3
Brick4: 10.70.37.62:/rhs/brick2/b4
Options Reconfigured:
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
cluster.enable-shared-storage: enable
 
Volume Name: Slave2
Type: Distributed-Replicate
Volume ID: 72c1006b-135f-4641-b2a1-a10a5a1ac12b
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.178:/rhs/brick1/b1
Brick2: 10.70.37.59:/rhs/brick1/b2
Brick3: 10.70.37.178:/rhs/brick2/b3
Brick4: 10.70.37.59:/rhs/brick2/b4
Options Reconfigured:
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
cluster.enable-shared-storage: enable
[root@dhcp37-214 ~]#

--- Additional comment from Aravinda VK on 2016-10-17 07:55:22 EDT ---

Crash 1: [Errno 2] No such file or directory:
This looks like two workers trying unlink at the same time. (As part of rename, while Changelog reprocessing)

Solution: Handle the ENOENT and ESTALE errors during unlink.


Crash 2: [Errno 21] Is a directory:
"Is a Directory" issue is fixed in upstream
http://review.gluster.org/15132 

Related BZ https://bugzilla.redhat.com/show_bug.cgi?id=1365694#c3

Comment 1 Worker Ant 2016-11-17 11:41:09 UTC

REVIEW: http://review.gluster.org/15868 (geo-rep: Handle ENOENT during unlink) posted (#1) for review on master by Aravinda VK (avishwan)

Comment 2 Worker Ant 2016-11-22 19:33:41 UTC

COMMIT: http://review.gluster.org/15868 committed in master by Vijay Bellur (vbellur) 
------
commit ecd6da0a754f21909dbbd8189228f5a27a15df3e
Author: Aravinda VK <avishwan>
Date:   Thu Nov 17 17:07:36 2016 +0530

    geo-rep: Handle ENOENT during unlink
    
    Do not raise traceback if a file/dir not exists during
    unlink or rmdir
    
    BUG: 1396062
    Change-Id: Idd43ca1fa6ae6056c3cd493f0e2f151880a3968c
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/15868
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 3 Shyamsundar 2017-03-06 17:34:44 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.