Bug 1335538

Summary: [geo-rep]: recurrsive_rmdir causing "Transport endpoint not connected" followed by "Structure needs cleaning"
Product: [Community] GlusterFS Reporter: Aravinda VK <avishwan>
Component: geo-replicationAssignee: Aravinda VK <avishwan>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: mainlineCC: avishwan, bugs, chrisw, csaba, khiremat, nlevinki, rhinduja, sarumuga, smohan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-4.1.3 (or later) Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1261300 Environment:
Last Closed: 2018-08-29 03:36:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1261300    
Bug Blocks:    

Description Aravinda VK 2016-05-12 12:56:58 UTC
+++ This bug was initially created as a clone of Bug #1261300 +++

Description of problem:
=======================

Observing "OSError: [Errno 107] Transport endpoint is not connected" and "OSError: [Errno 117] Structure needs cleaning:" traceback on the slave logs when issued "rm -rf" from master

[2015-09-08 19:01:09.604484] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 685, in entry_ops
    [], [ENOTEMPTY, ESTALE, ENODATA])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 662, in recursive_rmdir
    recursive_rmdir(gfid, entry, fullname)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 667, in recursive_rmdir
    errno_wrap(os.rmdir, [path], [ENOENT, ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/4752e6d2-4bef-4319-a535-2fbf1fea55e7/level04/hardlink_to_files'
[2015-09-08 19:01:09.621318] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-09-08 19:01:09.621871] I [syncdutils(slave):220:finalize] <top>: exiting.
[2015-09-08 19:01:20.188294] I [gsyncd(slave):649:main_i] <top>: syncing: gluster://localhost:slave
[2015-09-08 19:01:21.213108] I [gsyncd(slave):649:main_i] <top>: syncing: gluster://localhost:slave
[2015-09-08 19:01:21.355303] I [resource(slave):844:service_loop] GLUSTER: slave listening
[2015-09-08 19:01:22.369762] I [resource(slave):844:service_loop] GLUSTER: slave listening
[2015-09-08 19:01:36.427643] W [syncdutils(slave):486:errno_wrap] <top>: reached maximum retries (['b7e28dc5-f77a-45ff-b5ed-38fc596359a7', '.gfid/7fad9921-77c4-4806-8ed1-048df17c2fb7/level44', '.gfid/7fad9921-77c4-4806-8ed1-048df17c2fb7/level44'])...[Errno 39] Directory not empty: '.gfid/7fad9921-77c4-4806-8ed1-048df17c2fb7/level44'
[2015-09-08 19:01:36.428356] W [resource(slave):692:entry_ops] <top>: Recursive remove b7e28dc5-f77a-45ff-b5ed-38fc596359a7 => .gfid/7fad9921-77c4-4806-8ed1-048df17c2fb7/level44failed: Directory not empty
[2015-09-08 19:01:37.557725] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 685, in entry_ops
    [], [ENOTEMPTY, ESTALE, ENODATA])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 659, in recursive_rmdir
    EISDIR])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 117] Structure needs cleaning: '.gfid/b5a0188e-f6fa-4c67-81dd-66b4f68a330b/level14/symlink_to_files'
[2015-09-08 19:01:37.568291] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.

Eventually with retrial all the files are removed from the slave. 

Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.1-14.el7rhgs.x86_64


How reproducible:
=================

Always


Steps to Reproduce:
===================
1. Create master and slave cluster
2. Create geo-rep session between master and slave volume
3. Execute following fops on master and verify the sync on slave 
create, chmod, chown, chgrp, symlink, hardlink, truncate, rename, followed by remove {rm -rf }

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-09-09 03:25:53 EDT ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Rahul Hinduja on 2015-09-09 05:33:43 EDT ---

sosreports are at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1261300/

--- Additional comment from Sakshi on 2015-09-14 00:36:17 EDT ---

Reason for Structure needs cleaning error is that the lookup selfheal in DHT is not setting errors properly at few places. Fix sent here
http://review.gluster.org/#/c/12165/

--- Additional comment from Rahul Hinduja on 2016-01-04 06:29:24 EST ---

For records, hitting this consistently with rm -rf * on build: glusterfs-3.7.5-13.el7rhgs.x86_64

[2016-01-04 11:06:34.928106] I [resource(slave):844:service_loop] GLUSTER: slave listening
[2016-01-04 11:06:47.999924] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 685, in entry_ops
    [], [ENOTEMPTY, ESTALE, ENODATA])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 662, in recursive_rmdir
    recursive_rmdir(gfid, entry, fullname)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 667, in recursive_rmdir
    errno_wrap(os.rmdir, [path], [ENOENT, ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/1d9fc177-fc1e-411f-8ea9-b6e1234575af/level11/symlink_to_files'

--- Additional comment from Rahul Hinduja on 2016-04-14 06:51:30 EDT ---

For records, hitting this consistently with rm -rf * on build: glusterfs-3.7.9-1.el7rhgs.x86_64

[2016-04-14 10:34:21.478831] W [resource(slave):721:entry_ops] <top>: Recursive remove ec3be6a9-d725-4a09-8a4c-ef98a84ee39f => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty
[2016-04-14 10:34:26.432831] W [syncdutils(slave):486:errno_wrap] <top>: reached maximum retries (['ec3be6a9-d725-4a09-8a4c-ef98a84ee39f', '.gfid/00000000-0000-0000-0000-000000000001/thread1', '.gfid/00000000-0000-0000-0000-000000000001/thread1'])...[Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread1/level07/level17/level27/level37'
[2016-04-14 10:34:26.433208] W [resource(slave):721:entry_ops] <top>: Recursive remove ec3be6a9-d725-4a09-8a4c-ef98a84ee39f => .gfid/00000000-0000-0000-0000-000000000001/thread1failed: Directory not empty
[2016-04-14 10:35:03.264699] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 714, in entry_ops
    [], [ENOTEMPTY, ESTALE, ENODATA])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 666, in recursive_rmdir
    recursive_rmdir(gfid, entry, fullname)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 671, in recursive_rmdir
    errno_wrap(os.rmdir, [path], [ENOENT, ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/cd8f7393-5078-4776-a24f-3e8eb69e953f/level37/symlink_to_files'
[2016-04-14 10:35:03.274899] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2016-04-14 10:35:03.275094] I [syncdutils(slave):220:finalize] <top>: exiting.
[2016-04-14 10:35:08.326251] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 714, in entry_ops
    [], [ENOTEMPTY, ESTALE, ENODATA])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 671, in recursive_rmdir
    errno_wrap(os.rmdir, [path], [ENOENT, ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/af8c23fb-dfd2-4c76-8c74-45813518acb0/symlink_to_files'
[2016-04-14 10:35:08.335688] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2016-04-14 10:35:08.335922] I [syncdutils(slave):220:finalize] <top>: exiting.
[2016-04-14 10:35:14.135530] I [gsyncd(slave):653:main_i] <top>: syncing: gluster://localhost:slave
[2016-04-14 10:35:15.292807] I [resource(slave):902:service_loop] GLUSTER: slave listening
[2016-04-14 10:35:19.146552] I [gsyncd(slave):653:main_i] <top>: syncing: gluster://localhost:slave
[2016-04-14 10:35:20.289783] I [resource(slave):902:service_loop] GLUSTER: slave listening
[2016-04-14 10:36:16.786941] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 714, in entry_ops
    [], [ENOTEMPTY, ESTALE, ENODATA])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 671, in recursive_rmdir
    errno_wrap(os.rmdir, [path], [ENOENT, ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/8ca37ba9-5d4f-40a1-94df-aed09a2bd36a/symlink_to_files'

Comment 1 Vijay Bellur 2016-05-12 13:04:31 UTC
REVIEW: http://review.gluster.org/14307 (geo-rep: Add Retry in case of ENOTCONN during rmdir) posted (#1) for review on master by Aravinda VK (avishwan)

Comment 2 Vijay Bellur 2016-05-13 02:55:46 UTC
REVIEW: http://review.gluster.org/14307 (geo-rep: Add Retry in case of ENOTCONN during rmdir) posted (#2) for review on master by Aravinda VK (avishwan)

Comment 3 Amar Tumballi 2018-08-29 03:36:01 UTC
This update is done in bulk based on the state of the patch and the time since last activity. If the issue is still seen, please reopen the bug.