1338051 – ENOTCONN error during parallel rmdir

Bug 1338051 - ENOTCONN error during parallel rmdir

Summary: ENOTCONN error during parallel rmdir

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1336381 1339446
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-21 03:42 UTC by Ravishankar N
Modified:	2016-06-16 14:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-3.8rc2
Clone Of:	1336381
Environment:
Last Closed:	2016-06-16 14:07:54 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ravishankar N 2016-05-21 03:42:19 UTC

+++ This bug was initially created as a clone of Bug #1336381 +++

Description of problem:

Reported by Sakshi Bansal sabansal

Parallel rmdir from multiple clients results in application receiving "Transport end point not connected" messages even though there was no network disconnects.


Steps to Reproduce:
1. Create 1x2 replica, fuse mount it from 2 clients.
2. Run the script from both clients

-------------------------
#!/bin/bash

dir=$(dirname $(readlink -f $0))
echo 'Script in '$dir
while :
do
        mkdir -p foo$1/bar/gee
        mkdir -p foo$1/bar/gne
        mkdir -p foo$1/lna/gme
        rm -rf foo$1
done
-------------------------

--- Additional comment from Vijay Bellur on 2016-05-16 06:19:36 EDT ---

REVIEW: http://review.gluster.org/14358 (cluster/afr: Return correct op_errno in pre-op) posted (#1) for review on master by Ravishankar N (ravishankar)

--- Additional comment from Vijay Bellur on 2016-05-18 05:27:03 EDT ---

REVIEW: http://review.gluster.org/14358 (cluster/afr: Check for required number of entrylks) posted (#2) for review on master by Ravishankar N (ravishankar)

--- Additional comment from Vijay Bellur on 2016-05-20 07:28:18 EDT ---

REVIEW: http://review.gluster.org/14358 (cluster/afr: Check for required number of entrylks) posted (#3) for review on master by Ravishankar N (ravishankar)

--- Additional comment from Vijay Bellur on 2016-05-20 11:19:42 EDT ---

REVIEW: http://review.gluster.org/14358 (cluster/afr: Check for required number of entrylks) posted (#4) for review on master by Ravishankar N (ravishankar)

Comment 1 Vijay Bellur 2016-05-21 03:44:14 UTC

REVIEW: http://review.gluster.org/14461 (cluster/afr: Check for required number of entrylks) posted (#1) for review on release-3.8 by Ravishankar N (ravishankar)

Comment 2 Vijay Bellur 2016-05-24 05:08:26 UTC

REVIEW: http://review.gluster.org/14461 (cluster/afr: Check for required number of entrylks) posted (#2) for review on release-3.8 by Ravishankar N (ravishankar)

Comment 3 Vijay Bellur 2016-05-24 05:31:25 UTC

REVIEW: http://review.gluster.org/14461 (cluster/afr: Check for required number of entrylks) posted (#3) for review on release-3.8 by Ravishankar N (ravishankar)

Comment 4 Vijay Bellur 2016-05-24 11:20:01 UTC

COMMIT: http://review.gluster.org/14461 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit 0c295ad2fddccea39d7fc5b402c2cd197f0825ca
Author: Ravishankar N <ravishankar>
Date:   Wed May 18 14:37:46 2016 +0530

    cluster/afr: Check for required number of entrylks
    
    Backport of http://review.gluster.org/#/c/14358/
    
    Problem:
    Parallel rmdir operations on the same directory results in ENOTCONN
    messages eventhough there was no network disconnect.
    
    In blocking entry lock during rmdir, AFR takes 2 set of locks on all its
    children-One (parentdir,name of dir to be deleted), the other (full lock
    on the dir being deleted). We proceed to pre-op stage even if only a
    single lock (but not all the needed locks) was obtained, only to fail it with
    ENOTCONN because afr_locked_nodes_get() returns zero nodes  in
    afr_changelog_pre_op().
    
    Fix:
    After we get replies for all blocking lock requests, if we don't have
    the minimum number of locks to carry out the FOP, unlock and fail the
    FOP. The op_errno will be that of the last failed reply we got, i.e.
    whatever is set in afr_lock_cbk().
    
    Change-Id: Ibef25e65b468ebb5ea6ae1f5121a5f1201072293
    BUG: 1338051
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/14461
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 5 Niels de Vos 2016-06-16 14:07:54 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.