Bug 1401814

Summary: [Arbiter] Directory lookup failed with 11(EAGAIN) leading to rebalance failure
Product: Red Hat Gluster Storage Reporter: Karan Sandha <ksandha>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Karan Sandha <ksandha>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, ravishankar, rcyriac, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 05:54:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 1351528    

Description Karan Sandha 2016-12-06 07:50:00 UTC
Description of problem:
After adding a subvolume the directories are failing with EAGAIN which leads to rebalance failure
Filed a issue with similiar steps earlier with permission denied warnings. I think this MIGHT be a repercussion of this issue 
https://bugzilla.redhat.com/show_bug.cgi?id=1399504

Version-Release number of selected component (if applicable):

[root@dhcp47-175 ~]# rpm -qa | grep gluster
glusterfs-fuse-3.8.4-6.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-6.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-5.el7rhgs.x86_64
glusterfs-libs-3.8.4-6.el7rhgs.x86_64
glusterfs-3.8.4-6.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-6.el7rhgs.x86_64
glusterfs-api-3.8.4-6.el7rhgs.x86_64
glusterfs-server-3.8.4-6.el7rhgs.x86_64
python-gluster-3.8.4-6.el7rhgs.noarch
vdsm-gluster-4.17.33-1.el7rhgs.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-cli-3.8.4-6.el7rhgs.x86_64

How reproducible:
2/2
Logs are placed at rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>

Steps to Reproduce:
1. create 1x(2+1) arbiter volume
2. create 1000 directories and 1000 files
3. replace a data brick from the volume
4. let the heals complete check using gluster volume heal <volname> info
5. add 3 bricks at once to form 2x(2+1)
6. now start rebalance of the volume using gluster volume rebalance <volname> start
7. output:- rebalance failed.



Actual results:
There are two different outputs shown:-
[root@dhcp47-197 ~]# gluster v status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.142:/bricks/brick1/testvol_b
rick0                                       49153     0          Y       10271
Brick 10.70.47.197:/bricks/brick0/testvol_b
rick1                                       49152     0          Y       10163
Brick 10.70.47.175:/bricks/brick0/testvol_b
rick2                                       49152     0          Y       9876 
Brick 10.70.46.142:/bricks/brick0/testvol_b
rick0                                       49152     0          Y       10407
Brick 10.70.47.197:/bricks/brick1/testvol_b
rick1                                       49153     0          Y       10441
Brick 10.70.47.175:/bricks/brick1/testvol_b
rick2                                       49153     0          Y       10095
Self-heal Daemon on localhost               N/A       N/A        Y       10461
Self-heal Daemon on dhcp46-142.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       10427
Self-heal Daemon on 10.70.47.175            N/A       N/A        Y       10115
 
Task Status of Volume testvol
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 80843ccb-0391-4ad6-af5b-4662d8fced43
Status               : failed              
 
******************************************************************************
[root@dhcp47-175 ~]# gluster volume rebalance testvol status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0           135             0               failed        0:0:38
       dhcp46-142.lab.eng.blr.redhat.com              492        0Bytes          1001           163             0               failed        0:0:37
                            10.70.47.197                0        0Bytes             0             1             0               failed        0:0:0
volume rebalance: testvol: success
[root@dhcp47-175 ~]# 


Expected results:
There should be no rebalance failure 
No errors should be reported.


Additional info:

[root@dhcp47-175 ~]# getfattr -d -m . -e hex  /bricks/brick?/*//dir998
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick0/testvol_brick2/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.testvol-client-0=0x000000000000000000000000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

# file: bricks/brick1/testvol_brick2/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe

#############################################################
[root@dhcp46-142 ~]# getfattr -d -m . -e hex  /bricks/brick?/*//dir998
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick0/testvol_brick0/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe

# file: bricks/brick1/testvol_brick0/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

#############################################################

[root@dhcp47-197 ~]# getfattr -d -m . -e hex  /bricks/brick?/*//dir998
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick0/testvol_brick1/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.testvol-client-0=0x000000000000000000000000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

# file: bricks/brick1/testvol_brick1/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe

Comment 2 Ravishankar N 2016-12-07 08:20:35 UTC
I was able to recreate the issue  downstream consistently .After applying https://code.engineering.redhat.com/gerrit/#/c/92316/ (sent against BZ  1393694).
I was not able to hit it any more.

Comment 9 errata-xmlrpc 2017-03-23 05:54:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html