Bug 1401814 - [Arbiter] Directory lookup failed with 11(EAGAIN) leading to rebalance failure
Summary: [Arbiter] Directory lookup failed with 11(EAGAIN) leading to rebalance failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.1
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Pranith Kumar K
QA Contact: Karan Sandha
URL:
Whiteboard:
Depends On:
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-12-06 07:50 UTC by Karan Sandha
Modified: 2017-03-23 05:54 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.8.4-8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-23 05:54:31 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Karan Sandha 2016-12-06 07:50:00 UTC
Description of problem:
After adding a subvolume the directories are failing with EAGAIN which leads to rebalance failure
Filed a issue with similiar steps earlier with permission denied warnings. I think this MIGHT be a repercussion of this issue 
https://bugzilla.redhat.com/show_bug.cgi?id=1399504

Version-Release number of selected component (if applicable):

[root@dhcp47-175 ~]# rpm -qa | grep gluster
glusterfs-fuse-3.8.4-6.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-6.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-5.el7rhgs.x86_64
glusterfs-libs-3.8.4-6.el7rhgs.x86_64
glusterfs-3.8.4-6.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-6.el7rhgs.x86_64
glusterfs-api-3.8.4-6.el7rhgs.x86_64
glusterfs-server-3.8.4-6.el7rhgs.x86_64
python-gluster-3.8.4-6.el7rhgs.noarch
vdsm-gluster-4.17.33-1.el7rhgs.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-cli-3.8.4-6.el7rhgs.x86_64

How reproducible:
2/2
Logs are placed at rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>

Steps to Reproduce:
1. create 1x(2+1) arbiter volume
2. create 1000 directories and 1000 files
3. replace a data brick from the volume
4. let the heals complete check using gluster volume heal <volname> info
5. add 3 bricks at once to form 2x(2+1)
6. now start rebalance of the volume using gluster volume rebalance <volname> start
7. output:- rebalance failed.



Actual results:
There are two different outputs shown:-
[root@dhcp47-197 ~]# gluster v status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.142:/bricks/brick1/testvol_b
rick0                                       49153     0          Y       10271
Brick 10.70.47.197:/bricks/brick0/testvol_b
rick1                                       49152     0          Y       10163
Brick 10.70.47.175:/bricks/brick0/testvol_b
rick2                                       49152     0          Y       9876 
Brick 10.70.46.142:/bricks/brick0/testvol_b
rick0                                       49152     0          Y       10407
Brick 10.70.47.197:/bricks/brick1/testvol_b
rick1                                       49153     0          Y       10441
Brick 10.70.47.175:/bricks/brick1/testvol_b
rick2                                       49153     0          Y       10095
Self-heal Daemon on localhost               N/A       N/A        Y       10461
Self-heal Daemon on dhcp46-142.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       10427
Self-heal Daemon on 10.70.47.175            N/A       N/A        Y       10115
 
Task Status of Volume testvol
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 80843ccb-0391-4ad6-af5b-4662d8fced43
Status               : failed              
 
******************************************************************************
[root@dhcp47-175 ~]# gluster volume rebalance testvol status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0           135             0               failed        0:0:38
       dhcp46-142.lab.eng.blr.redhat.com              492        0Bytes          1001           163             0               failed        0:0:37
                            10.70.47.197                0        0Bytes             0             1             0               failed        0:0:0
volume rebalance: testvol: success
[root@dhcp47-175 ~]# 


Expected results:
There should be no rebalance failure 
No errors should be reported.


Additional info:

[root@dhcp47-175 ~]# getfattr -d -m . -e hex  /bricks/brick?/*//dir998
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick0/testvol_brick2/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.testvol-client-0=0x000000000000000000000000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

# file: bricks/brick1/testvol_brick2/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe

#############################################################
[root@dhcp46-142 ~]# getfattr -d -m . -e hex  /bricks/brick?/*//dir998
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick0/testvol_brick0/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe

# file: bricks/brick1/testvol_brick0/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

#############################################################

[root@dhcp47-197 ~]# getfattr -d -m . -e hex  /bricks/brick?/*//dir998
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick0/testvol_brick1/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.testvol-client-0=0x000000000000000000000000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

# file: bricks/brick1/testvol_brick1/dir998
security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x3098ee30bfac4cc6a512087008811ae1
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe

Comment 2 Ravishankar N 2016-12-07 08:20:35 UTC
I was able to recreate the issue  downstream consistently .After applying https://code.engineering.redhat.com/gerrit/#/c/92316/ (sent against BZ  1393694).
I was not able to hit it any more.

Comment 9 errata-xmlrpc 2017-03-23 05:54:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.