Bug 1385605

Summary: fuse mount point not accessible
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Karan Sandha <ksandha>
Component: rpcAssignee: Raghavendra Talur <rtalur>
Status: CLOSED ERRATA QA Contact: Karan Sandha <ksandha>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: aloganat, amukherj, asrivast, bkunal, ccalhoun, hamiller, ksandha, nchilaka, olim, omasek, pgurusid, pkarampu, rabhat, rcyriac, rgowdapp, rhinduja, rhs-bugs, rjoseph, rnalakka, rtalur, sanandpa, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1386626 (view as bug list) Environment:
Last Closed: 2017-03-23 06:11:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1351528, 1386626, 1388323, 1392906, 1397267, 1398930, 1401534, 1408949, 1474007    

Description Karan Sandha 2016-10-17 12:06:52 UTC
Description of problem:
Mount point inaccessible when try to access.

Version-Release number of selected component (if applicable):
[root@dhcp46-231 gluster]# rpm -qa | grep gluster
gluster-nagios-addons-0.2.7-1.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-server-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-api-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-cli-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.el7rhgs.noarch
glusterfs-libs-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-fuse-3.8.4-2.26.git0a405a4.el7rhgs.x86_64


How reproducible: Hit it once
logs placed @ rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>

Steps performed:
1. Create an arbiter volume 1*3 volume named mdcache 
2. Mount the volume on two different clients /mnt on the both the clients.
3. Replace the brick0 with new brick. check for heal info wait for it to complete 
4. touch files{1..10000} from one client 
5. Replace the brick brick2(arbiter) with new brick simultaneously create newfiles{1..10000} on the mount point from second client. 
4. When completed. echo 1234 > newfiles from (1..10000) using script.sh placed with log files from first client.
5  Check for gluster volume heal mdcache info
6. / directory of the brick and one more file needs to be healed.

 [root@dhcp46-231 gluster]# gluster volume heal mdcache info 
Brick dhcp46-231.lab.eng.blr.redhat.com:/bricks/brick1/mdcache
/ - Possibly undergoing heal

/newfiles0 
Status: Connected
Number of entries: 2

Brick dhcp46-50.lab.eng.blr.redhat.com:/bricks/brick0/mdcache
Status: Connected
Number of entries: 0

Brick dhcp47-111.lab.eng.blr.redhat.com:/bricks/brick1/mdcache
/ - Possibly undergoing heal

/newfiles0 
Status: Connected
Number of entries: 2


##################################################################
[root@dhcp46-231 gluster]# getfattr -d -m . -e hex /bricks/brick1/mdcache/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/mdcache/
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.mdcache-client-1=0x000000000000000000000008
trusted.afr.mdcache-client-2=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x5a5aa31b79d84f458641f7c032141e53

*****************************
[root@dhcp47-111 gluster]#  getfattr -d -m . -e hex /bricks/brick1/mdcache/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/mdcache/
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.mdcache-client-1=0x000000000000000000000008
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x5a5aa31b79d84f458641f7c032141e53

Actual results:

1) There were hangs observed on the mount points.
2) Heals couldn't get completed.
3) Transport end point not connected errors observed in the logs of client and bricks
4) Multiple Blocked locks observed in the statedumps of the bricks.
5) Mount point not accessible. 


Expected results:
No hangs should be observed 
No pending heals should be there.

Additional info:

Comment 2 Raghavendra G 2016-10-18 04:56:23 UTC
Karan,

Can you attach brick and client log files?

regards,
Raghavendra

Comment 5 Poornima G 2016-10-20 07:03:34 UTC
Also, has this test case been tried on 3.2 build without md-cache options?

Comment 7 Karan Sandha 2016-10-24 06:58:50 UTC
Poornima,

yes i tried without MDCACHE build but i wasn't able to hit it. 

Thanks & regards
Karan Sandha

Comment 11 Pranith Kumar K 2016-10-25 11:14:40 UTC
*** Bug 1388414 has been marked as a duplicate of this bug. ***

Comment 15 Nag Pavan Chilakam 2016-11-15 09:39:40 UTC
I hit this case, in my systemic testing, where the replica pair has one brick down.
However the client sees that both the bricks are down inspite of one being up.
Hence if we try to cat  a file sitting on the brick, we get transportendpoint error
and if we try to write to a file on this brick we get EIO

version:3.8.4-5

Comment 16 Nag Pavan Chilakam 2016-11-15 10:03:33 UTC
sosreport of client is availble at [qe@rhsqe-repo nchilaka]$ pwd
/var/www/html/sosreports/nchilaka
[qe@rhsqe-repo nchilaka]$ /var/www/html/sosreports/nchilaka/bug.1385605

[root@dhcp35-191 ~]# gluster v info
gl 
Volume Name: sysvol
Type: Distributed-Replicate
Volume ID: b1ef4d84-0614-4d5d-9e2e-b19183996e43
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 10.70.35.191:/rhs/brick1/sysvol
Brick2: 10.70.37.108:/rhs/brick1/sysvol
Brick3: 10.70.35.3:/rhs/brick1/sysvol
Brick4: 10.70.37.66:/rhs/brick1/sysvol
Brick5: 10.70.35.191:/rhs/brick2/sysvol
Brick6: 10.70.37.108:/rhs/brick2/sysvol
Brick7: 10.70.35.3:/rhs/brick2/sysvol
Brick8: 10.70.37.66:/rhs/brick2/sysvol
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.stat-prefetch: on
performance.cache-invalidation: on
cluster.shd-max-threads: 10
features.cache-invalidation-timeout: 400
features.cache-invalidation: on
performance.md-cache-timeout: 300
features.uss: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
[root@dhcp35-191 ~]# gluster v status
Status of volume: sysvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.191:/rhs/brick1/sysvol       N/A       N/A        N       N/A  
Brick 10.70.37.108:/rhs/brick1/sysvol       49152     0          Y       27848
Brick 10.70.35.3:/rhs/brick1/sysvol         N/A       N/A        N       N/A  
Brick 10.70.37.66:/rhs/brick1/sysvol        49152     0          Y       28853
Brick 10.70.35.191:/rhs/brick2/sysvol       49153     0          Y       18344
Brick 10.70.37.108:/rhs/brick2/sysvol       N/A       N/A        N       N/A  
Brick 10.70.35.3:/rhs/brick2/sysvol         49153     0          Y       11727
Brick 10.70.37.66:/rhs/brick2/sysvol        N/A       N/A        N       N/A  
Snapshot Daemon on localhost                49154     0          Y       18461
Self-heal Daemon on localhost               N/A       N/A        Y       18364
Quota Daemon on localhost                   N/A       N/A        Y       18410
Snapshot Daemon on 10.70.35.3               49154     0          Y       11826
Self-heal Daemon on 10.70.35.3              N/A       N/A        Y       11747
Quota Daemon on 10.70.35.3                  N/A       N/A        Y       11779
Snapshot Daemon on 10.70.37.66              49154     0          Y       28970
Self-heal Daemon on 10.70.37.66             N/A       N/A        Y       28892
Quota Daemon on 10.70.37.66                 N/A       N/A        Y       28923
Snapshot Daemon on 10.70.37.108             49154     0          Y       27965
Self-heal Daemon on 10.70.37.108            N/A       N/A        Y       27887
Quota Daemon on 10.70.37.108                N/A       N/A        Y       27918
 
Task Status of Volume sysvol
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp35-191 ~]#

Comment 17 Jiffin 2016-11-16 12:46:11 UTC
*** Bug 1392906 has been marked as a duplicate of this bug. ***

Comment 18 Raghavendra Talur 2016-11-23 12:49:40 UTC
Patch posted upstream at http://review.gluster.org/#/c/15916

Comment 19 rjoseph 2016-12-05 14:22:09 UTC
Upstream master      : http://review.gluster.org/15916
Upstream release-3.8 : http://review.gluster.org/16025
Upstream release-3.9 : http://review.gluster.org/16026

Downstream : https://code.engineering.redhat.com/gerrit/92095

Comment 23 errata-xmlrpc 2017-03-23 06:11:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Comment 40 Bipin Kunal 2017-06-19 07:11:08 UTC
Thanks Nag for the update.

@Rejy : do we need hotfix flag set on this bug?

Comment 44 Raghavendra G 2017-12-01 06:35:21 UTC
*** Bug 1429145 has been marked as a duplicate of this bug. ***