Bug 1197588

Summary: DHT+rebalance: RHS 3.0.4: gfid failed to heal after brick removal though dir is visible on existing brick.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Amit Chaurasia <achauras>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED WORKSFORME QA Contact: storage-qa-internal <storage-qa-internal>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: mzywusko, nbalacha, rhs-bugs, sankarshan, spalai
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: dht-gfid-heal, dht-rca-unknown
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-29 11:30:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Amit Chaurasia 2015-03-02 07:13:42 UTC
Description of problem:
The gfid of the folder is not healed. As a result, though the folder is seen in the backend, its not accessible from the mount point.


Version-Release number of selected component (if applicable):
[root@dht-rhs-20 nfs_mnt1]# rpm -qa | grep -i gluster
gluster-nagios-common-0.1.4-1.el6rhs.noarch
vdsm-gluster-4.14.7.3-1.el6rhs.noarch
glusterfs-3.6.0.47-1.el6rhs.x86_64
glusterfs-geo-replication-3.6.0.47-1.el6rhs.x86_64
gluster-nagios-addons-0.1.14-1.el6rhs.x86_64
samba-glusterfs-3.6.509-169.4.el6rhs.x86_64
glusterfs-api-3.6.0.47-1.el6rhs.x86_64
glusterfs-fuse-3.6.0.47-1.el6rhs.x86_64
glusterfs-server-3.6.0.47-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.47-1.el6rhs.x86_64
glusterfs-libs-3.6.0.47-1.el6rhs.x86_64
glusterfs-cli-3.6.0.47-1.el6rhs.x86_64
[root@dht-rhs-20 nfs_mnt1]# 



How reproducible:
First observation.


Steps to Reproduce:
1. Create a folder.
2. Remove the xattr trusted.gfid from the folder from the brick it is hashed.
3. Removed the folder from the brick (Backend).
4. On Fuse mount point, performed a look up.
5. Now the folder is healed and it is accessible.
6. Removed a brick.
7. After the brick removal, the folder is visible on the backend brick but is not accessible on the mount point.

Actual results:
1. Folder is healed even if removed from the hashed sub-volume. I think this is not the expected behaviour. Folders will be healed only when deleted from the non-hashed sub-vol.

2. If the folder is anyway healed, the gfid is not healed. As a result, the folders are accessible backend but not on the volume.

On Fuse mount point:

[root@dht-rhs-20 fuse_mnt1]# ll
total 0
drwxr-xr-x 3 root root 27 Mar  2 04:15 amit-dir
drwxr-xr-x 3 root root 50 Mar  2 04:36 new-amit-dir
[root@dht-rhs-20 fuse_mnt1]# cd new-amit-dir/
[root@dht-rhs-20 new-amit-dir]# ls -ltrhR longernamedir1/ | grep -v "/" | grep -v '^$' | grep -v "total"  |wc -l
536
[root@dht-rhs-20 new-amit-dir]# cd ../amit-dir/
[root@dht-rhs-20 amit-dir]# ls -ltrhR longernamedir1/ | grep -v "/" | grep -v '^$' | grep -v "total"  |wc -l
ls: cannot access longernamedir1/: No data available
0
[root@dht-rhs-20 amit-dir]# 
[root@dht-rhs-20 amit-dir]# ll
ls: cannot access longernamedir1: No data available
total 0
?????????? ? ? ? ?            ? longernamedir1
[root@dht-rhs-20 amit-dir]# 
[root@dht-rhs-20 amit-dir]# 
[root@dht-rhs-20 amit-dir]# pwd
/fuse_mnt1/amit-dir
[root@dht-rhs-20 amit-dir]# ls -ltrh
ls: cannot access longernamedir1: No data available
total 0
?????????? ? ? ? ?            ? longernamedir1
================
On NFS mount point:

[root@dht-rhs-20 nfs_mnt1]# ls -ltrh
total 0
drwxr-xr-x 3 root root 27 Mar  2 04:15 amit-dir
drwxr-xr-x 3 root root 50 Mar  2 04:36 new-amit-dir
[root@dht-rhs-20 nfs_mnt1]# ls -ltrh amit-dir/longernamedir1/
ls: cannot access amit-dir/longernamedir1/: Remote I/O error
[root@dht-rhs-20 nfs_mnt1]# 
[root@dht-rhs-20 nfs_mnt1]# 
[root@dht-rhs-20 nfs_mnt1]# ls -ltrh amit-dir/longernamedir1/
ls: cannot access amit-dir/longernamedir1/: Remote I/O error
[root@dht-rhs-20 nfs_mnt1]# 
[root@dht-rhs-20 nfs_mnt1]# 
====================

On backend:

[root@dht-rhs-19 ~]# ls -ltrh /rhs/brick1/gv0/amit-dir/longernamedir1/
total 8.0K
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile8
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile7
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile4
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile3
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile11
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile43
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile41
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile38
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile37
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile36
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile35
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile33


On the removed brick:
[root@dht-rhs-19 ~]# ls -ltrh /rhs/brick2/gv0/amit-dir/longernamedir1/
total 8.0K
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile9
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile6
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile5
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile2
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile12
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile10
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile1
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile0
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile49
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile48
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile47
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile46
-rw-r--r-- 2 root root    0 Mar  2 03:37 newfile45

=========

trusted.gfid missing:

[root@dht-rhs-19 ~]# getfattr -n trusted.gfid /rhs/brick1/gv0/amit-dir/longernamedir1/
/rhs/brick1/gv0/amit-dir/longernamedir1/: trusted.gfid: No such attribute
[root@dht-rhs-19 ~]# 
[root@dht-rhs-19 ~]# getfattr -d -m . -e hex /rhs/brick1/gv0/amit-dir/longernamedir1/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/gv0/amit-dir/longernamedir1/
trusted.afr.gv0-client-0=0x000000000000000000000000
trusted.afr.gv0-client-1=0x000000000000000000000000

[root@dht-rhs-19 ~]# 

========================
Before removing brick:

[root@dht-rhs-19 ~]# gluster v info
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: f6213ca3-4130-454c-96bd-4bdf2690b5dd
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.47.98:/rhs/brick1/gv0
Brick2: 10.70.47.99:/rhs/brick1/gv0
Brick3: 10.70.47.98:/rhs/brick2/gv0
Brick4: 10.70.47.99:/rhs/brick2/gv0
Options Reconfigured:
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
[root@dht-rhs-19 ~]# 

==================================

Brick removal:

[root@dht-rhs-19 ~]# gluster v remove gv0 replica 2 10.70.47.98:/rhs/brick2/gv0 10.70.47.99:/rhs/brick2/gv0 start
volume remove-brick start: success
ID: 1702cff0-18a2-48b4-ad1f-f08787a333e4
[root@dht-rhs-19 ~]# gluster v remove gv0 replica 2 10.70.47.98:/rhs/brick2/gv0 10.70.47.99:/rhs/brick2/gv0 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              100      805Bytes           242             0             0          in progress              10.00
                             10.70.47.99                0        0Bytes          1004             0             0            completed               7.00


===============

[root@dht-rhs-19 ~]# gluster v remove gv0 replica 2 10.70.47.98:/rhs/brick2/gv0 10.70.47.99:/rhs/brick2/gv0 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              261      805Bytes          1134             0             0            completed              28.00
                             10.70.47.99                0        0Bytes          1004             0             0            completed               7.00
[root@dht-rhs-19 ~]# 

================================

After removal of the brick:
[root@dht-rhs-19 ~]# gluster v info
 
Volume Name: gv0
Type: Replicate
Volume ID: f6213ca3-4130-454c-96bd-4bdf2690b5dd
Status: Started
Snap Volume: no
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.47.98:/rhs/brick1/gv0
Brick2: 10.70.47.99:/rhs/brick1/gv0
Options Reconfigured:
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
=========================================



Expected results:
1. Ideally, the folder should not have healed once it is deleted from the hashed volume. 
2. If it is healed, after removal of the brick, the gfid should also have been healed.



Additional info:

Info on the fuse mount log file:
[root@dht-rhs-20 nfs_mnt1]# tail -f /var/log/glusterfs/fuse_mnt1.log 
[2015-03-01 23:12:16.150741] E [dht-helper.c:1364:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.6.0.47/xlator/cluster/distribute.so(dht_readdirp_cbk+0x30d) [0x7f4623b5cded] (-->/usr/lib64/glusterfs/3.6.0.47/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7f4623b33e9e] (-->/usr/lib64/glusterfs/3.6.0.47/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x52) [0x7f4623b36142]))) 5-gv0-dht: invalid argument: inode
[2015-03-01 23:12:39.982701] E [afr-self-heal-common.c:1629:afr_sh_common_lookup_cbk] 5-gv0-replicate-0: Missing Gfids for /amit-dir/longernamedir1
[2015-03-01 23:12:39.984242] E [afr-self-heal-common.c:2872:afr_log_self_heal_completion_status] 5-gv0-replicate-0:  gfid or missing entry self heal  failed,   on /amit-dir/longernamedir1
[2015-03-01 23:12:39.984338] W [fuse-bridge.c:481:fuse_entry_cbk] 0-glusterfs-fuse: 6910: LOOKUP() /amit-dir/longernamedir1 => -1 (No data available)
[2015-03-01 23:12:43.441852] E [afr-self-heal-common.c:1629:afr_sh_common_lookup_cbk] 5-gv0-replicate-0: Missing Gfids for /amit-dir/longernamedir1
[2015-03-01 23:12:43.443114] E [afr-self-heal-common.c:2872:afr_log_self_heal_completion_status] 5-gv0-replicate-0:  gfid or missing entry self heal  failed,   on /amit-dir/longernamedir1
[2015-03-01 23:12:43.443246] W [fuse-bridge.c:481:fuse_entry_cbk] 0-glusterfs-fuse: 6915: LOOKUP() /amit-dir/longernamedir1 => -1 (No data available)
[2015-03-01 23:12:47.103695] E [afr-self-heal-common.c:1629:afr_sh_common_lookup_cbk] 5-gv0-replicate-0: Missing Gfids for /amit-dir/longernamedir1
[2015-03-01 23:12:47.104970] E [afr-self-heal-common.c:2872:afr_log_self_heal_completion_status] 5-gv0-replicate-0:  gfid or missing entry self heal  failed,   on /amit-dir/longernamedir1
[2015-03-01 23:12:47.105079] W [fuse-bridge.c:481:fuse_entry_cbk] 0-glusterfs-fuse: 6922: LOOKUP() /amit-dir/longernamedir1 => -1 (No data available)

Comment 1 Amit Chaurasia 2015-03-02 07:23:37 UTC
Logs and sosreports are copied in @http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1197588/

Comment 5 Nithya Balachandran 2016-08-29 11:30:39 UTC
Closing this Bz as per comment#3. Please open a new BZ if seen again.