Bug 1426128

Summary:	Unable to delete a nested directory path on a snapshot restored (tiered) volume
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sweta Anandpara <sanandpa>
Component:	replicate	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED WONTFIX	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.2	CC:	amukherj, bmohanra, pkarampu, ravishankar, rhs-bugs, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	In a replicate volume, if a gluster volume snapshot is taken when a create is in progress the file may be present in one brick of the replica and not the other on the snapshotted volume. Due to this, when this snapshot is restored and a rm -rf is executed on a directory from the mount, it may fail with ENOTEMPTY. Workaround: If you get an ENOTEMPTY during `rm -rf dir`, but `ls` of the directory shows no entries, check the backend bricks of the replica to verify if files exist on some bricks and not the other. Perform a stat of that file name from the mount so that it is healed to all bricks of the replica. Now when you do `rm -rf dir`, it should succeed.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-13 04:26:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351530

Description Sweta Anandpara 2017-02-23 09:24:39 UTC

Description of problem:
=======================
Had a 2*2 as cold and 2*2 as hot tier on a 6node cluster. Had I/O taking place from 2 clients over fuse and nfs (ofcourse) on different directory paths. 

Took a couple of snapshots _while_ the I/O was going on. Stopped I/O, got the volume offline, did a snapshot restore, and got the volume back online. Post that, did a 'rm -rf *' on the fuse as well as nfs clients in their respective directory paths. The command on fuse client failed with "Directory not empty" on one of the nested directory paths. The mountpoint does not list any file/directory being present in the said location. The backend bricks were checked for the presence of any file/directory, and it showed the presence of 1 file in hot tier, and the corresponding T file in cold tier.

The file was present only on ONE of the replica pairs in cold as well as hot (and not on both). However, 'gluster volume heal info' showed no pending heals. When attached to gdb, it showed that READDIRP chose that particular replica pair which did not have the file, resulting in directory contents being shown as empty @mountpoint.

Ravishankar suggested a workaround, by executing the command 'ls <filename>' or 'stat <filename>' on the mountpoint path, thereby triggering an internal heal. And post that 'rm -rf *', which should successfully delete the file and the directory. That worked!

As to why-did-the-filesystem-end-up-in-this-state front, 'snapshot create' command when it would have been triggered, it would have taken the snapshot at THAT instant when this  file would have been copied to only _one_ of the replica pairs, or maybe some other race there.. and the corresponding snapshot restore would have restored the filesystem back to that same incorrect state - just one of the hypothesis.

Sosreports will be copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/



Version-Release number of selected component (if applicable):
=============================================================
3.8.4-14


How reproducible:
================
1:1


Additional info:
================

[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# gluster v info vola
 
Volume Name: vola
Type: Tier
Volume ID: 48ab6954-765c-46e2-846b-cfc1412c96ad
Status: Started
Snapshot Count: 2
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: 10.70.46.222:/run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick1/vola_tier3
Brick2: 10.70.46.221:/run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick2/vola_tier2
Brick3: 10.70.46.222:/run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick3/vola_tier1
Brick4: 10.70.46.221:/run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick4/vola_tier0
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick5: 10.70.46.239:/run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick5/vola_0
Brick6: 10.70.46.240:/run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick6/vola_1
Brick7: 10.70.46.242:/run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick7/vola_2
Brick8: 10.70.46.218:/run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick8/vola_3
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: off
features.quota: off
features.inode-quota: off
features.quota-deem-statfs: off
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# gluster snap list
ozone_snap1
ozone_snap2
ozone_snap3
vola_snap1_GMT-2017.02.22-06.03.37
vola_snap2
[root@dhcp46-239 ~]# rpm -qa | grep gluster
glusterfs-events-3.8.4-14.el7rhgs.x86_64
glusterfs-3.8.4-14.el7rhgs.x86_64
glusterfs-server-3.8.4-14.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-15.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-14.el7rhgs.x86_64
glusterfs-fuse-3.8.4-14.el7rhgs.x86_64
glusterfs-rdma-3.8.4-14.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-api-3.8.4-14.el7rhgs.x86_64
glusterfs-libs-3.8.4-14.el7rhgs.x86_64
glusterfs-cli-3.8.4-14.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
python-gluster-3.8.4-14.el7rhgs.noarch
glusterfs-geo-replication-3.8.4-14.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# gluster peer status
Number of Peers: 5

Hostname: dhcp46-242.lab.eng.blr.redhat.com
Uuid: 838465bf-1fd8-4f85-8599-dbc8367539aa
State: Peer in Cluster (Connected)
Other names:
10.70.46.242

Hostname: 10.70.46.240
Uuid: 5bff39d7-cd9c-4dbb-86eb-2a7ba6dfea3d
State: Peer in Cluster (Connected)

Hostname: 10.70.46.218
Uuid: c2fbc432-b7a9-4db1-9b9d-a8d82e998923
State: Peer in Cluster (Connected)

Hostname: 10.70.46.221
Uuid: 1277cf78-640e-46e8-a3d1-46e067508814
State: Peer in Cluster (Connected)

Hostname: 10.70.46.222
Uuid: 81184471-cbf7-47aa-ba41-21f32bb644b0
State: Peer in Cluster (Connected)
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# gluster v status vola
Status of volume: vola
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.46.222:/run/gluster/snaps/45480
7c64c5249a6a2c8fe20def23806/brick1/vola_tie
r3                                          49153     0          Y       10182
Brick 10.70.46.221:/run/gluster/snaps/45480
7c64c5249a6a2c8fe20def23806/brick2/vola_tie
r2                                          49153     0          Y       9472 
Brick 10.70.46.222:/run/gluster/snaps/45480
7c64c5249a6a2c8fe20def23806/brick3/vola_tie
r1                                          49156     0          Y       10202
Brick 10.70.46.221:/run/gluster/snaps/45480
7c64c5249a6a2c8fe20def23806/brick4/vola_tie
r0                                          49156     0          Y       9492 
Cold Bricks:
Brick 10.70.46.239:/run/gluster/snaps/45480
7c64c5249a6a2c8fe20def23806/brick5/vola_0   49152     0          Y       29348
Brick 10.70.46.240:/run/gluster/snaps/45480
7c64c5249a6a2c8fe20def23806/brick6/vola_1   49152     0          Y       29232
Brick 10.70.46.242:/run/gluster/snaps/45480
7c64c5249a6a2c8fe20def23806/brick7/vola_2   49153     0          Y       26742
Brick 10.70.46.218:/run/gluster/snaps/45480
7c64c5249a6a2c8fe20def23806/brick8/vola_3   49155     0          Y       13921
NFS Server on localhost                     2049      0          Y       14049
Self-heal Daemon on localhost               N/A       N/A        Y       29398
NFS Server on dhcp46-242.lab.eng.blr.redhat
.com                                        2049      0          Y       14531
Self-heal Daemon on dhcp46-242.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       26803
NFS Server on 10.70.46.240                  2049      0          Y       8021 
Self-heal Daemon on 10.70.46.240            N/A       N/A        Y       29281
NFS Server on 10.70.46.218                  2049      0          Y       21403
Self-heal Daemon on 10.70.46.218            N/A       N/A        Y       13961
NFS Server on 10.70.46.222                  2049      0          Y       21343
Self-heal Daemon on 10.70.46.222            N/A       N/A        Y       10253
NFS Server on 10.70.46.221                  2049      0          Y       20786
Self-heal Daemon on 10.70.46.221            N/A       N/A        Y       9545 
 
Task Status of Volume vola
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp46-239 ~]# 

<COLD TIER>
===========
[root@dhcp46-240 ~]# ll /run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick6/vola_1/fuse/crefi/level05/level15/level25/58ad2d28%%4GV4TOB7H4 
---------T. 2 root root 0 Feb 22 11:48 /run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick6/vola_1/fuse/crefi/level05/level15/level25/58ad2d28%%4GV4TOB7H4
[root@dhcp46-240 ~]#



<HOT TIER>
==========
[root@dhcp46-222 ~]# ll /run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick1/vola_tier3/fuse/crefi/level05/level15/level25/
-rw-r--r--. 2 root root 0 Feb 22 11:48 /run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick1/vola_tier3/fuse/crefi/level05/level15/level25/58ad2d28%%4GV4TOB7H4
[root@dhcp46-222 ~]# 

[root@dhcp46-221 ~]# ll /run/gluster/snaps/454807c64c5249a6a2c8fe20def23806/brick2/vola_tier2/fuse/crefi/level05/level15/level25/
total 0
[root@dhcp46-221 ~]#
========================  CLIENT LOGS   ##############################

[root@dhcp46-245 ~]# cd /mnt/vola_new/fuse/crefi/
[root@dhcp46-245 crefi]# ls
level05
[root@dhcp46-245 crefi]# 
[root@dhcp46-245 crefi]# 
[root@dhcp46-245 crefi]# rm -rf *
rm: cannot remove ‘level05/level15/level25’: Directory not empty
[root@dhcp46-245 crefi]# 
[root@dhcp46-245 crefi]# 
[root@dhcp46-245 crefi]# ls -al level05/level15/level25
total 8
drwxr-xr-x. 2 root root 4096 Feb 22 12:11 .
drwxr-xr-x. 3 root root 4096 Feb 22 12:12 ..
[root@dhcp46-245 crefi]# pwd
/mnt/vola_new/fuse/crefi
[root@dhcp46-245 crefi]#

Comment 2 Sweta Anandpara 2017-02-23 09:33:58 UTC

[qe@rhsqe-repo 1426128]$ 
[qe@rhsqe-repo 1426128]$ pwd
/home/repo/sosreports/1426128
[qe@rhsqe-repo 1426128]$ 
[qe@rhsqe-repo 1426128]$ 
[qe@rhsqe-repo 1426128]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1426128]$ 
[qe@rhsqe-repo 1426128]$ 
[qe@rhsqe-repo 1426128]$ ll
total 339168
-rwxr-xr-x. 1 qe qe 58412064 Feb 23 14:55 sosreport-dhcp46-218.lab.eng.blr.redhat.com-20170222175628.tar.xz
-rwxr-xr-x. 1 qe qe 48999220 Feb 23 14:55 sosreport-dhcp46-221.lab.eng.blr.redhat.com-20170222175638.tar.xz
-rwxr-xr-x. 1 qe qe 47173564 Feb 23 14:55 sosreport-dhcp46-222.lab.eng.blr.redhat.com-20170222175652.tar.xz
-rwxr-xr-x. 1 qe qe 61076304 Feb 23 14:55 sosreport-dhcp46-239.lab.eng.blr.redhat.com-20170222175607.tar.xz
-rwxr-xr-x. 1 qe qe 61326704 Feb 23 14:55 sosreport-dhcp46-240.lab.eng.blr.redhat.com-20170222175612.tar.xz
-rwxr-xr-x. 1 qe qe 70310468 Feb 23 14:55 sosreport-dhcp46-242.lab.eng.blr.redhat.com-20170222175620.tar.xz
[qe@rhsqe-repo 1426128]$ 
[qe@rhsqe-repo 1426128]$

Comment 3 Ravishankar N 2017-02-24 05:14:22 UTC

Some additional information: Just to confirm the snapshot vs create race that could have happened as described in the BZ, I also had a look at the xattrs of file/T-file and the corresponding parent dirs on the bricks of the replica. There were no afr-xattrs on them and there were no entry-self-heal log messages on the mount or shd logs.

Pranith, 
1) I'm wondering if we should we document this as a known issue with the workaround that we need to stat the file from the mount if rmdir fails with ENOTEMPY. This would involve comparing the dir on bricks of the replica subol to find the list of missing files in the first place.

2)Disabling optimistic-change-log for entry txns could solve the issue where the presence of the dirty-xattr on the parent dir can trigger a heal (conservative merge). But this is not exposed via the CLI even if were, toggling it every time we take snapshot does not seem practical.

Comment 5 Bhavana 2017-03-14 01:08:18 UTC

Edited the doc text slightly for the release notes

Comment 8 Atin Mukherjee 2018-11-11 21:36:33 UTC

What's the plan on addressing this bug? Are we even going to address this known issue in coming future?