Bug 1282390

Summary: Data Tiering:delete command rm -rf not deleting files the linkto file(hashed) which are under migration and possible spit-brain observed and possible disk wastage
Product: [Community] GlusterFS Reporter: Nithya Balachandran <nbalacha>
Component: tieringAssignee: Mohammed Rafi KC <rkavunga>
Status: CLOSED CURRENTRELEASE QA Contact: bugs <bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: mainlineCC: asrivast, bugs, dlambrig, nchilaka, rkavunga, sankarshan
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8rc2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1276227 Environment:
Last Closed: 2016-06-16 13:44:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1282388    
Bug Blocks: 1276227    

Description Nithya Balachandran 2015-11-16 09:53:01 UTC
+++ This bug was initially created as a clone of Bug #1276227 +++

Description of problem:
========================
On a tiered volume which has files under migration, if we issue an rm -rf, all the files including which are under migration are deleted but are leaving the link-to file(in hashed subvol) undeleted.
The link-to files are later getting converted to regular files and occupying disk space unncessarily.

While, we are deleting the original or cached file, I don't see a point of having the hashed file anymore. We need to have locks removed there too.

Version-Release number of selected component (if applicable):
============================================================
glusterfs-server-3.7.5-0.3.el7rhgs.x86_64


How reproducible:
==================
very easy and always

Steps to Reproduce:
====================
1.create,start and mount a tiered volume
2.create some files which take a while to get promoted/demoted. So let each file be of atleast 800MB. Create about 20 such files
3.Now let the demote cycle start.
4. Once the demote cycle starts, it can be seen that the files are being demoted as below in the cached and hashed subvol(see file is.7)

[root@zod glusterfs]# ll /rhs/brick*/rosa*/
/rhs/brick1/rosa/:
total 1876672
-rw-r--r--. 2 root root 614400000 Oct 29 12:00 is.1
-rw-r--r--. 2 root root 614400000 Oct 29 12:00 is.3
---------T. 2 root root 614400000 Oct 29 12:06 is.7
-rw-r--r--. 2 root root 614400000 Oct 28 19:32 new.14

/rhs/brick2/rosa/:
total 1800000
-rw-r--r--. 2 root root 614400000 Oct 29 12:00 is.2
-rw-r--r--. 2 root root 614400000 Oct 29 12:01 is.4
-rw-r--r--. 2 root root 614400000 Oct 29 12:01 is.6

/rhs/brick6/rosa_hot/:
total 9388992
-rw-r--r--. 2 root root 614400000 Oct 29 12:02 is.10
-rw-r--r--. 2 root root 398327808 Oct 29 12:04 is.22
-rw-r-Sr-T. 2 root root 614400000 Oct 29 12:01 is.7


5. Now from the fuse mount, before all files are demoted, issue a rm -rf to delete all files
6. It can be seen all files are delete except for the files which were under migrate 
7. Now if u check the backend brick immediately, it can be seen that it is a link-to file which is not deleted.
And after a few seconds this link-to file is converted to a normal read-write file as below


[root@zod glusterfs]# ll /rhs/brick*/rosa*/
/rhs/brick1/rosa/:
total 582400
---------T. 2 root root 614400000 Oct 29 12:07 is.7

==after few seconds========
[root@zod glusterfs]# 
[root@zod glusterfs]# ll /rhs/brick*/rosa*/
/rhs/brick1/rosa/:
total 600000
-rw-r--r--. 2 root root 614400000 Oct 29 12:01 is.7


8. If u monitor the client fuse logs, it can be seen that a possible split brain is observed:
[2015-10-29 11:41:18.567156] W [MSGID: 114031] [client-rpc-fops.c:1569:client3_3_fstat_cbk] 0-rosa-client-2: remote operation failed [No such file or directory]
[2015-10-29 11:41:18.571387] W [MSGID: 108008] [afr-read-txn.c:250:afr_read_txn] 0-rosa-replicate-1: Unreadable subvolume -1 found with event generation 2 for gfid 360ed98c-d031-4631-a1fc-0fface82400f. (Possible split-brain)
[2015-10-29 11:41:18.575262] E [MSGID: 109040] [dht-helper.c:1020:dht_migration_complete_check_task] 0-rosa-cold-dht: (null): failed to lookup the file on rosa-cold-dht [Stale file handle]
[2015-10-29 11:41:18.578245] W [MSGID: 108008] [afr-read-txn.c:250:afr_read_txn] 0-rosa-replicate-1: Unreadable subvolume -1 found with event generation 2 for gfid 360ed98c-d031-4631-a1fc-0fface82400f. (Possible split-brain)



Actual results:
==============
1)linkto file getting converted to a regular file
2)disk wastage happens due to this
3)split brain possibly seen
4)Also, later I can see a different bit rot version on the replicas(i didnt enable bitrot)

[root@zod glusterfs]# getfattr -d -m . -e hex /rhs/brick*/rosa*/*
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/rosa/is.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x0200000000000000562f7f97000aa25b
trusted.gfid=0x6db6cae40a784af38da9af842243ffe8
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri=0x00000000249f00000000000000000001
trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
trusted.tier-gfid.linkto=0x726f73612d686f742d64687400



replica:
[root@yarrow glusterfs]# getfattr -d -m . -e hex /rhs/brick*/rosa*/*
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/rosa/is.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x0200000000000000562f7f9a0003bc6d
trusted.gfid=0x6db6cae40a784af38da9af842243ffe8
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri=0x00000000249f00000000000000000001
trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
trusted.tier-gfid.linkto=0x726f73612d686f742d64687400




Expected results:
===================
None of the issues should be seen

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-10-29 03:04:10 EDT ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from nchilaka on 2015-10-29 03:59:56 EDT ---

following is the xattrs during the delete of files:
[root@zod glusterfs]# head -n 853 /heels.log |tail -n 100 
/rhs/brick1/rosa/:
total 0

/rhs/brick2/rosa/:
total 510080
---------T. 2 root root 614400000 Oct 29 13:16 heaven.3

/rhs/brick6/rosa_hot/:
total 0

/rhs/brick7/rosa_hot/:
total 0
# file: rhs/brick2/rosa/heaven.3
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000010000000000000000
trusted.bit-rot.version=0x0200000000000000562f7f97000d37e6
trusted.gfid=0x644b07152673448f8b29cb3e43940f13
trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
trusted.tier-gfid.linkto=0x726f73612d686f742d64687400

/rhs/brick1/rosa/:
total 0

/rhs/brick2/rosa/:
total 568960
---------T. 2 root root 614400000 Oct 29 13:16 heaven.3

/rhs/brick6/rosa_hot/:
total 0

/rhs/brick7/rosa_hot/:
total 0
# file: rhs/brick2/rosa/heaven.3
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000010000000000000000
trusted.bit-rot.version=0x0200000000000000562f7f97000d37e6
trusted.gfid=0x644b07152673448f8b29cb3e43940f13
trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
trusted.tier-gfid.linkto=0x726f73612d686f742d64687400

/rhs/brick1/rosa/:
total 0

/rhs/brick2/rosa/:
total 600000
-rw-r--r--. 2 root root 614400000 Oct 29 13:13 heaven.3

/rhs/brick6/rosa_hot/:
total 0

/rhs/brick7/rosa_hot/:
total 0
# file: rhs/brick2/rosa/heaven.3
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x0200000000000000562f7f97000d37e6
trusted.gfid=0x644b07152673448f8b29cb3e43940f13
trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
trusted.tier-gfid.linkto=0x726f73612d686f742d64687400

/rhs/brick1/rosa/:
total 0

/rhs/brick2/rosa/:
total 600000
-rw-r--r--. 2 root root 614400000 Oct 29 13:13 heaven.3

/rhs/brick6/rosa_hot/:
total 0

/rhs/brick7/rosa_hot/:
total 0
# file: rhs/brick2/rosa/heaven.3
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x0200000000000000562f7f97000d37e6
trusted.gfid=0x644b07152673448f8b29cb3e43940f13
trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
trusted.tier-gfid.linkto=0x726f73612d686f742d64687400

/rhs/brick1/rosa/:
total 0

/rhs/brick2/rosa/:
total 600000
-rw-r--r--. 2 root root 614400000 Oct 29 13:13 heaven.3

/rhs/brick6/rosa_hot/:
total 0

/rhs/brick7/rosa_hot/:
total 0
# file: rhs/brick2/rosa/heaven.3
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x0200000000000000562f7f97000d37e6
trusted.gfid=0x644b07152673448f8b29cb3e43940f13
trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
trusted.tier-gfid.linkto=0x726f73612d686f742d64687400

[root@zod glusterfs]#

--- Additional comment from nchilaka on 2015-10-29 04:00:21 EDT ---

tier logs:
===========
2015-10-29 07:46:23.327109] I [MSGID: 109038] [tier.c:476:tier_migrate_using_query_file] 0-rosa-tier-dht: Tier 0 src_subvol rosa-hot-dht file heaven.3
[2015-10-29 07:46:23.328847] I [dht-rebalance.c:1103:dht_migrate_file] 0-rosa-tier-dht: /heaven.3: attempting to move from rosa-hot-dht to rosa-cold-dht
[2015-10-29 07:46:44.142458] W [dht-rebalance.c:1247:dht_migrate_file] 0-rosa-tier-dht: /heaven.3: failed to fsync on rosa-cold-dht (Structure needs cleaning)
[2015-10-29 07:46:44.144700] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-rosa-client-7: remote operation failed. Path: <gfid:644b0715-2673-448f-8b29-cb3e43940f13> (644b0715-2673-448f-8b29-cb3e43940f13) [No such file or directory]
[2015-10-29 07:46:44.144923] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-rosa-client-6: remote operation failed. Path: <gfid:644b0715-2673-448f-8b29-cb3e43940f13> (644b0715-2673-448f-8b29-cb3e43940f13) [No such file or directory]
[2015-10-29 07:46:44.145032] W [MSGID: 109023] [dht-rebalance.c:1317:dht_migrate_file] 0-rosa-tier-dht: Migrate file failed:/heaven.3: failed to get xattr from rosa-hot-dht (No such file or directory)
[2015-10-29 07:46:44.145091] E [MSGID: 108008] [afr-transaction.c:1975:afr_transaction] 0-rosa-replicate-2: Failing FSETATTR on gfid 644b0715-2673-448f-8b29-cb3e43940f13: split-brain observed. [Input/output error]
[2015-10-29 07:46:44.145470] W [MSGID: 109023] [dht-rebalance.c:1356:dht_migrate_file] 0-rosa-tier-dht: Migrate file failed:/heaven.3: failed to perform setattr on rosa-hot-dht  [Input/output error]
[2015-10-29 07:46:44.146381] E [MSGID: 109037] [tier.c:492:tier_migrate_using_query_file] 0-rosa-tier-dht: ERROR -28 in current migration heaven.3 /heaven.3

[2015-10-29 07:46:44.150682] E [MSGID: 109037] [tier.c:442:tier_migrate_using_query_file] 0-rosa-tier-dht: ERROR in current lookup

[2015-10-29 07:46:44.153524] E [MSGID: 109037] [tier.c:442:tier_migrate_using_query_file] 0-rosa-tier-dht: ERROR in current lookup

[2015-10-29 07:46:44.153656] E [MSGID: 109037] [tier.c:1446:tier_start] 0-rosa-tier-dht: Demotion failed
[2015-10-29 07:48:00.161457] I [MSGID: 109038] [tier.c:1010:tier_build_migration_qfile] 0-rosa-tier-dht: Failed to remove /var/run/gluster/rosa-tier-dht/demotequeryfile-rosa-tier-dht
^C

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-11-03 10:12:02 EST ---

Since this bug has been approved for the z-stream release of Red Hat Gluster Storage 3, through release flag 'rhgs-3.1.z+', and has been marked for RHGS 3.1 Update 2 release through the Internal Whiteboard entry of '3.1.2', the Target Release is being automatically set to 'RHGS 3.1.2'

--- Additional comment from Nithya Balachandran on 2015-11-10 06:51:14 EST ---

This is reproducible during a demotion:

Analysis:

When a file is being demoted, the hashed subvolume (hot tier) contains the data file. As the hashed_subvol == cached_subvol, DHT sends an unlink only to the cached subvol. 

When a file is being migrated, fds are opened on the source and destination files. The migration operation proceeds even after the unlink from the client as the src file fd is still open. As the dst linkto file was never unlinked, once the data copy is complete, it is converted to a source file.

Comment 1 Vijay Bellur 2015-11-16 12:30:41 UTC
REVIEW: http://review.gluster.org/12586 (cluster/tier: Unlink during file migration) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 2 Vijay Bellur 2015-11-17 17:28:16 UTC
REVIEW: http://review.gluster.org/12586 (cluster/tier: Unlink during file migration) posted (#2) for review on master by N Balachandran (nbalacha)

Comment 3 Vijay Bellur 2015-11-30 14:40:04 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#1) for review on master by mohammed rafi  kc (rkavunga)

Comment 4 Vijay Bellur 2015-12-09 14:49:17 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#2) for review on master by mohammed rafi  kc (rkavunga)

Comment 5 Vijay Bellur 2015-12-10 02:07:59 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#3) for review on master by Dan Lambright (dlambrig)

Comment 6 Vijay Bellur 2015-12-12 09:20:59 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#4) for review on master by mohammed rafi  kc (rkavunga)

Comment 7 Vijay Bellur 2015-12-12 11:33:47 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#5) for review on master by mohammed rafi  kc (rkavunga)

Comment 8 Vijay Bellur 2015-12-12 12:52:09 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#6) for review on master by mohammed rafi  kc (rkavunga)

Comment 9 Vijay Bellur 2015-12-14 12:23:50 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#7) for review on master by soumya k (skoduri)

Comment 10 Vijay Bellur 2015-12-15 07:32:17 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#8) for review on master by mohammed rafi  kc (rkavunga)

Comment 11 Vijay Bellur 2015-12-16 09:56:27 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#9) for review on master by mohammed rafi  kc (rkavunga)

Comment 12 Vijay Bellur 2015-12-16 13:35:00 UTC
REVIEW: http://review.gluster.org/12829 (tier:unlink during migration) posted (#10) for review on master by mohammed rafi  kc (rkavunga)

Comment 13 Vijay Bellur 2015-12-16 20:45:08 UTC
COMMIT: http://review.gluster.org/12829 committed in master by Dan Lambright (dlambrig) 
------
commit b5de382afa8c5777e455c7a376fc4f1f01d782d1
Author: Mohammed Rafi KC <rkavunga>
Date:   Mon Nov 30 19:02:54 2015 +0530

    tier:unlink during migration
    
    files deleted during promotion were not deleting as the
    files are moving from hashed to non-hashed.
    
    On deleting a file that is undergoing promotion,
    the unlink call is not sent to the dst file as the
    hashed subvol == cached subvol. This causes
    the file to reappear once the migration is complete.
    
    This patch also fixes a problem with stale linkfile
    deleting.
    
    Change-Id: I4b02a498218c9d8eeaa4556fa4219e91e7fa71e5
    BUG: 1282390
    Signed-off-by: Mohammed Rafi KC <rkavunga>
    Reviewed-on: http://review.gluster.org/12829
    Tested-by: NetBSD Build System <jenkins.org>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Dan Lambright <dlambrig>
    Tested-by: Dan Lambright <dlambrig>

Comment 14 Niels de Vos 2016-06-16 13:44:42 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user