Bug 1296795

Summary:	Good files does not promoted in a tiered volume when bitrot is enabled
Product:	[Community] GlusterFS	Reporter:	Kotresh HR <khiremat>
Component:	tiering	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED CURRENTRELEASE	QA Contact:	bugs <bugs>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.7.7	CC:	avishwan, bugs, josferna, khiremat, knarra, nbalacha, nchilaka, rcyriac, rhs-bugs, rkavunga, sankarshan, vshankar
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.7.7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1294786	Environment:
Last Closed:	2016-04-19 07:52:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1288490, 1294786
Bug Blocks:

Description Kotresh HR 2016-01-08 04:57:39 UTC

+++ This bug was initially created as a clone of Bug #1294786 +++

+++ This bug was initially created as a clone of Bug #1288490 +++

Description of problem:
Have a tiered volume with both cold and hot tiers as dist-rep with bit rot enabled. In Cold tier when one of the sub volume has a good copy and another has bad one, doing I/O on the mount point does not promote the file to hot tier.

There will be a wastage of space when files which are marked bad stay in the hot tier until the file is been recovered manually.

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-9.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a tiered volume with both hot and cold tiers as dist-rep (2*2) volume.
2. Mount the volume using fuse
3. Create some files in the volume.
4. Corrupt the file in one of the sub volume and do I/O on the mount point since there is another good copy of the file in another sub volume.

Actual results:
Doing I/O on the good copy of the file does not promote the file.

Expected results:
Since there is a good copy of the file in another subvolume,  I/O on the file should promote the file.


RCA:

Tier process will be running on both replica set on each node, so when a file is ready to migrate, to avoid migrating both copy from replica set at same time, we use a virtual getxattr called "node id". From tier process we request a node id to the child subvolume. If the getxattr return the node id for the running process it will migrate the file, otherwise it will simple skip assuming the file will be migrated from the other replica set.

Now when a file is marked as bad in Node say N1 and the good copy for the same was in say N2, Then all the reads/write will got to N2, Once the file got heated N2 will pick the file to promote, before migrating we request node id, the afr will always send that node id request to first up child process and will wind back with that node id if it was returned with successful. Here in this case let us say first up child was N1 and will get node id for N1, then tier process running on N2 will skip the file assuming that N1 will migrate the file.

But from node N1, Since there is no ops winded to that brick as it was a bad file, the file will not be listed for promotion/demotion , hence it wont migrate the file. So the file will never get migrated.

--- Additional comment from Nithya Balachandran on 2015-12-29 23:28:41 EST ---

Moving this to bitrot.

--- Additional comment from Venky Shankar on 2015-12-30 00:48:10 EST ---

(In reply to Mohammed Rafi KC from comment #12)
> RCA:
> 
> Tier process will be running on both replica set on each node, so when a
> file is ready to migrate, to avoid migrating both copy from replica set at
> same time, we use a virtual getxattr called "node id". From tier process we
> request a node id to the child subvolume. If the getxattr return the node id
> for the running process it will migrate the file, otherwise it will simple
> skip assuming the file will be migrated from the other replica set.
> 
> Now when a file is marked as bad in Node say N1 and the good copy for the
> same was in say N2, Then all the reads/write will got to N2, Once the file
> got heated N2 will pick the file to promote, before migrating we request
> node id, the afr will always send that node id request to first up child
> process and will wind back with that node id if it was returned with
> successful. Here in this case let us say first up child was N1 and will get
> node id for N1, then tier process running on N2 will skip the file assuming
> that N1 will migrate the file.
> 
> But from node N1, Since there is no ops winded to that brick as it was a bad
> file, the file will not be listed for promotion/demotion , hence it wont
> migrate the file. So the file will never get migrated.

Returning EIO for node-uuid getxattr() should cause AFR to query other replicas.

--- Additional comment from Venky Shankar on 2015-12-30 00:53:56 EST ---

(In reply to Mohammed Rafi KC from comment #12)
> RCA:
> 
> Tier process will be running on both replica set on each node, so when a
> file is ready to migrate, to avoid migrating both copy from replica set at
> same time, we use a virtual getxattr called "node id". From tier process we
> request a node id to the child subvolume. If the getxattr return the node id
> for the running process it will migrate the file, otherwise it will simple
> skip assuming the file will be migrated from the other replica set.
> 
> Now when a file is marked as bad in Node say N1 and the good copy for the
> same was in say N2, Then all the reads/write will got to N2, Once the file
> got heated N2 will pick the file to promote, before migrating we request
> node id, the afr will always send that node id request to first up child
> process and will wind back with that node id if it was returned with
> successful. Here in this case let us say first up child was N1 and will get
> node id for N1, then tier process running on N2 will skip the file assuming
> that N1 will migrate the file.
> 
> But from node N1, Since there is no ops winded to that brick as it was a bad
> file, the file will not be listed for promotion/demotion , hence it wont
> migrate the file. So the file will never get migrated.

[ignore comment #14, submitted too quickly]

Returning EIO for node-uuid getxattr() should cause AFR to query other replicas. If not, then then that looks like a fix for this.

But OTOH, looks like geo-replication might too have similar side effects currently - changelog on one of the replica would not record the data operation and if this node happens to be the "active" node (which synchronized data), then such objects would be skipped for replication to the slave.

--- Additional comment from Vijay Bellur on 2015-12-30 05:26:26 EST ---

REVIEW: http://review.gluster.org/13116 (features/bitrot: Fail node-uuid getxattr if file is marked bad) posted (#1) for review on master by Kotresh HR (khiremat)

--- Additional comment from Vijay Bellur on 2015-12-31 02:20:04 EST ---

REVIEW: http://review.gluster.org/13116 (features/bitrot: Fail node-uuid getxattr if file is marked bad) posted (#2) for review on master by Kotresh HR (khiremat)

--- Additional comment from Vijay Bellur on 2016-01-02 04:22:25 EST ---

REVIEW: http://review.gluster.org/13116 (features/bitrot: Fail node-uuid getxattr if file is marked bad) posted (#3) for review on master by Kotresh HR (khiremat)

--- Additional comment from Vijay Bellur on 2016-01-04 03:43:21 EST ---

REVIEW: http://review.gluster.org/13116 (features/bitrot: Fail node-uuid getxattr if file is marked bad) posted (#4) for review on master by Kotresh HR (khiremat)

--- Additional comment from Vijay Bellur on 2016-01-05 04:10:30 EST ---

REVIEW: http://review.gluster.org/13116 (features/bitrot: Fail node-uuid getxattr if file is marked bad) posted (#5) for review on master by Kotresh HR (khiremat)

--- Additional comment from Vijay Bellur on 2016-01-07 11:23:47 EST ---

COMMIT: http://review.gluster.org/13116 committed in master by Venky Shankar (vshankar) 
------
commit b9d2a383a265f1552d6bad0a22c92f4e7204dd85
Author: Kotresh HR <khiremat>
Date:   Wed Dec 30 15:25:30 2015 +0530

    features/bitrot: Fail node-uuid getxattr if file is marked bad
    
    If xattr is node-uuid and the inode is marked bad, fail getxattr
    and fgetxattr with EIO. Returning EIO would result in AFR to
    choose correct node-uuid coresponding to the subvolume where
    the good copy of the file resides.
    
    Change-Id: I45a42ca38f8322d2b10f3c4c48dc504521162b42
    BUG: 1294786
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/13116
    Tested-by: NetBSD Build System <jenkins.org>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Venky Shankar <vshankar>

Comment 1 Vijay Bellur 2016-01-08 05:12:19 UTC

REVIEW: http://review.gluster.org/13194 (features/bitrot: Fail node-uuid getxattr if file is marked bad) posted (#1) for review on release-3.7 by Kotresh HR (khiremat)

Comment 2 Vijay Bellur 2016-01-11 03:56:09 UTC

REVIEW: http://review.gluster.org/13194 (features/bitrot: Fail node-uuid getxattr if file is marked bad) posted (#2) for review on release-3.7 by Venky Shankar (vshankar)

Comment 3 Vijay Bellur 2016-01-25 19:04:08 UTC

REVIEW: http://review.gluster.org/13194 (features/bitrot: Fail node-uuid getxattr if file is marked bad) posted (#3) for review on release-3.7 by Venky Shankar (vshankar)

Comment 4 Vijay Bellur 2016-01-27 11:46:06 UTC

COMMIT: http://review.gluster.org/13194 committed in release-3.7 by Venky Shankar (vshankar) 
------
commit 62dd323759fe2e9f45980835d97567ad8a4c371a
Author: Kotresh HR <khiremat>
Date:   Wed Dec 30 15:25:30 2015 +0530

    features/bitrot: Fail node-uuid getxattr if file is marked bad
    
    If xattr is node-uuid and the inode is marked bad, fail getxattr
    and fgetxattr with EIO. Returning EIO would result in AFR to
    choose correct node-uuid coresponding to the subvolume where
    the good copy of the file resides.
    
    BUG: 1296795
    Change-Id: I3f8dc807794f9a82867807e7c4c73ded6c64fd8a
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/13116
    Tested-by: NetBSD Build System <jenkins.org>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Venky Shankar <vshankar>
    Reviewed-on: http://review.gluster.org/13194
    Tested-by: Venky Shankar <vshankar>
    Smoke: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra Bhat <raghavendra>
    CentOS-regression: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 5 Kaushal 2016-04-19 07:52:37 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.7, please open a new bug report.

glusterfs-3.7.7 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-February/025292.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user