Bug 1288490

Summary:	Good files does not promoted in a tiered volume when bitrot is enabled
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RamaKasturi <knarra>
Component:	bitrot	Assignee:	Bug Updates Notification Mailing List <rhs-bugs>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	avishwan, byarlaga, josferna, khiremat, knarra, nbalacha, nchilaka, rcyriac, rhs-bugs, rkavunga, sankarshan, storage-qa-internal, vshankar
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.7.5-15	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1294786 (view as bug list)		Environment:
Last Closed:	2016-03-01 06:00:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1294786, 1296795

Description RamaKasturi 2015-12-04 11:43:03 UTC

Description of problem:
Have a tiered volume with both cold and hot tiers as dist-rep with bit rot enabled. In Cold tier when one of the sub volume has a good copy and another has bad one, doing I/O on the mount point does not promote the file to hot tier.

There will be a wastage of space when files which are marked bad stay in the hot tier until the file is been recovered manually.

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-9.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a tiered volume with both hot and cold tiers as dist-rep (2*2) volume.
2. Mount the volume using fuse
3. Create some files in the volume.
4. Corrupt the file in one of the sub volume and do I/O on the mount point since there is another good copy of the file in another sub volume.

Actual results:
Doing I/O on the good copy of the file does not promote the file.

Expected results:
Since there is a good copy of the file in another subvolume,  I/O on the file should promote the file.

Additional info:

Comment 2 RamaKasturi 2015-12-04 11:54:33 UTC

sos reports can be found in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1288490/

Comment 3 RamaKasturi 2015-12-07 09:35:44 UTC

Once the bad copy is recovered in the volume by deleting the file and its gfid from the backend and triggering a self heal on the volume, doing I/O from the mount point promotes the volume.

Comment 8 Joseph Elwin Fernandes 2015-12-21 14:18:15 UTC

Should be fixed with this 
Downstream patch https://code.engineering.redhat.com/gerrit/#/c/64239/

Comment 9 RamaKasturi 2015-12-24 10:34:12 UTC

I have tested this bug and this still exists ,so moving this bug back to assigned. Below are the steps i did for verifying.

1) Created a replicate volume.
2) Enabled bitrot on the volume.
3) Mounted using fuse mount and created some files.
4) edited file in one of the subvolume in replica pair  so that it gets marked as corrupted file.
5) Now ran attach-tier command.
6) started doing mount point in the I/O on the file which was corrupted.

since there is another good copy of the file present in another subvolume of the replica pair, doing I/O on the file should have promoted the file.

gluster volume info :
=====================
[root@rhs-client33 h2]# gluster vol info vol_rep
 
Volume Name: vol_rep
Type: Tier
Volume ID: 938581fa-3480-4fd2-9073-57eaecd78e50
Status: Started
Number of Bricks: 4
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 2 = 2
Brick1: rhs-client33.lab.eng.blr.redhat.com:/bricks/brick4/h2
Brick2: rhs-client24.lab.eng.blr.redhat.com:/bricks/brick4/h1
Cold Tier:
Cold Tier Type : Replicate
Number of Bricks: 1 x 2 = 2
Brick3: rhs-client24.lab.eng.blr.redhat.com:/bricks/brick3/b1
Brick4: rhs-client33.lab.eng.blr.redhat.com:/bricks/brick3/b2
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
features.scrub-freq: hourly
features.scrub: Active
features.bitrot: on
performance.readdir-ahead: on

Comment 10 RamaKasturi 2015-12-24 12:41:37 UTC

sos reports can be found in the link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1288490/

Comment 11 Joseph Elwin Fernandes 2015-12-24 14:43:18 UTC

Sure will look into it

Comment 12 Mohammed Rafi KC 2015-12-29 15:03:45 UTC

RCA:

Tier process will be running on both replica set on each node, so when a file is ready to migrate, to avoid migrating both copy from replica set at same time, we use a virtual getxattr called "node id". From tier process we request a node id to the child subvolume. If the getxattr return the node id for the running process it will migrate the file, otherwise it will simple skip assuming the file will be migrated from the other replica set.

Now when a file is marked as bad in Node say N1 and the good copy for the same was in say N2, Then all the reads/write will got to N2, Once the file got heated N2 will pick the file to promote, before migrating we request node id, the afr will always send that node id request to first up child process and will wind back with that node id if it was returned with successful. Here in this case let us say first up child was N1 and will get node id for N1, then tier process running on N2 will skip the file assuming that N1 will migrate the file.

But from node N1, Since there is no ops winded to that brick as it was a bad file, the file will not be listed for promotion/demotion , hence it wont migrate the file. So the file will never get migrated.

Comment 13 Nithya Balachandran 2015-12-30 04:28:41 UTC

Moving this to bitrot.

Comment 14 Venky Shankar 2015-12-30 05:48:10 UTC

(In reply to Mohammed Rafi KC from comment #12)
> RCA:
> 
> Tier process will be running on both replica set on each node, so when a
> file is ready to migrate, to avoid migrating both copy from replica set at
> same time, we use a virtual getxattr called "node id". From tier process we
> request a node id to the child subvolume. If the getxattr return the node id
> for the running process it will migrate the file, otherwise it will simple
> skip assuming the file will be migrated from the other replica set.
> 
> Now when a file is marked as bad in Node say N1 and the good copy for the
> same was in say N2, Then all the reads/write will got to N2, Once the file
> got heated N2 will pick the file to promote, before migrating we request
> node id, the afr will always send that node id request to first up child
> process and will wind back with that node id if it was returned with
> successful. Here in this case let us say first up child was N1 and will get
> node id for N1, then tier process running on N2 will skip the file assuming
> that N1 will migrate the file.
> 
> But from node N1, Since there is no ops winded to that brick as it was a bad
> file, the file will not be listed for promotion/demotion , hence it wont
> migrate the file. So the file will never get migrated.

Returning EIO for node-uuid getxattr() should cause AFR to query other replicas.

Comment 15 Venky Shankar 2015-12-30 05:53:56 UTC

(In reply to Mohammed Rafi KC from comment #12)
> RCA:
> 
> Tier process will be running on both replica set on each node, so when a
> file is ready to migrate, to avoid migrating both copy from replica set at
> same time, we use a virtual getxattr called "node id". From tier process we
> request a node id to the child subvolume. If the getxattr return the node id
> for the running process it will migrate the file, otherwise it will simple
> skip assuming the file will be migrated from the other replica set.
> 
> Now when a file is marked as bad in Node say N1 and the good copy for the
> same was in say N2, Then all the reads/write will got to N2, Once the file
> got heated N2 will pick the file to promote, before migrating we request
> node id, the afr will always send that node id request to first up child
> process and will wind back with that node id if it was returned with
> successful. Here in this case let us say first up child was N1 and will get
> node id for N1, then tier process running on N2 will skip the file assuming
> that N1 will migrate the file.
> 
> But from node N1, Since there is no ops winded to that brick as it was a bad
> file, the file will not be listed for promotion/demotion , hence it wont
> migrate the file. So the file will never get migrated.

[ignore comment #14, submitted too quickly]

Returning EIO for node-uuid getxattr() should cause AFR to query other replicas. If not, then then that looks like a fix for this.

But OTOH, looks like geo-replication might too have similar side effects currently - changelog on one of the replica would not record the data operation and if this node happens to be the "active" node (which synchronized data), then such objects would be skipped for replication to the slave.

Comment 16 Kotresh HR 2015-12-30 10:28:59 UTC

Upstream Patch:
http://review.gluster.org/13116

Comment 17 Kotresh HR 2016-01-04 12:11:40 UTC

Downstream Patch:
https://code.engineering.redhat.com/gerrit/#/c/64733/

Comment 19 RamaKasturi 2016-01-13 11:53:35 UTC

Verified and works fine with build glusterfs-3.7.5-15.el7rhgs.x86_64.

When a file is marked as bad in one of the subvolume in replica pair, doing an I/O from the mount point promotes good file to hot tier when bitrot is enabled on the volume.

Comment 21 errata-xmlrpc 2016-03-01 06:00:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html