Bug 1283507 - File which is marked as bad gets promoted to hot tier
Summary: File which is marked as bad gets promoted to hot tier
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: tier
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Kotresh HR
QA Contact: krishnaram Karthick
URL:
Whiteboard: tier-interops
Depends On:
Blocks: 1268895 1314168
TreeView+ depends on / blocked
 
Reported: 2015-11-19 07:59 UTC by RamaKasturi
Modified: 2018-11-08 18:36 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Corrupted files can be identified for promotion and promoted to hot tier storage. In rare circumstances, corruption can be missed by the BitRot scrubber. This can happen in two ways: 1. A file is corrupted before its checksum is created, so that the checksum matches the corrupted file, and the BitRot scrubber does not mark the file as corrupted. 2. A checksum is created for a healthy file, the file becomes corrupted, and the corrupted file is not compared to its checksum before being identified for promotion and promoted to the hot tier, where a new (corrupted) checksum is created. When tiering is in use, these unidentified corrupted files can be 'heated' and selected for promotion to the hot tier. If a corrupted file is migrated to the hot tier, and the hot tier is not replicated, the corrupted file cannot be accessed or migrated back to the cold tier.
Clone Of:
: 1314168 (view as bug list)
Environment:
Last Closed: 2018-11-08 18:36:34 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description RamaKasturi 2015-11-19 07:59:11 UTC
Description of problem:
File which is corrupted and marked as bad by scrubber gets promoted to hot tier.

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-6.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a tiered volume and enable bit rot
2. Fuse mount the volume and create some data.
3. schedule the scrubber frequency as hourly.
4. Edit the file from backend so that scrubber can identify it as bad file.

Actual results:
File which is marked as bad gets promoted to hot tier.

Expected results:
File which is marked as bad should not get promoted to hot tier.

Additional info:

Comment 2 RamaKasturi 2015-11-19 09:44:03 UTC
sos reports can be found at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1283507/

Comment 3 Joseph Elwin Fernandes 2015-11-24 12:57:19 UTC
1) IMHO Bad files should never be promoted. And if they are in the hot tier they should be demoted. Venky your thoughts on this.

2) As far as the bug is concerned the reason a bad file is not getting promoted is that bit rot will not serve a bad file to any clients. Tiering migrator is a client by itself. This should be true for files that are in the hot tier, that the bad files are not getting demoted. Now this should be the bug. Bad files shouldn't get the prime storage.

Comment 4 Joseph Elwin Fernandes 2015-11-24 13:15:49 UTC
3) Oh ya .. CTR shouldnt heatup bad files :)

Comment 5 Venky Shankar 2015-11-24 13:26:32 UTC
(In reply to Joseph Elwin Fernandes from comment #3)
> 1) IMHO Bad files should never be promoted. And if they are in the hot tier
> they should be demoted. Venky your thoughts on this.

Well, once an object is marked bad, any access is disallowed. So, promoting/demoting should be out of the question. Furthermore, promoting/demoting a bad object might screw up things - once the corrupted object gets migrated, it's signed again, but now with bad data. So, one needs to be extra careful here.

> 
> 2) As far as the bug is concerned the reason a bad file is not getting
> promoted is that bit rot will not serve a bad file to any clients. Tiering
> migrator is a client by itself. This should be true for files that are in
> the hot tier, that the bad files are not getting demoted. Now this should be
> the bug. Bad files shouldn't get the prime storage.

The bug description says "file get promoted". It shouldn't get migrated at all in either direction. You're correct in saying that corrupted objects shouldn't get prime storage, but you'd need to recover them anyway. So, there's less point in moving such objects to the cold tier and then initiate recovery - they can and _should_ be recovered from where ever it lives now given that there's a replica to recover from.

Coming to the bug - did the file get migrated by any chance before it was scanned by scrubber?

Comment 6 Joseph Elwin Fernandes 2015-11-24 13:41:07 UTC
(In reply to Venky Shankar from comment #5)
> (In reply to Joseph Elwin Fernandes from comment #3)
> > 1) IMHO Bad files should never be promoted. And if they are in the hot tier
> > they should be demoted. Venky your thoughts on this.
> 
> Well, once an object is marked bad, any access is disallowed. So,
> promoting/demoting should be out of the question. Furthermore,
> promoting/demoting a bad object might screw up things - once the corrupted
> object gets migrated, it's signed again, but now with bad data. So, one
> needs to be extra careful here.
> 
> > 
> > 2) As far as the bug is concerned the reason a bad file is not getting
> > promoted is that bit rot will not serve a bad file to any clients. Tiering
> > migrator is a client by itself. This should be true for files that are in
> > the hot tier, that the bad files are not getting demoted. Now this should be
> > the bug. Bad files shouldn't get the prime storage.
> 
> The bug description says "file get promoted". It shouldn't get migrated at
> all in either direction. You're correct in saying that corrupted objects
> shouldn't get prime storage, but you'd need to recover them anyway. So,
> there's less point in moving such objects to the cold tier and then initiate
> recovery - they can and _should_ be recovered from where ever it lives now
> given that there's a replica to recover from.

OOPS! missed that. Valid point on recovering and then moving. 
> 
> Coming to the bug - did the file get migrated by any chance before it was
> scanned by scrubber?

Comment 7 Joseph Elwin Fernandes 2015-11-24 13:43:46 UTC
> Coming to the bug - did the file get migrated by any chance before it was
> scanned by scrubber?

Is it possible, that a file is picked for migraiton and during the migration the scrubber marks it bad ?

Comment 8 RamaKasturi 2015-11-24 13:48:06 UTC
(In reply to Venky Shankar from comment #5)
> (In reply to Joseph Elwin Fernandes from comment #3)
> > 1) IMHO Bad files should never be promoted. And if they are in the hot tier
> > they should be demoted. Venky your thoughts on this.
> 
> Well, once an object is marked bad, any access is disallowed. So,
> promoting/demoting should be out of the question. Furthermore,
> promoting/demoting a bad object might screw up things - once the corrupted
> object gets migrated, it's signed again, but now with bad data. So, one
> needs to be extra careful here.
> 
> > 
> > 2) As far as the bug is concerned the reason a bad file is not getting
> > promoted is that bit rot will not serve a bad file to any clients. Tiering
> > migrator is a client by itself. This should be true for files that are in
> > the hot tier, that the bad files are not getting demoted. Now this should be
> > the bug. Bad files shouldn't get the prime storage.
> 
> The bug description says "file get promoted". It shouldn't get migrated at
> all in either direction. You're correct in saying that corrupted objects
> shouldn't get prime storage, but you'd need to recover them anyway. So,
> there's less point in moving such objects to the cold tier and then initiate
> recovery - they can and _should_ be recovered from where ever it lives now
> given that there's a replica to recover from.
> 
> Coming to the bug - did the file get migrated by any chance before it was
> scanned by scrubber?

This might be one of the reason why the file is promoted .

Comment 9 RamaKasturi 2015-11-24 13:50:29 UTC
(In reply to Joseph Elwin Fernandes from comment #6)
> (In reply to Venky Shankar from comment #5)
> > (In reply to Joseph Elwin Fernandes from comment #3)
> > > 1) IMHO Bad files should never be promoted. And if they are in the hot tier
> > > they should be demoted. Venky your thoughts on this.
> > 
> > Well, once an object is marked bad, any access is disallowed. So,
> > promoting/demoting should be out of the question. Furthermore,
> > promoting/demoting a bad object might screw up things - once the corrupted
> > object gets migrated, it's signed again, but now with bad data. So, one
> > needs to be extra careful here.
> > 
> > > 
> > > 2) As far as the bug is concerned the reason a bad file is not getting
> > > promoted is that bit rot will not serve a bad file to any clients. Tiering
> > > migrator is a client by itself. This should be true for files that are in
> > > the hot tier, that the bad files are not getting demoted. Now this should be
> > > the bug. Bad files shouldn't get the prime storage.
> > 
> > The bug description says "file get promoted". It shouldn't get migrated at
> > all in either direction. You're correct in saying that corrupted objects
> > shouldn't get prime storage, but you'd need to recover them anyway. So,
> > there's less point in moving such objects to the cold tier and then initiate
> > recovery - they can and _should_ be recovered from where ever it lives now
> > given that there's a replica to recover from.
> 
> OOPS! missed that. Valid point on recovering and then moving. 
> > 
> > Coming to the bug - did the file get migrated by any chance before it was
> > scanned by scrubber?

No, there is no chance because there was no I/O going on the mount point nor the file was accessed.

Comment 10 Venky Shankar 2015-11-24 13:56:29 UTC
(In reply to RamaKasturi from comment #9)
> (In reply to Joseph Elwin Fernandes from comment #6)
> > (In reply to Venky Shankar from comment #5)
> > > (In reply to Joseph Elwin Fernandes from comment #3)
> > > > 1) IMHO Bad files should never be promoted. And if they are in the hot tier
> > > > they should be demoted. Venky your thoughts on this.
> > > 
> > > Well, once an object is marked bad, any access is disallowed. So,
> > > promoting/demoting should be out of the question. Furthermore,
> > > promoting/demoting a bad object might screw up things - once the corrupted
> > > object gets migrated, it's signed again, but now with bad data. So, one
> > > needs to be extra careful here.
> > > 
> > > > 
> > > > 2) As far as the bug is concerned the reason a bad file is not getting
> > > > promoted is that bit rot will not serve a bad file to any clients. Tiering
> > > > migrator is a client by itself. This should be true for files that are in
> > > > the hot tier, that the bad files are not getting demoted. Now this should be
> > > > the bug. Bad files shouldn't get the prime storage.
> > > 
> > > The bug description says "file get promoted". It shouldn't get migrated at
> > > all in either direction. You're correct in saying that corrupted objects
> > > shouldn't get prime storage, but you'd need to recover them anyway. So,
> > > there's less point in moving such objects to the cold tier and then initiate
> > > recovery - they can and _should_ be recovered from where ever it lives now
> > > given that there's a replica to recover from.
> > 
> > OOPS! missed that. Valid point on recovering and then moving. 
> > > 
> > > Coming to the bug - did the file get migrated by any chance before it was
> > > scanned by scrubber?
> 
> No, there is no chance because there was no I/O going on the mount point nor
> the file was accessed.

Then how did it become a candidate for migration? If the file got migrated, then it needs to be accessed or modified -- which is disallowed when a file is marked corrupted.

So, either the file was not marked bad at all or it got migrated before scrubber could perform integrity checks.

Comment 11 RamaKasturi 2015-11-24 14:01:51 UTC
(In reply to Venky Shankar from comment #10)
> (In reply to RamaKasturi from comment #9)
> > (In reply to Joseph Elwin Fernandes from comment #6)
> > > (In reply to Venky Shankar from comment #5)
> > > > (In reply to Joseph Elwin Fernandes from comment #3)
> > > > > 1) IMHO Bad files should never be promoted. And if they are in the hot tier
> > > > > they should be demoted. Venky your thoughts on this.
> > > > 
> > > > Well, once an object is marked bad, any access is disallowed. So,
> > > > promoting/demoting should be out of the question. Furthermore,
> > > > promoting/demoting a bad object might screw up things - once the corrupted
> > > > object gets migrated, it's signed again, but now with bad data. So, one
> > > > needs to be extra careful here.
> > > > 
> > > > > 
> > > > > 2) As far as the bug is concerned the reason a bad file is not getting
> > > > > promoted is that bit rot will not serve a bad file to any clients. Tiering
> > > > > migrator is a client by itself. This should be true for files that are in
> > > > > the hot tier, that the bad files are not getting demoted. Now this should be
> > > > > the bug. Bad files shouldn't get the prime storage.
> > > > 
> > > > The bug description says "file get promoted". It shouldn't get migrated at
> > > > all in either direction. You're correct in saying that corrupted objects
> > > > shouldn't get prime storage, but you'd need to recover them anyway. So,
> > > > there's less point in moving such objects to the cold tier and then initiate
> > > > recovery - they can and _should_ be recovered from where ever it lives now
> > > > given that there's a replica to recover from.
> > > 
> > > OOPS! missed that. Valid point on recovering and then moving. 
> > > > 
> > > > Coming to the bug - did the file get migrated by any chance before it was
> > > > scanned by scrubber?
> > 
> > No, there is no chance because there was no I/O going on the mount point nor
> > the file was accessed.
> 
> Then how did it become a candidate for migration? If the file got migrated,
> then it needs to be accessed or modified -- which is disallowed when a file
> is marked corrupted.
> 
> So, either the file was not marked bad at all or it got migrated before
> scrubber could perform integrity checks.

As per my understanding, When scrubber scanned the files (BZ 1283505), files got heated up and this is how the file was moved to hot tier.

Comment 12 Venky Shankar 2015-11-24 15:09:11 UTC
(In reply to RamaKasturi from comment #11)
> (In reply to Venky Shankar from comment #10)
> > (In reply to RamaKasturi from comment #9)
> > > (In reply to Joseph Elwin Fernandes from comment #6)
> > > > (In reply to Venky Shankar from comment #5)
> > > > > (In reply to Joseph Elwin Fernandes from comment #3)
> > > > > > 1) IMHO Bad files should never be promoted. And if they are in the hot tier
> > > > > > they should be demoted. Venky your thoughts on this.
> > > > > 
> > > > > Well, once an object is marked bad, any access is disallowed. So,
> > > > > promoting/demoting should be out of the question. Furthermore,
> > > > > promoting/demoting a bad object might screw up things - once the corrupted
> > > > > object gets migrated, it's signed again, but now with bad data. So, one
> > > > > needs to be extra careful here.
> > > > > 
> > > > > > 
> > > > > > 2) As far as the bug is concerned the reason a bad file is not getting
> > > > > > promoted is that bit rot will not serve a bad file to any clients. Tiering
> > > > > > migrator is a client by itself. This should be true for files that are in
> > > > > > the hot tier, that the bad files are not getting demoted. Now this should be
> > > > > > the bug. Bad files shouldn't get the prime storage.
> > > > > 
> > > > > The bug description says "file get promoted". It shouldn't get migrated at
> > > > > all in either direction. You're correct in saying that corrupted objects
> > > > > shouldn't get prime storage, but you'd need to recover them anyway. So,
> > > > > there's less point in moving such objects to the cold tier and then initiate
> > > > > recovery - they can and _should_ be recovered from where ever it lives now
> > > > > given that there's a replica to recover from.
> > > > 
> > > > OOPS! missed that. Valid point on recovering and then moving. 
> > > > > 
> > > > > Coming to the bug - did the file get migrated by any chance before it was
> > > > > scanned by scrubber?
> > > 
> > > No, there is no chance because there was no I/O going on the mount point nor
> > > the file was accessed.
> > 
> > Then how did it become a candidate for migration? If the file got migrated,
> > then it needs to be accessed or modified -- which is disallowed when a file
> > is marked corrupted.
> > 
> > So, either the file was not marked bad at all or it got migrated before
> > scrubber could perform integrity checks.
> 
> As per my understanding, When scrubber scanned the files (BZ 1283505), files
> got heated up and this is how the file was moved to hot tier.

Which is this build? There was a patch to avoid migrating files when scrubber acts on them.

Comment 13 Venky Shankar 2015-11-24 16:31:21 UTC
(In reply to Joseph Elwin Fernandes from comment #7)
> > Coming to the bug - did the file get migrated by any chance before it was
> > scanned by scrubber?
> 
> Is it possible, that a file is picked for migraiton and during the migration
> the scrubber marks it bad ?

If this happens, then the next I/O on the corrupted objects would result in EIO. In case the object is marked corrupted just after the migration is done (with the source file just about to be purged), then a corrupted object is migrated. This is due to the asynchronous nature of signing and scrubbing.

Comment 14 Joseph Elwin Fernandes 2015-11-24 17:56:19 UTC
Apart from the scrubber heating the file which is fixed in the next build.
The migration of Bad files is a known issue due to the asynchronous nature of signing and scrubbing, as Venky has pointed out. We need to document this in tiering interop section.

Comment 15 Venky Shankar 2015-11-25 04:12:21 UTC
(In reply to Joseph Elwin Fernandes from comment #14)
> Apart from the scrubber heating the file which is fixed in the next build.
> The migration of Bad files is a known issue due to the asynchronous nature
> of signing and scrubbing, as Venky has pointed out. We need to document this
> in tiering intro section.

Just to be more specific, there are couple of cases here: (jotting down here for ease of documentation)

1. Object is corrupted before it could be signed: In this case, the corrupted object is signed and get migrated upon I/O. There's no way to identify corruption for this set of objects.

2. Object is signed (but not scrubbed) and corruption happens thereafter: In this case, as of now, integrity checking is not done on the fly and the object would get migrated (and signed again in the hot tier). This can be fixed and is being tracked in bz #1227672.

But, #1 is the gap where corruption can sneak in.

Comment 20 Bhaskarakiran 2016-02-08 09:19:13 UTC
The doc text doesn't seem to be correct. Pls check with venky and make the needed changes.

Comment 21 Joseph Elwin Fernandes 2016-02-09 05:45:23 UTC
A file will not be marked healthy. This bug speaks about how corruption can sneak in and go undetected.

As Venky pointed out, these are the scenarios.

1. Object is corrupted before it could be signed: In this case, the corrupted object is signed and get migrated upon I/O. There's no way to identify corruption for this set of objects.

2. Object is signed (but not scrubbed) and corruption happens thereafter: In this case, as of now, integrity checking is not done on the fly and the object would get migrated (and signed again in the hot tier). This can be fixed and is being tracked in bz #1227672.

==============================================================================
Just for understanding : Bitrot works like this (explained in a very simple way),
STEP 1 :A file is signed, asynchronously using a md5 checksum, by the signer whenever there is some data written to file.

STEP 2 : Then the scrubber wakes up in a scheduled manner and regenerates the md5 checksum using the file data and compares it with previously store checksum (from STEP 1 i.e signer). If the checksum dont match then the file is marked as BAD/CORRUPTED.

So the above scenarios explains the cases where corruption can go undetected both from signer and scrubber.
==============================================================================

Comment 22 Bhaskarakiran 2016-02-09 09:43:17 UTC
Laura, 

Doc text and known issue documentation need to be updated with the https://bugzilla.redhat.com/show_bug.cgi?id=1283507#c21

Comment 26 Kotresh HR 2016-04-18 04:19:56 UTC
Posted Upstream:
http://review.gluster.org/#/c/13969/

Comment 28 hari gowtham 2018-11-08 18:36:34 UTC
As tier is not being actively developed, I'm closing this bug. Feel free to open it if necessary.


Note You need to log in before you can comment on or make changes to this bug.