Bug 1381194

Summary: [Bitrot]: 'Total scrubbed files' count off-the-mark, by the number of corruptions detected
Product: Red Hat Gluster Storage Reporter: Sweta Anandpara <sanandpa>
Component: bitrotAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED DEFERRED QA Contact: Sweta Anandpara <sanandpa>
Severity: medium Docs Contact:
Priority: low    
Version: rhgs-3.2CC: amukherj, atumball, rhs-bugs, sanandpa, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 10:15:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Sweta Anandpara 2016-10-03 11:37:51 UTC
Description of problem:
=======================
If there are 10 files, and the corruptions detected as 2, then the field 'Total number of scrubbed files' shows as 8 instead of 10. Though not consistently reproducible, was able to see this behaviour in 2 of my 4 tries. There might be a particular path of signing-verifying-scrubbing, where the 'scrubbed files' count is not getting added to, if an error/corruption is detected.


Version-Release number of selected component (if applicable):
=============================================================
3.8.4-2


How reproducible:
-================
50%


Steps to Reproduce:
==================
1. Have a 4 node cluster, with (say) 2*3 volume 'ozone'
2. Create 4 files and wait until signing takes place.
3. Corrupt a file at the backend, and run scrubber ondemand
4. Verify the scrub status output

Actual results:
==============
Corruption is detected. The node on which corrpution is detected, shows the 'Number of scrubbed files' as '3'


Expected results:
================
The field 'Number of scrubbed files' should show the value as '4'


Additional info:
================

[root@dhcp35-104 ozone]# gluster v bitrot ozone scrub ondemand
volume bitrot: success
[root@dhcp35-104 ozone]# gluster volume bitrot ozone scrub status

Volume name : ozone

State of scrub: Active (In Progress)

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: localhost

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2016-10-03 10:55:44

Duration of last scrub (D:M:H:M:S): 0:0:0:2

Error count: 0


=========================================================

Node: dhcp35-101.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2016-10-03 10:55:45

Duration of last scrub (D:M:H:M:S): 0:0:0:4

Error count: 0


=========================================================

Node: 10.70.35.100

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2016-10-03 10:55:44

Duration of last scrub (D:M:H:M:S): 0:0:0:3

Error count: 1

Corrupted object's [GFID]:

5793c862-a24c-41ab-b708-64de6da40ba0


=========================================================

Node: dhcp35-115.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2016-10-03 10:55:45

Duration of last scrub (D:M:H:M:S): 0:0:0:3

Error count: 0

=========================================================

[root@dhcp35-104 ozone]# 
[root@dhcp35-104 ozone]# 
[root@dhcp35-104 ozone]# 
[root@dhcp35-104 ozone]# gluster volume bitrot ozone scrub status

Volume name : ozone

State of scrub: Active (Idle)

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: localhost

Number of Scrubbed files: 1

Number of Skipped files: 0

Last completed scrub time: 2016-10-03 10:55:55

Duration of last scrub (D:M:H:M:S): 0:0:0:2

Error count: 0


=========================================================

Node: dhcp35-101.lab.eng.blr.redhat.com

Number of Scrubbed files: 3

Number of Skipped files: 0

Last completed scrub time: 2016-10-03 10:55:57

Duration of last scrub (D:M:H:M:S): 0:0:0:4

Error count: 0


=========================================================

Node: dhcp35-115.lab.eng.blr.redhat.com

Number of Scrubbed files: 4

Number of Skipped files: 0

Last completed scrub time: 2016-10-03 10:55:57

Duration of last scrub (D:M:H:M:S): 0:0:0:3

Error count: 0


=========================================================

Node: 10.70.35.100

Number of Scrubbed files: 3               >>>>>>>>>>>> THIS COUNT SHOULD HAVE BEEN '4' AND NOT '3'  >>>>>>>>>>>>>>>>>>>>

Number of Skipped files: 0

Last completed scrub time: 2016-10-03 10:55:56

Duration of last scrub (D:M:H:M:S): 0:0:0:3

Error count: 1

Corrupted object's [GFID]:

5793c862-a24c-41ab-b708-64de6da40ba0

=========================================================

[root@dhcp35-104 ozone]# gluster v bitrot ozone scrub-frequency minute
volume bitrot: success
[root@dhcp35-104 ozone]#
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# gluster v info ozone
 
Volume Name: ozone
Type: Distributed-Replicate
Volume ID: a9540783-ab5d-4b58-9c69-e81dbe986559
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.35.115:/bricks/brick1/ozone
Brick2: 10.70.35.100:/bricks/brick1/ozone
Brick3: 10.70.35.101:/bricks/brick1/ozone
Brick4: 10.70.35.115:/bricks/brick2/ozone
Brick5: 10.70.35.100:/bricks/brick2/ozone
Brick6: 10.70.35.104:/bricks/brick2/ozone
Options Reconfigured:
transport.address-family: inet
features.bitrot: on
features.scrub: Active
features.scrub-freq: minute
auto-delete: disable
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# ls /bricks/brick1/ozone
test1  test2  test5
[root@dhcp35-100 ozone]# ls /bricks/brick2/ozone
test3
[root@dhcp35-100 ozone]# getfattr -d -m . -e hex /bricks/brick1/ozone/*
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/ozone/test1
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.bad-file=0x3100
trusted.bit-rot.signature=0x0102000000000000008304374c403af0879931a6be4dbf25ea735955b67e6fd2ba5bed527cd1836af3
trusted.bit-rot.version=0x020000000000000057f232d10009e251
trusted.gfid=0xa9384c2593fd412da8b1442d0f3c16c4

# file: bricks/brick1/ozone/test2
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.signature=0x0102000000000000008304374c403af0879931a6be4dbf25ea735955b67e6fd2ba5bed527cd1836af3
trusted.bit-rot.version=0x020000000000000057f232d10009e251
trusted.gfid=0x9b0ba6524b6d4a4497916b0d42a4efae

# file: bricks/brick1/ozone/test5
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.signature=0x0102000000000000008304374c403af0879931a6be4dbf25ea735955b67e6fd2ba5bed527cd1836af3
trusted.bit-rot.version=0x020000000000000057f232d10009e251
trusted.gfid=0x2ea33905a051451aae74d4ddbb28726e

[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# getfattr -d -m . -e hex /bricks/brick2/ozone/*
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick2/ozone/test3
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.bad-file=0x3100
trusted.bit-rot.signature=0x0102000000000000008304374c403af0879931a6be4dbf25ea735955b67e6fd2ba5bed527cd1836af3
trusted.bit-rot.version=0x020000000000000057f232d1000a5a36
trusted.gfid=0x5793c862a24c41abb70864de6da40ba0

[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# gluster peer status
Number of Peers: 3

Hostname: dhcp35-115.lab.eng.blr.redhat.com
Uuid: 6ac165c0-317f-42ad-8262-953995171dbb
State: Peer in Cluster (Connected)

Hostname: 10.70.35.104
Uuid: 10335359-1c70-42b2-bcce-6215a973678d
State: Peer in Cluster (Connected)

Hostname: dhcp35-101.lab.eng.blr.redhat.com
Uuid: a3bd23b9-f70a-47f5-9c95-7a271f5f1e18
State: Peer in Cluster (Connected)
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# rpm -qa| grep gluster
glusterfs-fuse-3.8.4-2.el6rhs.x86_64
glusterfs-ganesha-3.8.4-2.el6rhs.x86_64
gluster-nagios-common-0.2.4-1.el6rhs.noarch
glusterfs-libs-3.8.4-2.el6rhs.x86_64
glusterfs-devel-3.8.4-2.el6rhs.x86_64
glusterfs-rdma-3.8.4-2.el6rhs.x86_64
glusterfs-api-3.8.4-2.el6rhs.x86_64
glusterfs-api-devel-3.8.4-2.el6rhs.x86_64
glusterfs-debuginfo-3.8.4-1.el6rhs.x86_64
glusterfs-client-xlators-3.8.4-2.el6rhs.x86_64
glusterfs-cli-3.8.4-2.el6rhs.x86_64
glusterfs-geo-replication-3.8.4-2.el6rhs.x86_64
glusterfs-3.8.4-2.el6rhs.x86_64
gluster-nagios-addons-0.2.8-1.el6rhs.x86_64
nfs-ganesha-gluster-2.3.1-8.el6rhs.x86_64
python-gluster-3.8.4-2.el6rhs.noarch
glusterfs-events-3.8.4-2.el6rhs.x86_64
vdsm-gluster-4.16.30-1.5.el6rhs.noarch
glusterfs-server-3.8.4-2.el6rhs.x86_64
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]# 
[root@dhcp35-100 ozone]#

Comment 2 Sweta Anandpara 2016-10-03 11:40:35 UTC
Will update in this space with the exact sequence of steps, to reproduce it.

Comment 3 Kotresh HR 2016-10-24 10:59:45 UTC
Analysis:

This is not actually a bug. A little more understanding of scrub status helps here.

Number of Scrubbed files : 
   It represents the number of files the scrubber actuallyscrubbed in the scrub cycle. Scrub meaning, calculation of checksum in memory and compared with on disk and verifying whether it's corrupted or not. It does not count if the file is already bad and if signing is not happened to scrub.

Number of Skipped files: 
   It counts number of files for which signing did not happen at the time of scrubbing. It doesn't consider bad files.

Considering above scenarios, let's analyse the possible outputs. Let's say
4 files are present as in this bug.

1. One file is corrupted. Assume signing is done for all files.

    > On first scrub run of file corruption.
          Number of scrubbed files: 4
          Number of skipped files: 0
          Error: 1
    > On second scrub run while the file is still corrupted.
          Number of scrubbed files: 3
          Number of skipped files: 0
          Error: 1

    In second run, it didn't actually scrub the bad file, hence the count is 3
    and is correct.

2. Assume the corrupted file has one hardlink. 
   > On first scrub run of file corruption.
          Number of scrubbed files: 3
          Number of skipped files: 0
          Error: 1
     This is because, while scrubbing hardlink, it finds it bad and doesn't scrub it.
   > On second and further run while the file is still corrupted.
          Number of scrubbed files: 2
          Number of skipped files: 0
          Error: 1 
      Error count is 1 because it doesn't counting hardlinks.


Please test this again with above understanding and close this as not a bug if the counts match as per above definition.

Comment 4 Sweta Anandpara 2016-10-24 11:30:49 UTC
Partly agree with comment 3. 

The real sense/meaning of scrub might be what is mentioned. It somehow doesn't seem to be intuitive, from the end user perspective. Nor does it feel simple enough for the end user to understand. 

The 'Number of scrubbed files', IF it is anything lesser than the total number of files, gives the impression that file(s) was/were skipped. Then the user would expect that count to be present in the 'skipped files', which will not really reflect that. And it would mislead one to think that a file was neither scrubbed, nor skipped. 

Either ways, it would be less hassle free if the 'scrubbed files' count becomes synonymous to 'total /considered/ files' count.

Comment 6 Sweta Anandpara 2016-11-21 07:36:55 UTC
Additional testing was done to verify if the behaviour has been the same even in the past releases. The very first scrub cycle shows the correct count of 'files scrubbed'. Every subsequent cycle shows the #files scrubbed to be reduced by the number of corruptions detected (in that particular node). 

This behaviour inspite of being present from before, has come to the fore after having a 'scrub ondemand' option.

Comment 11 Amar Tumballi 2018-10-11 10:15:37 UTC
Marking as deferred! 2 reasons!

This bug was not a blocker in last 2 years,
BitRot is not a focus right now, and we haven't faced any customer issues on this component!

Reopen if anything changes, or component becomes important!