Hide Forgot
Description of problem: ======================= If there are 10 files, and the corruptions detected as 2, then the field 'Total number of scrubbed files' shows as 8 instead of 10. Though not consistently reproducible, was able to see this behaviour in 2 of my 4 tries. There might be a particular path of signing-verifying-scrubbing, where the 'scrubbed files' count is not getting added to, if an error/corruption is detected. Version-Release number of selected component (if applicable): ============================================================= 3.8.4-2 How reproducible: -================ 50% Steps to Reproduce: ================== 1. Have a 4 node cluster, with (say) 2*3 volume 'ozone' 2. Create 4 files and wait until signing takes place. 3. Corrupt a file at the backend, and run scrubber ondemand 4. Verify the scrub status output Actual results: ============== Corruption is detected. The node on which corrpution is detected, shows the 'Number of scrubbed files' as '3' Expected results: ================ The field 'Number of scrubbed files' should show the value as '4' Additional info: ================ [root@dhcp35-104 ozone]# gluster v bitrot ozone scrub ondemand volume bitrot: success [root@dhcp35-104 ozone]# gluster volume bitrot ozone scrub status Volume name : ozone State of scrub: Active (In Progress) Scrub impact: lazy Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: 2016-10-03 10:55:44 Duration of last scrub (D:M:H:M:S): 0:0:0:2 Error count: 0 ========================================================= Node: dhcp35-101.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: 2016-10-03 10:55:45 Duration of last scrub (D:M:H:M:S): 0:0:0:4 Error count: 0 ========================================================= Node: 10.70.35.100 Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: 2016-10-03 10:55:44 Duration of last scrub (D:M:H:M:S): 0:0:0:3 Error count: 1 Corrupted object's [GFID]: 5793c862-a24c-41ab-b708-64de6da40ba0 ========================================================= Node: dhcp35-115.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: 2016-10-03 10:55:45 Duration of last scrub (D:M:H:M:S): 0:0:0:3 Error count: 0 ========================================================= [root@dhcp35-104 ozone]# [root@dhcp35-104 ozone]# [root@dhcp35-104 ozone]# [root@dhcp35-104 ozone]# gluster volume bitrot ozone scrub status Volume name : ozone State of scrub: Active (Idle) Scrub impact: lazy Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 1 Number of Skipped files: 0 Last completed scrub time: 2016-10-03 10:55:55 Duration of last scrub (D:M:H:M:S): 0:0:0:2 Error count: 0 ========================================================= Node: dhcp35-101.lab.eng.blr.redhat.com Number of Scrubbed files: 3 Number of Skipped files: 0 Last completed scrub time: 2016-10-03 10:55:57 Duration of last scrub (D:M:H:M:S): 0:0:0:4 Error count: 0 ========================================================= Node: dhcp35-115.lab.eng.blr.redhat.com Number of Scrubbed files: 4 Number of Skipped files: 0 Last completed scrub time: 2016-10-03 10:55:57 Duration of last scrub (D:M:H:M:S): 0:0:0:3 Error count: 0 ========================================================= Node: 10.70.35.100 Number of Scrubbed files: 3 >>>>>>>>>>>> THIS COUNT SHOULD HAVE BEEN '4' AND NOT '3' >>>>>>>>>>>>>>>>>>>> Number of Skipped files: 0 Last completed scrub time: 2016-10-03 10:55:56 Duration of last scrub (D:M:H:M:S): 0:0:0:3 Error count: 1 Corrupted object's [GFID]: 5793c862-a24c-41ab-b708-64de6da40ba0 ========================================================= [root@dhcp35-104 ozone]# gluster v bitrot ozone scrub-frequency minute volume bitrot: success [root@dhcp35-104 ozone]# [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# gluster v info ozone Volume Name: ozone Type: Distributed-Replicate Volume ID: a9540783-ab5d-4b58-9c69-e81dbe986559 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.70.35.115:/bricks/brick1/ozone Brick2: 10.70.35.100:/bricks/brick1/ozone Brick3: 10.70.35.101:/bricks/brick1/ozone Brick4: 10.70.35.115:/bricks/brick2/ozone Brick5: 10.70.35.100:/bricks/brick2/ozone Brick6: 10.70.35.104:/bricks/brick2/ozone Options Reconfigured: transport.address-family: inet features.bitrot: on features.scrub: Active features.scrub-freq: minute auto-delete: disable [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# ls /bricks/brick1/ozone test1 test2 test5 [root@dhcp35-100 ozone]# ls /bricks/brick2/ozone test3 [root@dhcp35-100 ozone]# getfattr -d -m . -e hex /bricks/brick1/ozone/* getfattr: Removing leading '/' from absolute path names # file: bricks/brick1/ozone/test1 trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.bad-file=0x3100 trusted.bit-rot.signature=0x0102000000000000008304374c403af0879931a6be4dbf25ea735955b67e6fd2ba5bed527cd1836af3 trusted.bit-rot.version=0x020000000000000057f232d10009e251 trusted.gfid=0xa9384c2593fd412da8b1442d0f3c16c4 # file: bricks/brick1/ozone/test2 trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.signature=0x0102000000000000008304374c403af0879931a6be4dbf25ea735955b67e6fd2ba5bed527cd1836af3 trusted.bit-rot.version=0x020000000000000057f232d10009e251 trusted.gfid=0x9b0ba6524b6d4a4497916b0d42a4efae # file: bricks/brick1/ozone/test5 trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.signature=0x0102000000000000008304374c403af0879931a6be4dbf25ea735955b67e6fd2ba5bed527cd1836af3 trusted.bit-rot.version=0x020000000000000057f232d10009e251 trusted.gfid=0x2ea33905a051451aae74d4ddbb28726e [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# getfattr -d -m . -e hex /bricks/brick2/ozone/* getfattr: Removing leading '/' from absolute path names # file: bricks/brick2/ozone/test3 trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.bad-file=0x3100 trusted.bit-rot.signature=0x0102000000000000008304374c403af0879931a6be4dbf25ea735955b67e6fd2ba5bed527cd1836af3 trusted.bit-rot.version=0x020000000000000057f232d1000a5a36 trusted.gfid=0x5793c862a24c41abb70864de6da40ba0 [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# gluster peer status Number of Peers: 3 Hostname: dhcp35-115.lab.eng.blr.redhat.com Uuid: 6ac165c0-317f-42ad-8262-953995171dbb State: Peer in Cluster (Connected) Hostname: 10.70.35.104 Uuid: 10335359-1c70-42b2-bcce-6215a973678d State: Peer in Cluster (Connected) Hostname: dhcp35-101.lab.eng.blr.redhat.com Uuid: a3bd23b9-f70a-47f5-9c95-7a271f5f1e18 State: Peer in Cluster (Connected) [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# rpm -qa| grep gluster glusterfs-fuse-3.8.4-2.el6rhs.x86_64 glusterfs-ganesha-3.8.4-2.el6rhs.x86_64 gluster-nagios-common-0.2.4-1.el6rhs.noarch glusterfs-libs-3.8.4-2.el6rhs.x86_64 glusterfs-devel-3.8.4-2.el6rhs.x86_64 glusterfs-rdma-3.8.4-2.el6rhs.x86_64 glusterfs-api-3.8.4-2.el6rhs.x86_64 glusterfs-api-devel-3.8.4-2.el6rhs.x86_64 glusterfs-debuginfo-3.8.4-1.el6rhs.x86_64 glusterfs-client-xlators-3.8.4-2.el6rhs.x86_64 glusterfs-cli-3.8.4-2.el6rhs.x86_64 glusterfs-geo-replication-3.8.4-2.el6rhs.x86_64 glusterfs-3.8.4-2.el6rhs.x86_64 gluster-nagios-addons-0.2.8-1.el6rhs.x86_64 nfs-ganesha-gluster-2.3.1-8.el6rhs.x86_64 python-gluster-3.8.4-2.el6rhs.noarch glusterfs-events-3.8.4-2.el6rhs.x86_64 vdsm-gluster-4.16.30-1.5.el6rhs.noarch glusterfs-server-3.8.4-2.el6rhs.x86_64 [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]# [root@dhcp35-100 ozone]#
Will update in this space with the exact sequence of steps, to reproduce it.
Analysis: This is not actually a bug. A little more understanding of scrub status helps here. Number of Scrubbed files : It represents the number of files the scrubber actuallyscrubbed in the scrub cycle. Scrub meaning, calculation of checksum in memory and compared with on disk and verifying whether it's corrupted or not. It does not count if the file is already bad and if signing is not happened to scrub. Number of Skipped files: It counts number of files for which signing did not happen at the time of scrubbing. It doesn't consider bad files. Considering above scenarios, let's analyse the possible outputs. Let's say 4 files are present as in this bug. 1. One file is corrupted. Assume signing is done for all files. > On first scrub run of file corruption. Number of scrubbed files: 4 Number of skipped files: 0 Error: 1 > On second scrub run while the file is still corrupted. Number of scrubbed files: 3 Number of skipped files: 0 Error: 1 In second run, it didn't actually scrub the bad file, hence the count is 3 and is correct. 2. Assume the corrupted file has one hardlink. > On first scrub run of file corruption. Number of scrubbed files: 3 Number of skipped files: 0 Error: 1 This is because, while scrubbing hardlink, it finds it bad and doesn't scrub it. > On second and further run while the file is still corrupted. Number of scrubbed files: 2 Number of skipped files: 0 Error: 1 Error count is 1 because it doesn't counting hardlinks. Please test this again with above understanding and close this as not a bug if the counts match as per above definition.
Partly agree with comment 3. The real sense/meaning of scrub might be what is mentioned. It somehow doesn't seem to be intuitive, from the end user perspective. Nor does it feel simple enough for the end user to understand. The 'Number of scrubbed files', IF it is anything lesser than the total number of files, gives the impression that file(s) was/were skipped. Then the user would expect that count to be present in the 'skipped files', which will not really reflect that. And it would mislead one to think that a file was neither scrubbed, nor skipped. Either ways, it would be less hassle free if the 'scrubbed files' count becomes synonymous to 'total /considered/ files' count.
Additional testing was done to verify if the behaviour has been the same even in the past releases. The very first scrub cycle shows the correct count of 'files scrubbed'. Every subsequent cycle shows the #files scrubbed to be reduced by the number of corruptions detected (in that particular node). This behaviour inspite of being present from before, has come to the fore after having a 'scrub ondemand' option.
Marking as deferred! 2 reasons! This bug was not a blocker in last 2 years, BitRot is not a focus right now, and we haven't faced any customer issues on this component! Reopen if anything changes, or component becomes important!