Description of problem: When user runs "gluster volume bitrot <vol_name> scrub status" values for Number of Scrubbed files, Number of Unsigned files, Last completed scrub time and Duration of last scrub are shown as zero. [root@rhs-client24 b7]# gluster vol bitrot vol1 scrub status Volume name : vol1 State of scrub: Active Scrub impact: lazy Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node name: localhost Number of Scrubbed files: 0 Number of Unsigned files: 0 Last completed scrub time: 0 Duration of last scrub: 0 Error count: 1 Corrupted object's: 522a6019-bde9-4963-b1d6-bbdaf27c788a ========================================================= Node name: 10.70.36.57 Number of Scrubbed files: 0 Number of Unsigned files: 0 Last completed scrub time: 0 Duration of last scrub: 0 Error count: 0 ========================================================= Version-Release number of selected component (if applicable): glusterfs-3.7.5-7.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Install the latest build 2. create a volume and enable bitrot on that. 3. Run the command gluster volume bitrot <vol_name> scrub status Actual results: values for Number of Scrubbed files, Number of Unsigned files, Last completed scrub time and Duration of last scrub are shown as zero. Expected results: actual Values should be shown for Number of Scrubbed files, Number of Unsigned files, Last completed scrub time and Duration of last scrub and not zeros. Additional info:
http://review.gluster.org/#/c/12776/ is the upstream patch for review.
downstream patch available: https://code.engineering.redhat.com/gerrit/63789
Hi Gaurav, I see some issues related to the bug. Following are my observations. 1) There are only three files in my volume and i see that number of scrubbed files is always more than my total number of files in the volume. 2) I am not sure when would there be files which are unsigned.I always see that the files are signed in my system and do not see why do we have the field Total Number of unsigned files. 3) In one of my node scrubber finished running @ 05:01:12 on [2015-12-22] but scrub status shows last completed scrub as 2015-12-20. 3) There is always inconsistency between both of my nodes scrub status output which forms the volume. output from scrub status: ======================== [root@rhs-client2 ~]# gluster vol bitrot vol_ec scrub status Volume name : vol_ec State of scrub: Active Scrub impact: lazy Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node name: localhost Number of Scrubbed files: 275 Number of Unsigned files: 0 Last completed scrub time: 2015-12-20 12:57:22 Duration of last scrub: 1:18:2:37 Error count: 0 ========================================================= Node name: 10.70.36.62 Number of Scrubbed files: 45 Number of Unsigned files: 0 Last completed scrub time: 2015-12-22 05:23:32 Duration of last scrub: 0:5:1:7 Error count: 0 ========================================================= Thanks kasturi
Hi Gaurav, As per what you said, a correct value for Total no.of Unsigned files will get displayed only when scrubber see that the file is unsigned. But scrubber does not do anything with the files which are does not even have a signature on them. Should this filed be renamed as 'Total number of unscrubbed files'? Thanks kasturi
Moving this bug back because of the below issues. 1) Number of scrubbed files are always more than the no of files present in the system. Whenever scrubber runs it just appends the no.of files scrubbed to the existing list because of which the number always gets doubled. 2) Duration of last scrub is shown as "Duration of last scrub: 16791:7:21:12", not sure what does 16791 here means. 3) From the log below it is clearly seen that last scrub happened @ 9.46 but scrub status 'Duration of last scrub as 2015-12-22 07:21:12'. Log file: ========= [2015-12-22 09:46:53.681552] I [MSGID: 118044] [bit-rot-scrub.c:654:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing "/bricks/brick8/b2" started at 2015-12-22 09:46:53 [2015-12-22 09:46:57.693594] I [MSGID: 118045] [bit-rot-scrub.c:659:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing "/bricks/brick8/b2" finished at 2015-12-22 09:46:57 [2015-12-22 09:46:57.693621] I [MSGID: 118038] [bit-rot-scrub.c:918:br_fsscan_activate] 0-vol_rep-bit-rot-0: Scrubbing for /bricks/brick8/b2 rescheduled to run at 2015-12-22 10:46:57 scrub status output: =================== ========================================================= Node name: localhost Number of Scrubbed files: 12 Number of Unsigned files: 0 Last completed scrub time: 2015-12-22 07:21:12 Duration of last scrub: 16791:7:21:12 Error count: 0 ========================================================= Node name: rhs-client2.lab.eng.blr.redhat.com Number of Scrubbed files: 3 Number of Unsigned files: 0 Last completed scrub time: 2015-12-22 10:01:44 Duration of last scrub: 16791:10:1:44 Error count: 0 ========================================================= 4) I see that unsigned file count is always shown as zero. Is this meant for unsigned files / unscrubbed files?
(In reply to RamaKasturi from comment #6) > Moving this bug back because of the below issues. > > 1) Number of scrubbed files are always more than the no of files present in > the system. Whenever scrubber runs it just appends the no.of files scrubbed > to the existing list because of which the number always gets doubled. > > 2) Duration of last scrub is shown as "Duration of last scrub: > 16791:7:21:12", not sure what does 16791 here means. > > 3) From the log below it is clearly seen that last scrub happened @ 9.46 but > scrub status 'Duration of last scrub as 2015-12-22 07:21:12'. Replace Duration of last scrub with Last completed scrub time > > Log file: > ========= > > [2015-12-22 09:46:53.681552] I [MSGID: 118044] > [bit-rot-scrub.c:654:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing > "/bricks/brick8/b2" started > at 2015-12-22 09:46:53 > [2015-12-22 09:46:57.693594] I [MSGID: 118045] > [bit-rot-scrub.c:659:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing > "/bricks/brick8/b2" finished > at 2015-12-22 09:46:57 > [2015-12-22 09:46:57.693621] I [MSGID: 118038] > [bit-rot-scrub.c:918:br_fsscan_activate] 0-vol_rep-bit-rot-0: Scrubbing for > /bricks/brick8/b2 rescheduled to run at 2015-12-22 10:46:57 > > scrub status output: > =================== > > ========================================================= > > Node name: localhost > > Number of Scrubbed files: 12 > > Number of Unsigned files: 0 > > Last completed scrub time: 2015-12-22 07:21:12 > > Duration of last scrub: 16791:7:21:12 > > Error count: 0 > > > ========================================================= > > Node name: rhs-client2.lab.eng.blr.redhat.com > > Number of Scrubbed files: 3 > > Number of Unsigned files: 0 > > Last completed scrub time: 2015-12-22 10:01:44 > > Duration of last scrub: 16791:10:1:44 > > Error count: 0 > > ========================================================= > > 4) I see that unsigned file count is always shown as zero. Is this meant for > unsigned files / unscrubbed files?
I do see inconsistency in the duration of scrub in the nodes from which the volume is formed even though both of the scrubber starts running at the same time. [root@rhs-client2 ~]# gluster vol bitrot vol_ec scrub status Volume name : vol_ec State of scrub: Active Scrub impact: lazy Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node name: localhost Number of Scrubbed files: 0 Number of Unsigned files: 0 Last completed scrub time: 2015-12-22 12:34:31 Duration of last scrub: 0:1:0:4 Error count: 0 ========================================================= Node name: 10.70.36.62 Number of Scrubbed files: 0 Number of Unsigned files: 0 Last completed scrub time: 2015-12-22 12:34:32 Duration of last scrub: 0:0:0:16 Error count: 0 =========================================================
(In reply to RamaKasturi from comment #7) > (In reply to RamaKasturi from comment #6) > > Moving this bug back because of the below issues. > > > > 1) Number of scrubbed files are always more than the no of files present in > > the system. Whenever scrubber runs it just appends the no.of files scrubbed > > to the existing list because of which the number always gets doubled. > > > > 2) Duration of last scrub is shown as "Duration of last scrub: > > 16791:7:21:12", not sure what does 16791 here means. And instead of duration of last scrub it prints the value of last completed scrub time. > > > > 3) From the log below it is clearly seen that last scrub happened @ 9.46 but > > scrub status 'Duration of last scrub as 2015-12-22 07:21:12'. > > Replace Duration of last scrub with Last completed scrub time > > > > Log file: > > ========= > > > > [2015-12-22 09:46:53.681552] I [MSGID: 118044] > > [bit-rot-scrub.c:654:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing > > "/bricks/brick8/b2" started > > at 2015-12-22 09:46:53 > > [2015-12-22 09:46:57.693594] I [MSGID: 118045] > > [bit-rot-scrub.c:659:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing > > "/bricks/brick8/b2" finished > > at 2015-12-22 09:46:57 > > [2015-12-22 09:46:57.693621] I [MSGID: 118038] > > [bit-rot-scrub.c:918:br_fsscan_activate] 0-vol_rep-bit-rot-0: Scrubbing for > > /bricks/brick8/b2 rescheduled to run at 2015-12-22 10:46:57 > > > > scrub status output: > > =================== > > > > ========================================================= > > > > Node name: localhost > > > > Number of Scrubbed files: 12 > > > > Number of Unsigned files: 0 > > > > Last completed scrub time: 2015-12-22 07:21:12 > > > > Duration of last scrub: 16791:7:21:12 > > > > Error count: 0 > > > > > > ========================================================= > > > > Node name: rhs-client2.lab.eng.blr.redhat.com > > > > Number of Scrubbed files: 3 > > > > Number of Unsigned files: 0 > > > > Last completed scrub time: 2015-12-22 10:01:44 > > > > Duration of last scrub: 16791:10:1:44 > > > > Error count: 0 > > > > ========================================================= > > > > 4) I see that unsigned file count is always shown as zero. Is this meant for > > unsigned files / unscrubbed files?
(In reply to RamaKasturi from comment #8) > I do see inconsistency in the duration of scrub in the nodes from which the > volume is formed even though both of the scrubber starts running at the same > time. > > [root@rhs-client2 ~]# gluster vol bitrot vol_ec scrub status > > Volume name : vol_ec > > State of scrub: Active > > Scrub impact: lazy > > Scrub frequency: hourly > > Bitrot error log location: /var/log/glusterfs/bitd.log > > Scrubber error log location: /var/log/glusterfs/scrub.log > > > ========================================================= > > Node name: localhost > > Number of Scrubbed files: 0 > > Number of Unsigned files: 0 > > Last completed scrub time: 2015-12-22 12:34:31 > > Duration of last scrub: 0:1:0:4 > > Error count: 0 > > > ========================================================= > > Node name: 10.70.36.62 > > Number of Scrubbed files: 0 > > Number of Unsigned files: 0 > > Last completed scrub time: 2015-12-22 12:34:32 > > Duration of last scrub: 0:0:0:16 > > Error count: 0 > > ========================================================= scrubber run per node basis not per volume basis. so if you are having volume and one node having more file and another node having less file then you can see this behaviour because of less file on second node scrubber finish early.
Hi Gaurav, I just have three files in my volume and not more than that. I think scrubbing three files which are hardly few bytes should not take so long. Thanks kasturi.
Gaurav, There are couple of things that need fixing. 1. The way we reset the scrub statistics is incorrect[1] when there are more than one brick for a volume in a node. The values should be reset by the "first" scanner thread. 2. The way scrubber updates finish time is also a bit incorrect[2]. This is a bit hard to get it right atm. Moreover, I think we'd should move the statistics part to a separate C source file so as to not pollute the scrubber code with (probably complex) statistics updates. [1]: https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/src/bitd/bit-rot-scrub.c#L678 [2]: https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/src/bitd/bit-rot-scrub.c#L621
(In reply to Venky Shankar from comment #12) > Gaurav, > > There are couple of things that need fixing. > > 1. The way we reset the scrub statistics is incorrect[1] when there are more > than one brick for a volume in a node. The values should be reset by the > "first" scanner thread. > > 2. The way scrubber updates finish time is also a bit incorrect[2]. This is > a bit hard to get it right atm. > > Moreover, I think we'd should move the statistics part to a separate C > source file so as to not pollute the scrubber code with (probably complex) > statistics updates. > > [1]: > https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/ > src/bitd/bit-rot-scrub.c#L678 > [2]: > https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/ > src/bitd/bit-rot-scrub.c#L621 yes, thats true. in current scrubber code multiple thread are doing crawling/scrubbing we need to come up with a solution so that scrubbed file count/signed file count/scrubber time etc value can be reset just after starting of signing/scrubbing thread. currently its resetting tunable value based on the total number sub child count. This might require modification of existing scrubbing code or some other mechanism. meanwhile we can display scrubber statistics value whatever important and coming accurately. We can have separate C source file for this scrubber statistics.
[removing 3.1.2 tracker] Review link : http://review.gluster.org/#/c/13105/ As discussed with Alok, "scrub status" command would provide the list of corrupted objects (and the count). Other statistics would be added later as currently they are not accurate (the above patch masks out these stats while displaying).
downstream patch: https://code.engineering.redhat.com/gerrit/#/c/64852/
Verified and works fine with build glusterfs-3.7.5-17.el7rhgs.x86_64. Following values are displayed when user executes "gluster vol bitrot <vol_name> scrub status". [root@rhs-client38 b2]# gluster vol bitrot vol_rep scrub status Volume name : vol_rep State of scrub: Active Scrub impact: lazy Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Error count: 0 ========================================================= Node: rhs-client2.lab.eng.blr.redhat.com Error count: 2 Corrupted object's [GFID]: ddf706fd-580c-4f62-8ef3-38d55ab9e9f4 87912742-20c0-434f-93d2-e4f09ccf1f12 ========================================================= Other values values for "Number of Scrubbed files, Number of Unsigned files, Last completed scrub time and Duration of last scrub" are not shown in the scrub status ouput. Marking this verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html