Bug 1285226 - Masking the wrong values in Bitrot status command
Summary: Masking the wrong values in Bitrot status command
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: bitrot
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.1.2
Assignee: Bug Updates Notification Mailing List
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks: 1260783 1299737 1329211 1332072
TreeView+ depends on / blocked
 
Reported: 2015-11-25 09:15 UTC by RamaKasturi
Modified: 2016-09-17 14:22 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.7.5-17
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1299737 (view as bug list)
Environment:
Last Closed: 2016-03-01 05:57:36 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0193 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.1 update 2 2016-03-01 10:20:36 UTC

Description RamaKasturi 2015-11-25 09:15:31 UTC
Description of problem:
When user runs "gluster volume bitrot <vol_name> scrub status" values for Number of Scrubbed files, Number of Unsigned files, Last completed scrub time and Duration of last scrub are shown as zero.

[root@rhs-client24 b7]# gluster vol bitrot vol1 scrub status

Volume name : vol1

State of scrub: Active

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node name: localhost

Number of Scrubbed files: 0

Number of Unsigned files: 0

Last completed scrub time: 0

Duration of last scrub: 0

Error count: 1

Corrupted object's:

522a6019-bde9-4963-b1d6-bbdaf27c788a


=========================================================

Node name: 10.70.36.57

Number of Scrubbed files: 0

Number of Unsigned files: 0

Last completed scrub time: 0

Duration of last scrub: 0

Error count: 0

=========================================================


Version-Release number of selected component (if applicable):
glusterfs-3.7.5-7.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install the latest build
2. create a volume and enable bitrot on that.
3. Run the command gluster volume bitrot <vol_name> scrub status

Actual results:
values for  Number of Scrubbed files, Number of Unsigned files, Last completed scrub time and Duration of last scrub are shown as zero.

Expected results:
actual Values should be shown for Number of Scrubbed files, Number of Unsigned files, Last completed scrub time and Duration of last scrub and not zeros.

Additional info:

Comment 2 Raghavendra Bhat 2015-12-04 07:21:54 UTC
http://review.gluster.org/#/c/12776/ is the upstream patch for review.

Comment 3 Gaurav Kumar Garg 2015-12-15 06:36:52 UTC
downstream patch available: https://code.engineering.redhat.com/gerrit/63789

Comment 4 RamaKasturi 2015-12-22 05:35:01 UTC
Hi Gaurav,

    I see some issues related to the bug. Following are my observations.

    1) There are only three files in my volume and i see that number of scrubbed files is always more than my total number of files in the volume.

    2) I am not sure when would there be files which are unsigned.I always see that the files are signed in my system and do not see why do we have the field Total Number of unsigned files.

    3) In one of my node scrubber finished running @ 05:01:12 on [2015-12-22] but scrub status shows last completed scrub as 2015-12-20.

    3) There is always inconsistency between both of my nodes scrub status output which forms the volume.

output from scrub status:
========================

[root@rhs-client2 ~]# gluster vol bitrot vol_ec scrub status

Volume name : vol_ec

State of scrub: Active

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node name: localhost

Number of Scrubbed files: 275

Number of Unsigned files: 0

Last completed scrub time: 2015-12-20 12:57:22

Duration of last scrub: 1:18:2:37

Error count: 0


=========================================================

Node name: 10.70.36.62

Number of Scrubbed files: 45

Number of Unsigned files: 0

Last completed scrub time: 2015-12-22 05:23:32

Duration of last scrub: 0:5:1:7

Error count: 0

=========================================================

Thanks
kasturi

Comment 5 RamaKasturi 2015-12-22 06:39:36 UTC
Hi Gaurav,

   As per what you said, a correct value for Total no.of Unsigned files will get displayed only when scrubber see that the file is unsigned. But scrubber does not do anything with the files which are does not even have a signature on them. Should this filed be renamed as 'Total number of unscrubbed files'?

Thanks
kasturi

Comment 6 RamaKasturi 2015-12-22 10:07:52 UTC
Moving this bug back because of the below issues.

1) Number of scrubbed files are always more than the no of files present in the system. Whenever scrubber runs it just appends the no.of files scrubbed to the existing list because of which the number always gets doubled.

2) Duration of last scrub is shown as "Duration of last scrub: 16791:7:21:12", not sure what does 16791 here means.

3) From the log below it is clearly seen that last scrub happened @ 9.46 but scrub status 'Duration of last scrub as 2015-12-22 07:21:12'.

Log file:
=========

[2015-12-22 09:46:53.681552] I [MSGID: 118044] [bit-rot-scrub.c:654:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing "/bricks/brick8/b2" started 
at 2015-12-22 09:46:53
[2015-12-22 09:46:57.693594] I [MSGID: 118045] [bit-rot-scrub.c:659:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing "/bricks/brick8/b2" finished
 at 2015-12-22 09:46:57
[2015-12-22 09:46:57.693621] I [MSGID: 118038] [bit-rot-scrub.c:918:br_fsscan_activate] 0-vol_rep-bit-rot-0: Scrubbing for /bricks/brick8/b2 rescheduled to run at 2015-12-22 10:46:57

scrub status output:
===================

=========================================================

Node name: localhost

Number of Scrubbed files: 12

Number of Unsigned files: 0

Last completed scrub time: 2015-12-22 07:21:12

Duration of last scrub: 16791:7:21:12

Error count: 0


=========================================================

Node name: rhs-client2.lab.eng.blr.redhat.com

Number of Scrubbed files: 3

Number of Unsigned files: 0

Last completed scrub time: 2015-12-22 10:01:44

Duration of last scrub: 16791:10:1:44

Error count: 0

=========================================================

4) I see that unsigned file count is always shown as zero. Is this meant for unsigned files / unscrubbed files?

Comment 7 RamaKasturi 2015-12-22 10:30:15 UTC
(In reply to RamaKasturi from comment #6)
> Moving this bug back because of the below issues.
> 
> 1) Number of scrubbed files are always more than the no of files present in
> the system. Whenever scrubber runs it just appends the no.of files scrubbed
> to the existing list because of which the number always gets doubled.
> 
> 2) Duration of last scrub is shown as "Duration of last scrub:
> 16791:7:21:12", not sure what does 16791 here means.
> 
> 3) From the log below it is clearly seen that last scrub happened @ 9.46 but
> scrub status 'Duration of last scrub as 2015-12-22 07:21:12'.

Replace Duration of last scrub with Last completed scrub time
> 
> Log file:
> =========
> 
> [2015-12-22 09:46:53.681552] I [MSGID: 118044]
> [bit-rot-scrub.c:654:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing
> "/bricks/brick8/b2" started 
> at 2015-12-22 09:46:53
> [2015-12-22 09:46:57.693594] I [MSGID: 118045]
> [bit-rot-scrub.c:659:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing
> "/bricks/brick8/b2" finished
>  at 2015-12-22 09:46:57
> [2015-12-22 09:46:57.693621] I [MSGID: 118038]
> [bit-rot-scrub.c:918:br_fsscan_activate] 0-vol_rep-bit-rot-0: Scrubbing for
> /bricks/brick8/b2 rescheduled to run at 2015-12-22 10:46:57
> 
> scrub status output:
> ===================
> 
> =========================================================
> 
> Node name: localhost
> 
> Number of Scrubbed files: 12
> 
> Number of Unsigned files: 0
> 
> Last completed scrub time: 2015-12-22 07:21:12
> 
> Duration of last scrub: 16791:7:21:12
> 
> Error count: 0
> 
> 
> =========================================================
> 
> Node name: rhs-client2.lab.eng.blr.redhat.com
> 
> Number of Scrubbed files: 3
> 
> Number of Unsigned files: 0
> 
> Last completed scrub time: 2015-12-22 10:01:44
> 
> Duration of last scrub: 16791:10:1:44
> 
> Error count: 0
> 
> =========================================================
> 
> 4) I see that unsigned file count is always shown as zero. Is this meant for
> unsigned files / unscrubbed files?

Comment 8 RamaKasturi 2015-12-22 12:59:34 UTC
I do see inconsistency in the duration of scrub in the nodes from which the volume is formed even though both of the scrubber starts running at the same time.

[root@rhs-client2 ~]# gluster vol bitrot vol_ec scrub status

Volume name : vol_ec

State of scrub: Active

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node name: localhost

Number of Scrubbed files: 0

Number of Unsigned files: 0

Last completed scrub time: 2015-12-22 12:34:31

Duration of last scrub: 0:1:0:4

Error count: 0


=========================================================

Node name: 10.70.36.62

Number of Scrubbed files: 0

Number of Unsigned files: 0

Last completed scrub time: 2015-12-22 12:34:32

Duration of last scrub: 0:0:0:16

Error count: 0

=========================================================

Comment 9 RamaKasturi 2015-12-23 07:35:27 UTC
(In reply to RamaKasturi from comment #7)
> (In reply to RamaKasturi from comment #6)
> > Moving this bug back because of the below issues.
> > 
> > 1) Number of scrubbed files are always more than the no of files present in
> > the system. Whenever scrubber runs it just appends the no.of files scrubbed
> > to the existing list because of which the number always gets doubled.
> > 
> > 2) Duration of last scrub is shown as "Duration of last scrub:
> > 16791:7:21:12", not sure what does 16791 here means.
And instead of duration of last scrub it prints the value of last completed scrub time.
> > 
> > 3) From the log below it is clearly seen that last scrub happened @ 9.46 but
> > scrub status 'Duration of last scrub as 2015-12-22 07:21:12'.
> 
> Replace Duration of last scrub with Last completed scrub time
> > 
> > Log file:
> > =========
> > 
> > [2015-12-22 09:46:53.681552] I [MSGID: 118044]
> > [bit-rot-scrub.c:654:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing
> > "/bricks/brick8/b2" started 
> > at 2015-12-22 09:46:53
> > [2015-12-22 09:46:57.693594] I [MSGID: 118045]
> > [bit-rot-scrub.c:659:br_fsscanner_log_time] 0-vol_rep-bit-rot-0: Scrubbing
> > "/bricks/brick8/b2" finished
> >  at 2015-12-22 09:46:57
> > [2015-12-22 09:46:57.693621] I [MSGID: 118038]
> > [bit-rot-scrub.c:918:br_fsscan_activate] 0-vol_rep-bit-rot-0: Scrubbing for
> > /bricks/brick8/b2 rescheduled to run at 2015-12-22 10:46:57
> > 
> > scrub status output:
> > ===================
> > 
> > =========================================================
> > 
> > Node name: localhost
> > 
> > Number of Scrubbed files: 12
> > 
> > Number of Unsigned files: 0
> > 
> > Last completed scrub time: 2015-12-22 07:21:12
> > 
> > Duration of last scrub: 16791:7:21:12
> > 
> > Error count: 0
> > 
> > 
> > =========================================================
> > 
> > Node name: rhs-client2.lab.eng.blr.redhat.com
> > 
> > Number of Scrubbed files: 3
> > 
> > Number of Unsigned files: 0
> > 
> > Last completed scrub time: 2015-12-22 10:01:44
> > 
> > Duration of last scrub: 16791:10:1:44
> > 
> > Error count: 0
> > 
> > =========================================================
> > 
> > 4) I see that unsigned file count is always shown as zero. Is this meant for
> > unsigned files / unscrubbed files?

Comment 10 Gaurav Kumar Garg 2015-12-23 09:34:42 UTC
(In reply to RamaKasturi from comment #8)
> I do see inconsistency in the duration of scrub in the nodes from which the
> volume is formed even though both of the scrubber starts running at the same
> time.
> 
> [root@rhs-client2 ~]# gluster vol bitrot vol_ec scrub status
> 
> Volume name : vol_ec
> 
> State of scrub: Active
> 
> Scrub impact: lazy
> 
> Scrub frequency: hourly
> 
> Bitrot error log location: /var/log/glusterfs/bitd.log
> 
> Scrubber error log location: /var/log/glusterfs/scrub.log
> 
> 
> =========================================================
> 
> Node name: localhost
> 
> Number of Scrubbed files: 0
> 
> Number of Unsigned files: 0
> 
> Last completed scrub time: 2015-12-22 12:34:31
> 
> Duration of last scrub: 0:1:0:4
> 
> Error count: 0
> 
> 
> =========================================================
> 
> Node name: 10.70.36.62
> 
> Number of Scrubbed files: 0
> 
> Number of Unsigned files: 0
> 
> Last completed scrub time: 2015-12-22 12:34:32
> 
> Duration of last scrub: 0:0:0:16
> 
> Error count: 0
> 
> =========================================================

scrubber run per node basis not per volume basis. so if you are having volume and one node having more file and another node having less file then you can see this behaviour because of less file on second node scrubber finish early.

Comment 11 RamaKasturi 2015-12-23 09:38:49 UTC
Hi Gaurav,

    I just have three files in my volume and not more than that. I think scrubbing three files which are hardly few bytes should not take so long.

Thanks
kasturi.

Comment 12 Venky Shankar 2015-12-28 05:42:51 UTC
Gaurav,

There are couple of things that need fixing.

1. The way we reset the scrub statistics is incorrect[1] when there are more than one brick for a volume in a node. The values should be reset by the "first" scanner thread.

2. The way scrubber updates finish time is also a bit incorrect[2]. This is a bit hard to get it right atm.

Moreover, I think we'd should move the statistics part to a separate C source file so as to not pollute the scrubber code with (probably complex) statistics updates.

[1]: https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/src/bitd/bit-rot-scrub.c#L678
[2]: https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/src/bitd/bit-rot-scrub.c#L621

Comment 13 Gaurav Kumar Garg 2015-12-29 09:49:31 UTC
(In reply to Venky Shankar from comment #12)
> Gaurav,
> 
> There are couple of things that need fixing.
> 
> 1. The way we reset the scrub statistics is incorrect[1] when there are more
> than one brick for a volume in a node. The values should be reset by the
> "first" scanner thread.
> 
> 2. The way scrubber updates finish time is also a bit incorrect[2]. This is
> a bit hard to get it right atm.
> 
> Moreover, I think we'd should move the statistics part to a separate C
> source file so as to not pollute the scrubber code with (probably complex)
> statistics updates.
> 
> [1]:
> https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/
> src/bitd/bit-rot-scrub.c#L678
> [2]:
> https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/
> src/bitd/bit-rot-scrub.c#L621

yes, thats true. in current scrubber code multiple thread are doing crawling/scrubbing we need to come up with a solution so that scrubbed file count/signed file count/scrubber time etc value can be reset just after starting of signing/scrubbing thread. currently its resetting tunable value based on the total number sub child count. This might require modification of existing scrubbing code or some other mechanism. meanwhile we can display scrubber statistics value whatever important and coming accurately.

We can have separate C source file for this scrubber statistics.

Comment 14 Venky Shankar 2016-01-04 11:55:55 UTC
[removing 3.1.2 tracker]

Review link : http://review.gluster.org/#/c/13105/

As discussed with Alok, "scrub status" command would provide the list of corrupted objects (and the count). Other statistics would be added later as currently they are not accurate (the above patch masks out these stats while displaying).

Comment 20 Gaurav Kumar Garg 2016-01-20 07:03:03 UTC
downstream patch: https://code.engineering.redhat.com/gerrit/#/c/64852/

Comment 21 RamaKasturi 2016-01-21 12:14:30 UTC
Verified and works fine with build glusterfs-3.7.5-17.el7rhgs.x86_64.

Following values are displayed when user executes "gluster vol bitrot <vol_name> scrub status".

[root@rhs-client38 b2]# gluster vol bitrot vol_rep scrub status

Volume name : vol_rep

State of scrub: Active

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: localhost

Error count: 0


=========================================================

Node: rhs-client2.lab.eng.blr.redhat.com

Error count: 2

Corrupted object's [GFID]:

ddf706fd-580c-4f62-8ef3-38d55ab9e9f4

87912742-20c0-434f-93d2-e4f09ccf1f12

=========================================================

Other values values for  "Number of Scrubbed files, Number of Unsigned files, Last completed scrub time and Duration of last scrub" are not shown in the scrub status ouput.
 
Marking this verified

Comment 23 errata-xmlrpc 2016-03-01 05:57:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html


Note You need to log in before you can comment on or make changes to this bug.