1337450 – [Bitrot+Sharding] Scrub status shows incorrect values for 'files scrubbed' and 'files skipped'

Bug 1337450 - [Bitrot+Sharding] Scrub status shows incorrect values for 'files scrubbed' and 'files skipped'

Summary: [Bitrot+Sharding] Scrub status shows incorrect values for 'files scrubbed' an...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	bitrot
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Kotresh HR
QA Contact:	Sweta Anandpara
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351522 1356851 1357973 1357975
TreeView+	depends on / blocked

Reported:	2016-05-19 09:06 UTC by Sweta Anandpara
Modified:	2017-03-23 05:31 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1356851 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:31:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Sweta Anandpara 2016-05-19 09:06:44 UTC

Description of problem:
========================

In a sharded volume, where every file is split into multiple shards, the scrubber runs and validates every file (and its shards), but instead of incrementing once for every file, it does once for every shard. The same gets reflected in the scrub status output for the fields 'files scrubbed' and 'files skipped' - which is misleading to the user as the number there is much more than the total number of files created. 


Version-Release number of selected component (if applicable):
===========================================================
3.7.9-4


How reproducible:
================= 
Always


Steps to Reproduce:
=====================

1. Have a dist-rep volume, and enable sharding.
2. Create 100 1MB files and validate the scrub status output after its run.
3. Create 5 4G files and wait for the next scrub run.
4. Validate the scrub status output after the scrubber has finished running.

Actual results:
================
'files scrubbed' and 'files skipped' show the number much more than the total number of files created.


Expected results:
=================
All the fields should be in line with the data actually created.


Additional info:
==================

[root@dhcp35-210 ~]# 
[root@dhcp35-210 ~]# rpm -qa | grep gluster
glusterfs-client-xlators-3.7.9-4.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.7.9-4.el7rhgs.x86_64
glusterfs-api-3.7.9-4.el7rhgs.x86_64
gluster-nagios-addons-0.2.7-1.el7rhgs.x86_64
python-gluster-3.7.5-19.el7rhgs.noarch
glusterfs-3.7.9-4.el7rhgs.x86_64
glusterfs-cli-3.7.9-4.el7rhgs.x86_64
glusterfs-server-3.7.9-4.el7rhgs.x86_64
glusterfs-fuse-3.7.9-4.el7rhgs.x86_64
[root@dhcp35-210 ~]# 
[root@dhcp35-210 ~]# 
[root@dhcp35-210 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.35.85
Uuid: c9550322-c0ef-45e6-ad20-f38658a5ce54
State: Peer in Cluster (Connected)

Hostname: 10.70.35.137
Uuid: 35426000-dad1-416f-b145-f25049f5036e
State: Peer in Cluster (Connected)

Hostname: 10.70.35.13
Uuid: a756f3da-7896-4970-a77d-4829e603f773
State: Peer in Cluster (Connected)
[root@dhcp35-210 ~]# 
[root@dhcp35-210 ~]# gluster v info
 
Volume Name: ozone
Type: Distributed-Replicate
Volume ID: d79e220b-acde-4d13-b9d5-f37ec741c117
Status: Started
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.70.35.210:/bricks/brick1/ozone
Brick2: 10.70.35.85:/bricks/brick1/ozone
Brick3: 10.70.35.137:/bricks/brick1/ozone
Brick4: 10.70.35.210:/bricks/brick2/ozone
Brick5: 10.70.35.85:/bricks/brick2/ozone
Brick6: 10.70.35.137:/bricks/brick2/ozone
Brick7: 10.70.35.210:/bricks/brick3/ozone
Brick8: 10.70.35.85:/bricks/brick3/ozone
Brick9: 10.70.35.137:/bricks/brick3/ozone
Options Reconfigured:
features.shard: on
features.scrub-throttle: normal
features.scrub-freq: hourly
features.scrub: Active
features.bitrot: on
performance.readdir-ahead: on
[root@dhcp35-210 ~]# 
[root@dhcp35-210 ~]# gluster  v status
Status of volume: ozone
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.210:/bricks/brick1/ozone     49152     0          Y       3255 
Brick 10.70.35.85:/bricks/brick1/ozone      49152     0          Y       15549
Brick 10.70.35.137:/bricks/brick1/ozone     49152     0          Y       32158
Brick 10.70.35.210:/bricks/brick2/ozone     49153     0          Y       3261 
Brick 10.70.35.85:/bricks/brick2/ozone      49153     0          Y       15557
Brick 10.70.35.137:/bricks/brick2/ozone     49153     0          Y       32164
Brick 10.70.35.210:/bricks/brick3/ozone     49154     0          Y       3270 
Brick 10.70.35.85:/bricks/brick3/ozone      49154     0          Y       15564
Brick 10.70.35.137:/bricks/brick3/ozone     49154     0          Y       32171
NFS Server on localhost                     2049      0          Y       24614
Self-heal Daemon on localhost               N/A       N/A        Y       3248 
Bitrot Daemon on localhost                  N/A       N/A        Y       8545 
Scrubber Daemon on localhost                N/A       N/A        Y       8551 
NFS Server on 10.70.35.13                   2049      0          Y       6082 
Self-heal Daemon on 10.70.35.13             N/A       N/A        Y       21680
Bitrot Daemon on 10.70.35.13                N/A       N/A        N       N/A  
Scrubber Daemon on 10.70.35.13              N/A       N/A        N       N/A  
NFS Server on 10.70.35.85                   2049      0          Y       9515 
Self-heal Daemon on 10.70.35.85             N/A       N/A        Y       15542
Bitrot Daemon on 10.70.35.85                N/A       N/A        Y       18642
Scrubber Daemon on 10.70.35.85              N/A       N/A        Y       18648
NFS Server on 10.70.35.137                  2049      0          Y       26213
Self-heal Daemon on 10.70.35.137            N/A       N/A        Y       32153
Bitrot Daemon on 10.70.35.137               N/A       N/A        Y       2919 
Scrubber Daemon on 10.70.35.137             N/A       N/A        Y       2925 
 
Task Status of Volume ozone
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp35-210 ~]# 
[root@dhcp35-210 ~]# gluster v bitrot ozone scrub status

Volume name : ozone

State of scrub: Active

Scrub impact: normal

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: localhost

Number of Scrubbed files: 4930

Number of Skipped files: 0

Last completed scrub time: 2016-05-19 07:40:18

Duration of last scrub (D:M:H:M:S): 0:0:30:35

Error count: 1

Corrupted object's [GFID]:

2be8fc38-db5e-464b-b741-616377994cc8


=========================================================

Node: 10.70.35.85

Number of Scrubbed files: 5139

Number of Skipped files: 0

Last completed scrub time: 2016-05-19 08:49:49

Duration of last scrub (D:M:H:M:S): 0:0:29:39

Error count: 1

Corrupted object's [GFID]:

ce5e7a94-cba6-4e65-a7bb-82b1ec396eef


=========================================================

Node: 10.70.35.137

Number of Scrubbed files: 5138

Number of Skipped files: 0

Last completed scrub time: 2016-05-19 09:02:46

Duration of last scrub (D:M:H:M:S): 0:0:31:57

Error count: 0

=========================================================

[root@dhcp35-210 ~]# 


=============
CLIENT LOGS
==============

[root@dhcp35-30 ~]# 
[root@dhcp35-30 ~]# cd /mnt/ozone
[root@dhcp35-30 ozone]# df -k .
Filesystem          1K-blocks     Used Available Use% Mounted on
10.70.35.137:/ozone  62553600 21098496  41455104  34% /mnt/ozone
[root@dhcp35-30 ozone]# 
[root@dhcp35-30 ozone]# 
[root@dhcp35-30 ozone]# ls -a
.  ..  1m_files  4g_files  .trashcan
[root@dhcp35-30 ozone]# 
[root@dhcp35-30 ozone]# 
[root@dhcp35-30 ozone]# ls -l 1m_files/ | wc -l
21
[root@dhcp35-30 ozone]# ls -l 4g_files/ | wc -l
6
[root@dhcp35-30 ozone]#

Comment 4 Kotresh HR 2016-07-22 06:49:19 UTC

Upstream Patches

http://review.gluster.org/#/c/14927/ (master)
http://review.gluster.org/#/c/14958/ (3.7)
http://review.gluster.org/#/c/14959/ (3.8)

Comment 6 Atin Mukherjee 2016-09-17 14:28:05 UTC

(In reply to Kotresh HR from comment #4)
> Upstream Patches
> 
> http://review.gluster.org/#/c/14927/ (master)
> http://review.gluster.org/#/c/14958/ (3.7)
> http://review.gluster.org/#/c/14959/ (3.8)

The fix is available in rhgs-3.2.0 as a rebase to GlusterFS 3.8.4

Comment 9 Sweta Anandpara 2016-11-11 08:26:47 UTC

Tested and verified this on the build glusterfs-3.8.4-3

Had a 4node setup with bitrot and sharding enabled on a 2*2 volume, as well as an arbiter volume. Created files and observed the scrub status output. 

Did end up hitting bz 1378466, waited it out. Eventually the right number of files get updated in the field #scrubbedFiles and #skippedFiles

Moving this bugzilla to verified in 3.2. Detailed logs are pasted below.

[root@dhcp35-101 fd]# gluster peer status
Number of Peers: 3

Hostname: 10.70.35.100
Uuid: fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5
State: Peer in Cluster (Connected)

Hostname: 10.70.35.104
Uuid: 10335359-1c70-42b2-bcce-6215a973678d
State: Peer in Cluster (Connected)

Hostname: dhcp35-115.lab.eng.blr.redhat.com
Uuid: 6ac165c0-317f-42ad-8262-953995171dbb
State: Peer in Cluster (Connected)
[root@dhcp35-101 fd]# rpm -qa | grep gluster
python-gluster-3.8.4-3.el6rhs.noarch
glusterfs-rdma-3.8.4-3.el6rhs.x86_64
glusterfs-api-3.8.4-3.el6rhs.x86_64
glusterfs-server-3.8.4-3.el6rhs.x86_64
glusterfs-ganesha-3.8.4-3.el6rhs.x86_64
gluster-nagios-addons-0.2.8-1.el6rhs.x86_64
glusterfs-libs-3.8.4-3.el6rhs.x86_64
glusterfs-fuse-3.8.4-3.el6rhs.x86_64
glusterfs-geo-replication-3.8.4-3.el6rhs.x86_64
gluster-nagios-common-0.2.4-1.el6rhs.noarch
vdsm-gluster-4.16.30-1.5.el6rhs.noarch
glusterfs-3.8.4-3.el6rhs.x86_64
glusterfs-cli-3.8.4-3.el6rhs.x86_64
glusterfs-devel-3.8.4-3.el6rhs.x86_64
glusterfs-events-3.8.4-3.el6rhs.x86_64
glusterfs-client-xlators-3.8.4-3.el6rhs.x86_64
glusterfs-api-devel-3.8.4-3.el6rhs.x86_64
nfs-ganesha-gluster-2.3.1-8.el6rhs.x86_64
glusterfs-debuginfo-3.8.4-2.el6rhs.x86_64
[root@dhcp35-101 fd]# gluster v info
 
Volume Name: nash
Type: Distributed-Replicate
Volume ID: d9c962de-5e4a-4fa9-a9c4-89b6803e543f
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.115:/bricks/brick1/nash0
Brick2: 10.70.35.100:/bricks/brick1/nash1
Brick3: 10.70.35.101:/bricks/brick1/nash2
Brick4: 10.70.35.104:/bricks/brick1/nash3
Options Reconfigured:
features.shard: on
features.scrub-freq: hourly
features.scrub: Active
features.bitrot: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
auto-delete: disable
 
Volume Name: ozone
Type: Distributed-Replicate
Volume ID: 630022dd-1f6c-423e-bad6-22fb16f9fbcf
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.35.115:/bricks/brick1/ozone
Brick2: 10.70.35.100:/bricks/brick1/ozone
Brick3: 10.70.35.101:/bricks/brick1/ozone (arbiter)
Brick4: 10.70.35.115:/bricks/brick2/ozone4
Brick5: 10.70.35.100:/bricks/brick2/ozone5
Brick6: 10.70.35.101:/bricks/brick2/ozone6 (arbiter)
Options Reconfigured:
features.scrub-freq: hourly
features.shard: on
features.scrub: Active
features.bitrot: on
features.expiry-time: 20
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
auto-delete: disable
[root@dhcp35-101 fd]# 
[root@dhcp35-101 fd]# gluster v bitrot nash scrub status

Volume name : nash

State of scrub: Active (Idle)

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: localhost

Number of Scrubbed files: 4

Number of Skipped files: 0

Last completed scrub time: 2016-11-11 08:17:09

Duration of last scrub (D:M:H:M:S): 0:0:0:24

Error count: 0


=========================================================

Node: 10.70.35.100

Number of Scrubbed files: 1

Number of Skipped files: 0

Last completed scrub time: 2016-11-11 08:17:15

Duration of last scrub (D:M:H:M:S): 0:0:0:30

Error count: 0


=========================================================

Node: dhcp35-115.lab.eng.blr.redhat.com

Number of Scrubbed files: 1

Number of Skipped files: 0

Last completed scrub time: 2016-11-11 08:17:15

Duration of last scrub (D:M:H:M:S): 0:0:0:30

Error count: 0


=========================================================

Node: 10.70.35.104

Number of Scrubbed files: 4

Number of Skipped files: 0

Last completed scrub time: 2016-11-11 08:17:09

Duration of last scrub (D:M:H:M:S): 0:0:0:23

Error count: 0

=========================================================

[root@dhcp35-101 fd]#

Comment 11 errata-xmlrpc 2017-03-23 05:31:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.