Description of problem: ======================== In a sharded volume, where every file is split into multiple shards, the scrubber runs and validates every file (and its shards), but instead of incrementing once for every file, it does once for every shard. The same gets reflected in the scrub status output for the fields 'files scrubbed' and 'files skipped' - which is misleading to the user as the number there is much more than the total number of files created. Version-Release number of selected component (if applicable): =========================================================== 3.7.9-4 How reproducible: ================= Always Steps to Reproduce: ===================== 1. Have a dist-rep volume, and enable sharding. 2. Create 100 1MB files and validate the scrub status output after its run. 3. Create 5 4G files and wait for the next scrub run. 4. Validate the scrub status output after the scrubber has finished running. Actual results: ================ 'files scrubbed' and 'files skipped' show the number much more than the total number of files created. Expected results: ================= All the fields should be in line with the data actually created. Additional info: ================== [root@dhcp35-210 ~]# [root@dhcp35-210 ~]# rpm -qa | grep gluster glusterfs-client-xlators-3.7.9-4.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-libs-3.7.9-4.el7rhgs.x86_64 glusterfs-api-3.7.9-4.el7rhgs.x86_64 gluster-nagios-addons-0.2.7-1.el7rhgs.x86_64 python-gluster-3.7.5-19.el7rhgs.noarch glusterfs-3.7.9-4.el7rhgs.x86_64 glusterfs-cli-3.7.9-4.el7rhgs.x86_64 glusterfs-server-3.7.9-4.el7rhgs.x86_64 glusterfs-fuse-3.7.9-4.el7rhgs.x86_64 [root@dhcp35-210 ~]# [root@dhcp35-210 ~]# [root@dhcp35-210 ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.35.85 Uuid: c9550322-c0ef-45e6-ad20-f38658a5ce54 State: Peer in Cluster (Connected) Hostname: 10.70.35.137 Uuid: 35426000-dad1-416f-b145-f25049f5036e State: Peer in Cluster (Connected) Hostname: 10.70.35.13 Uuid: a756f3da-7896-4970-a77d-4829e603f773 State: Peer in Cluster (Connected) [root@dhcp35-210 ~]# [root@dhcp35-210 ~]# gluster v info Volume Name: ozone Type: Distributed-Replicate Volume ID: d79e220b-acde-4d13-b9d5-f37ec741c117 Status: Started Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: 10.70.35.210:/bricks/brick1/ozone Brick2: 10.70.35.85:/bricks/brick1/ozone Brick3: 10.70.35.137:/bricks/brick1/ozone Brick4: 10.70.35.210:/bricks/brick2/ozone Brick5: 10.70.35.85:/bricks/brick2/ozone Brick6: 10.70.35.137:/bricks/brick2/ozone Brick7: 10.70.35.210:/bricks/brick3/ozone Brick8: 10.70.35.85:/bricks/brick3/ozone Brick9: 10.70.35.137:/bricks/brick3/ozone Options Reconfigured: features.shard: on features.scrub-throttle: normal features.scrub-freq: hourly features.scrub: Active features.bitrot: on performance.readdir-ahead: on [root@dhcp35-210 ~]# [root@dhcp35-210 ~]# gluster v status Status of volume: ozone Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.210:/bricks/brick1/ozone 49152 0 Y 3255 Brick 10.70.35.85:/bricks/brick1/ozone 49152 0 Y 15549 Brick 10.70.35.137:/bricks/brick1/ozone 49152 0 Y 32158 Brick 10.70.35.210:/bricks/brick2/ozone 49153 0 Y 3261 Brick 10.70.35.85:/bricks/brick2/ozone 49153 0 Y 15557 Brick 10.70.35.137:/bricks/brick2/ozone 49153 0 Y 32164 Brick 10.70.35.210:/bricks/brick3/ozone 49154 0 Y 3270 Brick 10.70.35.85:/bricks/brick3/ozone 49154 0 Y 15564 Brick 10.70.35.137:/bricks/brick3/ozone 49154 0 Y 32171 NFS Server on localhost 2049 0 Y 24614 Self-heal Daemon on localhost N/A N/A Y 3248 Bitrot Daemon on localhost N/A N/A Y 8545 Scrubber Daemon on localhost N/A N/A Y 8551 NFS Server on 10.70.35.13 2049 0 Y 6082 Self-heal Daemon on 10.70.35.13 N/A N/A Y 21680 Bitrot Daemon on 10.70.35.13 N/A N/A N N/A Scrubber Daemon on 10.70.35.13 N/A N/A N N/A NFS Server on 10.70.35.85 2049 0 Y 9515 Self-heal Daemon on 10.70.35.85 N/A N/A Y 15542 Bitrot Daemon on 10.70.35.85 N/A N/A Y 18642 Scrubber Daemon on 10.70.35.85 N/A N/A Y 18648 NFS Server on 10.70.35.137 2049 0 Y 26213 Self-heal Daemon on 10.70.35.137 N/A N/A Y 32153 Bitrot Daemon on 10.70.35.137 N/A N/A Y 2919 Scrubber Daemon on 10.70.35.137 N/A N/A Y 2925 Task Status of Volume ozone ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-210 ~]# [root@dhcp35-210 ~]# gluster v bitrot ozone scrub status Volume name : ozone State of scrub: Active Scrub impact: normal Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 4930 Number of Skipped files: 0 Last completed scrub time: 2016-05-19 07:40:18 Duration of last scrub (D:M:H:M:S): 0:0:30:35 Error count: 1 Corrupted object's [GFID]: 2be8fc38-db5e-464b-b741-616377994cc8 ========================================================= Node: 10.70.35.85 Number of Scrubbed files: 5139 Number of Skipped files: 0 Last completed scrub time: 2016-05-19 08:49:49 Duration of last scrub (D:M:H:M:S): 0:0:29:39 Error count: 1 Corrupted object's [GFID]: ce5e7a94-cba6-4e65-a7bb-82b1ec396eef ========================================================= Node: 10.70.35.137 Number of Scrubbed files: 5138 Number of Skipped files: 0 Last completed scrub time: 2016-05-19 09:02:46 Duration of last scrub (D:M:H:M:S): 0:0:31:57 Error count: 0 ========================================================= [root@dhcp35-210 ~]# ============= CLIENT LOGS ============== [root@dhcp35-30 ~]# [root@dhcp35-30 ~]# cd /mnt/ozone [root@dhcp35-30 ozone]# df -k . Filesystem 1K-blocks Used Available Use% Mounted on 10.70.35.137:/ozone 62553600 21098496 41455104 34% /mnt/ozone [root@dhcp35-30 ozone]# [root@dhcp35-30 ozone]# [root@dhcp35-30 ozone]# ls -a . .. 1m_files 4g_files .trashcan [root@dhcp35-30 ozone]# [root@dhcp35-30 ozone]# [root@dhcp35-30 ozone]# ls -l 1m_files/ | wc -l 21 [root@dhcp35-30 ozone]# ls -l 4g_files/ | wc -l 6 [root@dhcp35-30 ozone]#
Upstream Patches http://review.gluster.org/#/c/14927/ (master) http://review.gluster.org/#/c/14958/ (3.7) http://review.gluster.org/#/c/14959/ (3.8)
(In reply to Kotresh HR from comment #4) > Upstream Patches > > http://review.gluster.org/#/c/14927/ (master) > http://review.gluster.org/#/c/14958/ (3.7) > http://review.gluster.org/#/c/14959/ (3.8) The fix is available in rhgs-3.2.0 as a rebase to GlusterFS 3.8.4
Tested and verified this on the build glusterfs-3.8.4-3 Had a 4node setup with bitrot and sharding enabled on a 2*2 volume, as well as an arbiter volume. Created files and observed the scrub status output. Did end up hitting bz 1378466, waited it out. Eventually the right number of files get updated in the field #scrubbedFiles and #skippedFiles Moving this bugzilla to verified in 3.2. Detailed logs are pasted below. [root@dhcp35-101 fd]# gluster peer status Number of Peers: 3 Hostname: 10.70.35.100 Uuid: fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5 State: Peer in Cluster (Connected) Hostname: 10.70.35.104 Uuid: 10335359-1c70-42b2-bcce-6215a973678d State: Peer in Cluster (Connected) Hostname: dhcp35-115.lab.eng.blr.redhat.com Uuid: 6ac165c0-317f-42ad-8262-953995171dbb State: Peer in Cluster (Connected) [root@dhcp35-101 fd]# rpm -qa | grep gluster python-gluster-3.8.4-3.el6rhs.noarch glusterfs-rdma-3.8.4-3.el6rhs.x86_64 glusterfs-api-3.8.4-3.el6rhs.x86_64 glusterfs-server-3.8.4-3.el6rhs.x86_64 glusterfs-ganesha-3.8.4-3.el6rhs.x86_64 gluster-nagios-addons-0.2.8-1.el6rhs.x86_64 glusterfs-libs-3.8.4-3.el6rhs.x86_64 glusterfs-fuse-3.8.4-3.el6rhs.x86_64 glusterfs-geo-replication-3.8.4-3.el6rhs.x86_64 gluster-nagios-common-0.2.4-1.el6rhs.noarch vdsm-gluster-4.16.30-1.5.el6rhs.noarch glusterfs-3.8.4-3.el6rhs.x86_64 glusterfs-cli-3.8.4-3.el6rhs.x86_64 glusterfs-devel-3.8.4-3.el6rhs.x86_64 glusterfs-events-3.8.4-3.el6rhs.x86_64 glusterfs-client-xlators-3.8.4-3.el6rhs.x86_64 glusterfs-api-devel-3.8.4-3.el6rhs.x86_64 nfs-ganesha-gluster-2.3.1-8.el6rhs.x86_64 glusterfs-debuginfo-3.8.4-2.el6rhs.x86_64 [root@dhcp35-101 fd]# gluster v info Volume Name: nash Type: Distributed-Replicate Volume ID: d9c962de-5e4a-4fa9-a9c4-89b6803e543f Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.115:/bricks/brick1/nash0 Brick2: 10.70.35.100:/bricks/brick1/nash1 Brick3: 10.70.35.101:/bricks/brick1/nash2 Brick4: 10.70.35.104:/bricks/brick1/nash3 Options Reconfigured: features.shard: on features.scrub-freq: hourly features.scrub: Active features.bitrot: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on auto-delete: disable Volume Name: ozone Type: Distributed-Replicate Volume ID: 630022dd-1f6c-423e-bad6-22fb16f9fbcf Status: Started Snapshot Count: 0 Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: 10.70.35.115:/bricks/brick1/ozone Brick2: 10.70.35.100:/bricks/brick1/ozone Brick3: 10.70.35.101:/bricks/brick1/ozone (arbiter) Brick4: 10.70.35.115:/bricks/brick2/ozone4 Brick5: 10.70.35.100:/bricks/brick2/ozone5 Brick6: 10.70.35.101:/bricks/brick2/ozone6 (arbiter) Options Reconfigured: features.scrub-freq: hourly features.shard: on features.scrub: Active features.bitrot: on features.expiry-time: 20 nfs.disable: on performance.readdir-ahead: on transport.address-family: inet auto-delete: disable [root@dhcp35-101 fd]# [root@dhcp35-101 fd]# gluster v bitrot nash scrub status Volume name : nash State of scrub: Active (Idle) Scrub impact: lazy Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 4 Number of Skipped files: 0 Last completed scrub time: 2016-11-11 08:17:09 Duration of last scrub (D:M:H:M:S): 0:0:0:24 Error count: 0 ========================================================= Node: 10.70.35.100 Number of Scrubbed files: 1 Number of Skipped files: 0 Last completed scrub time: 2016-11-11 08:17:15 Duration of last scrub (D:M:H:M:S): 0:0:0:30 Error count: 0 ========================================================= Node: dhcp35-115.lab.eng.blr.redhat.com Number of Scrubbed files: 1 Number of Skipped files: 0 Last completed scrub time: 2016-11-11 08:17:15 Duration of last scrub (D:M:H:M:S): 0:0:0:30 Error count: 0 ========================================================= Node: 10.70.35.104 Number of Scrubbed files: 4 Number of Skipped files: 0 Last completed scrub time: 2016-11-11 08:17:09 Duration of last scrub (D:M:H:M:S): 0:0:0:23 Error count: 0 ========================================================= [root@dhcp35-101 fd]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html