Description of problem: ======================= If we have sticky bit files present in a distribute replicate volume, then they get displayed as 'skipped' in the scrub status output. Considering that the actual file would have anyways gotten considered, the scrubber should ideally be ignoring all T files. Version-Release number of selected component (if applicable): ========================================================== 3.7.9-7 How reproducible: ================ Reporting the first occurrence. Steps to Reproduce: =================== 1. Have a 4node cluster, with n*2 distribute replicate volume. 2. Create files from the mountpoint 3. Create a scenario such that sticky bit files are created (say, kill a brick process, and do creates) 4. Validate the scrub status output Actual results: ================ Step4 shows a number>0 in the field 'Files skipped' in scrub status output. Expected results: ================== Step4 should not show any files as skipped, as there is no open FD while the scrubber is doing its run. Additional info: ================= [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# rpm -qa | grep gluster glusterfs-cli-3.7.9-7.el7rhgs.x86_64 glusterfs-client-xlators-3.7.9-7.el7rhgs.x86_64 gluster-nagios-addons-0.2.7-1.el7rhgs.x86_64 glusterfs-fuse-3.7.9-7.el7rhgs.x86_64 glusterfs-geo-replication-3.7.9-7.el7rhgs.x86_64 glusterfs-libs-3.7.9-7.el7rhgs.x86_64 glusterfs-api-3.7.9-7.el7rhgs.x86_64 glusterfs-3.7.9-7.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-server-3.7.9-7.el7rhgs.x86_64 [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# gluster pool list UUID Hostname State d8339859-b7e5-4683-9e53-00e34a3d090d dhcp47-188.lab.eng.blr.redhat.com Connected 1bb3d70d-dbb0-4dd7-9a4d-ae33564ef226 10.70.46.215 Connected 34a7a230-1513-4244-92b6-47fd17cd7f37 10.70.46.193 Connected 60b85677-44a0-413f-9200-7516c9b88006 localhost Connected [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# gluster v list disp distrep2 distrep3 gluster_shared_storage mm [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# gluster v info distrep2 Volume Name: distrep2 Type: Distributed-Replicate Volume ID: a40e89f0-02dd-4fa7-8687-afe0f092ae80 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.47.188:/brick/brick1/distrep2 Brick2: 10.70.46.215:/brick/brick1/distrep2 Brick3: 10.70.46.187:/brick/brick1/distrep2 Brick4: 10.70.46.193:/brick/brick1/distrep2 Options Reconfigured: cluster.self-heal-daemon: enable performance.readdir-ahead: on features.bitrot: on features.scrub: Active features.scrub-freq: hourly [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# gluster v bitrot distrep2 scrub status Volume name : distrep2 State of scrub: Active Scrub impact: lazy Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 9 Number of Skipped files: 1 Last completed scrub time: 2016-06-03 10:21:46 Duration of last scrub (D:M:H:M:S): 0:0:0:43 Error count: 0 ========================================================= Node: 10.70.46.193 Number of Scrubbed files: 9 Number of Skipped files: 1 Last completed scrub time: 2016-06-03 10:21:45 Duration of last scrub (D:M:H:M:S): 0:0:0:42 Error count: 0 ========================================================= Node: dhcp47-188.lab.eng.blr.redhat.com Number of Scrubbed files: 13 Number of Skipped files: 0 Last completed scrub time: 2016-06-03 10:21:50 Duration of last scrub (D:M:H:M:S): 0:0:0:46 Error count: 0 ========================================================= Node: 10.70.46.215 Number of Scrubbed files: 13 Number of Skipped files: 0 Last completed scrub time: 2016-06-03 10:21:49 Duration of last scrub (D:M:H:M:S): 0:0:0:46 Error count: 0 ========================================================= [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# [root@dhcp46-187 ~]# cd /brick/brick1/distrep2/dir1/dir2/dir3/dir4/dir5/ [root@dhcp46-187 dir5]# ls -l total 32 ---------T. 2 root root 20 Jun 3 11:59 file1_ln -rw-r--r--. 2 root root 22 Jun 3 11:25 test1 -rw-r--r--. 2 root root 22 Jun 3 11:25 test2 -rw-r--r--. 2 root root 22 Jun 3 11:25 test4 [root@dhcp46-187 dir5]# [root@dhcp46-187 dir5]#
Upstream Patch: http://review.gluster.org/14903 (master)
Upstream Patches: http://review.gluster.org/14903 (master) http://review.gluster.org/14982 (3.7) http://review.gluster.org/14983 (3.8)
as mentioned in comment 4, the fix is already available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4
Tested and verified this on the build glusterfs-3.8.4-3.el7rhgs.x86_64 Had a 4 node setup with 14*2 distribute-replicate volume created and enabled bitrot on the same. Mounted it via fuse and created files. Killed bricks and moved files around, so that link files get created. Monitored the scrub output across various intervals and it correctly showed the count with respect to 'files scrubbed' and 'files skipped'. Also, corrupted one of the files from the backend (for which a link file was present) and validated that it correctly got detected as corrupted and also got healed as expected. Moving this BZ to verified in 3.2. Detailed logs are attached.
Created attachment 1217926 [details] Server and client logs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html