Description of problem: ======================= In a bitrot enabled volume, if the scrub process is already in the middle of its run, and we trigger a 'gluster volume <volname> scrub ondemand', then it should not result in resetting the scrub values. Ondemand scrubbing if executed, should not hamper/affect an in-progress run. In a customer environment which has a large data set, this would result in an unnecesary overhead. Version-Release number of selected component (if applicable): ============================================================== 3.8.4-2 How reproducible: ================ Always Steps to Reproduce: =================== 1. In a 4 node cluster, create a replica 3 volume 'ozone' 2. Enable bitrot and set scrub-frequency to a minute 3. Execute a 'gluster volume bitrot ozone scrub status' to figure out if the status is in 'Active (In progress)'. 4. If it is, trigger a 'gluster volume bitrot ozone scrub ondemand' and immediately see the output of scrub status Actual results: =============== Step4 scrub status output shows all the values are reset to '0' and the scrub is started afresh. Expected results: ================== ondemand scrubbing should not hamper the already progressing run. Additional info: ================= [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# rpm -qa | grep gluster glusterfs-debuginfo-3.8.4-1.el7rhgs.x86_64 glusterfs-fuse-3.8.4-2.el7rhgs.x86_64 glusterfs-cli-3.8.4-2.el7rhgs.x86_64 glusterfs-events-3.8.4-2.el7rhgs.x86_64 glusterfs-devel-3.8.4-2.el7rhgs.x86_64 glusterfs-api-devel-3.8.4-2.el7rhgs.x86_64 glusterfs-3.8.4-2.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-2.el7rhgs.x86_64 python-gluster-3.8.4-2.el7rhgs.noarch glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64 glusterfs-server-3.8.4-2.el7rhgs.x86_64 nfs-ganesha-gluster-2.3.1-8.el7rhgs.x86_64 glusterfs-libs-3.8.4-2.el7rhgs.x86_64 glusterfs-api-3.8.4-2.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-2.el7rhgs.x86_64 glusterfs-rdma-3.8.4-2.el7rhgs.x86_64 [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# gluster v info Volume Name: repthree Type: Replicate Volume ID: aa8f3095-5a69-4d0a-80d9-6182c3de3cb4 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.46.239:/bricks/brick0/repthree1 Brick2: 10.70.46.240:/bricks/brick0/repthree2 Brick3: 10.70.46.242:/bricks/brick0/repthree3 Options Reconfigured: performance.stat-prefetch: off features.scrub-freq: minute features.scrub: Active features.bitrot: on transport.address-family: inet performance.readdir-ahead: on cluster.enable-shared-storage: disable [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# gluster peer status Number of Peers: 3 Hostname: dhcp46-239.lab.eng.blr.redhat.com Uuid: ed362eb3-421c-4a25-ad0e-82ef157ea328 State: Peer in Cluster (Connected) Hostname: 10.70.46.240 Uuid: 72c4f894-61f7-433e-a546-4ad2d7f0a176 State: Peer in Cluster (Connected) Hostname: 10.70.46.242 Uuid: 1e8967ae-51b2-4c27-907e-a22a83107fd0 State: Peer in Cluster (Connected) [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# gluster v bitrot repthree scrub status Volume name : repthree State of scrub: Active (Idle) Scrub impact: lazy Scrub frequency: minute Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: dhcp46-239.lab.eng.blr.redhat.com Number of Scrubbed files: 24 Number of Skipped files: 0 Last completed scrub time: 2016-10-25 05:02:34 Duration of last scrub (D:M:H:M:S): 0:0:0:48 Error count: 0 ========================================================= Node: 10.70.46.240 Number of Scrubbed files: 24 Number of Skipped files: 0 Last completed scrub time: 2016-10-25 05:02:33 Duration of last scrub (D:M:H:M:S): 0:0:0:48 Error count: 0 ========================================================= Node: 10.70.46.242 Number of Scrubbed files: 24 Number of Skipped files: 0 Last completed scrub time: 2016-10-25 05:02:34 Duration of last scrub (D:M:H:M:S): 0:0:0:48 Error count: 0 ========================================================= [root@dhcp46-218 ~]#
On one of my successive runs, I do see the below error message: [root@dhcp46-218 brick0]# gluster v bitrot repthree scrub ondemand Bitrot command failed : Commit failed on dhcp46-239.lab.eng.blr.redhat.com. Error: Scrubber is in Pause/Inactive/Running state Commit failed on 10.70.46.240. Error: Scrubber is in Pause/Inactive/Running state Commit failed on 10.70.46.242. Error: Scrubber is in Pause/Inactive/Running state [root@dhcp46-218 brick0]# This is how we would expect the scrub ondemand to fail when the scrub process is already running. The above log confirms that the check IS present. It might not be at the right place, which explains the scrub-values-being-reset-to-0 behaviour in the past.
Based on the discussion with Kotresh, providing devel ack.
I tested this multiple times and is not being reproduced. Please re-test and let me know if it's reproducible. If yes, please upload logs or share the machine details for debugging.
I have been unable to reproduce this. Multiple tries in the past two days have been in vain. Moving this BZ to its closure. Will reopen if I happen to hit it again.