Bug 1388298

Summary:	[Bitrot]: Scrub ondemand should be a no-op if scrubber is already running
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sweta Anandpara <sanandpa>
Component:	bitrot	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED WORKSFORME	QA Contact:	Sweta Anandpara <sanandpa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, rhinduja, rhs-bugs, sanandpa, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-16 06:21:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sweta Anandpara 2016-10-25 05:03:44 UTC

Description of problem:
=======================
In a bitrot enabled volume, if the scrub process is already in the middle of its run, and we trigger a 'gluster volume <volname> scrub ondemand', then it should not result in resetting the scrub values. Ondemand scrubbing if executed, should not hamper/affect an in-progress run. 
In a customer environment which has a large data set, this would result in an unnecesary overhead.


Version-Release number of selected component (if applicable):
==============================================================
3.8.4-2


How reproducible:
================
Always


Steps to Reproduce:
===================
1. In a 4 node cluster, create a replica 3 volume 'ozone'
2. Enable bitrot and set scrub-frequency to a minute
3. Execute a 'gluster volume bitrot ozone scrub status' to figure out if the status is in 'Active (In progress)'.
4. If it is, trigger a 'gluster volume bitrot ozone scrub ondemand' and immediately see the output of scrub status

Actual results:
===============
Step4 scrub status output shows all the values are reset to '0' and the scrub is started afresh.


Expected results:
==================
ondemand scrubbing should not hamper the already progressing run.


Additional info:
=================

[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# rpm -qa | grep gluster
glusterfs-debuginfo-3.8.4-1.el7rhgs.x86_64
glusterfs-fuse-3.8.4-2.el7rhgs.x86_64
glusterfs-cli-3.8.4-2.el7rhgs.x86_64
glusterfs-events-3.8.4-2.el7rhgs.x86_64
glusterfs-devel-3.8.4-2.el7rhgs.x86_64
glusterfs-api-devel-3.8.4-2.el7rhgs.x86_64
glusterfs-3.8.4-2.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-2.el7rhgs.x86_64
python-gluster-3.8.4-2.el7rhgs.noarch
glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64
glusterfs-server-3.8.4-2.el7rhgs.x86_64
nfs-ganesha-gluster-2.3.1-8.el7rhgs.x86_64
glusterfs-libs-3.8.4-2.el7rhgs.x86_64
glusterfs-api-3.8.4-2.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-2.el7rhgs.x86_64
glusterfs-rdma-3.8.4-2.el7rhgs.x86_64
[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# gluster v info
 
Volume Name: repthree
Type: Replicate
Volume ID: aa8f3095-5a69-4d0a-80d9-6182c3de3cb4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.46.239:/bricks/brick0/repthree1
Brick2: 10.70.46.240:/bricks/brick0/repthree2
Brick3: 10.70.46.242:/bricks/brick0/repthree3
Options Reconfigured:
performance.stat-prefetch: off
features.scrub-freq: minute
features.scrub: Active
features.bitrot: on
transport.address-family: inet
performance.readdir-ahead: on
cluster.enable-shared-storage: disable
[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# gluster peer status
Number of Peers: 3

Hostname: dhcp46-239.lab.eng.blr.redhat.com
Uuid: ed362eb3-421c-4a25-ad0e-82ef157ea328
State: Peer in Cluster (Connected)

Hostname: 10.70.46.240
Uuid: 72c4f894-61f7-433e-a546-4ad2d7f0a176
State: Peer in Cluster (Connected)

Hostname: 10.70.46.242
Uuid: 1e8967ae-51b2-4c27-907e-a22a83107fd0
State: Peer in Cluster (Connected)
[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# 
[root@dhcp46-218 ~]# gluster v bitrot repthree scrub status

Volume name : repthree

State of scrub: Active (Idle)

Scrub impact: lazy

Scrub frequency: minute

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: dhcp46-239.lab.eng.blr.redhat.com

Number of Scrubbed files: 24

Number of Skipped files: 0

Last completed scrub time: 2016-10-25 05:02:34

Duration of last scrub (D:M:H:M:S): 0:0:0:48

Error count: 0


=========================================================

Node: 10.70.46.240

Number of Scrubbed files: 24

Number of Skipped files: 0

Last completed scrub time: 2016-10-25 05:02:33

Duration of last scrub (D:M:H:M:S): 0:0:0:48

Error count: 0


=========================================================

Node: 10.70.46.242

Number of Scrubbed files: 24

Number of Skipped files: 0

Last completed scrub time: 2016-10-25 05:02:34

Duration of last scrub (D:M:H:M:S): 0:0:0:48

Error count: 0

=========================================================

[root@dhcp46-218 ~]#

Comment 2 Sweta Anandpara 2016-10-25 05:34:55 UTC

On one of my successive runs, I do see the below error message:

[root@dhcp46-218 brick0]# gluster v bitrot repthree scrub ondemand
Bitrot command failed : Commit failed on dhcp46-239.lab.eng.blr.redhat.com. Error: Scrubber is in Pause/Inactive/Running state
Commit failed on 10.70.46.240. Error: Scrubber is in Pause/Inactive/Running state
Commit failed on 10.70.46.242. Error: Scrubber is in Pause/Inactive/Running state
[root@dhcp46-218 brick0]# 

This is how we would expect the scrub ondemand to fail when the scrub process is already running.
The above log confirms that the check IS present. It might not be at the right place, which explains the scrub-values-being-reset-to-0 behaviour in the past.

Comment 3 Atin Mukherjee 2016-11-07 13:16:23 UTC

Based on the discussion with Kotresh, providing devel ack.

Comment 6 Kotresh HR 2016-11-15 10:17:42 UTC

I tested this multiple times and is not being reproduced. Please re-test and let me know if it's reproducible. If yes, please upload logs or share the machine details for debugging.

Comment 7 Sweta Anandpara 2016-11-16 06:21:38 UTC

I have been unable to reproduce this. Multiple tries in the past two days have been in vain. Moving this BZ to its closure. Will reopen if I happen to hit it again.