Bug 1454596

Summary:	[Bitrot]: Inconsistency seen with 'scrub ondemand' - fails to trigger scrub
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sweta Anandpara <sanandpa>
Component:	bitrot	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED ERRATA	QA Contact:	Sweta Anandpara <sanandpa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, khiremat, rhinduja, rhs-bugs, sanandpa, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-29	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1461845 (view as bug list)		Environment:
Last Closed:	2017-09-21 04:45:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1417151, 1461845, 1462080, 1462127

Description Sweta Anandpara 2017-05-23 06:53:43 UTC

Description of problem:
=======================
In a 4/6 node cluster for any kind of bitrot-enabled-volume, there have been times when the command 'gluster volume bitrot <volname> scrub ondemand' was executed, but that failed to trigger the scrubber process to start scrubbing. The command 'gluster volume bitrot <volname> scrub status' which should ideally show the progress of the scrub run per node, continues to display 'Scrubber pending to complete' for every node, with its overall state 'Active (Idle)' - proving that the command 'scrub ondemand' turned out to be a no-op. Have hit this multiple times in automation and once while testing manually. The scrub logs do show that the scrub ondemand was called, and that is followed with 'No change in volfile, continuing' logs. 

Version-Release number of selected component (if applicable):
============================================================
3.8.4-25


How reproducible:
================
Multiple times


Steps to Reproduce:
==================
These might not be sure-shot ways to reproduce it, but these are the general steps that have been executed whenever this has been hit.
1. Have a bitrot enabled volume with data
2. Disable bitrot. Enable bitrot
3. Trigger scrub ondemand


Additional info:
===================

[2017-05-23 06:10:45.513449] I [MSGID: 118038] [bit-rot-scrub.c:1085:br_fsscan_ondemand] 0-ozone-bit-rot-0: Ondemand Scrubbing scheduled to run at 2017-05-23 06:10:46
[2017-05-23 06:10:45.605562] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-05-23 06:10:46.161784] I [glusterfsd-mgmt.c:1780:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2017-05-23 06:10:46.840056] I [MSGID: 118044] [bit-rot-scrub.c:615:br_scrubber_log_time] 0-ozone-bit-rot-0: Scrubbing started at 2017-05-23 06:10:46
[2017-05-23 06:10:48.083396] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-05-23 06:10:48.644978] I [glusterfsd-mgmt.c:1780:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing

[root@dhcp47-164 ~]# 
[root@dhcp47-164 ~]# gluster peer status
Number of Peers: 3

Hostname: dhcp47-165.lab.eng.blr.redhat.com
Uuid: 834d66eb-fb65-4ea3-949a-e7cb4c198f2b
State: Peer in Cluster (Connected)

Hostname: dhcp47-162.lab.eng.blr.redhat.com
Uuid: 95491d39-d83a-4053-b1d5-682ca7290bd2
State: Peer in Cluster (Connected)

Hostname: dhcp47-157.lab.eng.blr.redhat.com
Uuid: d0955c85-94d0-41ba-aea8-1ffde3575ea5
State: Peer in Cluster (Connected)
[root@dhcp47-164 ~]# 
[root@dhcp47-164 ~]# rpm -qa | grep gluster
glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64
glusterfs-libs-3.8.4-25.el7rhgs.x86_64
glusterfs-fuse-3.8.4-25.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-4.el7.x86_64
glusterfs-events-3.8.4-25.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-25.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-rdma-3.8.4-25.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-25.el7rhgs.x86_64
glusterfs-3.8.4-25.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-cli-3.8.4-25.el7rhgs.x86_64
glusterfs-server-3.8.4-25.el7rhgs.x86_64
python-gluster-3.8.4-25.el7rhgs.noarch
glusterfs-api-3.8.4-25.el7rhgs.x86_64
[root@dhcp47-164 ~]# 
[root@dhcp47-164 ~]# 
[root@dhcp47-164 ~]# 
[root@dhcp47-164 ~]# gluster v list
distrep
ozone
[root@dhcp47-164 ~]# gluster v status
Status of volume: distrep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.165:/bricks/brick1/distrep_0 49152     0          Y       7697 
Brick 10.70.47.164:/bricks/brick1/distrep_1 49153     0          Y       2021 
Brick 10.70.47.162:/bricks/brick1/distrep_2 49153     0          Y       628  
Brick 10.70.47.157:/bricks/brick1/distrep_3 49153     0          Y       31735
Self-heal Daemon on localhost               N/A       N/A        Y       2041 
Bitrot Daemon on localhost                  N/A       N/A        Y       2528 
Scrubber Daemon on localhost                N/A       N/A        Y       2538 
Self-heal Daemon on dhcp47-165.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       7785 
Bitrot Daemon on dhcp47-165.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       16837
Scrubber Daemon on dhcp47-165.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       16901
Self-heal Daemon on dhcp47-162.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       648  
Bitrot Daemon on dhcp47-162.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       1350 
Scrubber Daemon on dhcp47-162.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       1360 
Self-heal Daemon on dhcp47-157.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       31762
Bitrot Daemon on dhcp47-157.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       32487
Scrubber Daemon on dhcp47-157.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       32505
 
Task Status of Volume distrep
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: ozone
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.165:/bricks/brick0/ozone_0   49153     0          Y       12918
Brick 10.70.47.164:/bricks/brick0/ozone_1   49152     0          Y       32008
Brick 10.70.47.162:/bricks/brick0/ozone_2   49152     0          Y       31242
Brick 10.70.47.157:/bricks/brick0/ozone_3   49152     0          Y       30037
Self-heal Daemon on localhost               N/A       N/A        Y       2041 
Bitrot Daemon on localhost                  N/A       N/A        Y       2528 
Scrubber Daemon on localhost                N/A       N/A        Y       2538 
Self-heal Daemon on dhcp47-162.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       648  
Bitrot Daemon on dhcp47-162.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       1350 
Scrubber Daemon on dhcp47-162.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       1360 
Self-heal Daemon on dhcp47-165.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       7785 
Bitrot Daemon on dhcp47-165.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       16837
Scrubber Daemon on dhcp47-165.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       16901
Self-heal Daemon on dhcp47-157.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       31762
Bitrot Daemon on dhcp47-157.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       32487
Scrubber Daemon on dhcp47-157.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       32505
 
Task Status of Volume ozone
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp47-164 ~]# 
[root@dhcp47-164 ~]# 
[root@dhcp47-164 ~]# gluster v info
 
Volume Name: distrep
Type: Distributed-Replicate
Volume ID: 71537fad-fa85-4dac-b534-dd6edceba4e9
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.47.165:/bricks/brick1/distrep_0
Brick2: 10.70.47.164:/bricks/brick1/distrep_1
Brick3: 10.70.47.162:/bricks/brick1/distrep_2
Brick4: 10.70.47.157:/bricks/brick1/distrep_3
Options Reconfigured:
features.scrub: Active
features.bitrot: on
transport.address-family: inet
nfs.disable: on
 
Volume Name: ozone
Type: Distributed-Replicate
Volume ID: aba2693d-b771-4ef5-a0df-d0a2c8f77f9e
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.47.165:/bricks/brick0/ozone_0
Brick2: 10.70.47.164:/bricks/brick0/ozone_1
Brick3: 10.70.47.162:/bricks/brick0/ozone_2
Brick4: 10.70.47.157:/bricks/brick0/ozone_3
Options Reconfigured:
features.scrub-throttle: aggressive
features.scrub-freq: hourly
storage.batch-fsync-delay-usec: 0
nfs.disable: on
transport.address-family: inet
server.allow-insecure: on
performance.cache-samba-metadata: on
performance.nl-cache: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.parallel-readdir: on
features.bitrot: on
features.scrub: Active
[root@dhcp47-164 ~]# 
[root@dhcp47-164 ~]# 
[root@dhcp47-164 ~]#

Comment 2 Sweta Anandpara 2017-05-23 06:58:45 UTC

[qe@rhsqe-repo 1454596]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1454596]$ 
[qe@rhsqe-repo 1454596]$ pwd
/home/repo/sosreports/1454596
[qe@rhsqe-repo 1454596]$ 
[qe@rhsqe-repo 1454596]$ ll
total 144772
-rwxr-xr-x. 1 qe qe 39345624 May 23 12:17 sosreport-sysreg-prod-20170523021643_dhcp47_165.tar.xz
-rwxr-xr-x. 1 qe qe 38025452 May 23 12:17 sosreport-sysreg-prod-20170523021649_dhcp47_164.tar.xz
-rwxr-xr-x. 1 qe qe 35709660 May 23 12:17 sosreport-sysreg-prod-20170523021653_dhcp47_162.tar.xz
-rwxr-xr-x. 1 qe qe 35158712 May 23 12:17 sosreport-sysreg-prod-20170523021659_dhcp47_157.tar.xz
[qe@rhsqe-repo 1454596]$

Comment 3 Sweta Anandpara 2017-05-23 07:00:01 UTC

CLI logs of the node where this took place:

[root@dhcp47-164 ~]# gluster v bitrot ozone scrub ondemand
volume bitrot: success
[root@dhcp47-164 ~]# gluster v bitrot ozone scrub status

Volume name : ozone

State of scrub: Active (Idle)

Scrub impact: aggressive

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: localhost

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: Scrubber pending to complete.

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0


=========================================================

Node: dhcp47-165.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: Scrubber pending to complete.

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0


=========================================================

Node: dhcp47-157.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: Scrubber pending to complete.

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0


=========================================================

Node: dhcp47-162.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: Scrubber pending to complete.

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0

=========================================================

[root@dhcp47-164 ~]# gluster v bitrot ozone scrub status

Volume name : ozone

State of scrub: Active (Idle)

Scrub impact: aggressive

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: localhost

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: Scrubber pending to complete.

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0


=========================================================

Node: dhcp47-162.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: Scrubber pending to complete.

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0


=========================================================

Node: dhcp47-157.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: Scrubber pending to complete.

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0


=========================================================

Node: dhcp47-165.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: Scrubber pending to complete.

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0

=========================================================

[root@dhcp47-164 ~]# vim /var/log/glusterfs/scrub.log
[root@dhcp47-164 ~]# vim /var/log/glusterfs/bitd.log
[root@dhcp47-164 ~]#

Comment 4 Sweta Anandpara 2017-05-23 07:02:27 UTC

Pasted below is a snippet of glusterd logs:

[2017-05-23 06:01:33.109798] I [MSGID: 106567] [glusterd-svc-mgmt.c:196:glusterd_svc_start] 0-management: Starting scrub service
[2017-05-23 06:01:42.387551] I [MSGID: 106499] [glusterd-handler.c:4369:__glusterd_handle_status_volume] 0-management: Received status volume req for volume ozone
[2017-05-23 06:02:13.908230] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value
[2017-05-23 06:02:21.799744] W [MSGID: 106387] [glusterd-op-sm.c:5093:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed
[2017-05-23 06:02:24.606063] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value
[2017-05-23 06:02:33.604024] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value
[2017-05-23 06:03:11.761096] I [MSGID: 106488] [glusterd-handler.c:1539:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2017-05-23 06:04:07.142209] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value
[2017-05-23 06:06:44.300978] I [MSGID: 106062] [glusterd-volume-ops.c:2617:glusterd_op_start_volume] 0-management: Global dict not present.

Comment 8 Sweta Anandpara 2017-05-29 10:30:20 UTC

Sosreports have not been collected (in the past) for the multiple times that this issue has been hit in automation, as the automation-framework currently does not have that capability built it yet.

I have triggered another run for the bitrot scripts. I'll monitor it and will manually login to the VMs as soon as the run ends, to collect the sosreports. Will be updating this BZ with the details once the run is complete. 

Keeping the need-info on me, until I update the sosreports.

Comment 9 Sweta Anandpara 2017-05-31 07:38:11 UTC

I realize that I have hit this issue manually as well, and the sosreports of the same have been copied to: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>

The cmd_history log should show the time when 'scrub ondemand' was run, and the corresponding gluster logs of that time can be analysed. 

[qe@rhsqe-repo 1454596]$ 
[qe@rhsqe-repo 1454596]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1454596]$ 
[qe@rhsqe-repo 1454596]$ pwd
/home/repo/sosreports/1454596
[qe@rhsqe-repo 1454596]$ 
[qe@rhsqe-repo 1454596]$ ll
total 144772
-rwxr-xr-x. 1 qe qe 39345624 May 23 12:17 sosreport-sysreg-prod-20170523021643_dhcp47_165.tar.xz
-rwxr-xr-x. 1 qe qe 38025452 May 23 12:17 sosreport-sysreg-prod-20170523021649_dhcp47_164.tar.xz
-rwxr-xr-x. 1 qe qe 35709660 May 23 12:17 sosreport-sysreg-prod-20170523021653_dhcp47_162.tar.xz
-rwxr-xr-x. 1 qe qe 35158712 May 23 12:17 sosreport-sysreg-prod-20170523021659_dhcp47_157.tar.xz
[qe@rhsqe-repo 1454596]$

Comment 10 Sweta Anandpara 2017-05-31 07:42:29 UTC

Comment 2 already has sosreports' location pasted there :)

Comment 3 gives the node (whose cmd_history has to be checked) where the failure occurred.

Comment 4 gives the glusterd logs -where we can get the _time_ in and around which debugging has to be done.

Comment 11 Sweta Anandpara 2017-06-01 01:47:43 UTC

The automation run results are out, and the sosreports of all the nodes after the entire run are copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1454596/new/

Comment 12 Kotresh HR 2017-06-05 14:41:32 UTC

By comparing the ondemand failure timings and scrubber logs, it looks like not a bug. The failure is because either the scrubber is coming up or the scrubber is (Pause/Inactive/Running) state. Please look the below logs for more details.
But if there are instances where the cli says cmd is successful but scrubbing didn't happen, we need the proper timings to compare the same. I could not able to match the timings from [1] to scrubber logs. 

But all these failures are around the time where the scrubber is initializing. Can we run the automation test adding sleep of 5 sec after scrubber is started and see you still hit the issue? 

[1] http://jenkins-rhs.lab.eng.blr.redhat.com:8080/job/auto-RHGS_Downstream_Regression_Bitrot_RHEL7_4_RHGS_3_8_brew/ws/distaf_tests_bitrot_glusterfs_10.log 


SOSREPORT OF Comment 9:

The logs are clearly saying Scrubber is in Pause/Inactive/Running state.


root@dhcp-0-80#find . -name cmd_history.log | xargs grep ondemand
./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:07:32.502807]  : v bitrot distrep scrub ondemand : SUCCESS
./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:08:08.001503]  : v bitrot ozone scrub ondemand : SUCCESS
./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:09:48.556031]  : v bitrot distrep scrub ondemand : SUCCESS
./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:10:46.224744]  : v bitrot ozone scrub ondemand : SUCCESS
./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:40:54.856145]  : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state
./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:47:41.087389]  : v bitrot ozone scrub ondemand : SUCCESS
./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:47:52.579467]  : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state
./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:48:02.778093]  : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state
./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 06:02:21.804275]  : v bitrot ozone scrub ondemand : SUCCESS
./sosreport-sysreg-prod-20170523021643/var/log/glusterfs/cmd_history.log:[2017-05-23 05:39:23.302703]  : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state

SOSREPORT OF Comment 11:
On both the times it's failed, the scrubber is still coming up.
Please compare the timings below from scrubber logs.

root@dhcp-0-80#grep ondemand cmd_history.log 
[2017-05-31 09:22:05.788918]  : volume bitrot testvol scrub ondemand : SUCCESS
[2017-05-31 09:41:23.383408]  : volume bitrot testvol scrub ondemand : FAILED : 
[2017-05-31 10:08:30.273021]  : volume bitrot testvol scrub ondemand : SUCCESS
[2017-05-31 10:27:37.480013]  : volume bitrot testvol scrub ondemand : SUCCESS
[2017-05-31 10:33:27.136219]  : volume bitrot testvol scrub ondemand : FAILED : 
[2017-05-31 11:32:58.767932]  : volume bitrot testvol scrub ondemand : SUCCESS
[2017-05-31 19:04:15.373722]  : volume bitrot testvol scrub ondemand : SUCCESS




[2017-05-31 09:41:21.593591] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/lib/glusterd/scrub/run/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/0fc8642d18f8a025d5a9cc1aa4344112.socket --global-timer-wheel)
[2017-05-31 09:41:21.607443] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-05-31 09:41:21.607481] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket
[2017-05-31 09:41:21.607499] I [glusterfsd-mgmt.c:2140:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost
[2017-05-31 09:41:25.610427] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-testvol-bit-rot-0: Scaling up scrubbers [0 => 8]
[2017-05-31 09:41:25.610531] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-testvol-bit-rot-0: SCRUB TUNABLES:: [Frequency: biweekly, Throttle: aggressive]
[2017-05-31 09:41:25.610554] I [MSGID: 118031] [bit-rot.c:2010:init] 0-testvol-bit-rot-0: bit-rot xlator loaded in "SCRUBBER" mode
[2017-05-31 09:41:25.610847] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-0: parent translators are ready, attempting connect on transport
[2017-05-31 09:41:25.611139] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2017-05-31 09:41:25.613826] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-4: parent translators are ready, attempting connect on transport
[2017-05-31 09:41:25.614104] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-0: changing port to 49152 (from 0)
[2017-05-31 09:41:25.616723] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-8: parent translators are ready, attempting connect on transport
[2017-05-31 09:41:25.619118] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-4: changing port to 49153 (from 0)


..
..
..

[2017-05-31 10:33:24.700462] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/lib/glusterd/scrub/run/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/0fc8642d18f8a025d5a9cc1aa4344112.socket --global-timer-wheel)
[2017-05-31 10:33:24.710902] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-05-31 10:33:24.710974] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket
[2017-05-31 10:33:24.710997] I [glusterfsd-mgmt.c:2140:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost
[2017-05-31 10:33:28.706414] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-testvol-bit-rot-0: Scaling up scrubbers [0 => 8]
[2017-05-31 10:33:28.707260] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-testvol-bit-rot-0: SCRUB TUNABLES:: [Frequency: biweekly, Throttle: aggressive]
[2017-05-31 10:33:28.707333] I [MSGID: 118031] [bit-rot.c:2010:init] 0-testvol-bit-rot-0: bit-rot xlator loaded in "SCRUBBER" mode
[2017-05-31 10:33:28.707609] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2017-05-31 10:33:28.707890] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-0: parent translators are ready, attempting connect on transport
[2017-05-31 10:33:28.711210] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-4: parent translators are ready, attempting connect on transport
[2017-05-31 10:33:28.711649] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-0: changing port to 49155 (from 0)
[2017-05-31 10:33:28.714554] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-8: parent translators are ready, attempting connect on transport
[2017-05-31 10:33:28.717268] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-4: changing port to 49156 (from 0)
Final graph:

Comment 13 Sweta Anandpara 2017-06-06 06:01:57 UTC

We are going about it the wrong way, at the outset itself.

Firstly, we do not need to check the cmd_history logs only for the times that it has _failed_ . The command does report /success/ in cli, and hence every trigger of bitrot ondemand scrub has to be analysed in cmd_history.logs. 

Secondly, are you requesting to add a delay of 5 seconds after triggering a scrub ondemand, and then checking scrub status? If yes, that is already present in the logs itself. The scrub status is checked repeatedly, during the entire duration of that test case. 

1:23:00    Bitrot is enabled.
1:25:32    Scrub throttle is set to aggressive
1:25:34    File contents are corrupted, after verifying that the file is signed
1:25:35    Scrub status is checked before triggering scrub ondemand
1:25:37    Scrub ondemand is triggered
1:25:39    Scrub status is checked after triggering scrub ondemand
1:26:41    ANOTHER TIME SCRUB STATUS IS CHECKED 
1:36:45    ANOTHER TIME SCRUB STATUS IS CHECKED 

>>>>>  There is a time gap of more than a minute, in between trigger-of-scrub-ondemand and scrub-status-check. And then there is a time gap of 10 mins! Both the times, it says 'Scrub status' as 'Active(Idle)' and 'Last completed scrub time' as 'Scrubber pending to complete'
Both the things above, together imply that the scrubber was never triggered.
Please find the supporting logs below. 


2017-05-31 01:23:00,079 INFO run "gluster volume bitrot testvol enable" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0
2017-05-31 01:23:00,080 INFO run "gluster volume bitrot testvol enable" on rhsauto053.lab.eng.blr.redhat.com: STDOUT is 
 volume bitrot: success

2017-05-31 01:25:32,945 INFO run "gluster volume bitrot testvol scrub-throttle aggressive" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0
2017-05-31 01:25:32,945 INFO run "gluster volume bitrot testvol scrub-throttle aggressive" on rhsauto053.lab.eng.blr.redhat.com: STDOUT is 
 volume bitrot: success

2017-05-31 01:25:34,448 INFO run Corrupting the file /bricks/brick0/testvol_brick0/glusterfs_file_0_1496208180.txt from server 10.70.40.128
2017-05-31 01:25:34,449 INFO run Executing echo test > /bricks/brick0/testvol_brick0/glusterfs_file_0_1496208180.txt on 10.70.40.128
2017-05-31 01:25:34,463 INFO run "echo test > /bricks/brick0/testvol_brick0/glusterfs_file_0_1496208180.txt" on 10.70.40.128: RETCODE is 0

2017-05-31 01:25:35,081 INFO trigger_scrubber_on_demand BITROT SCRUB STATUS Before Triggering scrub on demand:
2017-05-31 01:25:35,082 INFO run Executing gluster volume bitrot testvol scrub status on rhsauto053.lab.eng.blr.redhat.com
2017-05-31 01:25:36,587 INFO run "gluster volume bitrot testvol scrub status" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0

2017-05-31 01:25:37,983 INFO run "gluster volume bitrot testvol scrub ondemand" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0
2017-05-31 01:25:37,983 INFO run "gluster volume bitrot testvol scrub ondemand" on rhsauto053.lab.eng.blr.redhat.com: STDOUT is 
 volume bitrot: success

2017-05-31 01:25:37,984 INFO trigger_scrubber_on_demand BITROT SCRUB STATUS After Triggering scrub on demand:
2017-05-31 01:25:37,984 INFO run Executing gluster volume bitrot testvol scrub status on rhsauto053.lab.eng.blr.redhat.com
2017-05-31 01:25:39,337 INFO run "gluster volume bitrot testvol scrub status" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0

2017-05-31 01:26:39,371 INFO verify_scrubbing BITROT SCRUB STATUS Before starting the validation of scrubbing a bad file
2017-05-31 01:26:39,371 INFO run Executing gluster volume bitrot testvol scrub status on 10.70.40.128
2017-05-31 01:26:41,013 INFO run "gluster volume bitrot testvol scrub status" on 10.70.40.128: RETCODE is 0

2017-05-31 01:36:42,553 INFO run Executing gluster volume bitrot testvol scrub status on 10.70.40.128
2017-05-31 01:36:45,469 INFO run "gluster volume bitrot testvol scrub status" on 10.70.40.128: RETCODE is 0

Comment 14 Sweta Anandpara 2017-06-06 11:09:23 UTC

After sitting together and debugging the logs, it was agreed that Kotresh will add a few extra log messages at certain places and again trigger a run- which would in turn help in rca'ing the issue.

Comment 16 Kotresh HR 2017-06-15 12:46:39 UTC

Upstream patch:

https://review.gluster.org/17552

Comment 19 Kotresh HR 2017-06-16 10:01:18 UTC

Upstream Patches:

https://review.gluster.org/17552 (master)
https://review.gluster.org/17553 (3.10)
https://review.gluster.org/17555 (3.11)

Downstream Patch:
https://code.engineering.redhat.com/gerrit/109279

Comment 23 Sweta Anandpara 2017-08-10 06:25:47 UTC

Tested and verified this on the build glusterfs-3.8.4-38. Added the necessary time delays in automation in every glusterd restart and bitrot enable. I do not see the same issue that is mentioned in the description. One scrub test 'verify_scrubber_when_glusterd_down' did fail, for which the error logs look similar. However, will have to do that manually to ascertain the real issue.

With all but one of the test cases passed in the latest run, moving this bug to verified in rhgs 3.3.0.

Comment 25 errata-xmlrpc 2017-09-21 04:45:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 26 errata-xmlrpc 2017-09-21 04:58:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774