Description of problem: ======================= In a 4/6 node cluster for any kind of bitrot-enabled-volume, there have been times when the command 'gluster volume bitrot <volname> scrub ondemand' was executed, but that failed to trigger the scrubber process to start scrubbing. The command 'gluster volume bitrot <volname> scrub status' which should ideally show the progress of the scrub run per node, continues to display 'Scrubber pending to complete' for every node, with its overall state 'Active (Idle)' - proving that the command 'scrub ondemand' turned out to be a no-op. Have hit this multiple times in automation and once while testing manually. The scrub logs do show that the scrub ondemand was called, and that is followed with 'No change in volfile, continuing' logs. Version-Release number of selected component (if applicable): ============================================================ 3.8.4-25 How reproducible: ================ Multiple times Steps to Reproduce: ================== These might not be sure-shot ways to reproduce it, but these are the general steps that have been executed whenever this has been hit. 1. Have a bitrot enabled volume with data 2. Disable bitrot. Enable bitrot 3. Trigger scrub ondemand Additional info: =================== [2017-05-23 06:10:45.513449] I [MSGID: 118038] [bit-rot-scrub.c:1085:br_fsscan_ondemand] 0-ozone-bit-rot-0: Ondemand Scrubbing scheduled to run at 2017-05-23 06:10:46 [2017-05-23 06:10:45.605562] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed [2017-05-23 06:10:46.161784] I [glusterfsd-mgmt.c:1780:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2017-05-23 06:10:46.840056] I [MSGID: 118044] [bit-rot-scrub.c:615:br_scrubber_log_time] 0-ozone-bit-rot-0: Scrubbing started at 2017-05-23 06:10:46 [2017-05-23 06:10:48.083396] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed [2017-05-23 06:10:48.644978] I [glusterfsd-mgmt.c:1780:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [root@dhcp47-164 ~]# [root@dhcp47-164 ~]# gluster peer status Number of Peers: 3 Hostname: dhcp47-165.lab.eng.blr.redhat.com Uuid: 834d66eb-fb65-4ea3-949a-e7cb4c198f2b State: Peer in Cluster (Connected) Hostname: dhcp47-162.lab.eng.blr.redhat.com Uuid: 95491d39-d83a-4053-b1d5-682ca7290bd2 State: Peer in Cluster (Connected) Hostname: dhcp47-157.lab.eng.blr.redhat.com Uuid: d0955c85-94d0-41ba-aea8-1ffde3575ea5 State: Peer in Cluster (Connected) [root@dhcp47-164 ~]# [root@dhcp47-164 ~]# rpm -qa | grep gluster glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64 glusterfs-libs-3.8.4-25.el7rhgs.x86_64 glusterfs-fuse-3.8.4-25.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.2.0-4.el7.x86_64 glusterfs-events-3.8.4-25.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-25.el7rhgs.x86_64 vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-rdma-3.8.4-25.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-25.el7rhgs.x86_64 glusterfs-3.8.4-25.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64 gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-cli-3.8.4-25.el7rhgs.x86_64 glusterfs-server-3.8.4-25.el7rhgs.x86_64 python-gluster-3.8.4-25.el7rhgs.noarch glusterfs-api-3.8.4-25.el7rhgs.x86_64 [root@dhcp47-164 ~]# [root@dhcp47-164 ~]# [root@dhcp47-164 ~]# [root@dhcp47-164 ~]# gluster v list distrep ozone [root@dhcp47-164 ~]# gluster v status Status of volume: distrep Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.165:/bricks/brick1/distrep_0 49152 0 Y 7697 Brick 10.70.47.164:/bricks/brick1/distrep_1 49153 0 Y 2021 Brick 10.70.47.162:/bricks/brick1/distrep_2 49153 0 Y 628 Brick 10.70.47.157:/bricks/brick1/distrep_3 49153 0 Y 31735 Self-heal Daemon on localhost N/A N/A Y 2041 Bitrot Daemon on localhost N/A N/A Y 2528 Scrubber Daemon on localhost N/A N/A Y 2538 Self-heal Daemon on dhcp47-165.lab.eng.blr. redhat.com N/A N/A Y 7785 Bitrot Daemon on dhcp47-165.lab.eng.blr.red hat.com N/A N/A Y 16837 Scrubber Daemon on dhcp47-165.lab.eng.blr.r edhat.com N/A N/A Y 16901 Self-heal Daemon on dhcp47-162.lab.eng.blr. redhat.com N/A N/A Y 648 Bitrot Daemon on dhcp47-162.lab.eng.blr.red hat.com N/A N/A Y 1350 Scrubber Daemon on dhcp47-162.lab.eng.blr.r edhat.com N/A N/A Y 1360 Self-heal Daemon on dhcp47-157.lab.eng.blr. redhat.com N/A N/A Y 31762 Bitrot Daemon on dhcp47-157.lab.eng.blr.red hat.com N/A N/A Y 32487 Scrubber Daemon on dhcp47-157.lab.eng.blr.r edhat.com N/A N/A Y 32505 Task Status of Volume distrep ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: ozone Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.165:/bricks/brick0/ozone_0 49153 0 Y 12918 Brick 10.70.47.164:/bricks/brick0/ozone_1 49152 0 Y 32008 Brick 10.70.47.162:/bricks/brick0/ozone_2 49152 0 Y 31242 Brick 10.70.47.157:/bricks/brick0/ozone_3 49152 0 Y 30037 Self-heal Daemon on localhost N/A N/A Y 2041 Bitrot Daemon on localhost N/A N/A Y 2528 Scrubber Daemon on localhost N/A N/A Y 2538 Self-heal Daemon on dhcp47-162.lab.eng.blr. redhat.com N/A N/A Y 648 Bitrot Daemon on dhcp47-162.lab.eng.blr.red hat.com N/A N/A Y 1350 Scrubber Daemon on dhcp47-162.lab.eng.blr.r edhat.com N/A N/A Y 1360 Self-heal Daemon on dhcp47-165.lab.eng.blr. redhat.com N/A N/A Y 7785 Bitrot Daemon on dhcp47-165.lab.eng.blr.red hat.com N/A N/A Y 16837 Scrubber Daemon on dhcp47-165.lab.eng.blr.r edhat.com N/A N/A Y 16901 Self-heal Daemon on dhcp47-157.lab.eng.blr. redhat.com N/A N/A Y 31762 Bitrot Daemon on dhcp47-157.lab.eng.blr.red hat.com N/A N/A Y 32487 Scrubber Daemon on dhcp47-157.lab.eng.blr.r edhat.com N/A N/A Y 32505 Task Status of Volume ozone ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp47-164 ~]# [root@dhcp47-164 ~]# [root@dhcp47-164 ~]# gluster v info Volume Name: distrep Type: Distributed-Replicate Volume ID: 71537fad-fa85-4dac-b534-dd6edceba4e9 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.47.165:/bricks/brick1/distrep_0 Brick2: 10.70.47.164:/bricks/brick1/distrep_1 Brick3: 10.70.47.162:/bricks/brick1/distrep_2 Brick4: 10.70.47.157:/bricks/brick1/distrep_3 Options Reconfigured: features.scrub: Active features.bitrot: on transport.address-family: inet nfs.disable: on Volume Name: ozone Type: Distributed-Replicate Volume ID: aba2693d-b771-4ef5-a0df-d0a2c8f77f9e Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.47.165:/bricks/brick0/ozone_0 Brick2: 10.70.47.164:/bricks/brick0/ozone_1 Brick3: 10.70.47.162:/bricks/brick0/ozone_2 Brick4: 10.70.47.157:/bricks/brick0/ozone_3 Options Reconfigured: features.scrub-throttle: aggressive features.scrub-freq: hourly storage.batch-fsync-delay-usec: 0 nfs.disable: on transport.address-family: inet server.allow-insecure: on performance.cache-samba-metadata: on performance.nl-cache: on features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.parallel-readdir: on features.bitrot: on features.scrub: Active [root@dhcp47-164 ~]# [root@dhcp47-164 ~]# [root@dhcp47-164 ~]#
[qe@rhsqe-repo 1454596]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ pwd /home/repo/sosreports/1454596 [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ ll total 144772 -rwxr-xr-x. 1 qe qe 39345624 May 23 12:17 sosreport-sysreg-prod-20170523021643_dhcp47_165.tar.xz -rwxr-xr-x. 1 qe qe 38025452 May 23 12:17 sosreport-sysreg-prod-20170523021649_dhcp47_164.tar.xz -rwxr-xr-x. 1 qe qe 35709660 May 23 12:17 sosreport-sysreg-prod-20170523021653_dhcp47_162.tar.xz -rwxr-xr-x. 1 qe qe 35158712 May 23 12:17 sosreport-sysreg-prod-20170523021659_dhcp47_157.tar.xz [qe@rhsqe-repo 1454596]$
CLI logs of the node where this took place: [root@dhcp47-164 ~]# gluster v bitrot ozone scrub ondemand volume bitrot: success [root@dhcp47-164 ~]# gluster v bitrot ozone scrub status Volume name : ozone State of scrub: Active (Idle) Scrub impact: aggressive Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-165.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-157.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-162.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= [root@dhcp47-164 ~]# gluster v bitrot ozone scrub status Volume name : ozone State of scrub: Active (Idle) Scrub impact: aggressive Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-162.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-157.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-165.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= [root@dhcp47-164 ~]# vim /var/log/glusterfs/scrub.log [root@dhcp47-164 ~]# vim /var/log/glusterfs/bitd.log [root@dhcp47-164 ~]#
Pasted below is a snippet of glusterd logs: [2017-05-23 06:01:33.109798] I [MSGID: 106567] [glusterd-svc-mgmt.c:196:glusterd_svc_start] 0-management: Starting scrub service [2017-05-23 06:01:42.387551] I [MSGID: 106499] [glusterd-handler.c:4369:__glusterd_handle_status_volume] 0-management: Received status volume req for volume ozone [2017-05-23 06:02:13.908230] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value [2017-05-23 06:02:21.799744] W [MSGID: 106387] [glusterd-op-sm.c:5093:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed [2017-05-23 06:02:24.606063] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value [2017-05-23 06:02:33.604024] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value [2017-05-23 06:03:11.761096] I [MSGID: 106488] [glusterd-handler.c:1539:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2017-05-23 06:04:07.142209] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value [2017-05-23 06:06:44.300978] I [MSGID: 106062] [glusterd-volume-ops.c:2617:glusterd_op_start_volume] 0-management: Global dict not present.
Sosreports have not been collected (in the past) for the multiple times that this issue has been hit in automation, as the automation-framework currently does not have that capability built it yet. I have triggered another run for the bitrot scripts. I'll monitor it and will manually login to the VMs as soon as the run ends, to collect the sosreports. Will be updating this BZ with the details once the run is complete. Keeping the need-info on me, until I update the sosreports.
I realize that I have hit this issue manually as well, and the sosreports of the same have been copied to: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber> The cmd_history log should show the time when 'scrub ondemand' was run, and the corresponding gluster logs of that time can be analysed. [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ pwd /home/repo/sosreports/1454596 [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ ll total 144772 -rwxr-xr-x. 1 qe qe 39345624 May 23 12:17 sosreport-sysreg-prod-20170523021643_dhcp47_165.tar.xz -rwxr-xr-x. 1 qe qe 38025452 May 23 12:17 sosreport-sysreg-prod-20170523021649_dhcp47_164.tar.xz -rwxr-xr-x. 1 qe qe 35709660 May 23 12:17 sosreport-sysreg-prod-20170523021653_dhcp47_162.tar.xz -rwxr-xr-x. 1 qe qe 35158712 May 23 12:17 sosreport-sysreg-prod-20170523021659_dhcp47_157.tar.xz [qe@rhsqe-repo 1454596]$
Comment 2 already has sosreports' location pasted there :) Comment 3 gives the node (whose cmd_history has to be checked) where the failure occurred. Comment 4 gives the glusterd logs -where we can get the _time_ in and around which debugging has to be done.
The automation run results are out, and the sosreports of all the nodes after the entire run are copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1454596/new/
By comparing the ondemand failure timings and scrubber logs, it looks like not a bug. The failure is because either the scrubber is coming up or the scrubber is (Pause/Inactive/Running) state. Please look the below logs for more details. But if there are instances where the cli says cmd is successful but scrubbing didn't happen, we need the proper timings to compare the same. I could not able to match the timings from [1] to scrubber logs. But all these failures are around the time where the scrubber is initializing. Can we run the automation test adding sleep of 5 sec after scrubber is started and see you still hit the issue? [1] http://jenkins-rhs.lab.eng.blr.redhat.com:8080/job/auto-RHGS_Downstream_Regression_Bitrot_RHEL7_4_RHGS_3_8_brew/ws/distaf_tests_bitrot_glusterfs_10.log SOSREPORT OF Comment 9: The logs are clearly saying Scrubber is in Pause/Inactive/Running state. root@dhcp-0-80#find . -name cmd_history.log | xargs grep ondemand ./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:07:32.502807] : v bitrot distrep scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:08:08.001503] : v bitrot ozone scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:09:48.556031] : v bitrot distrep scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:10:46.224744] : v bitrot ozone scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:40:54.856145] : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:47:41.087389] : v bitrot ozone scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:47:52.579467] : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:48:02.778093] : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 06:02:21.804275] : v bitrot ozone scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021643/var/log/glusterfs/cmd_history.log:[2017-05-23 05:39:23.302703] : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state SOSREPORT OF Comment 11: On both the times it's failed, the scrubber is still coming up. Please compare the timings below from scrubber logs. root@dhcp-0-80#grep ondemand cmd_history.log [2017-05-31 09:22:05.788918] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 09:41:23.383408] : volume bitrot testvol scrub ondemand : FAILED : [2017-05-31 10:08:30.273021] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 10:27:37.480013] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 10:33:27.136219] : volume bitrot testvol scrub ondemand : FAILED : [2017-05-31 11:32:58.767932] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 19:04:15.373722] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 09:41:21.593591] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/lib/glusterd/scrub/run/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/0fc8642d18f8a025d5a9cc1aa4344112.socket --global-timer-wheel) [2017-05-31 09:41:21.607443] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-31 09:41:21.607481] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket [2017-05-31 09:41:21.607499] I [glusterfsd-mgmt.c:2140:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost [2017-05-31 09:41:25.610427] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-testvol-bit-rot-0: Scaling up scrubbers [0 => 8] [2017-05-31 09:41:25.610531] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-testvol-bit-rot-0: SCRUB TUNABLES:: [Frequency: biweekly, Throttle: aggressive] [2017-05-31 09:41:25.610554] I [MSGID: 118031] [bit-rot.c:2010:init] 0-testvol-bit-rot-0: bit-rot xlator loaded in "SCRUBBER" mode [2017-05-31 09:41:25.610847] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-0: parent translators are ready, attempting connect on transport [2017-05-31 09:41:25.611139] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2017-05-31 09:41:25.613826] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-4: parent translators are ready, attempting connect on transport [2017-05-31 09:41:25.614104] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-0: changing port to 49152 (from 0) [2017-05-31 09:41:25.616723] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-8: parent translators are ready, attempting connect on transport [2017-05-31 09:41:25.619118] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-4: changing port to 49153 (from 0) .. .. .. [2017-05-31 10:33:24.700462] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/lib/glusterd/scrub/run/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/0fc8642d18f8a025d5a9cc1aa4344112.socket --global-timer-wheel) [2017-05-31 10:33:24.710902] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-31 10:33:24.710974] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket [2017-05-31 10:33:24.710997] I [glusterfsd-mgmt.c:2140:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost [2017-05-31 10:33:28.706414] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-testvol-bit-rot-0: Scaling up scrubbers [0 => 8] [2017-05-31 10:33:28.707260] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-testvol-bit-rot-0: SCRUB TUNABLES:: [Frequency: biweekly, Throttle: aggressive] [2017-05-31 10:33:28.707333] I [MSGID: 118031] [bit-rot.c:2010:init] 0-testvol-bit-rot-0: bit-rot xlator loaded in "SCRUBBER" mode [2017-05-31 10:33:28.707609] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2017-05-31 10:33:28.707890] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-0: parent translators are ready, attempting connect on transport [2017-05-31 10:33:28.711210] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-4: parent translators are ready, attempting connect on transport [2017-05-31 10:33:28.711649] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-0: changing port to 49155 (from 0) [2017-05-31 10:33:28.714554] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-8: parent translators are ready, attempting connect on transport [2017-05-31 10:33:28.717268] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-4: changing port to 49156 (from 0) Final graph:
We are going about it the wrong way, at the outset itself. Firstly, we do not need to check the cmd_history logs only for the times that it has _failed_ . The command does report /success/ in cli, and hence every trigger of bitrot ondemand scrub has to be analysed in cmd_history.logs. Secondly, are you requesting to add a delay of 5 seconds after triggering a scrub ondemand, and then checking scrub status? If yes, that is already present in the logs itself. The scrub status is checked repeatedly, during the entire duration of that test case. 1:23:00 Bitrot is enabled. 1:25:32 Scrub throttle is set to aggressive 1:25:34 File contents are corrupted, after verifying that the file is signed 1:25:35 Scrub status is checked before triggering scrub ondemand 1:25:37 Scrub ondemand is triggered 1:25:39 Scrub status is checked after triggering scrub ondemand 1:26:41 ANOTHER TIME SCRUB STATUS IS CHECKED 1:36:45 ANOTHER TIME SCRUB STATUS IS CHECKED >>>>> There is a time gap of more than a minute, in between trigger-of-scrub-ondemand and scrub-status-check. And then there is a time gap of 10 mins! Both the times, it says 'Scrub status' as 'Active(Idle)' and 'Last completed scrub time' as 'Scrubber pending to complete' Both the things above, together imply that the scrubber was never triggered. Please find the supporting logs below. 2017-05-31 01:23:00,079 INFO run "gluster volume bitrot testvol enable" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0 2017-05-31 01:23:00,080 INFO run "gluster volume bitrot testvol enable" on rhsauto053.lab.eng.blr.redhat.com: STDOUT is volume bitrot: success 2017-05-31 01:25:32,945 INFO run "gluster volume bitrot testvol scrub-throttle aggressive" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0 2017-05-31 01:25:32,945 INFO run "gluster volume bitrot testvol scrub-throttle aggressive" on rhsauto053.lab.eng.blr.redhat.com: STDOUT is volume bitrot: success 2017-05-31 01:25:34,448 INFO run Corrupting the file /bricks/brick0/testvol_brick0/glusterfs_file_0_1496208180.txt from server 10.70.40.128 2017-05-31 01:25:34,449 INFO run Executing echo test > /bricks/brick0/testvol_brick0/glusterfs_file_0_1496208180.txt on 10.70.40.128 2017-05-31 01:25:34,463 INFO run "echo test > /bricks/brick0/testvol_brick0/glusterfs_file_0_1496208180.txt" on 10.70.40.128: RETCODE is 0 2017-05-31 01:25:35,081 INFO trigger_scrubber_on_demand BITROT SCRUB STATUS Before Triggering scrub on demand: 2017-05-31 01:25:35,082 INFO run Executing gluster volume bitrot testvol scrub status on rhsauto053.lab.eng.blr.redhat.com 2017-05-31 01:25:36,587 INFO run "gluster volume bitrot testvol scrub status" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0 2017-05-31 01:25:37,983 INFO run "gluster volume bitrot testvol scrub ondemand" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0 2017-05-31 01:25:37,983 INFO run "gluster volume bitrot testvol scrub ondemand" on rhsauto053.lab.eng.blr.redhat.com: STDOUT is volume bitrot: success 2017-05-31 01:25:37,984 INFO trigger_scrubber_on_demand BITROT SCRUB STATUS After Triggering scrub on demand: 2017-05-31 01:25:37,984 INFO run Executing gluster volume bitrot testvol scrub status on rhsauto053.lab.eng.blr.redhat.com 2017-05-31 01:25:39,337 INFO run "gluster volume bitrot testvol scrub status" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0 2017-05-31 01:26:39,371 INFO verify_scrubbing BITROT SCRUB STATUS Before starting the validation of scrubbing a bad file 2017-05-31 01:26:39,371 INFO run Executing gluster volume bitrot testvol scrub status on 10.70.40.128 2017-05-31 01:26:41,013 INFO run "gluster volume bitrot testvol scrub status" on 10.70.40.128: RETCODE is 0 2017-05-31 01:36:42,553 INFO run Executing gluster volume bitrot testvol scrub status on 10.70.40.128 2017-05-31 01:36:45,469 INFO run "gluster volume bitrot testvol scrub status" on 10.70.40.128: RETCODE is 0
After sitting together and debugging the logs, it was agreed that Kotresh will add a few extra log messages at certain places and again trigger a run- which would in turn help in rca'ing the issue.
Upstream patch: https://review.gluster.org/17552
Upstream Patches: https://review.gluster.org/17552 (master) https://review.gluster.org/17553 (3.10) https://review.gluster.org/17555 (3.11) Downstream Patch: https://code.engineering.redhat.com/gerrit/109279
Tested and verified this on the build glusterfs-3.8.4-38. Added the necessary time delays in automation in every glusterd restart and bitrot enable. I do not see the same issue that is mentioned in the description. One scrub test 'verify_scrubber_when_glusterd_down' did fail, for which the error logs look similar. However, will have to do that manually to ascertain the real issue. With all but one of the test cases passed in the latest run, moving this bug to verified in rhgs 3.3.0.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774