Bug 1454596
Summary: | [Bitrot]: Inconsistency seen with 'scrub ondemand' - fails to trigger scrub | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Sweta Anandpara <sanandpa> | |
Component: | bitrot | Assignee: | Kotresh HR <khiremat> | |
Status: | CLOSED ERRATA | QA Contact: | Sweta Anandpara <sanandpa> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.3 | CC: | amukherj, khiremat, rhinduja, rhs-bugs, sanandpa, storage-qa-internal | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.8.4-29 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1461845 (view as bug list) | Environment: | ||
Last Closed: | 2017-09-21 04:45:37 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1417151, 1461845, 1462080, 1462127 |
Description
Sweta Anandpara
2017-05-23 06:53:43 UTC
[qe@rhsqe-repo 1454596]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ pwd /home/repo/sosreports/1454596 [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ ll total 144772 -rwxr-xr-x. 1 qe qe 39345624 May 23 12:17 sosreport-sysreg-prod-20170523021643_dhcp47_165.tar.xz -rwxr-xr-x. 1 qe qe 38025452 May 23 12:17 sosreport-sysreg-prod-20170523021649_dhcp47_164.tar.xz -rwxr-xr-x. 1 qe qe 35709660 May 23 12:17 sosreport-sysreg-prod-20170523021653_dhcp47_162.tar.xz -rwxr-xr-x. 1 qe qe 35158712 May 23 12:17 sosreport-sysreg-prod-20170523021659_dhcp47_157.tar.xz [qe@rhsqe-repo 1454596]$ CLI logs of the node where this took place: [root@dhcp47-164 ~]# gluster v bitrot ozone scrub ondemand volume bitrot: success [root@dhcp47-164 ~]# gluster v bitrot ozone scrub status Volume name : ozone State of scrub: Active (Idle) Scrub impact: aggressive Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-165.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-157.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-162.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= [root@dhcp47-164 ~]# gluster v bitrot ozone scrub status Volume name : ozone State of scrub: Active (Idle) Scrub impact: aggressive Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-162.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-157.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: dhcp47-165.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: Scrubber pending to complete. Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= [root@dhcp47-164 ~]# vim /var/log/glusterfs/scrub.log [root@dhcp47-164 ~]# vim /var/log/glusterfs/bitd.log [root@dhcp47-164 ~]# Pasted below is a snippet of glusterd logs: [2017-05-23 06:01:33.109798] I [MSGID: 106567] [glusterd-svc-mgmt.c:196:glusterd_svc_start] 0-management: Starting scrub service [2017-05-23 06:01:42.387551] I [MSGID: 106499] [glusterd-handler.c:4369:__glusterd_handle_status_volume] 0-management: Received status volume req for volume ozone [2017-05-23 06:02:13.908230] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value [2017-05-23 06:02:21.799744] W [MSGID: 106387] [glusterd-op-sm.c:5093:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed [2017-05-23 06:02:24.606063] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value [2017-05-23 06:02:33.604024] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value [2017-05-23 06:03:11.761096] I [MSGID: 106488] [glusterd-handler.c:1539:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2017-05-23 06:04:07.142209] E [MSGID: 106062] [glusterd-utils.c:9434:glusterd_volume_bitrot_scrub_use_rsp_dict] 0-management: failed to get count value [2017-05-23 06:06:44.300978] I [MSGID: 106062] [glusterd-volume-ops.c:2617:glusterd_op_start_volume] 0-management: Global dict not present. Sosreports have not been collected (in the past) for the multiple times that this issue has been hit in automation, as the automation-framework currently does not have that capability built it yet. I have triggered another run for the bitrot scripts. I'll monitor it and will manually login to the VMs as soon as the run ends, to collect the sosreports. Will be updating this BZ with the details once the run is complete. Keeping the need-info on me, until I update the sosreports. I realize that I have hit this issue manually as well, and the sosreports of the same have been copied to: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber> The cmd_history log should show the time when 'scrub ondemand' was run, and the corresponding gluster logs of that time can be analysed. [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ pwd /home/repo/sosreports/1454596 [qe@rhsqe-repo 1454596]$ [qe@rhsqe-repo 1454596]$ ll total 144772 -rwxr-xr-x. 1 qe qe 39345624 May 23 12:17 sosreport-sysreg-prod-20170523021643_dhcp47_165.tar.xz -rwxr-xr-x. 1 qe qe 38025452 May 23 12:17 sosreport-sysreg-prod-20170523021649_dhcp47_164.tar.xz -rwxr-xr-x. 1 qe qe 35709660 May 23 12:17 sosreport-sysreg-prod-20170523021653_dhcp47_162.tar.xz -rwxr-xr-x. 1 qe qe 35158712 May 23 12:17 sosreport-sysreg-prod-20170523021659_dhcp47_157.tar.xz [qe@rhsqe-repo 1454596]$ Comment 2 already has sosreports' location pasted there :) Comment 3 gives the node (whose cmd_history has to be checked) where the failure occurred. Comment 4 gives the glusterd logs -where we can get the _time_ in and around which debugging has to be done. The automation run results are out, and the sosreports of all the nodes after the entire run are copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1454596/new/ By comparing the ondemand failure timings and scrubber logs, it looks like not a bug. The failure is because either the scrubber is coming up or the scrubber is (Pause/Inactive/Running) state. Please look the below logs for more details. But if there are instances where the cli says cmd is successful but scrubbing didn't happen, we need the proper timings to compare the same. I could not able to match the timings from [1] to scrubber logs. But all these failures are around the time where the scrubber is initializing. Can we run the automation test adding sleep of 5 sec after scrubber is started and see you still hit the issue? [1] http://jenkins-rhs.lab.eng.blr.redhat.com:8080/job/auto-RHGS_Downstream_Regression_Bitrot_RHEL7_4_RHGS_3_8_brew/ws/distaf_tests_bitrot_glusterfs_10.log SOSREPORT OF Comment 9: The logs are clearly saying Scrubber is in Pause/Inactive/Running state. root@dhcp-0-80#find . -name cmd_history.log | xargs grep ondemand ./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:07:32.502807] : v bitrot distrep scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:08:08.001503] : v bitrot ozone scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:09:48.556031] : v bitrot distrep scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021659/var/log/glusterfs/cmd_history.log:[2017-05-23 06:10:46.224744] : v bitrot ozone scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:40:54.856145] : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:47:41.087389] : v bitrot ozone scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:47:52.579467] : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 05:48:02.778093] : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state ./sosreport-sysreg-prod-20170523021649/var/log/glusterfs/cmd_history.log:[2017-05-23 06:02:21.804275] : v bitrot ozone scrub ondemand : SUCCESS ./sosreport-sysreg-prod-20170523021643/var/log/glusterfs/cmd_history.log:[2017-05-23 05:39:23.302703] : v bitrot ozone scrub ondemand : FAILED : Scrubber is in Pause/Inactive/Running state SOSREPORT OF Comment 11: On both the times it's failed, the scrubber is still coming up. Please compare the timings below from scrubber logs. root@dhcp-0-80#grep ondemand cmd_history.log [2017-05-31 09:22:05.788918] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 09:41:23.383408] : volume bitrot testvol scrub ondemand : FAILED : [2017-05-31 10:08:30.273021] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 10:27:37.480013] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 10:33:27.136219] : volume bitrot testvol scrub ondemand : FAILED : [2017-05-31 11:32:58.767932] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 19:04:15.373722] : volume bitrot testvol scrub ondemand : SUCCESS [2017-05-31 09:41:21.593591] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/lib/glusterd/scrub/run/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/0fc8642d18f8a025d5a9cc1aa4344112.socket --global-timer-wheel) [2017-05-31 09:41:21.607443] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-31 09:41:21.607481] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket [2017-05-31 09:41:21.607499] I [glusterfsd-mgmt.c:2140:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost [2017-05-31 09:41:25.610427] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-testvol-bit-rot-0: Scaling up scrubbers [0 => 8] [2017-05-31 09:41:25.610531] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-testvol-bit-rot-0: SCRUB TUNABLES:: [Frequency: biweekly, Throttle: aggressive] [2017-05-31 09:41:25.610554] I [MSGID: 118031] [bit-rot.c:2010:init] 0-testvol-bit-rot-0: bit-rot xlator loaded in "SCRUBBER" mode [2017-05-31 09:41:25.610847] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-0: parent translators are ready, attempting connect on transport [2017-05-31 09:41:25.611139] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2017-05-31 09:41:25.613826] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-4: parent translators are ready, attempting connect on transport [2017-05-31 09:41:25.614104] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-0: changing port to 49152 (from 0) [2017-05-31 09:41:25.616723] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-8: parent translators are ready, attempting connect on transport [2017-05-31 09:41:25.619118] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-4: changing port to 49153 (from 0) .. .. .. [2017-05-31 10:33:24.700462] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/lib/glusterd/scrub/run/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/0fc8642d18f8a025d5a9cc1aa4344112.socket --global-timer-wheel) [2017-05-31 10:33:24.710902] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-31 10:33:24.710974] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket [2017-05-31 10:33:24.710997] I [glusterfsd-mgmt.c:2140:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost [2017-05-31 10:33:28.706414] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-testvol-bit-rot-0: Scaling up scrubbers [0 => 8] [2017-05-31 10:33:28.707260] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-testvol-bit-rot-0: SCRUB TUNABLES:: [Frequency: biweekly, Throttle: aggressive] [2017-05-31 10:33:28.707333] I [MSGID: 118031] [bit-rot.c:2010:init] 0-testvol-bit-rot-0: bit-rot xlator loaded in "SCRUBBER" mode [2017-05-31 10:33:28.707609] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2017-05-31 10:33:28.707890] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-0: parent translators are ready, attempting connect on transport [2017-05-31 10:33:28.711210] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-4: parent translators are ready, attempting connect on transport [2017-05-31 10:33:28.711649] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-0: changing port to 49155 (from 0) [2017-05-31 10:33:28.714554] I [MSGID: 114020] [client.c:2356:notify] 0-testvol-client-8: parent translators are ready, attempting connect on transport [2017-05-31 10:33:28.717268] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-testvol-client-4: changing port to 49156 (from 0) Final graph: We are going about it the wrong way, at the outset itself.
Firstly, we do not need to check the cmd_history logs only for the times that it has _failed_ . The command does report /success/ in cli, and hence every trigger of bitrot ondemand scrub has to be analysed in cmd_history.logs.
Secondly, are you requesting to add a delay of 5 seconds after triggering a scrub ondemand, and then checking scrub status? If yes, that is already present in the logs itself. The scrub status is checked repeatedly, during the entire duration of that test case.
1:23:00 Bitrot is enabled.
1:25:32 Scrub throttle is set to aggressive
1:25:34 File contents are corrupted, after verifying that the file is signed
1:25:35 Scrub status is checked before triggering scrub ondemand
1:25:37 Scrub ondemand is triggered
1:25:39 Scrub status is checked after triggering scrub ondemand
1:26:41 ANOTHER TIME SCRUB STATUS IS CHECKED
1:36:45 ANOTHER TIME SCRUB STATUS IS CHECKED
>>>>> There is a time gap of more than a minute, in between trigger-of-scrub-ondemand and scrub-status-check. And then there is a time gap of 10 mins! Both the times, it says 'Scrub status' as 'Active(Idle)' and 'Last completed scrub time' as 'Scrubber pending to complete'
Both the things above, together imply that the scrubber was never triggered.
Please find the supporting logs below.
2017-05-31 01:23:00,079 INFO run "gluster volume bitrot testvol enable" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0
2017-05-31 01:23:00,080 INFO run "gluster volume bitrot testvol enable" on rhsauto053.lab.eng.blr.redhat.com: STDOUT is
volume bitrot: success
2017-05-31 01:25:32,945 INFO run "gluster volume bitrot testvol scrub-throttle aggressive" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0
2017-05-31 01:25:32,945 INFO run "gluster volume bitrot testvol scrub-throttle aggressive" on rhsauto053.lab.eng.blr.redhat.com: STDOUT is
volume bitrot: success
2017-05-31 01:25:34,448 INFO run Corrupting the file /bricks/brick0/testvol_brick0/glusterfs_file_0_1496208180.txt from server 10.70.40.128
2017-05-31 01:25:34,449 INFO run Executing echo test > /bricks/brick0/testvol_brick0/glusterfs_file_0_1496208180.txt on 10.70.40.128
2017-05-31 01:25:34,463 INFO run "echo test > /bricks/brick0/testvol_brick0/glusterfs_file_0_1496208180.txt" on 10.70.40.128: RETCODE is 0
2017-05-31 01:25:35,081 INFO trigger_scrubber_on_demand BITROT SCRUB STATUS Before Triggering scrub on demand:
2017-05-31 01:25:35,082 INFO run Executing gluster volume bitrot testvol scrub status on rhsauto053.lab.eng.blr.redhat.com
2017-05-31 01:25:36,587 INFO run "gluster volume bitrot testvol scrub status" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0
2017-05-31 01:25:37,983 INFO run "gluster volume bitrot testvol scrub ondemand" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0
2017-05-31 01:25:37,983 INFO run "gluster volume bitrot testvol scrub ondemand" on rhsauto053.lab.eng.blr.redhat.com: STDOUT is
volume bitrot: success
2017-05-31 01:25:37,984 INFO trigger_scrubber_on_demand BITROT SCRUB STATUS After Triggering scrub on demand:
2017-05-31 01:25:37,984 INFO run Executing gluster volume bitrot testvol scrub status on rhsauto053.lab.eng.blr.redhat.com
2017-05-31 01:25:39,337 INFO run "gluster volume bitrot testvol scrub status" on rhsauto053.lab.eng.blr.redhat.com: RETCODE is 0
2017-05-31 01:26:39,371 INFO verify_scrubbing BITROT SCRUB STATUS Before starting the validation of scrubbing a bad file
2017-05-31 01:26:39,371 INFO run Executing gluster volume bitrot testvol scrub status on 10.70.40.128
2017-05-31 01:26:41,013 INFO run "gluster volume bitrot testvol scrub status" on 10.70.40.128: RETCODE is 0
2017-05-31 01:36:42,553 INFO run Executing gluster volume bitrot testvol scrub status on 10.70.40.128
2017-05-31 01:36:45,469 INFO run "gluster volume bitrot testvol scrub status" on 10.70.40.128: RETCODE is 0
After sitting together and debugging the logs, it was agreed that Kotresh will add a few extra log messages at certain places and again trigger a run- which would in turn help in rca'ing the issue. Upstream patch: https://review.gluster.org/17552 Upstream Patches: https://review.gluster.org/17552 (master) https://review.gluster.org/17553 (3.10) https://review.gluster.org/17555 (3.11) Downstream Patch: https://code.engineering.redhat.com/gerrit/109279 Tested and verified this on the build glusterfs-3.8.4-38. Added the necessary time delays in automation in every glusterd restart and bitrot enable. I do not see the same issue that is mentioned in the description. One scrub test 'verify_scrubber_when_glusterd_down' did fail, for which the error logs look similar. However, will have to do that manually to ascertain the real issue. With all but one of the test cases passed in the latest run, moving this bug to verified in rhgs 3.3.0. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |