Description of problem: On a three node cluster with quorum enabled on a replicated volume. Performed add-brick, stopped glusterd on one node then started rebalance on the volume. gluster vol rebalance testvol start volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234 Rebalance start is succeeding when quorum is not met. Version-Release number of selected component (if applicable): glusterfs-server-6.0-15.el7rhgs.x86_64 How reproducible: 2/2 Steps to Reproduce: 1.On a three node cluster, create a 1X3 replicate volume 2. Set "cluster.server-quorum-type" as server and set the ratio to 90. 3. Performed add-brick(3 bricks) 4. stopped glusterd on one node. 5. perform rebalance start Actual results: gluster vol rebalance testvol start volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234 rebalance start is successful when quorum not met Expected results: rebalance start should not succeed when quorum not met Additional info: #### gluster vol info [root@dhcp35-11 ~]# gluster vol info Volume Name: testvol Type: Distributed-Replicate Volume ID: c9822762-7dac-47bd-8645-9cfee3d02b00 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.70.35.11:/bricks/brick4/testvol Brick2: 10.70.35.7:/bricks/brick4/testvol Brick3: 10.70.35.73:/bricks/brick4/testvol Brick4: 10.70.35.73:/bricks/brick4/ht Brick5: 10.70.35.11:/bricks/brick4/ht Brick6: 10.70.35.7:/bricks/brick4/ht Options Reconfigured: cluster.server-quorum-type: server transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off cluster.server-quorum-ratio: 90 #### gluster vol status gluster vol status Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.11:/bricks/brick4/testvol 49152 0 Y 11039 Brick 10.70.35.7:/bricks/brick4/testvol 49152 0 Y 27266 Brick 10.70.35.73:/bricks/brick4/testvol 49152 0 Y 10746 Brick 10.70.35.73:/bricks/brick4/ht 49153 0 Y 11028 Brick 10.70.35.11:/bricks/brick4/ht 49153 0 Y 11338 Brick 10.70.35.7:/bricks/brick4/ht 49153 0 Y 27551 Self-heal Daemon on localhost N/A N/A Y 11363 Self-heal Daemon on 10.70.35.73 N/A N/A Y 11053 Self-heal Daemon on dhcp35-7.lab.eng.blr.re dhat.com N/A N/A Y 27577 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks #### After stopping glusterd on one node volume status [root@dhcp35-11 ~]# gluster vol status Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.11:/bricks/brick4/testvol N/A N/A N N/A Brick 10.70.35.7:/bricks/brick4/testvol N/A N/A N N/A Brick 10.70.35.11:/bricks/brick4/ht N/A N/A N N/A Brick 10.70.35.7:/bricks/brick4/ht N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 11363 Self-heal Daemon on dhcp35-7.lab.eng.blr.re dhat.com N/A N/A Y 27577 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks gluster vol rebalance testvol start volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234 [root@dhcp35-11 ~]# gluster vol rebalance testvol status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- dhcp35-7.lab.eng.blr.redhat.com 0 0Bytes 0 0 0 failed 0:00:00 localhost 0 0Bytes 0 0 0 failed 0:00:00 volume rebalance: testvol: success ### glusterd log after stopping glusterd on one of the node [2019-10-10 09:19:00.361314] I [MSGID: 106004] [glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer <10.70.35.73> (<53117ee2-5182-42c6-8c74-26f43b075a0c>), in state <Peer in Cluster>, has disconnected from glusterd. [2019-10-10 09:19:00.361553] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24f6a) [0x7fe6a4b4df6a] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2f790) [0x7fe6a4b58790] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf3883) [0x7fe6a4c1c883] ) 0-management: Lock for vol testvol not held [2019-10-10 09:19:00.361570] W [MSGID: 106117] [glusterd-handler.c:6542:__glusterd_peer_rpc_notify] 0-management: Lock not released for testvol [2019-10-10 09:19:00.361607] C [MSGID: 106002] [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume testvol. Stopping local bricks. [2019-10-10 09:19:00.361825] I [MSGID: 106542] [glusterd-utils.c:8775:glusterd_brick_signal] 0-glusterd: sending signal 15 to brick with pid 11039 [2019-10-10 09:19:01.362068] I [socket.c:871:__socket_shutdown] 0-management: intentional socket shutdown(16) [2019-10-10 09:19:01.362680] I [MSGID: 106542] [glusterd-utils.c:8775:glusterd_brick_signal] 0-glusterd: sending signal 15 to brick with pid 11338 [2019-10-10 09:19:02.362982] I [socket.c:871:__socket_shutdown] 0-management: intentional socket shutdown(20) [2019-10-10 09:19:02.363239] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /bricks/brick4/testvol on port 49152 [2019-10-10 09:19:02.368590] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /bricks/brick4/ht on port 49153 [2019-10-10 09:19:02.375567] I [MSGID: 106499] [glusterd-handler.c:4502:__glusterd_handle_status_volume] 0-management: Received status volume req for volume testvol [2019-10-10 09:19:25.717254] I [MSGID: 106539] [glusterd-utils.c:12461:glusterd_generate_and_set_task_id] 0-management: Generated task-id 86cfc8b1-1e24-4244-b8e0-6941f4684234 for key rebalance-id [2019-10-10 09:19:30.751060] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-10-10 09:19:30.751284] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict [2019-10-10 09:19:35.761694] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict [2019-10-10 09:19:35.767505] I [MSGID: 106172] [glusterd-handshake.c:1085:__server_event_notify] 0-glusterd: received defrag status updated [2019-10-10 09:19:35.773243] I [MSGID: 106007] [glusterd-rebalance.c:153:__glusterd_defrag_notify] 0-management: Rebalance process for volume testvol has disconnected. [2019-10-10 09:19:39.436119] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict [2019-10-10 09:19:39.436978] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict [2019-10-10 09:31:36.682991] I [MSGID: 106488] [glusterd-handler.c:1564:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2019-10-10 09:31:36.684006] I [MSGID: 106488] [glusterd-handler.c:1564:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
pushed https://review.gluster.org/#/c/glusterfs/+/23536 at upstream to address this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3249
hi Konda: How did you solve the problem? i got thoes logs "failed to get index from rsp dict " when i running a rebalance cmd. i need u help! thanks