Bug 1760261 - rebalance start is succeeding when quorum is not met
Summary: rebalance start is succeeding when quorum is not met
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.5
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.5.0
Assignee: Sanju
QA Contact: Bala Konda Reddy M
URL:
Whiteboard:
Depends On:
Blocks: 1696809 1760467
TreeView+ depends on / blocked
 
Reported: 2019-10-10 09:35 UTC by Bala Konda Reddy M
Modified: 2020-12-27 07:42 UTC (History)
8 users (show)

Fixed In Version: glusterfs-6.0-17
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1760467 (view as bug list)
Environment:
Last Closed: 2019-10-30 12:23:03 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2019:3249 0 None None None 2019-10-30 12:23:26 UTC

Description Bala Konda Reddy M 2019-10-10 09:35:40 UTC
Description of problem:
On a three node cluster with quorum enabled on a replicated volume. Performed add-brick, stopped glusterd on one node then started rebalance on the volume.

gluster vol rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234

Rebalance start is succeeding when quorum is not met.


Version-Release number of selected component (if applicable):
glusterfs-server-6.0-15.el7rhgs.x86_64

How reproducible:
2/2

Steps to Reproduce:
1.On a three node cluster, create a 1X3 replicate volume 
2. Set "cluster.server-quorum-type" as server and set the ratio to 90.
3. Performed add-brick(3 bricks)
4. stopped glusterd on one node.
5. perform rebalance start

Actual results:

 gluster vol rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234

rebalance start is successful when quorum not met

Expected results:
rebalance start should not succeed when quorum not met


Additional info:

#### gluster vol info
[root@dhcp35-11 ~]# gluster vol info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: c9822762-7dac-47bd-8645-9cfee3d02b00
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.35.11:/bricks/brick4/testvol
Brick2: 10.70.35.7:/bricks/brick4/testvol
Brick3: 10.70.35.73:/bricks/brick4/testvol
Brick4: 10.70.35.73:/bricks/brick4/ht
Brick5: 10.70.35.11:/bricks/brick4/ht
Brick6: 10.70.35.7:/bricks/brick4/ht
Options Reconfigured:
cluster.server-quorum-type: server
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.server-quorum-ratio: 90



#### gluster vol status 

 gluster vol status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.11:/bricks/brick4/testvol    49152     0          Y       11039
Brick 10.70.35.7:/bricks/brick4/testvol     49152     0          Y       27266
Brick 10.70.35.73:/bricks/brick4/testvol    49152     0          Y       10746
Brick 10.70.35.73:/bricks/brick4/ht         49153     0          Y       11028
Brick 10.70.35.11:/bricks/brick4/ht         49153     0          Y       11338
Brick 10.70.35.7:/bricks/brick4/ht          49153     0          Y       27551
Self-heal Daemon on localhost               N/A       N/A        Y       11363
Self-heal Daemon on 10.70.35.73             N/A       N/A        Y       11053
Self-heal Daemon on dhcp35-7.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       27577
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks

#### After stopping glusterd on one node volume status

[root@dhcp35-11 ~]# gluster vol status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.11:/bricks/brick4/testvol    N/A       N/A        N       N/A  
Brick 10.70.35.7:/bricks/brick4/testvol     N/A       N/A        N       N/A  
Brick 10.70.35.11:/bricks/brick4/ht         N/A       N/A        N       N/A  
Brick 10.70.35.7:/bricks/brick4/ht          N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       11363
Self-heal Daemon on dhcp35-7.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       27577
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks
 


 gluster vol rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234
[root@dhcp35-11 ~]# gluster vol rebalance testvol status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
         dhcp35-7.lab.eng.blr.redhat.com                0        0Bytes             0             0             0               failed        0:00:00
                               localhost                0        0Bytes             0             0             0               failed        0:00:00
volume rebalance: testvol: success


### glusterd log after stopping glusterd on one of the node
[2019-10-10 09:19:00.361314] I [MSGID: 106004] [glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer <10.70.35.73> (<53117ee2-5182-42c6-8c74-26f43b075a0c>), in state <Peer in Cluster>, has disconnected from glusterd.
[2019-10-10 09:19:00.361553] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24f6a) [0x7fe6a4b4df6a] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2f790) [0x7fe6a4b58790] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf3883) [0x7fe6a4c1c883] ) 0-management: Lock for vol testvol not held
[2019-10-10 09:19:00.361570] W [MSGID: 106117] [glusterd-handler.c:6542:__glusterd_peer_rpc_notify] 0-management: Lock not released for testvol
[2019-10-10 09:19:00.361607] C [MSGID: 106002] [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume testvol. Stopping local bricks.
[2019-10-10 09:19:00.361825] I [MSGID: 106542] [glusterd-utils.c:8775:glusterd_brick_signal] 0-glusterd: sending signal 15 to brick with pid 11039
[2019-10-10 09:19:01.362068] I [socket.c:871:__socket_shutdown] 0-management: intentional socket shutdown(16)
[2019-10-10 09:19:01.362680] I [MSGID: 106542] [glusterd-utils.c:8775:glusterd_brick_signal] 0-glusterd: sending signal 15 to brick with pid 11338
[2019-10-10 09:19:02.362982] I [socket.c:871:__socket_shutdown] 0-management: intentional socket shutdown(20)
[2019-10-10 09:19:02.363239] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /bricks/brick4/testvol on port 49152
[2019-10-10 09:19:02.368590] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /bricks/brick4/ht on port 49153
[2019-10-10 09:19:02.375567] I [MSGID: 106499] [glusterd-handler.c:4502:__glusterd_handle_status_volume] 0-management: Received status volume req for volume testvol
[2019-10-10 09:19:25.717254] I [MSGID: 106539] [glusterd-utils.c:12461:glusterd_generate_and_set_task_id] 0-management: Generated task-id 86cfc8b1-1e24-4244-b8e0-6941f4684234 for key rebalance-id
[2019-10-10 09:19:30.751060] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-10-10 09:19:30.751284] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2019-10-10 09:19:35.761694] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2019-10-10 09:19:35.767505] I [MSGID: 106172] [glusterd-handshake.c:1085:__server_event_notify] 0-glusterd: received defrag status updated
[2019-10-10 09:19:35.773243] I [MSGID: 106007] [glusterd-rebalance.c:153:__glusterd_defrag_notify] 0-management: Rebalance process for volume testvol has disconnected.
[2019-10-10 09:19:39.436119] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2019-10-10 09:19:39.436978] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2019-10-10 09:31:36.682991] I [MSGID: 106488] [glusterd-handler.c:1564:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2019-10-10 09:31:36.684006] I [MSGID: 106488] [glusterd-handler.c:1564:__glusterd_handle_cli_get_volume] 0-management: Received get vol req

Comment 5 Sanju 2019-10-10 15:24:20 UTC
pushed https://review.gluster.org/#/c/glusterfs/+/23536 at upstream to address this issue.

Comment 13 errata-xmlrpc 2019-10-30 12:23:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3249

Comment 14 james-tang 2020-12-27 07:42:22 UTC
hi  Konda:
How did you solve the problem? i got thoes logs "failed to get index from rsp dict " when i running a rebalance cmd.
i need u help!
thanks


Note You need to log in before you can comment on or make changes to this bug.