Bug 1760467 - rebalance start is succeeding when quorum is not met
Summary: rebalance start is succeeding when quorum is not met
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Sanju
QA Contact:
URL:
Whiteboard:
Depends On: 1760261
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-10 15:20 UTC by Sanju
Modified: 2020-01-09 12:45 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1760261
Environment:
Last Closed: 2019-10-11 00:21:22 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 23536 0 None Merged glusterd: rebalance start should fail when quorum is not met 2019-10-11 00:21:20 UTC

Description Sanju 2019-10-10 15:20:25 UTC
+++ This bug was initially created as a clone of Bug #1760261 +++

Description of problem:
On a three node cluster with quorum enabled on a replicated volume. Performed add-brick, stopped glusterd on one node then started rebalance on the volume.

gluster vol rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234

Rebalance start is succeeding when quorum is not met.


Version-Release number of selected component (if applicable):
glusterfs-server-6.0-15.el7rhgs.x86_64

How reproducible:
2/2

Steps to Reproduce:
1.On a three node cluster, create a 1X3 replicate volume 
2. Set "cluster.server-quorum-type" as server and set the ratio to 90.
3. Performed add-brick(3 bricks)
4. stopped glusterd on one node.
5. perform rebalance start

Actual results:

 gluster vol rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234

rebalance start is successful when quorum not met

Expected results:
rebalance start should not succeed when quorum not met


Additional info:

#### gluster vol info
[root@dhcp35-11 ~]# gluster vol info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: c9822762-7dac-47bd-8645-9cfee3d02b00
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.35.11:/bricks/brick4/testvol
Brick2: 10.70.35.7:/bricks/brick4/testvol
Brick3: 10.70.35.73:/bricks/brick4/testvol
Brick4: 10.70.35.73:/bricks/brick4/ht
Brick5: 10.70.35.11:/bricks/brick4/ht
Brick6: 10.70.35.7:/bricks/brick4/ht
Options Reconfigured:
cluster.server-quorum-type: server
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.server-quorum-ratio: 90



#### gluster vol status 

 gluster vol status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.11:/bricks/brick4/testvol    49152     0          Y       11039
Brick 10.70.35.7:/bricks/brick4/testvol     49152     0          Y       27266
Brick 10.70.35.73:/bricks/brick4/testvol    49152     0          Y       10746
Brick 10.70.35.73:/bricks/brick4/ht         49153     0          Y       11028
Brick 10.70.35.11:/bricks/brick4/ht         49153     0          Y       11338
Brick 10.70.35.7:/bricks/brick4/ht          49153     0          Y       27551
Self-heal Daemon on localhost               N/A       N/A        Y       11363
Self-heal Daemon on 10.70.35.73             N/A       N/A        Y       11053
Self-heal Daemon on dhcp35-7.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       27577
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks

#### After stopping glusterd on one node volume status

[root@dhcp35-11 ~]# gluster vol status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.11:/bricks/brick4/testvol    N/A       N/A        N       N/A  
Brick 10.70.35.7:/bricks/brick4/testvol     N/A       N/A        N       N/A  
Brick 10.70.35.11:/bricks/brick4/ht         N/A       N/A        N       N/A  
Brick 10.70.35.7:/bricks/brick4/ht          N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       11363
Self-heal Daemon on dhcp35-7.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       27577
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks
 


 gluster vol rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234
[root@dhcp35-11 ~]# gluster vol rebalance testvol status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
         dhcp35-7.lab.eng.blr.redhat.com                0        0Bytes             0             0             0               failed        0:00:00
                               localhost                0        0Bytes             0             0             0               failed        0:00:00
volume rebalance: testvol: success


### glusterd log after stopping glusterd on one of the node
[2019-10-10 09:19:00.361314] I [MSGID: 106004] [glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer <10.70.35.73> (<53117ee2-5182-42c6-8c74-26f43b075a0c>), in state <Peer in Cluster>, has disconnected from glusterd.
[2019-10-10 09:19:00.361553] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24f6a) [0x7fe6a4b4df6a] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2f790) [0x7fe6a4b58790] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf3883) [0x7fe6a4c1c883] ) 0-management: Lock for vol testvol not held
[2019-10-10 09:19:00.361570] W [MSGID: 106117] [glusterd-handler.c:6542:__glusterd_peer_rpc_notify] 0-management: Lock not released for testvol
[2019-10-10 09:19:00.361607] C [MSGID: 106002] [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume testvol. Stopping local bricks.
[2019-10-10 09:19:00.361825] I [MSGID: 106542] [glusterd-utils.c:8775:glusterd_brick_signal] 0-glusterd: sending signal 15 to brick with pid 11039
[2019-10-10 09:19:01.362068] I [socket.c:871:__socket_shutdown] 0-management: intentional socket shutdown(16)
[2019-10-10 09:19:01.362680] I [MSGID: 106542] [glusterd-utils.c:8775:glusterd_brick_signal] 0-glusterd: sending signal 15 to brick with pid 11338
[2019-10-10 09:19:02.362982] I [socket.c:871:__socket_shutdown] 0-management: intentional socket shutdown(20)
[2019-10-10 09:19:02.363239] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /bricks/brick4/testvol on port 49152
[2019-10-10 09:19:02.368590] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /bricks/brick4/ht on port 49153
[2019-10-10 09:19:02.375567] I [MSGID: 106499] [glusterd-handler.c:4502:__glusterd_handle_status_volume] 0-management: Received status volume req for volume testvol
[2019-10-10 09:19:25.717254] I [MSGID: 106539] [glusterd-utils.c:12461:glusterd_generate_and_set_task_id] 0-management: Generated task-id 86cfc8b1-1e24-4244-b8e0-6941f4684234 for key rebalance-id
[2019-10-10 09:19:30.751060] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-10-10 09:19:30.751284] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2019-10-10 09:19:35.761694] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2019-10-10 09:19:35.767505] I [MSGID: 106172] [glusterd-handshake.c:1085:__server_event_notify] 0-glusterd: received defrag status updated
[2019-10-10 09:19:35.773243] I [MSGID: 106007] [glusterd-rebalance.c:153:__glusterd_defrag_notify] 0-management: Rebalance process for volume testvol has disconnected.
[2019-10-10 09:19:39.436119] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2019-10-10 09:19:39.436978] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2019-10-10 09:31:36.682991] I [MSGID: 106488] [glusterd-handler.c:1564:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2019-10-10 09:31:36.684006] I [MSGID: 106488] [glusterd-handler.c:1564:__glusterd_handle_cli_get_volume] 0-management: Received get vol req

--- Additional comment from RHEL Product and Program Management on 2019-10-10 15:06:22 IST ---

This bug is automatically being proposed for the next minor release of Red Hat Gluster Storage by setting the release flag 'rhgs‑3.5.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Bala Konda Reddy M on 2019-10-10 15:16:01 IST ---

Setup is in same state for further debugging.

Ip: 10.70.35.11 
credentials: root/1

Regards,
Bala

Comment 1 Worker Ant 2019-10-10 15:23:40 UTC
REVIEW: https://review.gluster.org/23536 (glusterd: rebalance start should fail when quorum is not met) posted (#1) for review on master by Sanju Rakonde

Comment 2 Worker Ant 2019-10-11 00:21:22 UTC
REVIEW: https://review.gluster.org/23536 (glusterd: rebalance start should fail when quorum is not met) merged (#1) on master by Sanju Rakonde


Note You need to log in before you can comment on or make changes to this bug.