Description of problem: On a three node cluster(N1,N2,N3), Create volume using only two nodes(N1 and N2). while volume is in creating state perform glusterd restart on the last node. The volume create is succeeding but throwing below error which says volume create failed and peer status is showing the last node is in peer rejected state gluster vol list |wc -l is mismatching in the cluster ############################################################### gluster vol create testvol6 replica 2 10.70.37.213:/bricks/brick1/test6 10.70.37.75:/bricks/brick1/test6 10.70.37.213:/bricks/brick2/test7 10.70.37.75:/bricks/brick2/test7 Replica 2 volumes are prone to split-brain. Use Arbiter or Replica 3 to avoid this. Do you still want to continue? (y/n) y volume create: testvol6: failed: Commit failed on dhcp37-94.lab.eng.blr.redhat.com. Please check log file for details. ################################################################ [root@dhcp37-213 ~]# gluster peer status Number of Peers: 2 Hostname: dhcp37-75.lab.eng.blr.redhat.com Uuid: d8e9a211-1a54-4288-b790-ea13a603c93b State: Peer in Cluster (Connected) Hostname: dhcp37-94.lab.eng.blr.redhat.com Uuid: 56bb1e66-264b-4b2a-97d3-9fd8125ec57a State: Peer Rejected (Connected) ################################################################## Version-Release number of selected component (if applicable): 3.12.2-18.1 How reproducible: 1/1 Steps to Reproduce: 1. Form a cluster with three nodes 2. Create a volume with bricks from only two nodes 3. While create is going, issue glusterd restart on node N3(These two operations should run at a same time) Actual results: Volume create is throwing failed and commit failed but volume is creating. gluster peer status is Peer Rejected for the third node. volume count mismatch is seen on the third node Expected results: Volume create should succeed and sync with third node once it is up. peer status should be connected state Additional info:
1. volume count mismatch - not at all a surprise in this case as glusterd doesn't have any transaction rollback mechanism. So in N1 & N2 the volume create might be through but in N3 it wasn't as glusterd got restarted during commit phase. 3. Peer going into reject state - which peer was showing rejected. Did we check the glusterd log of the node which was rejected?
It seems to me that the root cause of this is same as of BZ 1635136 & 1637459 . If so can you please update the bug status with the patch link?
upstream patch: https://review.gluster.org/#/c/glusterfs/+/21336/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3827