During glusterd handshake glusterd received a volume dictionary from peer end to compare the own volume dictionary data.If the options are differ it sets the key to recognize volume options are changed and call import syntask to delete/start the volume.In brick_mux environment while number of volumes are high(5k) the dict api in function glusterd_compare_friend_volume takes time because the function glusterd_handle_friend_req saves all peer volume data in a single dictionary. Due to time taken by the function glusterd_handle_friend RPC requests receives a call_bail from a peer end gluster(CLI) won't be able to show volume status. Reproducer 1) Setup 5100 distributed volumes 3x1 2) Enable brick_mux 3) Start all the volumes 4) Kill all gluster processes on 3rd node 5) Run a loop to update volume option on a 1st node for i in {1..5100}; do gluster v set vol$i performance.open-behind off; done 6) Start the glusterd process on the 3rd node Wait to finish handshake and check there should not be any call_bail message in the logs
The upstream patch link is under review process https://github.com/gluster/glusterfs/issues/1613
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1462