Description of problem:
VM installation paused while doing rebalance after adding one sub volume to 2 *3 type.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Have RHV-RHGS setup with 3 rhgs nodes and 2 virt cluster
2. Create a 2*3 volume and create the data domain using this volume.
3. Create one app vm (say vm1 ) using the created data domain.
4. While step-3 in progress, add one more sub volume to convert the volume from 2*3 to 3*3 type.
5. Create one more app vm (say vm2) using the same data domain
6. Trigger the rebalance now // during this step VM went for pause state.
VM installation paused while doing rebalance after adding bricks
VM should not goto pause state.
I will provide the sos reports from all the nodes ( rhgs + hosts[clients]) to debug the issue
Created attachment 1239021 [details]
Screen shot of an issue.
Took a look at the setup provided by Byreddy.
So the vm pause is due to WRITEV failing with 'Transport endpoint not connected' and 'Read-only file system' error respectively on the two clients.
At one point, shortly after the process switched to the new graph (2x3->3x3), the two clients got disconnected from their volfile servers as well as about 6 bricks - 2 from each replica set, although momentarily. This led to client quorum loss and eventually writes failed.
So turns out the bricks disconnected from the clients because of a server quorum loss amongst the glusterds and this caused glusterds to kill their local bricks.
See the following logs:
[2017-01-10 07:07:12.607436] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume Dis-Rep1. Stopping local bricks.
[2017-01-10 07:07:12.620233] I [MSGID: 106493] [glusterd-rpc-ops.c:478:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 8cd39ea1-cf36-4dda-aab2-5117f0610387, host: dhcp42-105.l
ab.eng.blr.redhat.com, port: 0
[2017-01-10 07:07:12.690597] C [MSGID: 106003] [glusterd-server-quorum.c:341:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume Dis-Rep1. Starting local bricks.
Discussed the same with Byreddy and he agreed that he'd restarted the glusterds on all servers in quick succession.
Based on the discussion, this behavior is expected. Hence closing the bug as NOTABUG after coming to an agreement on the same with Byreddy.