Bug 1411646 - [RHV-RHGS]: VM installation paused while doing rebalance after adding bricks.
Summary: [RHV-RHGS]: VM installation paused while doing rebalance after adding bricks.
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
Target Milestone: ---
: ---
Assignee: Pranith Kumar K
QA Contact: Nag Pavan Chilakam
Depends On:
TreeView+ depends on / blocked
Reported: 2017-01-10 08:26 UTC by Byreddy
Modified: 2017-02-01 14:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2017-01-10 10:36:23 UTC
Target Upstream Version:

Attachments (Terms of Use)
Screen shot of an issue. (311.94 KB, image/png)
2017-01-10 09:29 UTC, Byreddy
no flags Details

Description Byreddy 2017-01-10 08:26:51 UTC
Description of problem:
VM installation paused while doing rebalance after adding one sub volume to 2 *3 type.

Version-Release number of selected component (if applicable):

How reproducible:
One time

Steps to Reproduce:
1. Have RHV-RHGS setup with 3 rhgs nodes and 2 virt cluster
2. Create a 2*3 volume and create the data domain using this volume.
3. Create one app vm (say  vm1 ) using the created data domain.
4. While step-3 in progress, add one more sub volume to convert the volume from 2*3 to 3*3 type.
5. Create one more app vm (say vm2) using the same data domain
6. Trigger the rebalance now // during this step VM went for pause state.

Actual results:
VM installation paused while doing rebalance after adding bricks

Expected results:
VM should not goto  pause state.

Additional info:

I will provide the sos reports from all the nodes ( rhgs + hosts[clients]) to debug the issue

Comment 4 Byreddy 2017-01-10 09:29:16 UTC
Created attachment 1239021 [details]
Screen shot of an issue.

Comment 5 Krutika Dhananjay 2017-01-10 10:36:23 UTC
Took a look at the setup provided by Byreddy.

So the vm pause is due to WRITEV failing with 'Transport endpoint not connected' and 'Read-only file system' error respectively on the two clients.

At one point, shortly after the process switched to the new graph (2x3->3x3), the two clients got disconnected from their volfile servers as well as about 6 bricks - 2 from each replica set, although momentarily. This led to client quorum loss and eventually writes failed.

So turns out the bricks disconnected from the clients because of a server quorum loss amongst the glusterds and this caused  glusterds to kill their local bricks.

See the following logs:

[2017-01-10 07:07:12.607436] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume Dis-Rep1. Stopping local bricks.
[2017-01-10 07:07:12.620233] I [MSGID: 106493] [glusterd-rpc-ops.c:478:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 8cd39ea1-cf36-4dda-aab2-5117f0610387, host: dhcp42-105.l
ab.eng.blr.redhat.com, port: 0
[2017-01-10 07:07:12.690597] C [MSGID: 106003] [glusterd-server-quorum.c:341:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume Dis-Rep1. Starting local bricks.

Discussed the same with Byreddy and he agreed that he'd restarted the glusterds on all servers in quick succession.

Based on the discussion, this behavior is expected. Hence closing the bug as NOTABUG after coming to an agreement on the same with Byreddy.

Note You need to log in before you can comment on or make changes to this bug.