Bug 1243542

Summary: [RHEV-RHGS] App VMs paused due to IO error caused by split-brain, after initiating remove-brick operation
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: distributeAssignee: Ravishankar N <ravishankar>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: asriram, asrivast, divya, nbalacha, ravishankar, rcyriac, rgowdapp, ssampat, vagarwal
Target Milestone: ---Keywords: Regression, ZStream
Target Release: RHGS 3.1.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.1-12 Doc Type: Bug Fix
Doc Text:
Previously, when a distribute leg of a dist-rep Gluster volume that hosts VM images was removed using the `remove-brick start` Gluster CLI, the VMs went into a paused state. With the fix, they do not go into a paused state.
Story Points: ---
Clone Of:
: 1244165 (view as bug list) Environment:
RHEL 6.7 as Hypervisor RHEVM 3.5.4 RHGS 3.1 Nighty build ( based on RHEL 7.1 )
Last Closed: 2015-10-05 07:20:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1216951, 1244165, 1245202, 1245934, 1251815    
Attachments:
Description Flags
Fuse mount log from the hypervisor
none
sosreport from NODE1 - dhcp37-211
none
sosreport from NODE2 none

Description SATHEESARAN 2015-07-15 18:08:03 UTC
Description of problem:
------------------------
With 8X2 distributed replicate volume, initiated a remove-brick with data migration. After few minutes, all the application VMs with its disk image on that gluster volume went in to paused state.

Noticed the split-brain error message in the fuse mount log

Version
--------
RHEL 6.7 as hypervisor
RHGS 3.1 based on RHEL 7.1

How reproducible:
-----------------
Tried only once

Steps to Reproduce:
-------------------
1. Create a 2X2 distributed replicate volume
2. Use this gluster volume as the 'Data Domain' for RHEV
3. Create few App VMs and install OS
4. Remove the bricks where the disk image of App VMs are residing

Actual results:
----------------
App VMs went in to **paused** state

Expected results:
-----------------
App VMs should be healthy

Comment 1 SATHEESARAN 2015-07-15 18:17:43 UTC
Following error messages are seen in the fuse mount logs:


[2015-07-15 17:49:42.709088] E [MSGID: 114031] [client-rpc-fops.c:1673:client3_3_finodelk_cbk] 6-vol1-client-0: remote operation failed [Transport endpoint is not connected]

[2015-07-15 17:49:42.710849] W [MSGID: 114031] [client-rpc-fops.c:1028:client3_3_fsync_cbk] 6-vol1-client-0: remote operation failed [Transport endpoint is not connected]
[2015-07-15 17:49:42.710874] W [MSGID: 108035] [afr-transaction.c:1614:afr_changelog_fsync_cbk] 6-vol1-replicate-0: fsync(b7d21675-6fd8-472a-b7d9-71d7436c614d) failed on subvolume vol1-client-0. Transaction was WRITE [Transport endpoint is not connected]
[2015-07-15 17:49:42.710897] W [MSGID: 108001] [afr-transaction.c:686:afr_handle_quorum] 6-vol1-replicate-0: b7d21675-6fd8-472a-b7d9-71d7436c614d: Failing WRITE as quorum is not met

[2015-07-15 18:12:15.544061] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 12-vol1-replicate-5: Failing WRITE on gfid b7d21675-6fd8-472a-b7d9-71d7436c614d: split-brain observed. [Input/output error]
[2015-07-15 18:12:15.737906] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293197: WRITE => -1 (Input/output error)
[2015-07-15 18:12:17.022070] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-5: remote operation failed. Path: /c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b (d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory]
[2015-07-15 18:12:17.022073] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-4: remote operation failed. Path: /c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b (d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory]
[2015-07-15 18:12:22.952290] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293304: WRITE => -1 (Input/output error)
[2015-07-15 18:12:22.952550] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293306: WRITE => -1 (Input/output error)

Comment 3 SATHEESARAN 2015-07-15 18:28:59 UTC
Created attachment 1052404 [details]
Fuse mount log from the hypervisor

This is the fuse mount log from the hypervisor

Comment 4 SATHEESARAN 2015-07-15 18:31:09 UTC
Created attachment 1052406 [details]
sosreport from NODE1 - dhcp37-211

sosreport from RHGS node1 - dhcp37-211

Comment 5 SATHEESARAN 2015-07-15 18:36:06 UTC
Created attachment 1052407 [details]
sosreport from NODE2

This is the sosreport from NODE2 - dhcp37-66

Comment 6 SATHEESARAN 2015-07-16 09:04:33 UTC
I retested this issue again and I am seeing this issue consistently.

This was the new test taken up,
1. Created a 2X2 distributed-replicate volume and used it to back VM Image store in RHEV

2. While installing OS in the App VM, initiated 'remove-brick start' for the brick which originally contains the VM disk image.

3. Observed split-brain error messages in the fuse mount log

4. App VM went in to PAUSED state

The issue looks like a regression and I could reproduce this issue consistently all the time.

Marking this bug as a BLOCKER for RHGS 3.1 release

Comment 7 SATHEESARAN 2015-07-16 10:46:19 UTC
This issue is the regression and adding 'REGRESSION' keyword.

Comment 12 monti lawrence 2015-07-24 16:58:43 UTC
Doc text is edited. Please sign off to be included in Known Issues.

Comment 16 SATHEESARAN 2015-08-03 13:53:17 UTC
*** Bug 1235496 has been marked as a duplicate of this bug. ***

Comment 20 SATHEESARAN 2015-09-02 08:32:11 UTC
Tested with glusterfs-3.7.1-13.el7rhgs with the following steps

1. Configured 2X2 gluster volume as a Data Domain for RHEV
2. Created few appvms and started OS installation on them
3. Initiated remove-brick on the brick which contained VM disk image

There were no issues found and OS installation happened successfully.
App VMs are healthy after removing the brick on which the VM's disk image was available.

Marking this bug as VERIFIED

Comment 22 errata-xmlrpc 2015-10-05 07:20:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1845.html