1243542 – [RHEV-RHGS] App VMs paused due to IO error caused by split-brain, after initiating remove-brick operation

Bug 1243542 - [RHEV-RHGS] App VMs paused due to IO error caused by split-brain, after initiating remove-brick operation

Summary: [RHEV-RHGS] App VMs paused due to IO error caused by split-brain, after initi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.1
Assignee:	Ravishankar N
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1235496 (view as bug list)
Depends On:
Blocks:	1216951 1244165 1245202 1245934 1251815
TreeView+	depends on / blocked

Reported:	2015-07-15 18:08 UTC by SATHEESARAN
Modified:	2015-10-05 07:20 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.7.1-12
Doc Type:	Bug Fix
Doc Text:	Previously, when a distribute leg of a dist-rep Gluster volume that hosts VM images was removed using the `remove-brick start` Gluster CLI, the VMs went into a paused state. With the fix, they do not go into a paused state.
Clone Of:
Clones:	1244165 (view as bug list)
Environment:	RHEL 6.7 as Hypervisor RHEVM 3.5.4 RHGS 3.1 Nighty build ( based on RHEL 7.1 )
Last Closed:	2015-10-05 07:20:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Fuse mount log from the hypervisor (479.14 KB, text/plain) 2015-07-15 18:28 UTC, SATHEESARAN	no flags	Details
sosreport from NODE1 - dhcp37-211 (6.14 MB, application/x-xz) 2015-07-15 18:31 UTC, SATHEESARAN	no flags	Details
sosreport from NODE2 (6.88 MB, application/x-xz) 2015-07-15 18:36 UTC, SATHEESARAN	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1845	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.1 update	2015-10-05 11:06:22 UTC

Description SATHEESARAN 2015-07-15 18:08:03 UTC

Description of problem:
------------------------
With 8X2 distributed replicate volume, initiated a remove-brick with data migration. After few minutes, all the application VMs with its disk image on that gluster volume went in to paused state.

Noticed the split-brain error message in the fuse mount log

Version
--------
RHEL 6.7 as hypervisor
RHGS 3.1 based on RHEL 7.1

How reproducible:
-----------------
Tried only once

Steps to Reproduce:
-------------------
1. Create a 2X2 distributed replicate volume
2. Use this gluster volume as the 'Data Domain' for RHEV
3. Create few App VMs and install OS
4. Remove the bricks where the disk image of App VMs are residing

Actual results:
----------------
App VMs went in to **paused** state

Expected results:
-----------------
App VMs should be healthy

Comment 1 SATHEESARAN 2015-07-15 18:17:43 UTC

Following error messages are seen in the fuse mount logs:


[2015-07-15 17:49:42.709088] E [MSGID: 114031] [client-rpc-fops.c:1673:client3_3_finodelk_cbk] 6-vol1-client-0: remote operation failed [Transport endpoint is not connected]

[2015-07-15 17:49:42.710849] W [MSGID: 114031] [client-rpc-fops.c:1028:client3_3_fsync_cbk] 6-vol1-client-0: remote operation failed [Transport endpoint is not connected]
[2015-07-15 17:49:42.710874] W [MSGID: 108035] [afr-transaction.c:1614:afr_changelog_fsync_cbk] 6-vol1-replicate-0: fsync(b7d21675-6fd8-472a-b7d9-71d7436c614d) failed on subvolume vol1-client-0. Transaction was WRITE [Transport endpoint is not connected]
[2015-07-15 17:49:42.710897] W [MSGID: 108001] [afr-transaction.c:686:afr_handle_quorum] 6-vol1-replicate-0: b7d21675-6fd8-472a-b7d9-71d7436c614d: Failing WRITE as quorum is not met

[2015-07-15 18:12:15.544061] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 12-vol1-replicate-5: Failing WRITE on gfid b7d21675-6fd8-472a-b7d9-71d7436c614d: split-brain observed. [Input/output error]
[2015-07-15 18:12:15.737906] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293197: WRITE => -1 (Input/output error)
[2015-07-15 18:12:17.022070] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-5: remote operation failed. Path: /c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b (d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory]
[2015-07-15 18:12:17.022073] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-4: remote operation failed. Path: /c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b (d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory]
[2015-07-15 18:12:22.952290] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293304: WRITE => -1 (Input/output error)
[2015-07-15 18:12:22.952550] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293306: WRITE => -1 (Input/output error)

Comment 3 SATHEESARAN 2015-07-15 18:28:59 UTC

Created attachment 1052404 [details]
Fuse mount log from the hypervisor

This is the fuse mount log from the hypervisor

Comment 4 SATHEESARAN 2015-07-15 18:31:09 UTC

Created attachment 1052406 [details]
sosreport from NODE1 - dhcp37-211

sosreport from RHGS node1 - dhcp37-211

Comment 5 SATHEESARAN 2015-07-15 18:36:06 UTC

Created attachment 1052407 [details]
sosreport from NODE2

This is the sosreport from NODE2 - dhcp37-66

Comment 6 SATHEESARAN 2015-07-16 09:04:33 UTC

I retested this issue again and I am seeing this issue consistently.

This was the new test taken up,
1. Created a 2X2 distributed-replicate volume and used it to back VM Image store in RHEV

2. While installing OS in the App VM, initiated 'remove-brick start' for the brick which originally contains the VM disk image.

3. Observed split-brain error messages in the fuse mount log

4. App VM went in to PAUSED state

The issue looks like a regression and I could reproduce this issue consistently all the time.

Marking this bug as a BLOCKER for RHGS 3.1 release

Comment 7 SATHEESARAN 2015-07-16 10:46:19 UTC

This issue is the regression and adding 'REGRESSION' keyword.

Comment 12 monti lawrence 2015-07-24 16:58:43 UTC

Doc text is edited. Please sign off to be included in Known Issues.

Comment 16 SATHEESARAN 2015-08-03 13:53:17 UTC

*** Bug 1235496 has been marked as a duplicate of this bug. ***

Comment 19 Raghavendra G 2015-08-13 10:50:16 UTC

https://code.engineering.redhat.com/gerrit/#/c/53818/

Comment 20 SATHEESARAN 2015-09-02 08:32:11 UTC

Tested with glusterfs-3.7.1-13.el7rhgs with the following steps

1. Configured 2X2 gluster volume as a Data Domain for RHEV
2. Created few appvms and started OS installation on them
3. Initiated remove-brick on the brick which contained VM disk image

There were no issues found and OS installation happened successfully.
App VMs are healthy after removing the brick on which the VM's disk image was available.

Marking this bug as VERIFIED

Comment 22 errata-xmlrpc 2015-10-05 07:20:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1845.html

Note You need to log in before you can comment on or make changes to this bug.