863939 – [RHEV-RHS] VM paused when one of the bricks in the replica pair was brought down

Bug 863939 - [RHEV-RHS] VM paused when one of the bricks in the replica pair was brought down

Summary: [RHEV-RHS] VM paused when one of the bricks in the replica pair was brought down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Pranith Kumar K
QA Contact:	Rejy M Cyriac
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-08 06:41 UTC by Anush Shetty
Modified:	2013-09-23 22:33 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.4.0.33rhs-1.el6rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:	virt rhev integration
Last Closed:	2013-09-23 22:33:29 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
SOS report of the server (1.87 MB, application/x-xz) 2012-10-08 07:33 UTC, Anush Shetty	no flags	Details
View All

Description Anush Shetty 2012-10-08 06:41:09 UTC

Description of problem: In a 2x2 distributed-replicate volume, when one of the bricks in a replica pair was brought down by killing the glusterfsd process, the VM hosted moved to paused state. Even after bringing the brick up again, we were unable to bring the VM up.


Version-Release number of selected component (if applicable):
# rpm -qa | grep glus
glusterfs-fuse-3.3.0rhsvirt1-6.el6rhs.x86_64
glusterfs-devel-3.3.0rhsvirt1-6.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-6.el6rhs.x86_64


How reproducible: Consistently


Steps to Reproduce:
1. Create a 2x2 distributed-replicate volume and use it as a storage domain in RHEV-M
2. Create VMs on the storage domain
3. Bring a brick in one of the replica pairs down
  
Actual results:

VM paused due to storage errors

Expected results:

VM should be up.


Additional info:


Volume Name: dist-replica
Type: Distributed-Replicate
Volume ID: 39e0c10c-12d8-4484-b21d-a3be0cd0b7aa
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhs-client36.lab.eng.blr.redhat.com:/dist-replica1
Brick2: rhs-client37.lab.eng.blr.redhat.com:/dist-replica1
Brick3: rhs-client43.lab.eng.blr.redhat.com:/dist-replica1
Brick4: rhs-client44.lab.eng.blr.redhat.com:/dist-replica1
Options Reconfigured:
performance.quick-read: disable
performance.io-cache: disable
performance.stat-prefetch: disable
performance.read-ahead: disable
storage.linux-aio: disable
cluster.eager-lock: enable

#glustershd log on the brick which was brought down 
[2012-10-08 11:25:25.678833] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-3: Connected to 10.70.36.68:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:25.678856] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:25.678907] I [afr-common.c:3631:afr_notify] 0-dist-replica-replicate-1: Subvolume 'dist-replica-client-3' came back up; going online.
[2012-10-08 11:25:25.679457] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-3: Server lk version = 1
[2012-10-08 11:25:25.689072] E [afr-self-heal-data.c:1311:afr_sh_data_open_cbk] 0-dist-replica-replicate-1: open of <gfid:aec237f3-6779-4117-b2ac-349cbdb2256a> failed on child dist-replica-client-2 (Transport endpoint is not connected)
[2012-10-08 11:25:25.699814] E [afr-self-heal-data.c:1311:afr_sh_data_open_cbk] 0-dist-replica-replicate-1: open of <gfid:b2c333bf-ca8c-4cf7-8388-534fb6035f2d> failed on child dist-replica-client-2 (Transport endpoint is not connected)
[2012-10-08 11:25:27.685167] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.685537] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-0: Connected to 10.70.36.60:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:27.685579] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.685659] I [afr-common.c:3631:afr_notify] 0-dist-replica-replicate-0: Subvolume 'dist-replica-client-0' came back up; going online.
[2012-10-08 11:25:27.686275] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-0: Server lk version = 1
[2012-10-08 11:25:27.690884] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-1: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.691223] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-1: Connected to 10.70.36.61:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:27.691251] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.691861] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-1: Server lk version = 1
[2012-10-08 11:25:27.695234] I [client-handshake.c:1614:select_server_supported_programs] 0-pure-replica-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.695467] I [client-handshake.c:1411:client_setvolume_cbk] 0-pure-replica-client-0: Connected to 10.70.36.67:24009, attached to remote volume '/pure-replica1'.
[2012-10-08 11:25:27.695492] I [client-handshake.c:1423:client_setvolume_cbk] 0-pure-replica-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.696043] I [client-handshake.c:453:client_set_lk_version_cbk] 0-pure-replica-client-0: Server lk version = 1
[2012-10-08 11:25:27.699351] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.699678] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-2: Connected to 10.70.36.67:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:27.699712] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.700306] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-2: Server lk version = 1
[2012-10-08 11:35:25.803699] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:aec237f3-6779-4117-b2ac-349cbdb2256a>' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2012-10-08 11:37:26.233494] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-10-08 11:37:27.256012] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-10-08 11:37:27.257859] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2012-10-08 11:43:12.961746] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed

Comment 2 Anush Shetty 2012-10-08 07:33:04 UTC

Created attachment 623321 [details]
SOS report of the server

Comment 3 Pranith Kumar K 2012-10-13 10:33:28 UTC

Anush,
    Could you attach sos-report from the other brick as well as client machine.

Pranith

Comment 5 Vijay Bellur 2013-01-08 09:44:46 UTC

http://review.gluster.org/4130

Comment 6 Pranith Kumar K 2013-01-08 11:02:57 UTC

Patch is under review.

Comment 7 Vijay Bellur 2013-01-17 07:11:34 UTC

CHANGE: http://review.gluster.org/4310 (cluster/afr: Pre-op should be undone for non-piggyback post-op) merged in master by Anand Avati (avati)

Comment 8 Pranith Kumar K 2013-01-22 11:04:50 UTC

The bug is fixed based on code-inspection. This may not be the complete fix. Please feel free to re-open the bug if it is observed again. The necessary debugging infra to figure out split-brains is going to be committed as part of 864666.

Comment 9 Rejy M Cyriac 2013-09-11 11:28:43 UTC

Verifed that issue is not reproducible

RHS 2.1 Server - glusterfs-server-3.4.0.33rhs-1.el6rhs.x86_64

RHEVM:

RHEVM 3.2:
SF 20.1 (3.2.3-0.42.el6ev) 

RHEVM 3.3:
IS13 - rhevm-3.3.0-0.19.master.el6ev

Comment 10 Scott Haines 2013-09-23 22:33:29 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.