Bug 863939

Summary:

[RHEV-RHS] VM paused when one of the bricks in the replica pair was brought down

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Anush Shetty <ashetty>

Component:

glusterfs

Assignee:

Pranith Kumar K <pkarampu>

Status:

CLOSED ERRATA

QA Contact:

Rejy M Cyriac <rcyriac>

Severity:

unspecified

Docs Contact:

Priority:

medium

Version:

2.0

CC:

grajaiya, pkarampu, rfortier, rhs-bugs, shaines, surs, vbellur

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glusterfs-3.4.0.33rhs-1.el6rhs

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

virt rhev integration

Last Closed:

2013-09-23 22:33:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
SOS report of the server	none

Description Anush Shetty 2012-10-08 06:41:09 UTC

Description of problem: In a 2x2 distributed-replicate volume, when one of the bricks in a replica pair was brought down by killing the glusterfsd process, the VM hosted moved to paused state. Even after bringing the brick up again, we were unable to bring the VM up.


Version-Release number of selected component (if applicable):
# rpm -qa | grep glus
glusterfs-fuse-3.3.0rhsvirt1-6.el6rhs.x86_64
glusterfs-devel-3.3.0rhsvirt1-6.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-6.el6rhs.x86_64


How reproducible: Consistently


Steps to Reproduce:
1. Create a 2x2 distributed-replicate volume and use it as a storage domain in RHEV-M
2. Create VMs on the storage domain
3. Bring a brick in one of the replica pairs down
  
Actual results:

VM paused due to storage errors

Expected results:

VM should be up.


Additional info:


Volume Name: dist-replica
Type: Distributed-Replicate
Volume ID: 39e0c10c-12d8-4484-b21d-a3be0cd0b7aa
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhs-client36.lab.eng.blr.redhat.com:/dist-replica1
Brick2: rhs-client37.lab.eng.blr.redhat.com:/dist-replica1
Brick3: rhs-client43.lab.eng.blr.redhat.com:/dist-replica1
Brick4: rhs-client44.lab.eng.blr.redhat.com:/dist-replica1
Options Reconfigured:
performance.quick-read: disable
performance.io-cache: disable
performance.stat-prefetch: disable
performance.read-ahead: disable
storage.linux-aio: disable
cluster.eager-lock: enable

#glustershd log on the brick which was brought down 
[2012-10-08 11:25:25.678833] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-3: Connected to 10.70.36.68:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:25.678856] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:25.678907] I [afr-common.c:3631:afr_notify] 0-dist-replica-replicate-1: Subvolume 'dist-replica-client-3' came back up; going online.
[2012-10-08 11:25:25.679457] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-3: Server lk version = 1
[2012-10-08 11:25:25.689072] E [afr-self-heal-data.c:1311:afr_sh_data_open_cbk] 0-dist-replica-replicate-1: open of <gfid:aec237f3-6779-4117-b2ac-349cbdb2256a> failed on child dist-replica-client-2 (Transport endpoint is not connected)
[2012-10-08 11:25:25.699814] E [afr-self-heal-data.c:1311:afr_sh_data_open_cbk] 0-dist-replica-replicate-1: open of <gfid:b2c333bf-ca8c-4cf7-8388-534fb6035f2d> failed on child dist-replica-client-2 (Transport endpoint is not connected)
[2012-10-08 11:25:27.685167] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.685537] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-0: Connected to 10.70.36.60:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:27.685579] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.685659] I [afr-common.c:3631:afr_notify] 0-dist-replica-replicate-0: Subvolume 'dist-replica-client-0' came back up; going online.
[2012-10-08 11:25:27.686275] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-0: Server lk version = 1
[2012-10-08 11:25:27.690884] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-1: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.691223] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-1: Connected to 10.70.36.61:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:27.691251] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.691861] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-1: Server lk version = 1
[2012-10-08 11:25:27.695234] I [client-handshake.c:1614:select_server_supported_programs] 0-pure-replica-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.695467] I [client-handshake.c:1411:client_setvolume_cbk] 0-pure-replica-client-0: Connected to 10.70.36.67:24009, attached to remote volume '/pure-replica1'.
[2012-10-08 11:25:27.695492] I [client-handshake.c:1423:client_setvolume_cbk] 0-pure-replica-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.696043] I [client-handshake.c:453:client_set_lk_version_cbk] 0-pure-replica-client-0: Server lk version = 1
[2012-10-08 11:25:27.699351] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.699678] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-2: Connected to 10.70.36.67:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:27.699712] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.700306] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-2: Server lk version = 1
[2012-10-08 11:35:25.803699] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:aec237f3-6779-4117-b2ac-349cbdb2256a>' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2012-10-08 11:37:26.233494] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-10-08 11:37:27.256012] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-10-08 11:37:27.257859] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2012-10-08 11:43:12.961746] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed

Comment 2 Anush Shetty 2012-10-08 07:33:04 UTC

Created attachment 623321 [details]
SOS report of the server

Comment 3 Pranith Kumar K 2012-10-13 10:33:28 UTC

Anush,
    Could you attach sos-report from the other brick as well as client machine.

Pranith

Comment 5 Vijay Bellur 2013-01-08 09:44:46 UTC

http://review.gluster.org/4130

Comment 6 Pranith Kumar K 2013-01-08 11:02:57 UTC

Patch is under review.

Comment 7 Vijay Bellur 2013-01-17 07:11:34 UTC

CHANGE: http://review.gluster.org/4310 (cluster/afr: Pre-op should be undone for non-piggyback post-op) merged in master by Anand Avati (avati)

Comment 8 Pranith Kumar K 2013-01-22 11:04:50 UTC

The bug is fixed based on code-inspection. This may not be the complete fix. Please feel free to re-open the bug if it is observed again. The necessary debugging infra to figure out split-brains is going to be committed as part of 864666.

Comment 9 Rejy M Cyriac 2013-09-11 11:28:43 UTC

Verifed that issue is not reproducible

RHS 2.1 Server - glusterfs-server-3.4.0.33rhs-1.el6rhs.x86_64

RHEVM:

RHEVM 3.2:
SF 20.1 (3.2.3-0.42.el6ev) 

RHEVM 3.3:
IS13 - rhevm-3.3.0-0.19.master.el6ev

Comment 10 Scott Haines 2013-09-23 22:33:29 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html