Bug 863939 - [RHEV-RHS] VM paused when one of the bricks in the replica pair was brought down
[RHEV-RHS] VM paused when one of the bricks in the replica pair was brought down
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs (Show other bugs)
2.0
Unspecified Unspecified
medium Severity unspecified
: ---
: ---
Assigned To: Pranith Kumar K
Rejy M Cyriac
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-08 02:41 EDT by Anush Shetty
Modified: 2013-09-23 18:33 EDT (History)
7 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0.33rhs-1.el6rhs
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
virt rhev integration
Last Closed: 2013-09-23 18:33:29 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
SOS report of the server (1.87 MB, application/x-xz)
2012-10-08 03:33 EDT, Anush Shetty
no flags Details

  None (edit)
Description Anush Shetty 2012-10-08 02:41:09 EDT
Description of problem: In a 2x2 distributed-replicate volume, when one of the bricks in a replica pair was brought down by killing the glusterfsd process, the VM hosted moved to paused state. Even after bringing the brick up again, we were unable to bring the VM up.


Version-Release number of selected component (if applicable):
# rpm -qa | grep glus
glusterfs-fuse-3.3.0rhsvirt1-6.el6rhs.x86_64
glusterfs-devel-3.3.0rhsvirt1-6.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-6.el6rhs.x86_64


How reproducible: Consistently


Steps to Reproduce:
1. Create a 2x2 distributed-replicate volume and use it as a storage domain in RHEV-M
2. Create VMs on the storage domain
3. Bring a brick in one of the replica pairs down
  
Actual results:

VM paused due to storage errors

Expected results:

VM should be up.


Additional info:


Volume Name: dist-replica
Type: Distributed-Replicate
Volume ID: 39e0c10c-12d8-4484-b21d-a3be0cd0b7aa
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhs-client36.lab.eng.blr.redhat.com:/dist-replica1
Brick2: rhs-client37.lab.eng.blr.redhat.com:/dist-replica1
Brick3: rhs-client43.lab.eng.blr.redhat.com:/dist-replica1
Brick4: rhs-client44.lab.eng.blr.redhat.com:/dist-replica1
Options Reconfigured:
performance.quick-read: disable
performance.io-cache: disable
performance.stat-prefetch: disable
performance.read-ahead: disable
storage.linux-aio: disable
cluster.eager-lock: enable

#glustershd log on the brick which was brought down 
[2012-10-08 11:25:25.678833] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-3: Connected to 10.70.36.68:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:25.678856] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:25.678907] I [afr-common.c:3631:afr_notify] 0-dist-replica-replicate-1: Subvolume 'dist-replica-client-3' came back up; going online.
[2012-10-08 11:25:25.679457] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-3: Server lk version = 1
[2012-10-08 11:25:25.689072] E [afr-self-heal-data.c:1311:afr_sh_data_open_cbk] 0-dist-replica-replicate-1: open of <gfid:aec237f3-6779-4117-b2ac-349cbdb2256a> failed on child dist-replica-client-2 (Transport endpoint is not connected)
[2012-10-08 11:25:25.699814] E [afr-self-heal-data.c:1311:afr_sh_data_open_cbk] 0-dist-replica-replicate-1: open of <gfid:b2c333bf-ca8c-4cf7-8388-534fb6035f2d> failed on child dist-replica-client-2 (Transport endpoint is not connected)
[2012-10-08 11:25:27.685167] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.685537] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-0: Connected to 10.70.36.60:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:27.685579] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.685659] I [afr-common.c:3631:afr_notify] 0-dist-replica-replicate-0: Subvolume 'dist-replica-client-0' came back up; going online.
[2012-10-08 11:25:27.686275] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-0: Server lk version = 1
[2012-10-08 11:25:27.690884] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-1: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.691223] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-1: Connected to 10.70.36.61:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:27.691251] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.691861] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-1: Server lk version = 1
[2012-10-08 11:25:27.695234] I [client-handshake.c:1614:select_server_supported_programs] 0-pure-replica-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.695467] I [client-handshake.c:1411:client_setvolume_cbk] 0-pure-replica-client-0: Connected to 10.70.36.67:24009, attached to remote volume '/pure-replica1'.
[2012-10-08 11:25:27.695492] I [client-handshake.c:1423:client_setvolume_cbk] 0-pure-replica-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.696043] I [client-handshake.c:453:client_set_lk_version_cbk] 0-pure-replica-client-0: Server lk version = 1
[2012-10-08 11:25:27.699351] I [client-handshake.c:1614:select_server_supported_programs] 0-dist-replica-client-2: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-08 11:25:27.699678] I [client-handshake.c:1411:client_setvolume_cbk] 0-dist-replica-client-2: Connected to 10.70.36.67:24010, attached to remote volume '/dist-replica1'.
[2012-10-08 11:25:27.699712] I [client-handshake.c:1423:client_setvolume_cbk] 0-dist-replica-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-08 11:25:27.700306] I [client-handshake.c:453:client_set_lk_version_cbk] 0-dist-replica-client-2: Server lk version = 1
[2012-10-08 11:35:25.803699] E [afr-self-heal-data.c:763:afr_sh_data_fxattrop_fstat_done] 0-dist-replica-replicate-1: Unable to self-heal contents of '<gfid:aec237f3-6779-4117-b2ac-349cbdb2256a>' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2012-10-08 11:37:26.233494] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-10-08 11:37:27.256012] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-10-08 11:37:27.257859] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2012-10-08 11:43:12.961746] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
Comment 2 Anush Shetty 2012-10-08 03:33:04 EDT
Created attachment 623321 [details]
SOS report of the server
Comment 3 Pranith Kumar K 2012-10-13 06:33:28 EDT
Anush,
    Could you attach sos-report from the other brick as well as client machine.

Pranith
Comment 5 Vijay Bellur 2013-01-08 04:44:46 EST
http://review.gluster.org/4130
Comment 6 Pranith Kumar K 2013-01-08 06:02:57 EST
Patch is under review.
Comment 7 Vijay Bellur 2013-01-17 02:11:34 EST
CHANGE: http://review.gluster.org/4310 (cluster/afr: Pre-op should be undone for non-piggyback post-op) merged in master by Anand Avati (avati@redhat.com)
Comment 8 Pranith Kumar K 2013-01-22 06:04:50 EST
The bug is fixed based on code-inspection. This may not be the complete fix. Please feel free to re-open the bug if it is observed again. The necessary debugging infra to figure out split-brains is going to be committed as part of 864666.
Comment 9 Rejy M Cyriac 2013-09-11 07:28:43 EDT
Verifed that issue is not reproducible

RHS 2.1 Server - glusterfs-server-3.4.0.33rhs-1.el6rhs.x86_64

RHEVM:

RHEVM 3.2:
SF 20.1 (3.2.3-0.42.el6ev) 

RHEVM 3.3:
IS13 - rhevm-3.3.0-0.19.master.el6ev
Comment 10 Scott Haines 2013-09-23 18:33:29 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.