973183 – Network down an up on one brick cause self-healing won't work until glusterd restart

Bug 973183 - Network down an up on one brick cause self-healing won't work until glusterd restart

Summary: Network down an up on one brick cause self-healing won't work until glusterd ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.4.0-beta
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Anuradha
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-11 12:36 UTC by Marcin Garski
Modified:	2016-09-20 02:00 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-09-01 06:35:27 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Marcin Garski 2013-06-11 12:36:31 UTC

Description of problem:
Two nodes on CentOS 6.4:
1. stgnode01 (eth0: 192.168.13.51, eth1: 10.10.0.11)
2. stgnode02 (eth0: 192.168.13.52, eth1: 10.10.0.12)
with replicated GlusterFS volume by eth1 (connected by Ethernet crossover cable) and CTDB (configured on eth0) providing virtual IP 192.168.13.21.
On test node mounted GlusterFS volume on NFS by 192.168.13.21. CTDB pointing from virtual IP to stgnode02.

When I copy some big file from test node to mounted NFS share and unplug Ethernet crossover cable, copying of file freezes and after some time unfreeze. Then I replug (when file is still being copied) Ethernet crossover cable between two nodes.

When copying of file end, problem appears :) On stgnode02 file have proper size, on stgnode01 there is only part of the file (looks like it stop to grow after cable unplug).

#######################################################
gluster vol heal datavol info
Gathering Heal info on volume datavol has been successful

Brick stgnode01:/bricks/data
Number of entries: 1
/test.iso

Brick stgnode02:/bricks/data
Number of entries: 1
/test.iso
#######################################################

The heal process does not end after 12h, getfattr shows:

#######################################################
[root@stgnode01 glusterfs]# getfattr -d -m '.*' -e hex /bricks/data/test.iso
# file: bricks/data/test.iso
trusted.afr.datavol-client-0=0x000000010000000000000000
trusted.afr.datavol-client-1=0x000000010000000000000000
trusted.gfid=0x6ed4b51f90634fce993c097b8d77e4fe


[root@stgnode02 data]# getfattr -d -m '.*' -e hex /bricks/data/test.iso
# file: bricks/data/test.iso
trusted.afr.datavol-client-0=0x00009a230000000300000000
trusted.afr.datavol-client-1=0x000000000000000000000000
trusted.gfid=0x6ed4b51f90634fce993c097b8d77e4fe
#######################################################

Procedure that help:
1. service glusterd stop
2. pkill glusterd
3. service glusterd start

After a few seconds "gluster volume heal datavol info" shows 0 entries, file-size of test.iso is identical on two nodes, getfattr shows:
trusted.afr.datavol-client-0=0x000000000000000000000000
trusted.afr.datavol-client-1=0x000000000000000000000000

Version-Release number of selected component (if applicable):
glusterfs-3.4.0-0.5.beta2.el6.x86_64

GlusterFS volume created as:
vol create datavol replica 2 stgnode01:/bricks/data stgnode02:/bricks/data
vol set datavol nfs.port 2049

Comment 1 Pranith Kumar K 2014-07-13 07:22:05 UTC

Anuradha,
    I think this bug should already be fixed in 3.5.x and upstream. Could you please verify and close the bug if it is indeed working.

Pranith

Comment 2 Niels de Vos 2015-05-17 22:00:17 UTC

GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Note You need to log in before you can comment on or make changes to this bug.