1113050 – Transient failures immediately after add-brick to a mounted volume

Bug 1113050 - Transient failures immediately after add-brick to a mounted volume

Summary: Transient failures immediately after add-brick to a mounted volume

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.5.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Krutika Dhananjay
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1110262 glusterfs-3.6.0 glusterfs-3.5.10
TreeView+	depends on / blocked

Reported:	2014-06-25 11:11 UTC by Anders Blomdell
Modified:	2016-06-17 16:23 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-06-17 16:23:41 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Script to trigger bug (2.59 KB, text/plain) 2014-06-25 11:12 UTC, Anders Blomdell	no flags	Details
View All

Description Anders Blomdell 2014-06-25 11:11:16 UTC

Description of problem:

With gluster-3.5.1beta2, add-brick on a mounted volume and immediate access to said volume gives transient/permanent failures for a few seconds.

Bug was found during investigation of 
https://bugzilla.redhat.com/show_bug.cgi?id=1110262


Version-Release number of selected component (if applicable):

gluster-3.5.1beta2

How reproducible:

Always

Steps to Reproduce:
1. Run attached script as './bug.sh 00777' 

Actual results:

## Start of run ons jun 25 13:03:43 CEST 2014
Setting up gluster
Creating testvol
Adding brick
Brick added
  0.009 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Transport endpoint is not connected
  0.014 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Transport endpoint is not connected
  ...
  2.372 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Transport endpoint is not connected
  2.378 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Transport endpoint is not connected
  2.386 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Stale file handle
  2.393 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Stale file handle
  ...
  3.005 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Stale file handle
  3.011 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Stale file handle
Indicator found (self-heal executed)
  3.057 # rmdir /mnt/gluster/heal        # rmdir: failed to remove ‘/mnt/gluster/heal’: Permission denied
755 /data/disk1
755 /data/disk1/gluster
777 /data/disk1/gluster/test
755 /data/disk1/gluster/test/dir1
755 /data/disk1/gluster/test/dir1/dir2
755 /data/disk1/gluster/test/dir1/dir2/dir3
755 /data/disk2
755 /data/disk2/gluster
755 /data/disk2/gluster/heal
## End of run ons jun 25 13:04:07 CEST 2014

Expected results:

## Start of run ons jun 25 13:03:43 CEST 2014
Setting up gluster
Creating testvol
Adding brick
Brick added
Indicator found (self-heal executed)
755 /data/disk1
755 /data/disk1/gluster
777 /data/disk1/gluster/test
755 /data/disk1/gluster/test/dir1
755 /data/disk1/gluster/test/dir1/dir2
755 /data/disk1/gluster/test/dir1/dir2/dir3
755 /data/disk2
755 /data/disk2/gluster
## End of run ons jun 25 13:04:07 CEST 2014

Additional info:

Comment 1 Anders Blomdell 2014-06-25 11:12:10 UTC

Created attachment 912056 [details]
Script to trigger bug

Comment 2 Krutika Dhananjay 2014-06-27 10:38:15 UTC

Root Cause Analysis:
===================

When a new brick is added, the DHT in the new graph waits to hear from all of its children before any I/O can happen on it.

In this case, here are the sequence of events seen:
1. glusterd sends a volfile change notification to the mount.
2. Mount creates a new graph.
3. On receiving PARENT_UP, the client translators attempt to do port map query on their respective remote-subvolumes.
4. Glusterd sends a failure status (rsp.ret = -1) for the portmap query by the new client translator since the brick process associated with the new brick has not done a portmap signin with its glusterd. (Note that portmap query succeeds on the already existing bricks, causing them to send a CHILD_UP notification all the way upto DHT)
5. This causes the second client translator to perceive that the last child is down, and send a CHILD_DOWN notification.
6. After this, DHT believes it has heard from all of its sub-volumes and notifies its parents and ancestors accordingly.
7. This causes the file operations to be passed on the new graph.
8. These file operations keep failing with ENOTCONN for 3 seconds.
9. After 3 seconds, the client attempts a reconnect, by which time the glusterd on the remote subvolume is aware of the new glusterfsd's port number and hence, from this point on, handshake succeeds and fops stop failing with ENOTCONN.

Comment 3 Anders Blomdell 2014-07-08 15:26:26 UTC

Current head (v3.5qa2-725-g72d96e2) has fixed the 'Stale file handle' errors, 'Transport endpoint is not connected' are still present (and seems to persist for 3.3 seconds instead of 2.3 seconds).

Comment 4 Anders Blomdell 2014-07-09 16:09:46 UTC

Is the current behavior in master, where file-system operations on the fuse-fs
fails with 'Transport endpoint is not connected' for approximately 3.2 seconds 
after an add-brick is done, or would it make sense to add something like NFS's
{soft/hard}/retrans/timeo logic, and if so, could somebody give some hints on 
a good way to do it (client.c?)

Comment 5 Niels de Vos 2014-09-27 09:06:20 UTC

Krutika, the root cause in your comment #2 is pretty well explained.

Do I understand correctly that *a* fix would be to have glusterd wait with sending the change notification to the mount, until the (new) bricks have signed-in at the glusterd-portmapper? (Probably stated much simpler than it actually is.)

Is there an ETA to have this fixed? Can this get included in 3.5.3 that hopefully sees a beta during next week?

Comment 6 Krutika Dhananjay 2014-09-30 14:12:23 UTC

(In reply to Niels de Vos from comment #5)
> Krutika, the root cause in your comment #2 is pretty well explained.
> 
> Do I understand correctly that *a* fix would be to have glusterd wait with
> sending the change notification to the mount, until the (new) bricks have
> signed-in at the glusterd-portmapper? (Probably stated much simpler than it
> actually is.)
> 
> Is there an ETA to have this fixed? Can this get included in 3.5.3 that
> hopefully sees a beta during next week?

Hi Niels,

That was one solution that Pranith and I had discussed with Kaushal. We concluded that this is an intrusive change and the current infrastructure in glusterd doesn't allow this to be implemented easily.

For example,

consider a 2 node cluster and an add-brick done to add a second brick to a volume from the second node, and the originator being the first node.

During commit op, glusterd on node-1 first changes its volfile and notifies clients connected to it.

Now before node-2 does commit op and as part of the same starts the glusterfsd process for the newly added brick, client(s) that have node-1 as volfile-server could end up querying glusterd on node-2 for brick-2's port number and fail due to one of the following reasons:

a) the glusterd on node-2 has launched glusterfsd associated with brick-2 but the brick process hasnt performed a portmap signin yet, or
b) the glusterd on node-2 hasn't even changed the volfile for the volume yet, and as a result does not even recognise there's a brick-2 for the volume.

Both the above cases need to be handled by glusterd and it should also handle the case of glusterds not waiting wistfully to hear from the bricks before notifying the clients, just in case the brick went down (due to a crash or disconnect) without ever doing a pmap signin.

Getting these two problems fixed into glusterd is not a trivial task and cannot be delivered as part of 3.5.3. :(

-Krutika

Comment 7 Niels de Vos 2014-09-30 14:51:45 UTC

Thanks for the clear explanation! Moving this to glusterfs-3.5.4 for now.

Comment 8 Niels de Vos 2015-09-15 14:09:52 UTC

There were no patches submitted in time for the glusterfs-3.5.6 release that should resolve this bug. This bug report is moved for tracking to the glusterfs-3.5.7 release, submitting patches/backports is very much appreciated.

Comment 9 Niels de Vos 2016-06-17 16:23:41 UTC

This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.

Note You need to log in before you can comment on or make changes to this bug.