Description of problem: With gluster-3.5.1beta2, add-brick on a mounted volume and immediate access to said volume gives transient/permanent failures for a few seconds. Bug was found during investigation of https://bugzilla.redhat.com/show_bug.cgi?id=1110262 Version-Release number of selected component (if applicable): gluster-3.5.1beta2 How reproducible: Always Steps to Reproduce: 1. Run attached script as './bug.sh 00777' Actual results: ## Start of run ons jun 25 13:03:43 CEST 2014 Setting up gluster Creating testvol Adding brick Brick added 0.009 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Transport endpoint is not connected 0.014 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Transport endpoint is not connected ... 2.372 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Transport endpoint is not connected 2.378 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Transport endpoint is not connected 2.386 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Stale file handle 2.393 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Stale file handle ... 3.005 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Stale file handle 3.011 # rmdir /mnt/gluster/heal/indica # rmdir: failed to remove ‘/mnt/gluster/heal/indicator’: Stale file handle Indicator found (self-heal executed) 3.057 # rmdir /mnt/gluster/heal # rmdir: failed to remove ‘/mnt/gluster/heal’: Permission denied 755 /data/disk1 755 /data/disk1/gluster 777 /data/disk1/gluster/test 755 /data/disk1/gluster/test/dir1 755 /data/disk1/gluster/test/dir1/dir2 755 /data/disk1/gluster/test/dir1/dir2/dir3 755 /data/disk2 755 /data/disk2/gluster 755 /data/disk2/gluster/heal ## End of run ons jun 25 13:04:07 CEST 2014 Expected results: ## Start of run ons jun 25 13:03:43 CEST 2014 Setting up gluster Creating testvol Adding brick Brick added Indicator found (self-heal executed) 755 /data/disk1 755 /data/disk1/gluster 777 /data/disk1/gluster/test 755 /data/disk1/gluster/test/dir1 755 /data/disk1/gluster/test/dir1/dir2 755 /data/disk1/gluster/test/dir1/dir2/dir3 755 /data/disk2 755 /data/disk2/gluster ## End of run ons jun 25 13:04:07 CEST 2014 Additional info:
Created attachment 912056 [details] Script to trigger bug
Root Cause Analysis: =================== When a new brick is added, the DHT in the new graph waits to hear from all of its children before any I/O can happen on it. In this case, here are the sequence of events seen: 1. glusterd sends a volfile change notification to the mount. 2. Mount creates a new graph. 3. On receiving PARENT_UP, the client translators attempt to do port map query on their respective remote-subvolumes. 4. Glusterd sends a failure status (rsp.ret = -1) for the portmap query by the new client translator since the brick process associated with the new brick has not done a portmap signin with its glusterd. (Note that portmap query succeeds on the already existing bricks, causing them to send a CHILD_UP notification all the way upto DHT) 5. This causes the second client translator to perceive that the last child is down, and send a CHILD_DOWN notification. 6. After this, DHT believes it has heard from all of its sub-volumes and notifies its parents and ancestors accordingly. 7. This causes the file operations to be passed on the new graph. 8. These file operations keep failing with ENOTCONN for 3 seconds. 9. After 3 seconds, the client attempts a reconnect, by which time the glusterd on the remote subvolume is aware of the new glusterfsd's port number and hence, from this point on, handshake succeeds and fops stop failing with ENOTCONN.
Current head (v3.5qa2-725-g72d96e2) has fixed the 'Stale file handle' errors, 'Transport endpoint is not connected' are still present (and seems to persist for 3.3 seconds instead of 2.3 seconds).
Is the current behavior in master, where file-system operations on the fuse-fs fails with 'Transport endpoint is not connected' for approximately 3.2 seconds after an add-brick is done, or would it make sense to add something like NFS's {soft/hard}/retrans/timeo logic, and if so, could somebody give some hints on a good way to do it (client.c?)
Krutika, the root cause in your comment #2 is pretty well explained. Do I understand correctly that *a* fix would be to have glusterd wait with sending the change notification to the mount, until the (new) bricks have signed-in at the glusterd-portmapper? (Probably stated much simpler than it actually is.) Is there an ETA to have this fixed? Can this get included in 3.5.3 that hopefully sees a beta during next week?
(In reply to Niels de Vos from comment #5) > Krutika, the root cause in your comment #2 is pretty well explained. > > Do I understand correctly that *a* fix would be to have glusterd wait with > sending the change notification to the mount, until the (new) bricks have > signed-in at the glusterd-portmapper? (Probably stated much simpler than it > actually is.) > > Is there an ETA to have this fixed? Can this get included in 3.5.3 that > hopefully sees a beta during next week? Hi Niels, That was one solution that Pranith and I had discussed with Kaushal. We concluded that this is an intrusive change and the current infrastructure in glusterd doesn't allow this to be implemented easily. For example, consider a 2 node cluster and an add-brick done to add a second brick to a volume from the second node, and the originator being the first node. During commit op, glusterd on node-1 first changes its volfile and notifies clients connected to it. Now before node-2 does commit op and as part of the same starts the glusterfsd process for the newly added brick, client(s) that have node-1 as volfile-server could end up querying glusterd on node-2 for brick-2's port number and fail due to one of the following reasons: a) the glusterd on node-2 has launched glusterfsd associated with brick-2 but the brick process hasnt performed a portmap signin yet, or b) the glusterd on node-2 hasn't even changed the volfile for the volume yet, and as a result does not even recognise there's a brick-2 for the volume. Both the above cases need to be handled by glusterd and it should also handle the case of glusterds not waiting wistfully to hear from the bricks before notifying the clients, just in case the brick went down (due to a crash or disconnect) without ever doing a pmap signin. Getting these two problems fixed into glusterd is not a trivial task and cannot be delivered as part of 3.5.3. :( -Krutika
Thanks for the clear explanation! Moving this to glusterfs-3.5.4 for now.
There were no patches submitted in time for the glusterfs-3.5.6 release that should resolve this bug. This bug report is moved for tracking to the glusterfs-3.5.7 release, submitting patches/backports is very much appreciated.
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.