Bug 762650 (GLUSTER-918)

Summary: AFR write fails when subvolumes' state is swapped
Product: [Community] GlusterFS Reporter: Pavan Vilas Sondur <pavan>
Component: replicateAssignee: Anand Avati <aavati>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: urgent    
Version: mainlineCC: amarts, chrisw, gluster-bugs, rabhat, rahulcs
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: ---
Documentation: DNR CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Pavan Vilas Sondur 2010-05-11 04:27:09 UTC
When a subvolume is down, if an open call comes in and before a write on the fd comes in, the 'offline' subvolume comes up and the 'online' subvolume goes offline, the subsequent write fails. Here's a small program to reproduce this:

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int
main (int argc, char *argv[])
{
        int fd = -1;
        int ret = -1;
        char *str = "hello, world\n";

        fd = open ("/mnt/test", O_RDWR | O_CREAT);
        if (!fd) {
                printf ("\n open failed");
                goto err;
        }

        sleep (40);

        ret = write (fd, str, strlen (str));
        printf ("wrote %d bytes", ret);


err:
        return 0;

}


Relevant logs:


[2010-05-11 09:48:58] T [socket.c:581:__socket_proto_state_machine] localhost-2: read (Transport endpoint is not connected) in state 1 (127.0.0.1:2002)
[2010-05-11 09:48:58] T [fuse-bridge.c:1966:fuse_write] glusterfs-fuse: 48: WRITE (0x875038, size=13, offset=0)
[2010-05-11 09:48:58] T [afr-open.c:480:afr_up_down_flush] repl: doing up/down flush on fd=0x875038

[2010-05-11 09:48:58] D [afr-transaction.c:1263:afr_lock_rec] repl: unable to lock on even one child
[2010-05-11 09:48:58] D [afr-transaction.c:1263:afr_lock_rec] repl: unable to lock on even one child
^^^^^^^^^^^^^^^^^^^^^^^^^

[2010-05-11 09:48:58] W [fuse-bridge.c:1921:fuse_writev_cbk] glusterfs-fuse: 48: WRITE => -1 (Resource temporarily unavailable)
[2010-05-11 09:48:58] T [client-protocol.c:5883:protocol_client_cleanup] localhost-2: cleaning up state in transport object 0x86e168
[2010-05-11 09:48:58] T [fuse-bridge.c:2064:fuse_release] glusterfs-fuse: 49: RELEASE 0x8753c8 (FLUSH implied)
[2010-05-11 09:48:58] T [client-protocol.c:1695:client_flush] localhost-1: (2342924): failed to get fd ctx. EBADFD
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2010-05-11 09:48:58] W [fuse-bridge.c:1182:fuse_err_cbk] glusterfs-fuse: 49: FLUSH() ERR => -1 (File descriptor in bad state)
[2010-05-11 09:48:58] T [fuse-bridge.c:2005:fuse_flush] glusterfs-fuse: 50: FLUSH 0x875038
[2010-05-11 09:48:58] T [client-protocol.c:1695:client_flush] localhost-1: (2342929): failed to get fd ctx. EBADFD
[2010-05-11 09:48:58] W [fuse-bridge.c:1182:fuse_err_cbk] glusterfs-fuse: 50: FLUSH() ERR => -1 (File descriptor in bad state)
[2010-05-11 09:48:58] T [fuse-bridge.c:2064:fuse_release] glusterfs-fuse: 51: RELEASE 0x875038


This can be a placeholder for bugs of this type: open - subvolume down/up - FOP

Comment 1 Amar Tumballi 2010-10-05 09:51:37 UTC
Avati, Can you confirm

Comment 2 Anand Avati 2010-10-28 03:37:41 UTC
PATCH: http://patches.gluster.com/patch/5590 in master (replicate: fix hang/missing frame during locking)

Comment 3 Anand Avati 2010-10-28 03:37:45 UTC
PATCH: http://patches.gluster.com/patch/5591 in master (replicate: attempt re-open of files before performing openfd selfheal)

Comment 4 Amar Tumballi 2011-02-15 04:47:47 UTC
I am not sure what to add here, but I guess we need to mention this feature to user. Avati/Vijay, any update on this? Divya, Please ask Vijay/Avati on do we need documentation at all to this bug.

Comment 5 Amar Tumballi 2011-04-13 06:22:23 UTC
DNR - as the bug is fixed, and for user, there are no issues seen.