Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 762650 (GLUSTER-918)

Summary:	AFR write fails when subvolumes' state is swapped
Product:	[Community] GlusterFS	Reporter:	Pavan Vilas Sondur <pavan>
Component:	replicate	Assignee:	Anand Avati <aavati>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	urgent
Version:	mainline	CC:	amarts, chrisw, gluster-bugs, rabhat, rahulcs
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	RTP	Mount Type:	---
Documentation:	DNR	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavan Vilas Sondur 2010-05-11 04:27:09 UTC

When a subvolume is down, if an open call comes in and before a write on the fd comes in, the 'offline' subvolume comes up and the 'online' subvolume goes offline, the subsequent write fails. Here's a small program to reproduce this:

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int
main (int argc, char *argv[])
{
        int fd = -1;
        int ret = -1;
        char *str = "hello, world\n";

        fd = open ("/mnt/test", O_RDWR | O_CREAT);
        if (!fd) {
                printf ("\n open failed");
                goto err;
        }

        sleep (40);

        ret = write (fd, str, strlen (str));
        printf ("wrote %d bytes", ret);


err:
        return 0;

}


Relevant logs:


[2010-05-11 09:48:58] T [socket.c:581:__socket_proto_state_machine] localhost-2: read (Transport endpoint is not connected) in state 1 (127.0.0.1:2002)
[2010-05-11 09:48:58] T [fuse-bridge.c:1966:fuse_write] glusterfs-fuse: 48: WRITE (0x875038, size=13, offset=0)
[2010-05-11 09:48:58] T [afr-open.c:480:afr_up_down_flush] repl: doing up/down flush on fd=0x875038

[2010-05-11 09:48:58] D [afr-transaction.c:1263:afr_lock_rec] repl: unable to lock on even one child
[2010-05-11 09:48:58] D [afr-transaction.c:1263:afr_lock_rec] repl: unable to lock on even one child
^^^^^^^^^^^^^^^^^^^^^^^^^

[2010-05-11 09:48:58] W [fuse-bridge.c:1921:fuse_writev_cbk] glusterfs-fuse: 48: WRITE => -1 (Resource temporarily unavailable)
[2010-05-11 09:48:58] T [client-protocol.c:5883:protocol_client_cleanup] localhost-2: cleaning up state in transport object 0x86e168
[2010-05-11 09:48:58] T [fuse-bridge.c:2064:fuse_release] glusterfs-fuse: 49: RELEASE 0x8753c8 (FLUSH implied)
[2010-05-11 09:48:58] T [client-protocol.c:1695:client_flush] localhost-1: (2342924): failed to get fd ctx. EBADFD
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2010-05-11 09:48:58] W [fuse-bridge.c:1182:fuse_err_cbk] glusterfs-fuse: 49: FLUSH() ERR => -1 (File descriptor in bad state)
[2010-05-11 09:48:58] T [fuse-bridge.c:2005:fuse_flush] glusterfs-fuse: 50: FLUSH 0x875038
[2010-05-11 09:48:58] T [client-protocol.c:1695:client_flush] localhost-1: (2342929): failed to get fd ctx. EBADFD
[2010-05-11 09:48:58] W [fuse-bridge.c:1182:fuse_err_cbk] glusterfs-fuse: 50: FLUSH() ERR => -1 (File descriptor in bad state)
[2010-05-11 09:48:58] T [fuse-bridge.c:2064:fuse_release] glusterfs-fuse: 51: RELEASE 0x875038


This can be a placeholder for bugs of this type: open - subvolume down/up - FOP

Comment 1 Amar Tumballi 2010-10-05 09:51:37 UTC

Avati, Can you confirm

Comment 2 Anand Avati 2010-10-28 03:37:41 UTC

PATCH: http://patches.gluster.com/patch/5590 in master (replicate: fix hang/missing frame during locking)

Comment 3 Anand Avati 2010-10-28 03:37:45 UTC

PATCH: http://patches.gluster.com/patch/5591 in master (replicate: attempt re-open of files before performing openfd selfheal)

Comment 4 Amar Tumballi 2011-02-15 04:47:47 UTC

I am not sure what to add here, but I guess we need to mention this feature to user. Avati/Vijay, any update on this? Divya, Please ask Vijay/Avati on do we need documentation at all to this bug.

Comment 5 Amar Tumballi 2011-04-13 06:22:23 UTC

DNR - as the bug is fixed, and for user, there are no issues seen.