| Summary: | AFR write fails when subvolumes' state is swapped | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Pavan Vilas Sondur <pavan> |
| Component: | replicate | Assignee: | Anand Avati <aavati> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | urgent | ||
| Version: | mainline | CC: | amarts, chrisw, gluster-bugs, rabhat, rahulcs |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | RTP | Mount Type: | --- |
| Documentation: | DNR | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Avati, Can you confirm PATCH: http://patches.gluster.com/patch/5590 in master (replicate: fix hang/missing frame during locking) PATCH: http://patches.gluster.com/patch/5591 in master (replicate: attempt re-open of files before performing openfd selfheal) I am not sure what to add here, but I guess we need to mention this feature to user. Avati/Vijay, any update on this? Divya, Please ask Vijay/Avati on do we need documentation at all to this bug. DNR - as the bug is fixed, and for user, there are no issues seen. |
When a subvolume is down, if an open call comes in and before a write on the fd comes in, the 'offline' subvolume comes up and the 'online' subvolume goes offline, the subsequent write fails. Here's a small program to reproduce this: #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main (int argc, char *argv[]) { int fd = -1; int ret = -1; char *str = "hello, world\n"; fd = open ("/mnt/test", O_RDWR | O_CREAT); if (!fd) { printf ("\n open failed"); goto err; } sleep (40); ret = write (fd, str, strlen (str)); printf ("wrote %d bytes", ret); err: return 0; } Relevant logs: [2010-05-11 09:48:58] T [socket.c:581:__socket_proto_state_machine] localhost-2: read (Transport endpoint is not connected) in state 1 (127.0.0.1:2002) [2010-05-11 09:48:58] T [fuse-bridge.c:1966:fuse_write] glusterfs-fuse: 48: WRITE (0x875038, size=13, offset=0) [2010-05-11 09:48:58] T [afr-open.c:480:afr_up_down_flush] repl: doing up/down flush on fd=0x875038 [2010-05-11 09:48:58] D [afr-transaction.c:1263:afr_lock_rec] repl: unable to lock on even one child [2010-05-11 09:48:58] D [afr-transaction.c:1263:afr_lock_rec] repl: unable to lock on even one child ^^^^^^^^^^^^^^^^^^^^^^^^^ [2010-05-11 09:48:58] W [fuse-bridge.c:1921:fuse_writev_cbk] glusterfs-fuse: 48: WRITE => -1 (Resource temporarily unavailable) [2010-05-11 09:48:58] T [client-protocol.c:5883:protocol_client_cleanup] localhost-2: cleaning up state in transport object 0x86e168 [2010-05-11 09:48:58] T [fuse-bridge.c:2064:fuse_release] glusterfs-fuse: 49: RELEASE 0x8753c8 (FLUSH implied) [2010-05-11 09:48:58] T [client-protocol.c:1695:client_flush] localhost-1: (2342924): failed to get fd ctx. EBADFD ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [2010-05-11 09:48:58] W [fuse-bridge.c:1182:fuse_err_cbk] glusterfs-fuse: 49: FLUSH() ERR => -1 (File descriptor in bad state) [2010-05-11 09:48:58] T [fuse-bridge.c:2005:fuse_flush] glusterfs-fuse: 50: FLUSH 0x875038 [2010-05-11 09:48:58] T [client-protocol.c:1695:client_flush] localhost-1: (2342929): failed to get fd ctx. EBADFD [2010-05-11 09:48:58] W [fuse-bridge.c:1182:fuse_err_cbk] glusterfs-fuse: 50: FLUSH() ERR => -1 (File descriptor in bad state) [2010-05-11 09:48:58] T [fuse-bridge.c:2064:fuse_release] glusterfs-fuse: 51: RELEASE 0x875038 This can be a placeholder for bugs of this type: open - subvolume down/up - FOP