Bug 761974 (GLUSTER-242) - If any of the writes fail, write-behind should not wait till the fd is closed for reporting errors
Summary: If any of the writes fail, write-behind should not wait till the fd is closed...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-242
Product: GlusterFS
Classification: Community
Component: write-behind
Version: 2.0.6
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-02 09:57 UTC by Raghavendra G
Modified: 2009-12-02 02:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
client volume file (1.08 KB, application/octet-stream)
2009-09-02 17:18 UTC, Brian Hirt
no flags Details
server volume file (375 bytes, application/octet-stream)
2009-09-02 17:19 UTC, Brian Hirt
no flags Details
Gluster client log (15.08 KB, application/octet-stream)
2009-09-03 11:18 UTC, Brian Hirt
no flags Details
output from a run (450 bytes, application/octet-stream)
2009-09-03 11:20 UTC, Brian Hirt
no flags Details

Description Raghavendra G 2009-09-02 09:57:42 UTC
Errors during write to server, if any are communicated to the application when the fd is closed. Instead, it should return the error in the next immediate write from the application.

Comment 1 Vijay Bellur 2009-09-02 16:42:17 UTC
Comments from Brian Hirt on gluster-devel mailer:

I've also discovered, that this problem is not isolated to the writebehind module.  While some errors are being sent back to the writer, there is still data corruption in the gluster created file.  Gluster is still reporting success to the writer when writes have failed.   I have a simple program that writes 1, 2, 3, 4 ... N to a file at the rate of 100 lines per second.  Whenever the writer gets an error returned from write() it waits a second, reopens the file and continues writing.   While this writer is writing, I restart the gluster nodes one by one.  Once this is done, I stop the writter and check it for corruption.

One interesting observation I have made is that when restarting the gluster servers, sometimes errorno EBADFD is returned and sometimes it's ENOTCONN.  When errno is ENOTCONN (107 in ubuntu 9.04) the file is not corrupted. When errno is EBADFD (77 in ubuntu 9.04) there is file corruption.  These statements are based on a limited number of test runs, but were always true for me.

Some sample output of some tests:

bhirt@ubuntu:~/gluster-tests$ rm -f /unify/m/test1.2009-09-02 && ./write-numbers /unify/m/test1.2009-09-02
problems writting to fd, reopening logfile (errno = 77) in one second
^C
bhirt@ubuntu:~/gluster-tests$ ./check-numbers /unify/m/test1.2009-09-02
169 <> 480
bhirt@ubuntu:~/gluster-tests$ rm -f /unify/m/test1.2009-09-02 && ./write-numbers /unify/m/test1.2009-09-02
problems writting to fd, reopening logfile (errno = 107) in one second
^C

bhirt@ubuntu:~/gluster-tests$ ./check-numbers /unify/m/test1.2009-09-02
OK

The programs I use to test this are:

bhirt@ubuntu:~/gluster-tests$ cat write-numbers.c check-numbers
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

#define BUFSIZE        65536

/* write 100 entries per second */
#define WRITE_DELAY    1000000 / 100

int open_testfile(char *testfile)
{
    int fd;

    fd = open(testfile, O_WRONLY | O_CREAT | O_APPEND, 0666);

    if (fd < 0) {
    perror("open");
    exit(2);
    }

    return(fd);
}

void usage(char *s)
{
    fprintf(stderr, "\nusage: %s testfile\n\n",s);
}

int main (int argc, char **argv)
{
    char buf[BUFSIZE];
    int  logfd;
    int  nread;
    int counter = 0;


    if (argc != 2) {
    usage(argv[0]);
    exit(1);
    }

    logfd = open_testfile(argv[1]);

    /* loop endlessly */
    for (;;) {

    snprintf(buf, sizeof(buf), "%d\n",counter);
    nread = strnlen(buf,sizeof(buf));

    /* write data */
    int nwrite = write(logfd, buf, nread);

    if (nwrite == nread) {
        counter++;
        usleep(WRITE_DELAY);
    } else {
        /* restarted gluster nodes give this error in 2.0.6 */
        if (errno == EBADFD || errno == ENOTCONN)
        {
          /* wait a second before re-opening the file */
          fprintf(stderr,"problems writting to fd, reopening logfile (errno = %d) in one second\n",errno);
          sleep(1);

          /* reopen log file, and set write again flag so the data tries to get written back */
          logfd = open_testfile(argv[1]);
        }
        else
        {
          perror("write");
          exit(2);
        }
    }
    }
}

#!/usr/bin/perl

use strict;
use warnings;

my $i=0;

while (<>) { die "$i <> $_" if $i++ != $_; }
print STDERR "OK\n";

The client log file during one of the tests I ran.

[2009-09-02 09:59:23] E [saved-frames.c:165:saved_frames_unwind] remote1: forced unwinding frame type(1) op(FINODELK)
[2009-09-02 09:59:23] N [client-protocol.c:6246:notify] remote1: disconnected
[2009-09-02 09:59:23] E [socket.c:745:socket_connect_finish] remote1: connection to 10.0.1.31:6996 failed (Connection refused)
[2009-09-02 09:59:26] N [client-protocol.c:5559:client_setvolume_cbk] remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
[2009-09-02 09:59:30] E [saved-frames.c:165:saved_frames_unwind] remote2: forced unwinding frame type(1) op(WRITE)
[2009-09-02 09:59:30] W [fuse-bridge.c:1534:fuse_writev_cbk] glusterfs-fuse: 153358: WRITE => -1 (Transport endpoint is not connected)
[2009-09-02 09:59:30] N [client-protocol.c:6246:notify] remote2: disconnected
[2009-09-02 09:59:30] E [socket.c:745:socket_connect_finish] remote2: connection to 10.0.1.32:6996 failed (Connection refused)
[2009-09-02 09:59:33] N [client-protocol.c:5559:client_setvolume_cbk] remote2: Connected to 10.0.1.32:6996, attached to remote volume 'brick'.
[2009-09-02 09:59:34] N [client-protocol.c:5559:client_setvolume_cbk] remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
[2009-09-02 09:59:37] E [saved-frames.c:165:saved_frames_unwind] remote1: forced unwinding frame type(1) op(FINODELK)
[2009-09-02 09:59:37] W [fuse-bridge.c:1534:fuse_writev_cbk] glusterfs-fuse: 153923: WRITE => -1 (File descriptor in bad state)
[2009-09-02 09:59:37] N [client-protocol.c:6246:notify] remote1: disconnected
[2009-09-02 09:59:40] N [client-protocol.c:5559:client_setvolume_cbk] remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
[2009-09-02 09:59:41] N [client-protocol.c:5559:client_setvolume_cbk] remote2: Connected to 10.0.1.32:6996, attached to remote volume 'brick'.
[2009-09-02 09:59:44] N [client-protocol.c:5559:client_setvolume_cbk] remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
[2009-09-02 09:59:51] W [fuse-bridge.c:882:fuse_err_cbk] glusterfs-fuse: 155106: FLUSH() ERR => -1 (File descriptor in bad state)
[2009-09-02 09:59:51] W [fuse-bridge.c:882:fuse_err_cbk] glusterfs-fuse: 155108: FLUSH() ERR => -1 (File descriptor in bad state)

Comment 2 Brian Hirt 2009-09-02 17:18:48 UTC
Created attachment 61 [details]
source html file referred to in bug report

Client volume file that I used for the tests in the quoted email.  using version 2.0.6

Comment 3 Brian Hirt 2009-09-02 17:19:43 UTC
Created attachment 62 [details]
Bugfixed version

the server volume file used on gfsnode1, gfsnode2, gfsnode3 and gfsnode4

Comment 4 Raghavendra G 2009-09-03 01:23:50 UTC
Brian,

Can you attach the entire client log file? If possible can you run glusterfs with "DEBUG" loglevel?

regards,
Raghavendra

Comment 5 Brian Hirt 2009-09-03 11:18:10 UTC
Created attachment 63 [details]
patch to 2.2.14 kernel to fix smbmount on sparc64

client log file wile write-numbers.c is running.  First two errors are ENOTCONN, third error is EBADFD.

Comment 6 Brian Hirt 2009-09-03 11:20:35 UTC
Created attachment 64 [details]
awk/gawk printf format error

output from the run that generated the client log file

Comment 7 Raghavendra G 2009-09-04 01:26:52 UTC
with write behind being removed from configuration, success is not returned for writes which are failed. The above situation is caused because of replicate in the setup. If replicate is removed from the test setup, this issue is not observed. As an example consider replicate over 2 nodes and following sequence of operations.

1. start writing on gluster mount, data is replicated to both the nodes.
2. stop node1. Application still receives success on writes, but data is written only to node2.
3. restart node1. Application still receives success on writes, but data is still not written to node1, since the file is no longer open on node1. Also note that self-heal will not be done from node2 to node1 since replicate does not support self-heal on open fds yet.
4. stop node2. Application receives either ENOTCONN or EBADFD, based on the child from which replicate received the last reply for write. subvolume corresponding to node1 returns EBADFD and that of node2 returns ENOTCONN.
5. Application sensing writes are failing, issues an open on the file. Note that node2 is still down. If it were to be up, data on node1 would be synced from node2.
6. Now writes happen only on node1. Note that file on node1 was missing some writes happened on node2. Now, node2 will miss the future writes, leading to a split-brain situation.
7. Bring up the node1. If open were to happen now, when both nodes are up, replicate identifies this as a split brain situation and a manual intervention is needed to identify the "more" correct version of file and remove the other. Then replicate copies the more correct version of file to other node. 

Hence the issue here is not writes being failed. Writes are happening but there is a split-brain situation because of the way servers have been restarted.

Comment 8 Anand Avati 2009-09-08 07:23:10 UTC
PATCH: http://patches.gluster.com/patch/1212 in master (performance/write-behind: communicate back the errors in writes to servers, in the next immediate writes from application.)

Comment 9 Anand Avati 2009-09-08 07:23:46 UTC
PATCH: http://patches.gluster.com/patch/1211 in release-2.0 (performance/write-behind: communicate back the errors in writes to servers in the next immediate writes from application.)

Comment 10 Anand Avati 2009-12-01 10:45:58 UTC
PATCH: http://patches.gluster.com/patch/2457 in master (stat-prefetch: fix unsafe reference to @local in sp_lookup_cbk)

Comment 11 Anand Avati 2009-12-01 23:28:10 UTC
PATCH: http://patches.gluster.com/patch/2472 in release-2.0 (stat-prefetch: fix unsafe reference to @local in sp_lookup_cbk.)


Note You need to log in before you can comment on or make changes to this bug.