Bug 1234891 - gf_store_save_value() fflush() error-checking bug, leading to corruption of glusterd.info when filesystem is full
Summary: gf_store_save_value() fflush() error-checking bug, leading to corruption of g...
Keywords:
Status: CLOSED DUPLICATE of bug 1253148
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-23 13:02 UTC by Tero Marttila
Modified: 2018-11-19 08:49 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-11-19 08:49:46 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Tero Marttila 2015-06-23 13:02:03 UTC
Description of problem:

When the host filesystem containing the glusterd workdir files (e.g. /var/lib/glusterd/) is full, this can lead to situations where files such as glusterd.info or peers/* etc are replaced with empty zero-length files. See Bug 1207534 and [1] for another example of this. I experienced this with glusterfs-server 3.4.2-1ubuntu1 running on Ubuntu 14.04.

Digging through the related code, I suspect this may be caused by the use of feof() in following error libglusterfs/src/store.c gf_store_save_value() code path which is intended to handle exactly these kinds of write errors:

        ret = fprintf (fp, "%s=%s\n", key, value);
        if (ret < 0) {
                ...
        }

        ret = fflush (fp);
        if (feof (fp)) {
                ...
        }

        ret = 0;

This code appears to be present in both glusterfs 3.4.2 and mainline git.

Based on my understanding, an error condition such as ENOSPC in fflush() will NOT cause feof() to return nonzero.

Steps to Reproduce:

Using the following example test program:

    int main (int argc, char **argv)
    {
        int ret;

        const char *path = argv[1];

        int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0600);

        FILE *fp = fdopen(fd, "a+");

        ret = fprintf(fp, "foo=bar\n");

        fprintf(stderr, "fprintf = %d: %s\n", ret, strerror(errno));

        ret = fflush(fp);

        fprintf(stderr, "fflush = %d: %s\n", ret, strerror(errno));

        ret = feof(fp);

        fprintf(stderr, "feof = %d\n", ret);

        fclose(fp);
        close(fd);

        return ret;
    }

Running this on a "full" ext4 filesystem:

	$ touch mnt/test/test2 && ls -l mnt/test/test2
	-rw-r--r-- 1 xxx staff 0 Jun 23 15:52 mnt/test/test2
	$ echo foo > mnt/test/test2
	-bash: echo: write error: No space left on device
	$ rm mnt/test/test2 

Gives the following:

	$ ./test mnt/test/test2
	fprintf = 8: Success
	fflush = -1: No space left on device
	feof = 0

Actual results:

I believe that this behavior of fflush()/feof() would cause gf_store_save_value() to ignore the write error, causing the store file to be replaced by an empty file as per the bug symptoms.

Expected results:

gf_store_save_value() should handle the -1 return code from fflush() by returning an error, causing the tmpfile to be unlinked and the existing file to be left intact.

[1] https://www.mail-archive.com/users@ovirt.org/msg25215.html

Comment 1 Tero Marttila 2015-06-23 13:04:14 UTC
Additional clarification that the test case indeed creates an empty file on disk:

$ cat mnt/test/test2 
$ ls -l mnt/test/test2 
-rw-r--r-- 1 xxx staff 0 Jun 23 15:53 mnt/test/test2

Comment 2 Niels de Vos 2015-06-30 12:22:08 UTC
I'm moving this to the 'mainline' version for now. 3.4.x is EOL and will not get any updates anymore.

Comment 3 Tero Marttila 2016-01-11 10:12:06 UTC
This is presumeably a dup of #1253148, based on the committed fix. Haven't tested it.

Comment 4 Vijay Bellur 2018-11-19 08:49:46 UTC

*** This bug has been marked as a duplicate of bug 1253148 ***


Note You need to log in before you can comment on or make changes to this bug.