Bug 1234891

Summary: gf_store_save_value() fflush() error-checking bug, leading to corruption of glusterd.info when filesystem is full
Product: [Community] GlusterFS Reporter: Tero Marttila <tero.marttila>
Component: glusterdAssignee: bugs <bugs>
Status: CLOSED DUPLICATE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: amukherj, bugs, kaushal, ndevos, smohan, vbellur
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 08:49:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tero Marttila 2015-06-23 13:02:03 UTC
Description of problem:

When the host filesystem containing the glusterd workdir files (e.g. /var/lib/glusterd/) is full, this can lead to situations where files such as glusterd.info or peers/* etc are replaced with empty zero-length files. See Bug 1207534 and [1] for another example of this. I experienced this with glusterfs-server 3.4.2-1ubuntu1 running on Ubuntu 14.04.

Digging through the related code, I suspect this may be caused by the use of feof() in following error libglusterfs/src/store.c gf_store_save_value() code path which is intended to handle exactly these kinds of write errors:

        ret = fprintf (fp, "%s=%s\n", key, value);
        if (ret < 0) {
                ...
        }

        ret = fflush (fp);
        if (feof (fp)) {
                ...
        }

        ret = 0;

This code appears to be present in both glusterfs 3.4.2 and mainline git.

Based on my understanding, an error condition such as ENOSPC in fflush() will NOT cause feof() to return nonzero.

Steps to Reproduce:

Using the following example test program:

    int main (int argc, char **argv)
    {
        int ret;

        const char *path = argv[1];

        int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0600);

        FILE *fp = fdopen(fd, "a+");

        ret = fprintf(fp, "foo=bar\n");

        fprintf(stderr, "fprintf = %d: %s\n", ret, strerror(errno));

        ret = fflush(fp);

        fprintf(stderr, "fflush = %d: %s\n", ret, strerror(errno));

        ret = feof(fp);

        fprintf(stderr, "feof = %d\n", ret);

        fclose(fp);
        close(fd);

        return ret;
    }

Running this on a "full" ext4 filesystem:

	$ touch mnt/test/test2 && ls -l mnt/test/test2
	-rw-r--r-- 1 xxx staff 0 Jun 23 15:52 mnt/test/test2
	$ echo foo > mnt/test/test2
	-bash: echo: write error: No space left on device
	$ rm mnt/test/test2 

Gives the following:

	$ ./test mnt/test/test2
	fprintf = 8: Success
	fflush = -1: No space left on device
	feof = 0

Actual results:

I believe that this behavior of fflush()/feof() would cause gf_store_save_value() to ignore the write error, causing the store file to be replaced by an empty file as per the bug symptoms.

Expected results:

gf_store_save_value() should handle the -1 return code from fflush() by returning an error, causing the tmpfile to be unlinked and the existing file to be left intact.

[1] https://www.mail-archive.com/users@ovirt.org/msg25215.html

Comment 1 Tero Marttila 2015-06-23 13:04:14 UTC
Additional clarification that the test case indeed creates an empty file on disk:

$ cat mnt/test/test2 
$ ls -l mnt/test/test2 
-rw-r--r-- 1 xxx staff 0 Jun 23 15:53 mnt/test/test2

Comment 2 Niels de Vos 2015-06-30 12:22:08 UTC
I'm moving this to the 'mainline' version for now. 3.4.x is EOL and will not get any updates anymore.

Comment 3 Tero Marttila 2016-01-11 10:12:06 UTC
This is presumeably a dup of #1253148, based on the committed fix. Haven't tested it.

Comment 4 Vijay Bellur 2018-11-19 08:49:46 UTC

*** This bug has been marked as a duplicate of bug 1253148 ***