Bug 1058663 - Sporadic SIGBUS with mmap() on a sparse file created with open(), seek(), write()
Summary: Sporadic SIGBUS with mmap() on a sparse file created with open(), seek(), wri...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: write-behind
Version: mainline
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
Assignee: Niels de Vos
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1058405 1071191
TreeView+ depends on / blocked
 
Reported: 2014-01-28 10:11 UTC by Niels de Vos
Modified: 2014-07-11 19:17 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.5.1
Doc Type: Bug Fix
Doc Text:
Clone Of: 1058405
: 1071191 (view as bug list)
Environment:
Last Closed: 2014-07-11 19:17:40 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Test program to test and verify write-behind with open(), seek(), write() and mmap() (2.17 KB, text/plain)
2014-01-28 10:11 UTC, Niels de Vos
no flags Details

Description Niels de Vos 2014-01-28 10:11:28 UTC
Created attachment 856464 [details]
Test program to test and verify write-behind with open(), seek(), write() and mmap()

+++ This bug was initially created as a clone of Bug #1058405 +++

Description of problem:
A program that calls mmap() on a newly created sparse file, may receive a
SIGBUS signal. If SIGBUS is not handled, a segmentation fault will occur and
the program will exit.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.44rhs-1.el6rhs.x86_64

How reproducible:
If you try hard enough (read: test in a loop), easily

Steps to Reproduce:
1. compile the attached program bug-1058405.c:
   gcc -o bug-1058405 bug-1058405.c
2. mount a volume (single brick is sufficient)
3. run: ./bug-1058405 /path/to/mount/some-file

Actual results:
The bug-1058405.c returns 0 if the error did not occur, and 1 in case of the
error.

Additional info:
The write-behind translator is used on the client-side (glusterfs-fuse) to
optimize writing to the bricks. Multiple small, subsequent (re)writes are
combined into bigger writes, which are more efficiently sent over the network.

A bug in the write-behind translator can cause the creation of a sparse file
created with open(), seek(), write() to be cached. The last write() may not be
sent to the server, until write-behind deems this necessary.

SIGBUS is a signal that can occur with mmap() when the mmap'd area of a file
is located after the end of the file. For example, the following will trigger
a SIGBUS:

    Legend:
    [ = start of file
    _ = unallocated space
    # = allocated bytes in the file
    ] = end of file

    [################]________
     |              | |      |
     '- byte 0      | |      '- byte 39
                    | '- byte 32
                    '- byte 31

* open() the file, it is 32 bytes big (byte 0-31)
* mmap() the file, but use a size of 40 byes (byte 0-39)
* read from the memory area returned by mmap()
* reading upto byte 31 is expected to work flawlessly
* reading after byte 31 should trigger a SIGBUS

In the case of creating a file with open(), seek(), write(), the file looks
like this:

    [_______________#]

Creating a sparse file this way is not very uncommon. However, the
write-behind translator can cache the last write. Normally all outstanding
writes are flushed when a read is done on an area cached by the translator.
Unfortunately, the write-behind translator did not contain logic to track
writes that extend a file when a seek() past the end-of-file was done. Normal
writes that extend the file would correctly mark the written range as
outstanding, and reading causes the outstanding data to be flushed.

In the case of open(), seek(), write(), the range that was skipped in the
seek() would not have been marked as outstanding. Reading from this range does
not trigger the outstanding writes to be flushed. The brick that receives the
read() (translated over the network from mmap()) does not know that the file
has been extended, and returns -EINVAL. This error gets transported back from
the brick to the glusterfs-fuse client, and translated by the Linux kernel/VFS
into SIGBUS triggered by mmap().

Workaround:
The write-behind translator has a special handling for the truncate()
systemcall. Using open(), seek(), write() is an alternative for doing
truncate(). truncate() is more elegant in any case and will not trigger a
SIGBUS. It is recommended to create sparse files with truncate().

Comment 1 Anand Avati 2014-01-28 11:15:52 UTC
REVIEW: http://review.gluster.org/6835 (write-behind: track filesize when doing extending writes) posted (#1) for review on master by Niels de Vos (ndevos)

Comment 2 Anand Avati 2014-02-04 10:42:19 UTC
REVIEW: http://review.gluster.org/6835 (write-behind: track filesize when doing extending writes) posted (#2) for review on master by Niels de Vos (ndevos)

Comment 3 Anand Avati 2014-02-14 14:27:49 UTC
REVIEW: http://review.gluster.org/6835 (write-behind: track filesize when doing extending writes) posted (#3) for review on master by Niels de Vos (ndevos)

Comment 4 Anand Avati 2014-02-17 16:08:23 UTC
REVIEW: http://review.gluster.org/6835 (write-behind: track filesize when doing extending writes) posted (#4) for review on master by Niels de Vos (ndevos)

Comment 5 Anand Avati 2014-02-28 05:56:55 UTC
COMMIT: http://review.gluster.org/6835 committed in master by Anand Avati (avati) 
------
commit b0515e2a4a08b657ef7e9715fb8c6222c700e78c
Author: Niels de Vos <ndevos>
Date:   Tue Jan 28 10:06:13 2014 +0100

    write-behind: track filesize when doing extending writes
    
    A program that calls mmap() on a newly created sparse file, may receive
    a SIGBUS signal. If SIGBUS is not handled, a segmentation fault will
    occur and the program will exit.
    
    A bug in the write-behind translator can cause the creation of a sparse
    file created with open(), seek(), write() to be cached. The last write()
    may not be sent to the server, until write-behind deems this necessary.
    
    * open(.., O_TRUNC, ...)/creat() the file, it is 0 bytes big
    * seek() into the file, use offset 31
    * write() 1 byte to the file
    * the range from byte 0-30 are unwritten so called 'sparse'
    
    The following illustration tries to capture this:
    
        Legend:
        [ = start of file
        _ = unallocated/unwritten bytes
        # = allocated bytes in the file
        ] = end of file
    
        [_______________#]
         |              |
         '- byte 0      '- byte 31
    
    Without this change, reading from byte 0-30 will return an error, and
    reading the same area through an mmap()'d pointer will trigger a SIGBUS.
    Reading from this range did not trigger the outstanding write() to be
    flushed. The brick that receives the read() (translated over the network
    from mmap()) does not know that the file has been extended, and returns
    -EINVAL. This error gets transported back from the brick to the
    glusterfs-fuse client, and translated by the Linux kernel/VFS into
    SIGBUS triggered by mmap().
    
    In order to solve this, a new attribute to the wb_inode structure is
    introduced; the current size of the file. All FOPs that can modify the
    size, are expected to update wb_inode->size. This makes it possible for
    extending writes with an offset bigger than EOF to mark the unwritten
    area as modified/pending.
    
    Change-Id: If5ba6646732e6be26568541ea9b12852a5d0b988
    BUG: 1058663
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: http://review.gluster.org/6835
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra G <rgowdapp>
    Reviewed-by: Anand Avati <avati>


Note You need to log in before you can comment on or make changes to this bug.