Bug 1140250

Summary: Unexpected results from using posix_fallocate with nfs target
Product: Red Hat Enterprise Linux 7 Reporter: John Ferlan <jferlan>
Component: glibcAssignee: Martin Sebor <msebor>
Status: CLOSED ERRATA QA Contact: Michael Petlan <mpetlan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.1CC: ashankar, codonell, eblake, fweimer, mcermak, mnewsome, mpetlan, msebor, pfrankli
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The posix_fallocate emulation in glibc contained an optimization which performed backing store allocation in increments of the file system block size. Some network file systems report a misleading block size. Consequence: posix_fallocate unexpectedly creates a sparse file, instead of allocating contiguous backing store. Fix: The posix_fallocate emulation in glibc now uses a reduced increment for backing store allocation (at most 4096 bytes). Result: Applications such as libvirt no longer create sparse files when calling the glibc posix_fallocate function.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 08:21:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1077068, 1297579, 1313485    

Description John Ferlan 2014-09-10 15:03:50 UTC
Initially filed as a libvirt bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1077068

And after some patches posted and more discussion - I was asked to file a bug against glibc upstream.  All the details of the issue can be found there:

https://sourceware.org/bugzilla/show_bug.cgi?id=17322

Comment 3 John Ferlan 2014-12-09 15:01:13 UTC
Since it had been a bit since I looked at this 'issue' and the libvirt bug is still on my 7.1 list, I figured I'd revisit this to see if perhaps I missed anything or even misunderstood something.

As I relooked at the results - I can see the posix_fallocate code apparently writing 10 1 MiB buffers to the file, which seems to be what's desired; however, it's how that data is interpreted by tools afterwards that seems to be the issue here.  For example, 'du -h' shows a different size than say 'ls -al' on the resulting file:

# du -h /home/nfs_pool/target/test_vol2
44K	/home/nfs_pool/target/test_vol2
# ls -al /home/nfs_pool/target/test_vol2
-rw-------. 1 root root 10485760 Dec  9 07:23 test_vol2
#

The following is output of a program that essentially calls stat and statfs on the target file and provides/prints information from the various fields.  

For the posix_fallocate'd file, the following is displayed:

# /home/jferlan/exmplCode/stat /home/nfs_pool/target/test_vol2
Open '/home/nfs_pool/target/test_vol2' for stat
stat st_blocks=88 st_blksize=1048576 st_size=10485760
statfs f_bsize=1048576 f_blocks=137675 f_bfree=95786 f_bavail=88770
lseek end=10485760
#

If I change the creation of the file to a different method that writes 1MiB buffers to the file in a loop using write instead of posix_fallocate, I get the following afterwards:

# /home/jferlan/exmplCode/stat /home/nfs_pool/target/test_vol2
Open '/home/nfs_pool/target/test_vol2' for stat
stat st_blocks=20480 st_blksize=1048576 st_size=10485760
statfs f_bsize=1048576 f_blocks=137675 f_bfree=95798 f_bavail=88782
lseek end=10485760
#


The "st_blocks" is what is used by du and libvirt to determine the size of the file.  From the man page:

       The st_blocks field indicates the number of blocks allocated to the
       file, 512-byte units.  (This may be smaller than st_size/512 when the
       file has holes.)

That naturally explains "the math" of "st_blocks * 512" to determine the size of the file in MiB (eg 88 * 512 / 1024 = 44 MiB).

Still not sure why/how the st_blocks gets set to 88, but whatever does that is the root cause of this problem.  The 88 even seems to be a bit of a problem - I could see perhaps if was 80 which would be 80 * 512 * 256 = 10485760, but the 88 seems to be an "off by 1 error". 

Another datapoint 1048576 (eg wsize) / 256 = 4096, where 4096 is the blocksize of the mount point parent directory (in my case /home, /home/nfs_pool) as well as the size of the 'nfs-export' directory even though the 'target' directory uses a blocksize 1048576:

(statfs is my simple code that just calls statfs and displays data)

# ./statfs /home
statfs f_bsize=4096 f_blocks=35244766 f_bfree=24517830 f_bavail=22721734
# ./statfs /home/nfs_pool
statfs f_bsize=4096 f_blocks=35244766 f_bfree=24517862 f_bavail=22721766
# ./statfs /home/nfs_pool/nfs-export
statfs f_bsize=4096 f_blocks=35244766 f_bfree=24517740 f_bavail=22721644
# ./statfs /home/nfs_pool/target
statfs f_bsize=1048576 f_blocks=137675 f_bfree=95773 f_bavail=88757
#
# ls -al /home/nfs_pool
total 10260
drwxr-xr-x.  4 root root     4096 Jul 31 17:12 .
drwxr-xr-x. 26 root root     4096 Dec  1 14:22 ..
drwxr-xr-x.  2 root root     4096 Dec  9 07:23 nfs-export
drwxr-xr-x.  2 root root     4096 Dec  9 07:23 target
#

Comment 5 Florian Weimer 2015-06-02 13:50:16 UTC
I proposed an upstream patch addressing this (among other things): https://sourceware.org/ml/libc-alpha/2015-05/msg00321.html

Comment 6 Florian Weimer 2015-06-05 09:53:36 UTC
Fix has been committed upstream: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7fe9e2e089f4990b7d18d0798f591ab276b15f2b

Comment 15 Michael Petlan 2016-08-04 13:46:01 UTC
Reproduced on s390x, aarch64 and x86_64. Not reproduced on ppc64, ppc64le unknown.

The NEW glibc-2.17-156.el7 passed on all architectures.

VERIFIED.

Comment 17 errata-xmlrpc 2016-11-03 08:21:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2573.html