This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 722185 - posix_fadvise needs to be stateful to be useful at avoiding filesystem cache pollution
posix_fadvise needs to be stateful to be useful at avoiding filesystem cache ...
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel (Show other bugs)
7.0
All Linux
low Severity low
: beta
: ---
Assigned To: Rik van Riel
Red Hat Kernel QE team
: FutureFeature, Improvement
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-07-14 10:42 EDT by Eric Blake
Modified: 2016-07-02 17:30 EDT (History)
21 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: 634653
Environment:
Last Closed:
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Comment 1 Eric Blake 2011-07-14 10:46:58 EDT
libvirt will gladly take advantage of posix_fadvise if this is fixed, but if this is not fixed in a timely manner, libvirt has a working solution based on O_DIRECT; so this is not technically a blocker for bug 634653.
Comment 3 Eric Blake 2011-07-14 11:05:16 EDT
Also, it would be handy if there were some file in sysfs that exists if the kernel supports sane advice handling, and is not present in older kernels, so that libvirt can make a runtime decision of whether the posix_fadvise call will actually be acted on, vs. needing to use the O_DIRECT fallback.
Comment 4 Laszlo Ersek 2011-07-14 12:24:19 EDT
(In reply to comment #0)

> On 07/13/2011 09:06 AM, Eric Blake wrote:

>> POSIX_FADV_NOREUSE and POSIX_FADV_SEQUENTIAL would both be stateful and
>> inheritable - libvirt opens an fd for writing, marks it with both advice
>> tags (that is, the fd will be used in sequential order and that it should
>> be a one-shot operation rather than left in the cache)

Also from bug 634653 comment 11:

> would an appropriate use of posix_fadvise with
> POSIX_FADV_SEQUENTIAL|POSIX_FADV_NOREUSE work to avoid some cache pressure,

I don't think those hints are meant to be OR'd together. SUSv4 says [1]:

        The advice to be applied to the data is specified by the /advice/
        parameter and may be one of the following values: [...]

The Linux manual says [2]:

        Permissible values for /advice/ include: [...]

I searched the POSIX bug db [3] for "posix_fadvise" and nothing turned up.

I didn't look at glibc, but the syscall (fadvise64_64() in "mm/fadvise.c")
seems to do simple switches on "advice" too.

Looking at the SUSv4 rationale [4], I'd say POSIX_FADV_NOREUSE alone is the hint to pass, even though the language below seems to refer to reading:

      POSIX_FADV_NOREUSE tells the system that caching the specified data is
      not optimal. For file I/O, the transfer should go directly to the user
      buffer instead of being cached internally by the implementation. To
      portably perform direct disk I/O on all systems, the application must
      perform its I/O transfers according to the following rules:

         1. The user buffer should be aligned according to the
            {POSIX_REC_XFER_ALIGN} pathconf() variable.

         2. The number of bytes transferred in an I/O operation should be a
            multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable.

         3. The offset into the file at the start of an I/O operation should be
            a multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable.

         4. The application should ensure that all threads which open a given
            file specify POSIX_FADV_NOREUSE to be sure that there is no
            unexpected interaction between threads using buffered I/O and
            threads using direct I/O to the same file.

      In some cases, a user buffer must be properly aligned in order to be
      transferred directly to/from the device. The {POSIX_REC_XFER_ALIGN}
      pathconf() variable tells the application the proper alignment.

(I can already see Eric's next mail to austin-group-l about making the hints OR-able :))

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_fadvise.html
[2] http://www.kernel.org/doc/man-pages/online/pages/man2/posix_fadvise.2.html
[3] http://austingroupbugs.net/view_all_bug_page.php
[4] http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html#tag_22_02_08_01
Comment 5 Eric Blake 2011-07-14 12:31:15 EDT
(In reply to comment #4)
> (I can already see Eric's next mail to austin-group-l about making the hints
> OR-able :))

Nope - in this case, existing practice is that you must use multiple posix_fadvise calls to attach multiple advice hints.  Anyone that hasn't already defined the FADV flags as bitwise-distinct would be broken if POSIX changed to require them to be OR-able.
Comment 6 Rik van Riel 2011-07-14 14:03:16 EDT
Agreed.  While the flags are not OR-able, the kernel is supposed to keep track of everything you want when you make multiple fadvise calls.  

If a call contradicts what you asked for earlier, the kernel undoes the earlier request.

If they go hand in hand (eg. FAD_NOREUSE together with something else), the kernel should just keep both in mind.
Comment 8 Eric Blake 2012-08-24 18:51:44 EDT
I'm mentioning this bug in my upcoming Linux Plumber's Conference talk; I will add any feedback I get from that talk back to this bug.
http://summit.linuxplumbersconf.org/lpc-2012/meeting/33/lpc2012-ref-improved-virt-disk-handling/
Comment 10 Peter Portante 2013-05-08 13:28:22 EDT
See also the work to address a similar cache pollution problem of writes for OpenStack Swift in https://bugzilla.redhat.com/show_bug.cgi?id=957821.
Comment 11 Larry Woodman 2013-05-09 11:09:01 EDT

This problem has been fixed upstream by commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 so it is also in the latest RHEL7 kernel.  Like Peter indicated above I backported this into RHEL6 to fix BZ957821. Please grab the latest RHEL7 kernel, retest and update this BZ accordingly.

Larry Woodman
Comment 12 Eric Blake 2013-05-09 11:17:50 EDT
(In reply to comment #11)
> 
> This problem has been fixed upstream by commit
> 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 so it is also in the latest RHEL7

That commit improves behavior, but is not easily discoverable.  This BZ asked for two things - sane behavior, and a way to easily discover if the behavior is sane.  That's because applications are still forced to fall back to O_DIRECT if they cannot prove the behavior will be sane.
Comment 13 RHEL Product and Program Management 2014-03-24 01:54:58 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 14 Eric Sandeen 2014-06-30 21:22:31 EDT
Hm given this from Rik in the original comment:

> Feel free to clone the BZ against the kernel and assign it to me.

Shouldn't this be Rik's (mm) bug, not on the fs list?  I'm going to try that, because obviously we're not getting to it.

Note You need to log in before you can comment on or make changes to this bug.