RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 722185 - posix_fadvise needs to be stateful to be useful at avoiding filesystem cache pollution
Summary: posix_fadvise needs to be stateful to be useful at avoiding filesystem cache ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.0
Hardware: All
OS: Linux
low
low
Target Milestone: beta
: ---
Assignee: fs-maint
QA Contact: Filesystem QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-14 14:42 UTC by Eric Blake
Modified: 2019-08-14 16:49 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of: 634653
Environment:
Last Closed: 2019-08-14 16:49:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 1 Eric Blake 2011-07-14 14:46:58 UTC
libvirt will gladly take advantage of posix_fadvise if this is fixed, but if this is not fixed in a timely manner, libvirt has a working solution based on O_DIRECT; so this is not technically a blocker for bug 634653.

Comment 3 Eric Blake 2011-07-14 15:05:16 UTC
Also, it would be handy if there were some file in sysfs that exists if the kernel supports sane advice handling, and is not present in older kernels, so that libvirt can make a runtime decision of whether the posix_fadvise call will actually be acted on, vs. needing to use the O_DIRECT fallback.

Comment 4 Laszlo Ersek 2011-07-14 16:24:19 UTC
(In reply to comment #0)

> On 07/13/2011 09:06 AM, Eric Blake wrote:

>> POSIX_FADV_NOREUSE and POSIX_FADV_SEQUENTIAL would both be stateful and
>> inheritable - libvirt opens an fd for writing, marks it with both advice
>> tags (that is, the fd will be used in sequential order and that it should
>> be a one-shot operation rather than left in the cache)

Also from bug 634653 comment 11:

> would an appropriate use of posix_fadvise with
> POSIX_FADV_SEQUENTIAL|POSIX_FADV_NOREUSE work to avoid some cache pressure,

I don't think those hints are meant to be OR'd together. SUSv4 says [1]:

        The advice to be applied to the data is specified by the /advice/
        parameter and may be one of the following values: [...]

The Linux manual says [2]:

        Permissible values for /advice/ include: [...]

I searched the POSIX bug db [3] for "posix_fadvise" and nothing turned up.

I didn't look at glibc, but the syscall (fadvise64_64() in "mm/fadvise.c")
seems to do simple switches on "advice" too.

Looking at the SUSv4 rationale [4], I'd say POSIX_FADV_NOREUSE alone is the hint to pass, even though the language below seems to refer to reading:

      POSIX_FADV_NOREUSE tells the system that caching the specified data is
      not optimal. For file I/O, the transfer should go directly to the user
      buffer instead of being cached internally by the implementation. To
      portably perform direct disk I/O on all systems, the application must
      perform its I/O transfers according to the following rules:

         1. The user buffer should be aligned according to the
            {POSIX_REC_XFER_ALIGN} pathconf() variable.

         2. The number of bytes transferred in an I/O operation should be a
            multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable.

         3. The offset into the file at the start of an I/O operation should be
            a multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable.

         4. The application should ensure that all threads which open a given
            file specify POSIX_FADV_NOREUSE to be sure that there is no
            unexpected interaction between threads using buffered I/O and
            threads using direct I/O to the same file.

      In some cases, a user buffer must be properly aligned in order to be
      transferred directly to/from the device. The {POSIX_REC_XFER_ALIGN}
      pathconf() variable tells the application the proper alignment.

(I can already see Eric's next mail to austin-group-l about making the hints OR-able :))

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_fadvise.html
[2] http://www.kernel.org/doc/man-pages/online/pages/man2/posix_fadvise.2.html
[3] http://austingroupbugs.net/view_all_bug_page.php
[4] http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html#tag_22_02_08_01

Comment 5 Eric Blake 2011-07-14 16:31:15 UTC
(In reply to comment #4)
> (I can already see Eric's next mail to austin-group-l about making the hints
> OR-able :))

Nope - in this case, existing practice is that you must use multiple posix_fadvise calls to attach multiple advice hints.  Anyone that hasn't already defined the FADV flags as bitwise-distinct would be broken if POSIX changed to require them to be OR-able.

Comment 6 Rik van Riel 2011-07-14 18:03:16 UTC
Agreed.  While the flags are not OR-able, the kernel is supposed to keep track of everything you want when you make multiple fadvise calls.  

If a call contradicts what you asked for earlier, the kernel undoes the earlier request.

If they go hand in hand (eg. FAD_NOREUSE together with something else), the kernel should just keep both in mind.

Comment 8 Eric Blake 2012-08-24 22:51:44 UTC
I'm mentioning this bug in my upcoming Linux Plumber's Conference talk; I will add any feedback I get from that talk back to this bug.
http://summit.linuxplumbersconf.org/lpc-2012/meeting/33/lpc2012-ref-improved-virt-disk-handling/

Comment 10 Peter Portante 2013-05-08 17:28:22 UTC
See also the work to address a similar cache pollution problem of writes for OpenStack Swift in https://bugzilla.redhat.com/show_bug.cgi?id=957821.

Comment 11 Larry Woodman 2013-05-09 15:09:01 UTC

This problem has been fixed upstream by commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 so it is also in the latest RHEL7 kernel.  Like Peter indicated above I backported this into RHEL6 to fix BZ957821. Please grab the latest RHEL7 kernel, retest and update this BZ accordingly.

Larry Woodman

Comment 12 Eric Blake 2013-05-09 15:17:50 UTC
(In reply to comment #11)
> 
> This problem has been fixed upstream by commit
> 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 so it is also in the latest RHEL7

That commit improves behavior, but is not easily discoverable.  This BZ asked for two things - sane behavior, and a way to easily discover if the behavior is sane.  That's because applications are still forced to fall back to O_DIRECT if they cannot prove the behavior will be sane.

Comment 13 RHEL Program Management 2014-03-24 05:54:58 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 14 Eric Sandeen 2014-07-01 01:22:31 UTC
Hm given this from Rik in the original comment:

> Feel free to clone the BZ against the kernel and assign it to me.

Shouldn't this be Rik's (mm) bug, not on the fs list?  I'm going to try that, because obviously we're not getting to it.

Comment 23 Eric Sandeen 2019-08-14 16:49:02 UTC
Let's just put this bug into the state that reflects reality; at this point in RHEL7 I can't imagine that we'll be finally addressing this request after 8 years of not doing so.


Note You need to log in before you can comment on or make changes to this bug.