| Summary: | posix_fadvise needs to be stateful to be useful at avoiding filesystem cache pollution | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Eric Blake <eblake> |
| Component: | kernel | Assignee: | fs-maint |
| kernel sub component: | VFS | QA Contact: | Filesystem QE <fs-qe> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | amit.shah, berrange, bsettle, chuhu, danken, dhowells, eblake, esandeen, hhuang, kbenoit, knoel, lersek, lwoodman, mkenneth, mszeredi, pportant, quintela, Rhev-m-bugs, sdenham, swhiteho, tburke, virt-maint, xzhou, yoyzhang |
| Version: | 7.0 | Keywords: | FutureFeature, Improvement |
| Target Milestone: | beta | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 634653 | Environment: | |
| Last Closed: | 2019-08-14 16:49:02 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Comment 1
Eric Blake
2011-07-14 14:46:58 UTC
Also, it would be handy if there were some file in sysfs that exists if the kernel supports sane advice handling, and is not present in older kernels, so that libvirt can make a runtime decision of whether the posix_fadvise call will actually be acted on, vs. needing to use the O_DIRECT fallback. (In reply to comment #0) > On 07/13/2011 09:06 AM, Eric Blake wrote: >> POSIX_FADV_NOREUSE and POSIX_FADV_SEQUENTIAL would both be stateful and >> inheritable - libvirt opens an fd for writing, marks it with both advice >> tags (that is, the fd will be used in sequential order and that it should >> be a one-shot operation rather than left in the cache) Also from bug 634653 comment 11: > would an appropriate use of posix_fadvise with > POSIX_FADV_SEQUENTIAL|POSIX_FADV_NOREUSE work to avoid some cache pressure, I don't think those hints are meant to be OR'd together. SUSv4 says [1]: The advice to be applied to the data is specified by the /advice/ parameter and may be one of the following values: [...] The Linux manual says [2]: Permissible values for /advice/ include: [...] I searched the POSIX bug db [3] for "posix_fadvise" and nothing turned up. I didn't look at glibc, but the syscall (fadvise64_64() in "mm/fadvise.c") seems to do simple switches on "advice" too. Looking at the SUSv4 rationale [4], I'd say POSIX_FADV_NOREUSE alone is the hint to pass, even though the language below seems to refer to reading: POSIX_FADV_NOREUSE tells the system that caching the specified data is not optimal. For file I/O, the transfer should go directly to the user buffer instead of being cached internally by the implementation. To portably perform direct disk I/O on all systems, the application must perform its I/O transfers according to the following rules: 1. The user buffer should be aligned according to the {POSIX_REC_XFER_ALIGN} pathconf() variable. 2. The number of bytes transferred in an I/O operation should be a multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable. 3. The offset into the file at the start of an I/O operation should be a multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable. 4. The application should ensure that all threads which open a given file specify POSIX_FADV_NOREUSE to be sure that there is no unexpected interaction between threads using buffered I/O and threads using direct I/O to the same file. In some cases, a user buffer must be properly aligned in order to be transferred directly to/from the device. The {POSIX_REC_XFER_ALIGN} pathconf() variable tells the application the proper alignment. (I can already see Eric's next mail to austin-group-l about making the hints OR-able :)) [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_fadvise.html [2] http://www.kernel.org/doc/man-pages/online/pages/man2/posix_fadvise.2.html [3] http://austingroupbugs.net/view_all_bug_page.php [4] http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html#tag_22_02_08_01 (In reply to comment #4) > (I can already see Eric's next mail to austin-group-l about making the hints > OR-able :)) Nope - in this case, existing practice is that you must use multiple posix_fadvise calls to attach multiple advice hints. Anyone that hasn't already defined the FADV flags as bitwise-distinct would be broken if POSIX changed to require them to be OR-able. Agreed. While the flags are not OR-able, the kernel is supposed to keep track of everything you want when you make multiple fadvise calls. If a call contradicts what you asked for earlier, the kernel undoes the earlier request. If they go hand in hand (eg. FAD_NOREUSE together with something else), the kernel should just keep both in mind. I'm mentioning this bug in my upcoming Linux Plumber's Conference talk; I will add any feedback I get from that talk back to this bug. http://summit.linuxplumbersconf.org/lpc-2012/meeting/33/lpc2012-ref-improved-virt-disk-handling/ See also the work to address a similar cache pollution problem of writes for OpenStack Swift in https://bugzilla.redhat.com/show_bug.cgi?id=957821. This problem has been fixed upstream by commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 so it is also in the latest RHEL7 kernel. Like Peter indicated above I backported this into RHEL6 to fix BZ957821. Please grab the latest RHEL7 kernel, retest and update this BZ accordingly. Larry Woodman (In reply to comment #11) > > This problem has been fixed upstream by commit > 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 so it is also in the latest RHEL7 That commit improves behavior, but is not easily discoverable. This BZ asked for two things - sane behavior, and a way to easily discover if the behavior is sane. That's because applications are still forced to fall back to O_DIRECT if they cannot prove the behavior will be sane. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Hm given this from Rik in the original comment:
> Feel free to clone the BZ against the kernel and assign it to me.
Shouldn't this be Rik's (mm) bug, not on the fs list? I'm going to try that, because obviously we're not getting to it.
Let's just put this bug into the state that reflects reality; at this point in RHEL7 I can't imagine that we'll be finally addressing this request after 8 years of not doing so. |