1103240 – readahead behavior change - possible performance regressions

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1103240 - readahead behavior change - possible performance regressions

Summary: readahead behavior change - possible performance regressions

Keywords:
Status:	CLOSED DUPLICATE of bug 1062288
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Jiri Jaburek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1296180 1394638 1404314
TreeView+	depends on / blocked

Reported:	2014-05-30 14:09 UTC by Jiri Jaburek
Modified:	2016-12-14 03:37 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-17 15:43:08 UTC
Target Upstream Version:
Embargoed:
Flags:	lwoodman: needinfo+

Attachments	(Terms of Use)

Description Jiri Jaburek 2014-05-30 14:09:10 UTC

Description of problem:

Discussed in bug 862177 and bug 1062288, newer kernels (RHEL7, likely RHEL6, latest upstream) contain the following change:

http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/mm/readahead.c?id=47db8bd21e9914006b3bd2b9e90e0aaf2c04cbd2

While this change may be beneficial for some NUMA systems, it completely changes the readahead(2) behavior for what seem to be all systems (x86 arch at least) by limiting the readahead size.


# cat readahead.c 
#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    off_t fsize;

    (void)argc;
    fd = open(argv[1], O_RDONLY);

    fsize = lseek(fd, 0, SEEK_END);
    lseek(fd, 0, SEEK_SET);

    readahead(fd, 0, fsize);

    return 0;
}

# gcc readahead.c -Wall -Wextra -o readahead
# dd if=/dev/zero of=bigfile bs=500M count=1


kernel-3.10.0-98.el7:

# echo 3 > /proc/sys/vm/drop_caches; free -m; ./readahead bigfile; free -m
             total       used       free     shared    buffers     cached
Mem:          1841        135       1705          8          0         19
-/+ buffers/cache:        116       1725
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:          1841        635       1205          8          0        519
-/+ buffers/cache:        115       1725
Swap:            0          0          0

kernel-3.10.0-99.el7:

# echo 3 > /proc/sys/vm/drop_caches; free -m; ./readahead bigfile; free -m
             total       used       free     shared    buffers     cached
Mem:          1841        136       1704          8          0         19
-/+ buffers/cache:        116       1724
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:          1841        138       1702          8          0         21
-/+ buffers/cache:        116       1724
Swap:            0          0          0


This is a regular qemu-kvm based virtual machine,

# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2
node 0 size: 2047 MB
node 0 free: 1201 MB
node distances:
node   0 
  0:  10 

but the effect can be observed on several "real" amd64 machines as well.


The problem is that readahead(2) doesn't do what the manpage claims;
    "readahead() blocks until the specified data has been read."
is not true - on newer kernels, it has an artificial limit, which seems to be 2MB (512 * 4K pages?).

Unlike read(2), readahead(2) has no way of telling the userspace application how much data has been read, so - unless the application takes kernel internals into account - it has no way of reliably caching the entire file (or at least a >2MB portion).

This change can negatively affect performance of applications that rely on readahead(2) working correctly, eg. reading 'count' bytes if there's enough system memory.


Is the upstream aware of this limitation? I have found some concerns, like https://lkml.org/lkml/2014/1/10/122, but these seem to be left unaddressed.


Version-Release number of selected component (if applicable):
kernel-3.10.0-123.el7

How reproducible:
always

Actual results:
readahead(2) reads less than 'count' bytes

Expected results:
readahead(2) reads exactly 'count' bytes

Additional info:
This bug may also affect RHEL6 (as the original bug was cloned to RHEL6).

Comment 2 Larry Woodman 2014-09-22 17:42:23 UTC

Did anyone ever verify this was a real problem or is it just speculation???

Larry Woodman

Comment 5 Rafael Aquini 2014-10-03 21:33:35 UTC

I went through numbers our performance team came up while doing their regression tests for rhel-6 and rhel-7 and I haven't find any change that could be directly linked with the change mentioned at comment #0. 

Other than setting a hard ceiling of 2MB for any issued readahead, which might be seen as trouble for certain users/usecases, there seems to be no other measurable loss here. OTOH, the tangible gain after the change is having the readahead working for NUMA layouts where some CPUs are within a memoryless node.

A bug similar to this one was reported upstream -- https://lkml.org/lkml/2014/7/3/416 -- and it has lead to the following discussion thread:
https://lkml.org/lkml/2014/7/3/416

Comment 6 Rafael Aquini 2014-10-03 21:38:32 UTC

Upstream bug mentioned at comment #5: https://bugzilla.kernel.org/show_bug.cgi?id=79111

Comment 7 Rafael Aquini 2014-10-03 21:56:36 UTC

A couple of notes based on upstream discussion and actual code placements:

(In reply to Jiri Jaburek from comment #0)
> The problem is that readahead(2) doesn't do what the manpage claims;
>     "readahead() blocks until the specified data has been read."
> is not true - on newer kernels, it has an artificial limit, which seems to
> be 2MB (512 * 4K pages?).
>

This is, actually, a man-page misguided expectation. The code readahead(2) relies on has never made such guarantee. There seems to be a patch  upstream to fix that man-page misunderstanding: http://www.spinics.net/lists/linux-mm/msg70517.html

 

> Actual results:
> readahead(2) reads less than 'count' bytes
> 
> Expected results:
> readahead(2) reads exactly 'count' bytes
> 

There's never been a guarantee readahead(2) would bring into pagecache "exactly 'count' bytes". The big change here, on the other hand, is that before the aforementioned change, codepaths relying on force_page_cache_readahead() would try to read-ahead as much as the total amount of free and reclaimable file memory in a given node would allow it. After the change, force_page_cache_readahead() itself is capped to a 2MB max effort, and that's why you're seeing such values being reported for cached memory after  readahead(2) calls in your testcase.

Regards,
-- Rafael

Comment 13 Larry Woodman 2016-11-17 15:43:08 UTC


*** This bug has been marked as a duplicate of bug 1062288 ***

Note You need to log in before you can comment on or make changes to this bug.