Bug 1103240

Summary:	readahead behavior change - possible performance regressions
Product:	Red Hat Enterprise Linux 7	Reporter:	Jiri Jaburek <jjaburek>
Component:	kernel	Assignee:	Larry Woodman <lwoodman>
kernel sub component:	Memory Management	QA Contact:	Jiri Jaburek <jjaburek>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aquini, ebenes, loberman, lwang, lwoodman, riel, sbest
Version:	7.0	Flags:	lwoodman: needinfo+
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-17 15:43:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1296180, 1394638, 1404314

Description Jiri Jaburek 2014-05-30 14:09:10 UTC

Description of problem:

Discussed in bug 862177 and bug 1062288, newer kernels (RHEL7, likely RHEL6, latest upstream) contain the following change:

http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/mm/readahead.c?id=47db8bd21e9914006b3bd2b9e90e0aaf2c04cbd2

While this change may be beneficial for some NUMA systems, it completely changes the readahead(2) behavior for what seem to be all systems (x86 arch at least) by limiting the readahead size.


# cat readahead.c 
#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    off_t fsize;

    (void)argc;
    fd = open(argv[1], O_RDONLY);

    fsize = lseek(fd, 0, SEEK_END);
    lseek(fd, 0, SEEK_SET);

    readahead(fd, 0, fsize);

    return 0;
}

# gcc readahead.c -Wall -Wextra -o readahead
# dd if=/dev/zero of=bigfile bs=500M count=1


kernel-3.10.0-98.el7:

# echo 3 > /proc/sys/vm/drop_caches; free -m; ./readahead bigfile; free -m
             total       used       free     shared    buffers     cached
Mem:          1841        135       1705          8          0         19
-/+ buffers/cache:        116       1725
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:          1841        635       1205          8          0        519
-/+ buffers/cache:        115       1725
Swap:            0          0          0

kernel-3.10.0-99.el7:

# echo 3 > /proc/sys/vm/drop_caches; free -m; ./readahead bigfile; free -m
             total       used       free     shared    buffers     cached
Mem:          1841        136       1704          8          0         19
-/+ buffers/cache:        116       1724
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:          1841        138       1702          8          0         21
-/+ buffers/cache:        116       1724
Swap:            0          0          0


This is a regular qemu-kvm based virtual machine,

# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2
node 0 size: 2047 MB
node 0 free: 1201 MB
node distances:
node   0 
  0:  10 

but the effect can be observed on several "real" amd64 machines as well.


The problem is that readahead(2) doesn't do what the manpage claims;
    "readahead() blocks until the specified data has been read."
is not true - on newer kernels, it has an artificial limit, which seems to be 2MB (512 * 4K pages?).

Unlike read(2), readahead(2) has no way of telling the userspace application how much data has been read, so - unless the application takes kernel internals into account - it has no way of reliably caching the entire file (or at least a >2MB portion).

This change can negatively affect performance of applications that rely on readahead(2) working correctly, eg. reading 'count' bytes if there's enough system memory.


Is the upstream aware of this limitation? I have found some concerns, like https://lkml.org/lkml/2014/1/10/122, but these seem to be left unaddressed.


Version-Release number of selected component (if applicable):
kernel-3.10.0-123.el7

How reproducible:
always

Actual results:
readahead(2) reads less than 'count' bytes

Expected results:
readahead(2) reads exactly 'count' bytes

Additional info:
This bug may also affect RHEL6 (as the original bug was cloned to RHEL6).

Comment 2 Larry Woodman 2014-09-22 17:42:23 UTC

Did anyone ever verify this was a real problem or is it just speculation???

Larry Woodman

Comment 5 Rafael Aquini 2014-10-03 21:33:35 UTC

I went through numbers our performance team came up while doing their regression tests for rhel-6 and rhel-7 and I haven't find any change that could be directly linked with the change mentioned at comment #0. 

Other than setting a hard ceiling of 2MB for any issued readahead, which might be seen as trouble for certain users/usecases, there seems to be no other measurable loss here. OTOH, the tangible gain after the change is having the readahead working for NUMA layouts where some CPUs are within a memoryless node.

A bug similar to this one was reported upstream -- https://lkml.org/lkml/2014/7/3/416 -- and it has lead to the following discussion thread:
https://lkml.org/lkml/2014/7/3/416

Comment 6 Rafael Aquini 2014-10-03 21:38:32 UTC

Upstream bug mentioned at comment #5: https://bugzilla.kernel.org/show_bug.cgi?id=79111

Comment 7 Rafael Aquini 2014-10-03 21:56:36 UTC

A couple of notes based on upstream discussion and actual code placements:

(In reply to Jiri Jaburek from comment #0)
> The problem is that readahead(2) doesn't do what the manpage claims;
>     "readahead() blocks until the specified data has been read."
> is not true - on newer kernels, it has an artificial limit, which seems to
> be 2MB (512 * 4K pages?).
>

This is, actually, a man-page misguided expectation. The code readahead(2) relies on has never made such guarantee. There seems to be a patch  upstream to fix that man-page misunderstanding: http://www.spinics.net/lists/linux-mm/msg70517.html

 

> Actual results:
> readahead(2) reads less than 'count' bytes
> 
> Expected results:
> readahead(2) reads exactly 'count' bytes
> 

There's never been a guarantee readahead(2) would bring into pagecache "exactly 'count' bytes". The big change here, on the other hand, is that before the aforementioned change, codepaths relying on force_page_cache_readahead() would try to read-ahead as much as the total amount of free and reclaimable file memory in a given node would allow it. After the change, force_page_cache_readahead() itself is capped to a 2MB max effort, and that's why you're seeing such values being reported for cached memory after  readahead(2) calls in your testcase.

Regards,
-- Rafael

Comment 13 Larry Woodman 2016-11-17 15:43:08 UTC


*** This bug has been marked as a duplicate of bug 1062288 ***