Bug 1187940
Summary: | Regression: Loading memory mapped files does not use the optimal sized (large) I/O any more in kernel 2.6.32-504.3.3.el6.x86_64 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | loberman <loberman> | ||||
Component: | kernel | Assignee: | Rafael Aquini <aquini> | ||||
kernel sub component: | Memory Management | QA Contact: | Li Wang <liwan> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | urgent | ||||||
Priority: | urgent | CC: | aquini, bubrown, coughlan, dhoward, hannsj_uhl, jjaburek, jweiner, lilu, loberman, lwoodman, mmilgram, riel, yanwang | ||||
Version: | 6.6 | Keywords: | Regression, ZStream | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-2.6.32-556.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause: RHEL-6.6 has incorporated a patch to fix a readahead() failure condition where max_sane_readahead() would return zero (0) on a CPU whose NUMA node has no local memory (Bug 862177). That patch forced a hard ceiling of 2MB for readahead(), as it changed max_sane_readahead() to fix Bug 862177
Consequence: The aforementioned change, despite fixing that corner case, turned out to show itself detrimental to other ordinary workloads that memory map big files and rely on readahead() or posix_fadvise(WILLNEED) syscalls to get most of the file populating system's cache because it hard capped all system readahead() calls to only 2MB, regardless how much memory is potentially available for caching.
Fix: Bring back the old behavior of max_sane_readahead() where we used to consider NR_INACTIVE_FILE and NR_FREE_PAGES pages to derive a sensible / adujstable readahead upper limit, while keeping the 2 mB ceiling scheme introduced as a fix for bug 862177, when we stumble across CPU on a memoryless NUMA node.
Result: Introducing the described fix, the described issue with regard to readahead degraded performance is no longer observed.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1215755 (view as bug list) | Environment: | |||||
Last Closed: | 2015-07-22 08:42:04 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1215755, 1351353 | ||||||
Attachments: |
|
Description
loberman
2015-01-31 15:42:09 UTC
Test program used to reproduce. #include <stdio.h> #include <unistd.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> /* mmap() is defined in this header */ #include <fcntl.h> int err_quit(char *msg) { printf(msg); return 0; } int main (int argc, char *argv[]) { int fdin; char *src; char dst[262144]; struct stat statbuf; int mode = 0x0777; unsigned long i; long ret; if (argc != 2) err_quit ("usage: a.out <fromfile> <tofile>\n"); /* open the input file */ if ((fdin = open (argv[1], O_RDONLY)) < 0) {printf("can't open %s for reading\n", argv[1]); return 0; } ret=posix_fadvise(fdin,0,0,POSIX_FADV_WILLNEED); printf("ret = %ld\n",ret); /* find size of input file */ if (fstat (fdin,&statbuf) < 0) {printf ("fstat error\n"); return 0; } printf("Size of input file = %ld\n",statbuf.st_size); /* mmap the input file */ if ((src = mmap (0, statbuf.st_size, PROT_READ, MAP_SHARED, fdin, 0)) == (caddr_t) -1) {printf ("mmap error for input"); return 0; } /* this copies the input file to the data buffer using 256k blocks*/ for (i=0;i< statbuf.st_size-262144; i+=262144) { memcpy (dst, src+i, 262144); } return 0; } /* main */ Note: The customer wants the default read_ahead_kb to be small 32, and control the size when necessary via posix_fadvise(). Customers note that is the key issue here To reiterate, what we want is readahead to be small (we use 32KB in production), but leverage fadvise if we need to load a big file. In the past that worked since I/O size was only limited by max_sectors_kb. In 504 it does not work anymore, since readahead also limits I/O sizes. Created attachment 986545 [details]
systemtap trace of memcpy
Systemtap Traced memcpy from the beginning and attaching here.
Will do that same on the earlier kernel and compare.
From perf top, its clear memcpy was needed to be traced.
+ 12.31% mmapexe libc-2.12.so [.] memcpy
+ 6.22% init [kernel.kallsyms] [k] intel_idle
+ 4.10% init [lpfc] [k] lpfc_sli_intr_handler
+ 2.77% mmapexe [kernel.kallsyms] [k] page_fault
+ 1.10% mmapexe [kernel.kallsyms] [k] get_page_from_freelist
+ 1.08% mmapexe [kernel.kallsyms] [k] find_busiest_group
+ 0.94% init [kernel.kallsyms] [k] _spin_unlock_irqrestore
Still trying to figure out where the change happened in 504.
Keep in mind that I have the same glibc when I reproduce this.
systemtap probe of memcpy via the mmapexe test program mmapexe(6567) memcpy 0 0xffffffff812985c0 : __memcpy+0x0/0x120 [kernel] 0xffffffffa034eac7 [lpfc] 0xffffffff8137ee90 : scsi_done+0x0/0x60 [kernel] (inexact) 0xffffffffa0351f93 [lpfc] (inexact) 0xffffffff813883b9 : scsi_setup_fs_cmnd+0x79/0xe0 [kernel] (inexact) 0xffffffffa026bb11 [sd_mod] (inexact) 0xffffffff810838fb : round_jiffies_up+0x1b/0x20 [kernel] (inexact) 0xffffffff812786b2 : blk_add_timer+0x72/0xc0 [kernel] (inexact) 0xffffffff8137f0e5 : scsi_dispatch_cmd+0xe5/0x310 [kernel] (inexact) 0xffffffff81387aae : scsi_request_fn+0x5be/0x750 [kernel] (inexact) 0xffffffff8108748d : del_timer+0x7d/0xe0 [kernel] (inexact) 0xffffffff81124320 : sync_page_killable+0x0/0x40 [kernel] (inexact) 0xffffffff8126f5d2 : __generic_unplug_device+0x32/0x40 [kernel] (inexact) 0xffffffff8126f60e : generic_unplug_device+0x2e/0x50 [kernel] (inexact) 0xffffffff8126b454 : blk_unplug+0x34/0x70 [kernel] (inexact) 0xffffffff8126b4a2 : blk_backing_dev_unplug+0x12/0x20 [kernel] (inexact) 0xffffffff811c32de : block_sync_page+0x3e/0x50 [kernel] (inexact) 0xffffffff81124308 : sync_page+0x38/0x50 [kernel] (inexact) 0xffffffff8112432e : sync_page_killable+0xe/0x40 [kernel] (inexact) 0xffffffff8152ac3f : __wait_on_bit+0x5f/0x90 [kernel] (inexact) Leads to an ftrace here mmapexe-10855 [016] 19358.275076: kernel_read <-prepare_binprm mmapexe-10855 [016] 19358.275076: vfs_read <-kernel_read .. mmapexe-10855 [016] 19358.275077: do_sync_read <-vfs_read mmapexe-10855 [016] 19358.275077: generic_file_aio_read <-do_sync_read mmapexe-10855 [016] 19358.275078: generic_segment_checks <-generic_file_aio_read mmapexe-10855 [016] 19358.275078: _cond_resched <-generic_file_aio_read mmapexe-10855 [016] 19358.275078: find_get_page <-generic_file_aio_read mmapexe-10855 [016] 19358.275078: page_cache_sync_readahead <-generic_file_aio_read mmapexe-10855 [016] 19358.275078: ondemand_readahead <-page_cache_sync_readahead mmapexe-10855 [016] 19358.275079: get_init_ra_size <-ondemand_readahead mmapexe-10855 [016] 19358.275079: ra_submit <-ondemand_readahead mmapexe-10855 [016] 19358.275079: __do_page_cache_readahead <-ra_submit On Rafael Aquini's suggestion I reverted 147f390ac0714f36728e95afb328b68ed234b15a Applied this patch and built a test kernel. Behaviour is now back to the 6/5 kernel diff -Nurp linux-2.6.32-504.3.3.el6.orig/mm/readahead.c linux-2.6.32-504.3.3.el6/mm/readahead.c --- linux-2.6.32-504.3.3.el6.orig/mm/readahead.c 2014-12-12 15:29:35.000000000 -0500 +++ linux-2.6.32-504.3.3.el6/mm/readahead.c 2015-02-03 11:05:15.103030796 -0500 @@ -228,14 +228,14 @@ int force_page_cache_readahead(struct ad return ret; } -#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) /* * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a * sensible upper limit. */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, MAX_READAHEAD); + return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) + + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); } /* Tom, we should aign this to kernel, its likely not storage Laurence, Would you mind testing the patch I attached here? I think it might be the way to go without reverting that commit from RHEL (and re-introduce the issue it was meant to address). Thanks in advance! -- Rafael Yes Sir.Will test it first thing tomorrow. Hello Rafael Looks like it's working. [root@veritas queue]# cat read_ahead_kb 32 [root@veritas queue]# echo 3 > /proc/sys/vm/drop_caches [root@veritas queue]# collectl -sD -oT -i1 | grep -e "#" -e sdg # DISK STATISTICS (/sec) # <---------reads---------><---------writes---------><--------averages--------> Pct #Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util 21:49:11 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:12 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:13 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:14 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:15 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:16 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:17 sdg 23568 183 53 445 0 0 0 0 444 130 100 3 20 21:49:18 sdg 162304 320 317 512 0 0 0 0 512 143 418 3 99 21:49:19 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 21:49:20 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 21:49:21 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 21:49:22 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 21:49:23 sdg 163840 213 320 512 0 0 0 0 512 125 447 3 99 21:49:24 sdg 18944 0 37 512 0 0 0 0 512 19 442 2 9 21:49:25 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 [oberman@veritas netezza]$ time ./mmapexe ./xaa ret = 0 Size of input file = 1048576000 real 0m6.329s user 0m0.243s sys 0m0.260s Rafael, Customer validated the fix you provided me last night as well. Most recent comment: On 2015-02-04 12:33:29, Antonov, Nick commented: "Laurence, I verified that using kernel-2.6.32-504.3.3.el6.bz1187940.x86_64.rpm the performance is good." Thanks, so perhaps we need to get it upstream ?? Thanks for all your help Laurence This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Patch(es) available on kernel-2.6.32-556.el6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1272.html |