Red Hat Bugzilla – Bug 1187940
Regression: Loading memory mapped files does not use the optimal sized (large) I/O any more in kernel 2.6.32-504.3.3.el6.x86_64
Last modified: 2016-07-13 03:59:04 EDT
Description of problem: Loading memory mapped files does not use the optimal sized (large) I/O any more in kernel 2.6.32-504.3.3.el6.x86_64. When loading memory mapped buffers the I/O size is constrained by the /sys/block/sdX/queue/read_ahead_kb value in the new kernel, which generates performance issues. While in kernel 2.6.32-431.29.2.el6.x86_64 loading memory mapped buffers results in the largest I/O possible (only constrained by /sys/block/sdX/queue/max_sectors_kb and not impacted by read_ahead_kb in any way. Version-Release number of selected component (if applicable): RHEL 6.6 2.6.32-504.3.3.el6.x86_64 How reproducible: Easily reproducible in the Red Hat lab, supporting data provided below. Steps to Reproduce: 1. Boot into the 504 kernel 2. Run the C test program that uses mmap() and posix_fadvise() 3. Measure the time taken and monitor I/O activity and the regression (change in behaviour) is clear Actual results: System limits the optimal I/O size by the size of the /sys/block/sdX/queue/read_ahead_kb value. Expected results: Behaves same as the prior kernel where the optimal I/O size is not constrained by /sys/block/sdX/queue/read_ahead_kb value. Additional info: queue]# uname -a 2.6.32-431.20.3.el6.x86_64 #1 SMP Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux # cd /sys/block/sdg # cd queue # cat read_ahead_kb 128 # echo 32 > read_ahead_kb # cat read_ahead_kb 32 # echo 3 > /proc/sys/vm/drop_caches $ time ./mmapexe ./xaa ret = 0 Size of input file = 1048576000 real 0m6.333s user 0m0.231s sys 0m0.269s # DISK STATISTICS (/sec) # <---------reads---------><---------writes---------><--------averages--------> Pct #Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util 10:18:12 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 10:18:13 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 10:18:14 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 10:18:15 sdg 54284 235 118 460 0 0 0 0 460 136 197 3 40 10:18:16 sdg 163840 320 320 512 0 0 0 0 512 143 448 3 99 10:18:17 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 10:18:18 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 10:18:19 sdg 163328 320 319 512 0 0 0 0 512 143 447 3 99 10:18:20 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 10:18:21 sdg 151040 155 295 512 12 1 2 6 508 110 445 3 90 10:18:22 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 10:18:23 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 Note the reads Size at 512, this happens even with the read_ahead_kb=32 Now switch to the 6.6 kernel and clearly see the regression (changed behaviour) here in the newer kernel I will be tracing and looking at figuring out what is contributing here but Develepment engineering will likely know already, hence the BZ. 2.6.32-504.3.3.el6.x86_64 #1 SMP Fri Dec 12 16:05:43 EST 2014 x86_64 x86_64 x86_64 GNU/Linux # cd /sys/block/sdg/queue # cat read_ahead_kb 128 # echo 32 > read_ahead_kb # cat read_ahead_kb 32 echo 3 > /proc/sys/vm/drop_caches Below we see the 504 kernel limits the ability to perform I/O at optimal sizes. This is viewed as a regression by the customer and is preventing them from moving to the 6.6 kernel and is a critical issue because they need the security fixes in the 6.6 zstream kernel $ time ./mmapexe ./xaa ret = 0 Size of input file = 1048576000 real 0m17.402s ~= 3 times slower user 0m0.304s sys 0m0.852s # DISK STATISTICS (/sec) # <---------reads---------><---------writes---------><--------averages--------> Pct #Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util 10:29:43 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 10:29:44 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 10:29:45 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 10:29:46 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 10:29:47 sdg 13036 2 350 37 0 0 0 0 37 2 1 0 23 10:29:48 sdg 56384 0 1762 32 0 0 0 0 32 1 1 0 99 10:29:49 sdg 59584 0 1862 32 0 0 0 0 32 1 1 0 99 10:29:50 sdg 60384 0 1887 32 0 0 0 0 32 1 1 0 99 10:29:51 sdg 61696 0 1928 32 0 0 0 0 32 1 1 0 99 10:29:52 sdg 59392 0 1856 32 0 0 0 0 32 1 1 0 99 10:29:53 sdg 62816 0 1963 32 0 0 0 0 32 1 0 0 99 10:29:54 sdg 61632 0 1926 32 0 0 0 0 32 1 1 0 99 10:29:55 sdg 58752 0 1836 32 0 0 0 0 32 1 1 0 99 10:29:56 sdg 58976 0 1843 32 0 0 0 0 32 1 1 0 99 10:29:57 sdg 58592 0 1831 32 0 0 0 0 32 1 1 0 99 10:29:58 sdg 57856 0 1808 32 0 0 0 0 32 1 1 0 99 10:29:59 sdg 57920 0 1810 32 0 0 0 0 32 1 1 0 99 10:30:00 sdg 56960 0 1780 32 0 0 0 0 32 1 1 0 99 10:30:01 sdg 57856 0 1808 32 0 0 0 0 32 1 1 0 99 10:30:02 sdg 57664 0 1802 32 0 0 0 0 32 1 1 0 99 10:30:03 sdg 57952 0 1811 32 0 0 0 0 32 1 1 0 99 10:30:04 sdg 58720 0 1835 32 0 0 0 0 32 1 1 0 99 10:30:05 sdg 7616 0 238 32 0 0 0 0 32 1 1 0 14 10:30:06 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0
Test program used to reproduce. #include <stdio.h> #include <unistd.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> /* mmap() is defined in this header */ #include <fcntl.h> int err_quit(char *msg) { printf(msg); return 0; } int main (int argc, char *argv[]) { int fdin; char *src; char dst[262144]; struct stat statbuf; int mode = 0x0777; unsigned long i; long ret; if (argc != 2) err_quit ("usage: a.out <fromfile> <tofile>\n"); /* open the input file */ if ((fdin = open (argv[1], O_RDONLY)) < 0) {printf("can't open %s for reading\n", argv[1]); return 0; } ret=posix_fadvise(fdin,0,0,POSIX_FADV_WILLNEED); printf("ret = %ld\n",ret); /* find size of input file */ if (fstat (fdin,&statbuf) < 0) {printf ("fstat error\n"); return 0; } printf("Size of input file = %ld\n",statbuf.st_size); /* mmap the input file */ if ((src = mmap (0, statbuf.st_size, PROT_READ, MAP_SHARED, fdin, 0)) == (caddr_t) -1) {printf ("mmap error for input"); return 0; } /* this copies the input file to the data buffer using 256k blocks*/ for (i=0;i< statbuf.st_size-262144; i+=262144) { memcpy (dst, src+i, 262144); } return 0; } /* main */
Note: The customer wants the default read_ahead_kb to be small 32, and control the size when necessary via posix_fadvise(). Customers note that is the key issue here To reiterate, what we want is readahead to be small (we use 32KB in production), but leverage fadvise if we need to load a big file. In the past that worked since I/O size was only limited by max_sectors_kb. In 504 it does not work anymore, since readahead also limits I/O sizes.
Created attachment 986545 [details] systemtap trace of memcpy Systemtap Traced memcpy from the beginning and attaching here. Will do that same on the earlier kernel and compare. From perf top, its clear memcpy was needed to be traced. + 12.31% mmapexe libc-2.12.so [.] memcpy + 6.22% init [kernel.kallsyms] [k] intel_idle + 4.10% init [lpfc] [k] lpfc_sli_intr_handler + 2.77% mmapexe [kernel.kallsyms] [k] page_fault + 1.10% mmapexe [kernel.kallsyms] [k] get_page_from_freelist + 1.08% mmapexe [kernel.kallsyms] [k] find_busiest_group + 0.94% init [kernel.kallsyms] [k] _spin_unlock_irqrestore Still trying to figure out where the change happened in 504. Keep in mind that I have the same glibc when I reproduce this.
systemtap probe of memcpy via the mmapexe test program mmapexe(6567) memcpy 0 0xffffffff812985c0 : __memcpy+0x0/0x120 [kernel] 0xffffffffa034eac7 [lpfc] 0xffffffff8137ee90 : scsi_done+0x0/0x60 [kernel] (inexact) 0xffffffffa0351f93 [lpfc] (inexact) 0xffffffff813883b9 : scsi_setup_fs_cmnd+0x79/0xe0 [kernel] (inexact) 0xffffffffa026bb11 [sd_mod] (inexact) 0xffffffff810838fb : round_jiffies_up+0x1b/0x20 [kernel] (inexact) 0xffffffff812786b2 : blk_add_timer+0x72/0xc0 [kernel] (inexact) 0xffffffff8137f0e5 : scsi_dispatch_cmd+0xe5/0x310 [kernel] (inexact) 0xffffffff81387aae : scsi_request_fn+0x5be/0x750 [kernel] (inexact) 0xffffffff8108748d : del_timer+0x7d/0xe0 [kernel] (inexact) 0xffffffff81124320 : sync_page_killable+0x0/0x40 [kernel] (inexact) 0xffffffff8126f5d2 : __generic_unplug_device+0x32/0x40 [kernel] (inexact) 0xffffffff8126f60e : generic_unplug_device+0x2e/0x50 [kernel] (inexact) 0xffffffff8126b454 : blk_unplug+0x34/0x70 [kernel] (inexact) 0xffffffff8126b4a2 : blk_backing_dev_unplug+0x12/0x20 [kernel] (inexact) 0xffffffff811c32de : block_sync_page+0x3e/0x50 [kernel] (inexact) 0xffffffff81124308 : sync_page+0x38/0x50 [kernel] (inexact) 0xffffffff8112432e : sync_page_killable+0xe/0x40 [kernel] (inexact) 0xffffffff8152ac3f : __wait_on_bit+0x5f/0x90 [kernel] (inexact) Leads to an ftrace here mmapexe-10855 [016] 19358.275076: kernel_read <-prepare_binprm mmapexe-10855 [016] 19358.275076: vfs_read <-kernel_read .. mmapexe-10855 [016] 19358.275077: do_sync_read <-vfs_read mmapexe-10855 [016] 19358.275077: generic_file_aio_read <-do_sync_read mmapexe-10855 [016] 19358.275078: generic_segment_checks <-generic_file_aio_read mmapexe-10855 [016] 19358.275078: _cond_resched <-generic_file_aio_read mmapexe-10855 [016] 19358.275078: find_get_page <-generic_file_aio_read mmapexe-10855 [016] 19358.275078: page_cache_sync_readahead <-generic_file_aio_read mmapexe-10855 [016] 19358.275078: ondemand_readahead <-page_cache_sync_readahead mmapexe-10855 [016] 19358.275079: get_init_ra_size <-ondemand_readahead mmapexe-10855 [016] 19358.275079: ra_submit <-ondemand_readahead mmapexe-10855 [016] 19358.275079: __do_page_cache_readahead <-ra_submit
On Rafael Aquini's suggestion I reverted 147f390ac0714f36728e95afb328b68ed234b15a Applied this patch and built a test kernel. Behaviour is now back to the 6/5 kernel diff -Nurp linux-2.6.32-504.3.3.el6.orig/mm/readahead.c linux-2.6.32-504.3.3.el6/mm/readahead.c --- linux-2.6.32-504.3.3.el6.orig/mm/readahead.c 2014-12-12 15:29:35.000000000 -0500 +++ linux-2.6.32-504.3.3.el6/mm/readahead.c 2015-02-03 11:05:15.103030796 -0500 @@ -228,14 +228,14 @@ int force_page_cache_readahead(struct ad return ret; } -#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) /* * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a * sensible upper limit. */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, MAX_READAHEAD); + return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) + + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); } /* Tom, we should aign this to kernel, its likely not storage
Laurence, Would you mind testing the patch I attached here? I think it might be the way to go without reverting that commit from RHEL (and re-introduce the issue it was meant to address). Thanks in advance! -- Rafael
Yes Sir.Will test it first thing tomorrow.
Hello Rafael Looks like it's working. [root@veritas queue]# cat read_ahead_kb 32 [root@veritas queue]# echo 3 > /proc/sys/vm/drop_caches [root@veritas queue]# collectl -sD -oT -i1 | grep -e "#" -e sdg # DISK STATISTICS (/sec) # <---------reads---------><---------writes---------><--------averages--------> Pct #Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util 21:49:11 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:12 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:13 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:14 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:15 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:16 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 21:49:17 sdg 23568 183 53 445 0 0 0 0 444 130 100 3 20 21:49:18 sdg 162304 320 317 512 0 0 0 0 512 143 418 3 99 21:49:19 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 21:49:20 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 21:49:21 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 21:49:22 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99 21:49:23 sdg 163840 213 320 512 0 0 0 0 512 125 447 3 99 21:49:24 sdg 18944 0 37 512 0 0 0 0 512 19 442 2 9 21:49:25 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0 [oberman@veritas netezza]$ time ./mmapexe ./xaa ret = 0 Size of input file = 1048576000 real 0m6.329s user 0m0.243s sys 0m0.260s
Rafael, Customer validated the fix you provided me last night as well. Most recent comment: On 2015-02-04 12:33:29, Antonov, Nick commented: "Laurence, I verified that using kernel-2.6.32-504.3.3.el6.bz1187940.x86_64.rpm the performance is good." Thanks, so perhaps we need to get it upstream ?? Thanks for all your help Laurence
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
Patch(es) available on kernel-2.6.32-556.el6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1272.html