Bug 1187940

Summary: Regression: Loading memory mapped files does not use the optimal sized (large) I/O any more in kernel 2.6.32-504.3.3.el6.x86_64
Product: Red Hat Enterprise Linux 6 Reporter: loberman <loberman>
Component: kernelAssignee: Rafael Aquini <aquini>
kernel sub component: Memory Management QA Contact: Li Wang <liwan>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aquini, bubrown, coughlan, dhoward, hannsj_uhl, jjaburek, jweiner, lilu, loberman, lwoodman, mmilgram, riel, yanwang
Version: 6.6Keywords: Regression, ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-556.el6 Doc Type: Bug Fix
Doc Text:
Cause: RHEL-6.6 has incorporated a patch to fix a readahead() failure condition where max_sane_readahead() would return zero (0) on a CPU whose NUMA node has no local memory (Bug 862177). That patch forced a hard ceiling of 2MB for readahead(), as it changed max_sane_readahead() to fix Bug 862177 Consequence: The aforementioned change, despite fixing that corner case, turned out to show itself detrimental to other ordinary workloads that memory map big files and rely on readahead() or posix_fadvise(WILLNEED) syscalls to get most of the file populating system's cache because it hard capped all system readahead() calls to only 2MB, regardless how much memory is potentially available for caching. Fix: Bring back the old behavior of max_sane_readahead() where we used to consider NR_INACTIVE_FILE and NR_FREE_PAGES pages to derive a sensible / adujstable readahead upper limit, while keeping the 2 mB ceiling scheme introduced as a fix for bug 862177, when we stumble across CPU on a memoryless NUMA node. Result: Introducing the described fix, the described issue with regard to readahead degraded performance is no longer observed.
Story Points: ---
Clone Of:
: 1215755 (view as bug list) Environment:
Last Closed: 2015-07-22 08:42:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1215755, 1351353    
Attachments:
Description Flags
systemtap trace of memcpy none

Description loberman 2015-01-31 15:42:09 UTC
Description of problem:
Loading memory mapped files does not use the optimal sized (large) I/O any more in kernel 2.6.32-504.3.3.el6.x86_64. 
When loading memory mapped buffers the I/O size is constrained by the /sys/block/sdX/queue/read_ahead_kb value in the new kernel, which generates performance issues. 
While in kernel 2.6.32-431.29.2.el6.x86_64 loading memory mapped buffers results in the largest I/O possible (only constrained by /sys/block/sdX/queue/max_sectors_kb and not impacted by read_ahead_kb in any way.

Version-Release number of selected component (if applicable):
RHEL 6.6 2.6.32-504.3.3.el6.x86_64

How reproducible:
Easily reproducible in the Red Hat lab, supporting data provided below.

Steps to Reproduce:
1. Boot into the 504 kernel
2. Run the C test program that uses mmap() and posix_fadvise()
3. Measure the time taken and monitor I/O activity and the regression (change in behaviour) is clear

Actual results:
System limits the optimal I/O size by the size of the /sys/block/sdX/queue/read_ahead_kb value.

Expected results:
Behaves same as the prior kernel where the optimal I/O size is not constrained by /sys/block/sdX/queue/read_ahead_kb value.

Additional info:
queue]# uname -a
2.6.32-431.20.3.el6.x86_64 #1 SMP Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

# cd /sys/block/sdg
# cd queue
# cat read_ahead_kb
128
# echo 32 > read_ahead_kb
# cat read_ahead_kb
32

# echo 3 > /proc/sys/vm/drop_caches 

$ time ./mmapexe ./xaa
ret = 0
Size of input file = 1048576000

real	0m6.333s
user	0m0.231s
sys	0m0.269s

# DISK STATISTICS (/sec)
#                   <---------reads---------><---------writes---------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
10:18:12 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
10:18:13 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
10:18:14 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
10:18:15 sdg          54284    235  118  460       0      0    0    0     460   136   197      3   40
10:18:16 sdg         163840    320  320  512       0      0    0    0     512   143   448      3   99
10:18:17 sdg         163840    320  320  512       0      0    0    0     512   143   447      3   99
10:18:18 sdg         163840    320  320  512       0      0    0    0     512   143   447      3   99
10:18:19 sdg         163328    320  319  512       0      0    0    0     512   143   447      3   99
10:18:20 sdg         163840    320  320  512       0      0    0    0     512   143   447      3   99
10:18:21 sdg         151040    155  295  512      12      1    2    6     508   110   445      3   90
10:18:22 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
10:18:23 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0

Note the reads Size at 512, this happens even with the read_ahead_kb=32

Now switch to the 6.6 kernel and clearly see the regression (changed behaviour) here in the newer kernel
I will be tracing and looking at figuring out what is contributing here but Develepment engineering will likely know already, hence the BZ.

2.6.32-504.3.3.el6.x86_64 #1 SMP Fri Dec 12 16:05:43 EST 2014 x86_64 x86_64 x86_64 GNU/Linux

# cd /sys/block/sdg/queue
# cat read_ahead_kb
128
# echo 32 > read_ahead_kb
# cat read_ahead_kb
32

echo 3 > /proc/sys/vm/drop_caches

Below we see the 504 kernel limits the ability to perform I/O at optimal sizes.
This is viewed as a regression by the customer and is preventing them from moving to the 6.6 kernel and is a critical issue because they need the security fixes in the 6.6 zstream kernel

$ time ./mmapexe ./xaa
ret = 0
Size of input file = 1048576000

real	0m17.402s                         ~= 3 times slower
user	0m0.304s
sys	0m0.852s

# DISK STATISTICS (/sec)
#                   <---------reads---------><---------writes---------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
10:29:43 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
10:29:44 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
10:29:45 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
10:29:46 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
10:29:47 sdg          13036      2  350   37       0      0    0    0      37     2     1      0   23
10:29:48 sdg          56384      0 1762   32       0      0    0    0      32     1     1      0   99
10:29:49 sdg          59584      0 1862   32       0      0    0    0      32     1     1      0   99
10:29:50 sdg          60384      0 1887   32       0      0    0    0      32     1     1      0   99
10:29:51 sdg          61696      0 1928   32       0      0    0    0      32     1     1      0   99
10:29:52 sdg          59392      0 1856   32       0      0    0    0      32     1     1      0   99
10:29:53 sdg          62816      0 1963   32       0      0    0    0      32     1     0      0   99
10:29:54 sdg          61632      0 1926   32       0      0    0    0      32     1     1      0   99
10:29:55 sdg          58752      0 1836   32       0      0    0    0      32     1     1      0   99
10:29:56 sdg          58976      0 1843   32       0      0    0    0      32     1     1      0   99
10:29:57 sdg          58592      0 1831   32       0      0    0    0      32     1     1      0   99
10:29:58 sdg          57856      0 1808   32       0      0    0    0      32     1     1      0   99
10:29:59 sdg          57920      0 1810   32       0      0    0    0      32     1     1      0   99
10:30:00 sdg          56960      0 1780   32       0      0    0    0      32     1     1      0   99
10:30:01 sdg          57856      0 1808   32       0      0    0    0      32     1     1      0   99
10:30:02 sdg          57664      0 1802   32       0      0    0    0      32     1     1      0   99
10:30:03 sdg          57952      0 1811   32       0      0    0    0      32     1     1      0   99
10:30:04 sdg          58720      0 1835   32       0      0    0    0      32     1     1      0   99
10:30:05 sdg           7616      0  238   32       0      0    0    0      32     1     1      0   14
10:30:06 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0

Comment 2 loberman 2015-01-31 16:04:06 UTC
Test program used to reproduce.

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h> /* mmap() is defined in this header */
#include <fcntl.h>

int err_quit(char *msg)
{
    printf(msg);
    return 0;
}

int main (int argc, char *argv[])
{
 int fdin; 
 char *src; 
 char dst[262144];
 struct stat statbuf;
 int mode = 0x0777;
 unsigned long i;
 long ret;

 if (argc != 2)
   err_quit ("usage: a.out <fromfile> <tofile>\n");

 /* open the input file */
 if ((fdin = open (argv[1], O_RDONLY)) < 0)
   {printf("can't open %s for reading\n", argv[1]);
    return 0;
   }

 ret=posix_fadvise(fdin,0,0,POSIX_FADV_WILLNEED);
 printf("ret = %ld\n",ret);

 /* find size of input file */
 if (fstat (fdin,&statbuf) < 0)
   {printf ("fstat error\n");
    return 0;
   }
 printf("Size of input file = %ld\n",statbuf.st_size);

 /* mmap the input file */
 if ((src = mmap (0, statbuf.st_size, PROT_READ, MAP_SHARED, fdin, 0))
   == (caddr_t) -1)
   {printf ("mmap error for input");
    return 0;
   }

 /* this copies the input file to the data buffer using 256k blocks*/
 
 for (i=0;i< statbuf.st_size-262144; i+=262144) {
 	memcpy (dst, src+i, 262144);
 }
 return 0;

} /* main */

Comment 3 loberman 2015-01-31 16:35:14 UTC
Note:

The customer wants the default read_ahead_kb to be small 32, and control the size when necessary via posix_fadvise().

Customers note that is the key issue here

To reiterate, what we want is readahead to be small (we use 32KB in production), but leverage fadvise if we need to load a big file. In the past that worked since I/O size was only limited by max_sectors_kb. In 504 it does not work anymore, since readahead also limits I/O sizes.

Comment 4 loberman 2015-01-31 17:24:50 UTC
Created attachment 986545 [details]
systemtap trace of memcpy

Systemtap Traced memcpy from the beginning and attaching here.
Will do that same on the earlier kernel and compare.

From perf top, its clear memcpy was needed to be traced.

+  12.31%         mmapexe  libc-2.12.so        [.] memcpy
+   6.22%            init  [kernel.kallsyms]   [k] intel_idle
+   4.10%            init  [lpfc]              [k] lpfc_sli_intr_handler
+   2.77%         mmapexe  [kernel.kallsyms]   [k] page_fault
+   1.10%         mmapexe  [kernel.kallsyms]   [k] get_page_from_freelist
+   1.08%         mmapexe  [kernel.kallsyms]   [k] find_busiest_group
+   0.94%            init  [kernel.kallsyms]   [k] _spin_unlock_irqrestore

Still trying to figure out where the change happened in 504.

Keep in mind that I have the same glibc when I reproduce this.

Comment 5 loberman 2015-01-31 20:54:43 UTC
systemtap probe of memcpy via the mmapexe test program

mmapexe(6567) memcpy 0
 0xffffffff812985c0 : __memcpy+0x0/0x120 [kernel]
 0xffffffffa034eac7 [lpfc]
 0xffffffff8137ee90 : scsi_done+0x0/0x60 [kernel] (inexact)
 0xffffffffa0351f93 [lpfc] (inexact)
 0xffffffff813883b9 : scsi_setup_fs_cmnd+0x79/0xe0 [kernel] (inexact)
 0xffffffffa026bb11 [sd_mod] (inexact)
 0xffffffff810838fb : round_jiffies_up+0x1b/0x20 [kernel] (inexact)
 0xffffffff812786b2 : blk_add_timer+0x72/0xc0 [kernel] (inexact)
 0xffffffff8137f0e5 : scsi_dispatch_cmd+0xe5/0x310 [kernel] (inexact)
 0xffffffff81387aae : scsi_request_fn+0x5be/0x750 [kernel] (inexact)
 0xffffffff8108748d : del_timer+0x7d/0xe0 [kernel] (inexact)
 0xffffffff81124320 : sync_page_killable+0x0/0x40 [kernel] (inexact)
 0xffffffff8126f5d2 : __generic_unplug_device+0x32/0x40 [kernel] (inexact)
 0xffffffff8126f60e : generic_unplug_device+0x2e/0x50 [kernel] (inexact)
 0xffffffff8126b454 : blk_unplug+0x34/0x70 [kernel] (inexact)
 0xffffffff8126b4a2 : blk_backing_dev_unplug+0x12/0x20 [kernel] (inexact)
 0xffffffff811c32de : block_sync_page+0x3e/0x50 [kernel] (inexact)
 0xffffffff81124308 : sync_page+0x38/0x50 [kernel] (inexact)
 0xffffffff8112432e : sync_page_killable+0xe/0x40 [kernel] (inexact)
 0xffffffff8152ac3f : __wait_on_bit+0x5f/0x90 [kernel] (inexact)

Leads to an ftrace here

mmapexe-10855 [016] 19358.275076: kernel_read <-prepare_binprm
mmapexe-10855 [016] 19358.275076: vfs_read <-kernel_read
..
mmapexe-10855 [016] 19358.275077: do_sync_read <-vfs_read
mmapexe-10855 [016] 19358.275077: generic_file_aio_read <-do_sync_read
mmapexe-10855 [016] 19358.275078: generic_segment_checks <-generic_file_aio_read
mmapexe-10855 [016] 19358.275078: _cond_resched <-generic_file_aio_read
mmapexe-10855 [016] 19358.275078: find_get_page <-generic_file_aio_read
mmapexe-10855 [016] 19358.275078: page_cache_sync_readahead <-generic_file_aio_read
mmapexe-10855 [016] 19358.275078: ondemand_readahead <-page_cache_sync_readahead
mmapexe-10855 [016] 19358.275079: get_init_ra_size <-ondemand_readahead
mmapexe-10855 [016] 19358.275079: ra_submit <-ondemand_readahead
mmapexe-10855 [016] 19358.275079: __do_page_cache_readahead <-ra_submit

Comment 6 loberman 2015-02-03 17:30:26 UTC
On Rafael Aquini's suggestion I reverted 147f390ac0714f36728e95afb328b68ed234b15a

Applied this patch and built a test kernel.
Behaviour is now back to the 6/5 kernel

diff -Nurp linux-2.6.32-504.3.3.el6.orig/mm/readahead.c linux-2.6.32-504.3.3.el6/mm/readahead.c
--- linux-2.6.32-504.3.3.el6.orig/mm/readahead.c	2014-12-12 15:29:35.000000000 -0500
+++ linux-2.6.32-504.3.3.el6/mm/readahead.c	2015-02-03 11:05:15.103030796 -0500
@@ -228,14 +228,14 @@ int force_page_cache_readahead(struct ad
 	return ret;
 }
 
-#define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
 /*
  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
  * sensible upper limit.
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, MAX_READAHEAD);
+        return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
+                + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
 /*

Tom, we should aign this to kernel, its likely not storage

Comment 8 Rafael Aquini 2015-02-04 01:09:35 UTC
Laurence,

Would you mind testing the patch I attached here? I think it might be the way to go without reverting that commit from RHEL (and re-introduce the issue it was meant to address).

Thanks in advance!
-- Rafael

Comment 9 loberman 2015-02-04 01:33:22 UTC
Yes Sir.Will test it first thing tomorrow.

Comment 10 loberman 2015-02-04 02:51:02 UTC
Hello Rafael
Looks like it's working.

[root@veritas queue]# cat read_ahead_kb
32

[root@veritas queue]# echo 3 > /proc/sys/vm/drop_caches 

[root@veritas queue]# collectl -sD -oT -i1 | grep -e "#" -e sdg
# DISK STATISTICS (/sec)
#                   <---------reads---------><---------writes---------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
21:49:11 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
21:49:12 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
21:49:13 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
21:49:14 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
21:49:15 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
21:49:16 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0
21:49:17 sdg          23568    183   53  445       0      0    0    0     444   130   100      3   20
21:49:18 sdg         162304    320  317  512       0      0    0    0     512   143   418      3   99
21:49:19 sdg         163840    320  320  512       0      0    0    0     512   143   447      3   99
21:49:20 sdg         163840    320  320  512       0      0    0    0     512   143   447      3   99
21:49:21 sdg         163840    320  320  512       0      0    0    0     512   143   447      3   99
21:49:22 sdg         163840    320  320  512       0      0    0    0     512   143   447      3   99
21:49:23 sdg         163840    213  320  512       0      0    0    0     512   125   447      3   99
21:49:24 sdg          18944      0   37  512       0      0    0    0     512    19   442      2    9
21:49:25 sdg              0      0    0    0       0      0    0    0       0     0     0      0    0

[oberman@veritas netezza]$ time ./mmapexe ./xaa
ret = 0
Size of input file = 1048576000

real	0m6.329s
user	0m0.243s
sys	0m0.260s

Comment 11 loberman 2015-02-04 17:37:40 UTC
Rafael,

Customer validated the fix you provided me last night as well.

Most recent comment: On 2015-02-04 12:33:29, Antonov, Nick commented:
"Laurence,

I verified that using kernel-2.6.32-504.3.3.el6.bz1187940.x86_64.rpm the performance is good."

Thanks, so perhaps we need to get it upstream ??

Thanks for all your help

Laurence

Comment 21 RHEL Program Management 2015-04-22 14:59:47 UTC
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 23 Kurt Stutsman 2015-04-27 16:14:59 UTC
Patch(es) available on kernel-2.6.32-556.el6

Comment 32 errata-xmlrpc 2015-07-22 08:42:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1272.html