Bug 604779

Summary:

Page out activity when there is no current VM load

Product:

Red Hat Enterprise Linux 5

Reporter:

Bill Braswell <bbraswel>

Component:

kernel

Assignee:

Larry Woodman <lwoodman>

Status:

CLOSED ERRATA

QA Contact:

Zhouping Liu <zliu>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.3

CC:

bhamrick, jbrier, jwest, nobody+295318, qcai, tao, zliu

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-01-13 21:37:53 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
the script test.sh creates the log	none
logfile when no patch	none
malloc_over.c program	none

Description Bill Braswell 2010-06-16 17:50:11 UTC

They start up multiple java programs.  Then there are several very large file copy operations each around 50 megabytes.  The system ends up with a large amount of memory tied up in pagecache.  About 30 seconds after the file copies end, there is a short burst of page out activity even though there is close to 3 gigabytes of free memory, well above min_free_kbytes, and the system itself is effectively idle.

The customer is wanting to know why the system is paging out their java applications when there is so much free memory?  And why it is not recovering memory from pagecache? 

After the burst, there seems to be very little paging out.  As expected, there is some page in activity afterwards.  But I see no evidence of thrashing.

kernel version: 2.6.18-194.el5


Sample of vmstat:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0  4    292 2892956   5240 2015592    0    0    76 30864 1111 7318  0  1 65 34  0
0  6    292 2918864   5236 1989920    0    0    13 29668 1135 6728  0  1 63 36  0
0  3    292 2943704   5232 1964840    0    0     8 32676 1115 6896  1  1 64 34  0
0  4    292 2977148   5368 1930016    0    0   414 27416 1186 7690  0  2 70 27  0
0  2    292 3011456   5320 1896364    0    0    62 37368 1176 6074  0  1 80 19  0
0  1    292 3043528   5252 1864336    0    0     9 27117 1182 3601  0  1 81 18  0
0  1    292 3045772   5256 1861936    0    0    49    20 1021  871  0  0 87 13  0
0  1    292 3048656   5264 1859264    0    0    44    12 1040  927  0  0 87 13  0
0  1    292 3053276   5284 1854808    0    0     9   223 1031  891  0  0 87 13  0
0  1    292 3056436   5312 1851708    0    0     9    79 1045  919  1  0 86 13  0
1  0    292 3060572   5260 1847752    0    0    13   100 1033  922  0  1 86 13  0
0  2  73372 3108212   5256 1847068    0 73080     8 73094 1262 1017  0  3 86 11  0
0  1 171428 3210032   5256 1845956    0 98056     4 98056 1471 1069  0  2 84 14  0
0  5 267604 3305248   5280 1845768   32 96164    51 96348 1509 1125  0  2 77 21  0
0  3 332984 3390068   5348 1846744   64 65376   259 65472 1508 1467  0  2 67 31  0
0  1 389804 3449460   5372 1846912    0 56820    98 57574 1468 1603  0  2 79 19  0
0  1 403136 3463360   5372 1846832    0 13332     0 13332 1080 1167  0  1 87 12  0
0  1 410360 3471196   5376 1846896    0 7224     4  7240 1069 1102  0  0 87 12  0
1  1 414076 3475660   5360 1846568    0 3716    32  4000 1045 1045  0  1 86 13  0
0  3 490996 3531708   5304 1844736   32 76932    58 76988 1287 1160  1  3 82 14  0
0  3 578964 3622892   5328 1844612  224 87968   229 88110 1449 1158  0  2 63 35  0

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0  3 674496 3717112   5348 1844072  512 95532   524 95540 1630 1344  0  2 75 23  0
1  2 740268 3801612   5424 1842796  344 65812   527 65812 1976 1981  1  3 73 23  0
0  2 743212 3804776   5640 1843516  844 2944  1784  3162 1345 1197  0  0 93  7  0
0  0 743212 3803744   5780 1843908  504    0   640    12 1060  990  0  0 98  2  0
0  0 743196 3802240   5944 1844900   64    0   982    56 1113 1351  0  1 96  3  0
0  0 743196 3801752   6020 1845028  128    0   324     0 1039  933  0  0 99  1  0
0  0 743196 3801628   6100 1845096   64    0   136    72 1046  929  0  0 99  1  0
0  0 743196 3802088   6164 1845056   32    0    88   772 1037  938  0  0 98  1  0
0  0 743196 3801696   6172 1845108   96    0    96    68 1034  908  1  0 99  0  0
0  0 743196 3801956   6188 1845184    0    0     0    43 1011  878  0  0 100  0  0
0  0 743196 3801956   6188 1845216    0    0     0     0 1028  872  0  0 100  0  0
2  0 743196 3801928   6244 1845260  256    0   424    32 1040 1053  0  0 98  1  0
0  1 743196 3801532   6256 1845392  128    0   132   216 1037  890  0  0 99  1  0
0  0 743196 3800316   6288 1846476  224    0   832    12 1063  919  0  0 98  1  0
0  0 743196 3799448   6308 1846516   64    0    66    64 1036 1222  0  1 99  1  0
0  0 743196 3799856   6308 1846652    0    0     0     0 1006  871  0  0 100  0  0
0  0 743196 3800008   6308 1846632   32    0    32     0 1027  925  0  0 100  0  0
0  0 743196 3799568   6332 1846676    0    0    20   844 1031  929  0  0 99  1  0
0  0 743196 3799568   6332 1846744   96    0    96     0 1027  908  1  0 99  0  0
0  0 743196 3799968   6352 1846772    0    0     0    87 1017  890  0  0 99  1  0
0  0 743196 3799872   6352 1846848    0    0     0     0 1026  883  0  0 100  0  0

Comment 2 Issue Tracker 2010-07-16 00:52:29 UTC

Event posted on 06-16-2010 03:46pm EDT by jbrier

"Has there been any attempt to reproduce this on an ext3 filesystem?  If
it is reproducible on ext3, then there would be no need to keep Symantec
involved."

Ya, I pushed back on this in the email, they are working on it.


"Is there a measurable performance problem here?  Or is this trying to
interpret some unusual system statistics?"

"In the testing over the weekend, the parameter suggested by Symantec to
change was modified.  From what I read, the system responded as expected. 
I/O wait time dropped.  It wasn't until a higher load was placed on the
system that is showed any of the signs.  Is that load something that would
be expected in a normal production environment?"

They kind of touched on this previously

======
However, throughout the testing I
did observe that the CPU IO wait % was a lot lower than before the
throttling was set. It tended to hold around 25% during copy, with the 
CPU
over 50% idle. Even during the swap dumps, IO wait% remained below 50%.
This is in contrast to the 60% IO wait seen without the throttling during
a copy, and 90% IO wait following a swap dump. Given this, I'll put the
question to our apps team to see if they can measure the response 
times of
the retrieval process within the Java apps, and whether we're at a point
where we have alleviated the issue of the retrievals timing out.
===


Which I followed up on. Here is a description of the actual problem,
without complaining about cache usage, or swap usage:

===
As for the actual problem. There are 2 types of actions occurring on the 
hosts. We are performing a "retrieval" operation, where a client is 
connecting to our Tomcat/Java processes and reading a file from our VxFS 
filesystems. I have been told we get up to 90,000 requests a day. The 
second is a "load" operation, where we are copying 50mb data files from
a 
share to the VxFS filesystems. This loading occurs on demand as the files

are made available. This usually occurs in batches.

We were able to perform both of these operations from a single SuSE host 
with 4gb of RAM for several years without noticing any performance issues.

After we migrated to the new Redhat 5.5 servers with all new hardware and

8gb RAM we started to experience timeouts within the retrieval process 
during or soon after a load process commenced. A timeout means that a 
client requested a data file, and the Tomcat was unable to fulfill the 
request within a period.

When looking at the server when these retrieval timeouts were occurring we

noticed that large amounts of data were being instantly dumped to swap, 
and the server was reporting high CPU IO wait %, in the range of 50-90%, 
and this high IO wait was persisting for long periods, effectively 
rendering our retrieval app non-operational.
===

Bill, I'll wait for your update after the out of band email thread with
Larry Woodman from today. It looks like we figured out some good stuff
about one of the NUMA nodes being very low, which caused page reclaiming
even though other NUMA nodes were not that low.

John Brier



This event sent from IssueTracker by jwest 
 issue 865263

Comment 3 Issue Tracker 2010-07-16 00:52:33 UTC

Event posted on 06-16-2010 05:52pm EDT by bbraswel

The system is acting as if it is low on memory, because from a NUMA
(Non-Uniform Memory Architecture) stand point, it is.  In NUMA, memory of
each node is managed semi-independently of the other nodes.

Once page reclamation begins the system continues to reclaim pages until
all nodes are no longer memory starved.  In the data we have looked at the
free memory on node 0 is well below min_free_kbytes.  The system will
continue to preform page recovery until free memory on node 0 is above
min_free_kbytes.  Even at the expense of other nodes.

This can cause swapping on systems where the overall total free count is
huge.

While doing memory recovery, it will check each node to see if free memory
is above the minimum.  It will continue until all nodes are.  The problem
is, it continues to do the recovery operations on all node, even those
that are not low on memory.  This can cause pages to be swapped out that
should not have been.

We are looking at a patch from upstream that checks a node before
performing recovery no that node.  Since all DMA memory does reside in
node 0, and that is the node having memory starvation issues, we would
still like to see if it can be reproduced on ext3.  This could eliminate
vxfs as being a concern here.


Bill


This event sent from IssueTracker by jwest 
 issue 865263

Comment 4 Issue Tracker 2010-07-16 00:52:35 UTC

Event posted on 06-21-2010 02:01pm EDT by cmarcant

Hello.  John is out of the office for the next week so I am watching this
case for him...

The patch that I saw from Larry's e-mail looks like it was already
backported to RHEL5 from what I can see/tell.  Line numbers appear to
match up to a current RHEL5 source tree, and luckily it's basically a 1
line patch (plus comments) so it's not too hard to drop into place as
appropriate if you're working off a different tree for some reason.

Is that likely our best plan of action?  Do you want me to just hand build
a kernel at my desk with that patch applied and have the customer try it? 
Would you rather push a build through brew and point me to that package?

The customer is currently working on securing more disk space for a test
on ext3, but in the meantime is a bit anxious to see an update from our
end and where we are at in regards to this issue.  Any preference for how
we go about having them test this patch?

Chris


This event sent from IssueTracker by jwest 
 issue 865263

Comment 5 Issue Tracker 2010-07-16 00:52:36 UTC

Event posted on 06-22-2010 10:39am EDT by cmarcant

Hmmmm.  I tried to build a kernel here at my desk last night with the
following patch (applied against the -194 tree):

[chris@virt5-64 2.6.18]$ cat SOURCES/linux-kernel-test.patch
diff -up linux-2.6.18.x86_64/mm/vmscan.c.ORIG
linux-2.6.18.x86_64/mm/vmscan.c
--- linux-2.6.18.x86_64/mm/vmscan.c.ORIG        2010-06-21
19:03:43.000000000 -0400
+++ linux-2.6.18.x86_64/mm/vmscan.c     2010-06-21 19:06:20.000000000
-0400
@@ -1221,7 +1221,13 @@ scan:
                        temp_priority[i] = priority;
                        sc.nr_scanned = 0;
                        note_zone_scanning_priority(zone, priority);
-                       shrink_zone(priority, zone, &sc);
+                       /*
+                        * We put equal pressure on every zone, unless
one
+                        * zone has way too many pages free already.
+                        */
+                       if (!zone_watermark_ok(zone, order,
+                                       8*zone->pages_high, end_zone, 0))
+                               shrink_zone(priority, zone, &sc);
                        reclaim_state->reclaimed_slab = 0;
                        nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
                                                lru_pages);
[chris@virt5-64 2.6.18]$

Unfortunately, apparently a little over 3G isn't enough space to properly
build a kernel and I ran out of space on the device, so the build failed. 
Arg.  I'm trying to see if I can get this going on another system in the
meantime.


This event sent from IssueTracker by jwest 
 issue 865263

Comment 6 Issue Tracker 2010-07-16 00:52:38 UTC

Event posted on 06-22-2010 05:27pm EDT by bbraswel

Chris's patch is finished.  I have not been able to test the patch as of
yet.  It is available at:

  https://brewweb.devel.redhat.com/taskinfo?taskID=2540532

Please note, this patch will not prevent page recovery when there is a
high overall total free memory.  What it does, is prevent page recovery on
NUMA nodes that have enough free memory.  But there will still be page
recovery operations on NUMA nodes that have a low free page count.


Bill

Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by jwest 
 issue 865263

Comment 7 Issue Tracker 2010-07-16 00:52:40 UTC

Event posted on 06-24-2010 04:07pm EDT by cmarcant

File uploaded: Flook_06_24_10-ext3.tar.gz

This event sent from IssueTracker by jwest 
 issue 865263
it_file 803043

Comment 8 Issue Tracker 2010-07-16 00:52:41 UTC

Event posted on 06-25-2010 04:05pm EDT by cmarcant

File uploaded: Flook_06_25_10-kernel.tar.gz

This event sent from IssueTracker by jwest 
 issue 865263
it_file 807513

Comment 9 Issue Tracker 2010-07-16 00:52:43 UTC

Event posted on 06-25-2010 04:05pm EDT by cmarcant

One more update from the customer today.  He has tested the patched kernel
with VXFS:

<snip>

I did some testing on the new test kernel. I still observed some swapping
activity in the order of 150mb, but no big dumps of 1gb or more.
Flook_06_25_10-kernel.tar.gz is attached. This testing was performed on 2
vxfs filesystems with write_throttle at the default of 0 (off).

14:24 - Start collect
14:26 - Start copy
14:32 - stop copy
14:34 - start (2 mins in 140mb dump)
14:40 - stop copy (+80mb swap)
14:43 - start copy
14:49 - stop copy (120mb swap)
14:51 - start  copy
14:56 - stop copy (+70mb swap)
14:58 - start copy
15:04 - stop copy
15:11 - stop collect.

I will run some more testing over the weekend. Let me know if you want any
tests under specific conditions (eg, using ext3, different kernel
parameters, using write_throttle, etc).
Michael Benn

<snip>

I've attached this tarball of data as well.

I'm also in the process of asking the customer if this has positively
impacted the timeouts they have previously complained about.

Is there anything else you want me to have them test, implement, or
answer?

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by jwest 
 issue 865263

Comment 10 Issue Tracker 2010-07-16 00:52:45 UTC

Event posted on 06-30-2010 06:30pm EDT by jbrier

File uploaded: Flook_06_25_10-kernel.tar.gz

This event sent from IssueTracker by jwest 
 issue 865263
it_file 822003

Comment 11 Issue Tracker 2010-07-16 00:52:46 UTC

Event posted on 06-30-2010 06:30pm EDT by jbrier

File uploaded: Flook_06_27_10-kern_ext3.tar.gz

This event sent from IssueTracker by jwest 
 issue 865263
it_file 822013

Comment 12 Issue Tracker 2010-07-16 00:52:48 UTC

Event posted on 06-30-2010 06:35pm EDT by jbrier

The customer got some good results with the test kernel:

===
I did some testing on the new test kernel. I still observed some swapping
activity in the order of 150mb, but no big dumps of 1gb or more.
Flook_06_25_10-kernel.tar.gz is attached. This testing was performed on 2
vxfs filesystems with write_throttle at the default of 0 (off).

14:24 - Start collect
14:26 - Start copy
14:32 - stop copy
14:34 - start (2 mins in 140mb dump)
14:40 - stop copy (+80mb swap)
14:43 - start copy
14:49 - stop copy (120mb swap)
14:51 - start  copy
14:56 - stop copy (+70mb swap)
14:58 - start copy
15:04 - stop copy
15:11 - stop collect.

I will run some more testing over the weekend. Let me know if you want any
tests under specific conditions (eg, using ext3, different kernel
parameters, using write_throttle, etc).
Michael Benn
Attachment uploadedFlook_06_25_10-kernel.tar.gz
===

===
I have attached a round of testing using the new kernel, with copies
occurring on the ext3 filesystems (on Veritas volumes).
Flook_06_27_10-kern_ext3.tar.gz

11:07 - Start collection
11:09 - Start copy
11:15 - Stop copy  (230 mb dump)
11:18 - Start copy  
11:24 - Stop copy  (+100mb)
11:26 - Start copy
11:31 - Stop copy  (230 mb dump)
11:38 - Stop collection

Michael Benn
===

The customer would like to know if "the new kernel is behaving properly
?"

It seems they already personally saw better behavior for their situation,
do you agree?

Also, since we are seeing good results, can we get this patch in a
supported hotfix kernel?

The customer is working to see if these increases are enough to fix the
problem:

"I am organising some more detailed testing with our apps team to confirm
if the recent changes are enough to alleviate the issue"

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by jwest 
 issue 865263

Comment 13 RHEL Program Management 2010-07-16 13:19:21 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 14 Chris Marcantonio 2010-07-16 17:02:19 UTC

Arg, I can't publish a comment from IT into BZ on this case (I think it has something to do with permissions based on the queue it was opened in), so I'll have to post directly into Bugzilla.

The patch that Jeremy pushed through in Comment 5 looks complete and correct to me.  It is the exact patch I used in my Brew build that I provided to the customer.  I snagged this patch out of an e-mail thread with Larry Woodman where he made reference to it.  By the sound of it, this *is* from upstream, though there isn't a specific commit referenced, so unfortunately I can't provide any more insight on that front.

I'll attach the e-mail here as well for completeness...

Comment 18 Jarod Wilson 2010-07-23 15:29:20 UTC

in kernel-2.6.18-208.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 22 John Brier 2010-09-24 17:47:07 UTC

Hi, I called the customer today and discussed the state of the issue.

What they have observed is that despite seeing some minmal swapping the application is no longer seeing IO blocking since implementing:

* patched kernel
* vxfs tuning (write throttle)
* kernel vm tuning

So what we need to do now is work to get this patched kernel included in an upcoming errata.

Comment 24 Zhouping Liu 2010-11-18 09:16:56 UTC

Created attachment 461244 [details]
the script test.sh creates the log

Comment 25 Zhouping Liu 2010-11-18 09:20:09 UTC

Created attachment 461245 [details]
logfile when no patch

Comment 26 Zhouping Liu 2010-11-18 09:22:54 UTC

Created attachment 461246 [details]
malloc_over.c program

Comment 29 John Brier 2010-11-30 15:27:33 UTC

I'm removing (closed) sales force ticket 00318498 form this Bug because it is the old migrated Issue Tracker ticket which was only used as a bridge from production support to SEG and from SEG to BZ.. now we have the main customer facing migrated Oracle CRM ticket (now salesforce 00329728) attached to this BZ..

Comment 31 errata-xmlrpc 2011-01-13 21:37:53 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html