They start up multiple java programs. Then there are several very large file copy operations each around 50 megabytes. The system ends up with a large amount of memory tied up in pagecache. About 30 seconds after the file copies end, there is a short burst of page out activity even though there is close to 3 gigabytes of free memory, well above min_free_kbytes, and the system itself is effectively idle. The customer is wanting to know why the system is paging out their java applications when there is so much free memory? And why it is not recovering memory from pagecache? After the burst, there seems to be very little paging out. As expected, there is some page in activity afterwards. But I see no evidence of thrashing. kernel version: 2.6.18-194.el5 Sample of vmstat: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 4 292 2892956 5240 2015592 0 0 76 30864 1111 7318 0 1 65 34 0 0 6 292 2918864 5236 1989920 0 0 13 29668 1135 6728 0 1 63 36 0 0 3 292 2943704 5232 1964840 0 0 8 32676 1115 6896 1 1 64 34 0 0 4 292 2977148 5368 1930016 0 0 414 27416 1186 7690 0 2 70 27 0 0 2 292 3011456 5320 1896364 0 0 62 37368 1176 6074 0 1 80 19 0 0 1 292 3043528 5252 1864336 0 0 9 27117 1182 3601 0 1 81 18 0 0 1 292 3045772 5256 1861936 0 0 49 20 1021 871 0 0 87 13 0 0 1 292 3048656 5264 1859264 0 0 44 12 1040 927 0 0 87 13 0 0 1 292 3053276 5284 1854808 0 0 9 223 1031 891 0 0 87 13 0 0 1 292 3056436 5312 1851708 0 0 9 79 1045 919 1 0 86 13 0 1 0 292 3060572 5260 1847752 0 0 13 100 1033 922 0 1 86 13 0 0 2 73372 3108212 5256 1847068 0 73080 8 73094 1262 1017 0 3 86 11 0 0 1 171428 3210032 5256 1845956 0 98056 4 98056 1471 1069 0 2 84 14 0 0 5 267604 3305248 5280 1845768 32 96164 51 96348 1509 1125 0 2 77 21 0 0 3 332984 3390068 5348 1846744 64 65376 259 65472 1508 1467 0 2 67 31 0 0 1 389804 3449460 5372 1846912 0 56820 98 57574 1468 1603 0 2 79 19 0 0 1 403136 3463360 5372 1846832 0 13332 0 13332 1080 1167 0 1 87 12 0 0 1 410360 3471196 5376 1846896 0 7224 4 7240 1069 1102 0 0 87 12 0 1 1 414076 3475660 5360 1846568 0 3716 32 4000 1045 1045 0 1 86 13 0 0 3 490996 3531708 5304 1844736 32 76932 58 76988 1287 1160 1 3 82 14 0 0 3 578964 3622892 5328 1844612 224 87968 229 88110 1449 1158 0 2 63 35 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 3 674496 3717112 5348 1844072 512 95532 524 95540 1630 1344 0 2 75 23 0 1 2 740268 3801612 5424 1842796 344 65812 527 65812 1976 1981 1 3 73 23 0 0 2 743212 3804776 5640 1843516 844 2944 1784 3162 1345 1197 0 0 93 7 0 0 0 743212 3803744 5780 1843908 504 0 640 12 1060 990 0 0 98 2 0 0 0 743196 3802240 5944 1844900 64 0 982 56 1113 1351 0 1 96 3 0 0 0 743196 3801752 6020 1845028 128 0 324 0 1039 933 0 0 99 1 0 0 0 743196 3801628 6100 1845096 64 0 136 72 1046 929 0 0 99 1 0 0 0 743196 3802088 6164 1845056 32 0 88 772 1037 938 0 0 98 1 0 0 0 743196 3801696 6172 1845108 96 0 96 68 1034 908 1 0 99 0 0 0 0 743196 3801956 6188 1845184 0 0 0 43 1011 878 0 0 100 0 0 0 0 743196 3801956 6188 1845216 0 0 0 0 1028 872 0 0 100 0 0 2 0 743196 3801928 6244 1845260 256 0 424 32 1040 1053 0 0 98 1 0 0 1 743196 3801532 6256 1845392 128 0 132 216 1037 890 0 0 99 1 0 0 0 743196 3800316 6288 1846476 224 0 832 12 1063 919 0 0 98 1 0 0 0 743196 3799448 6308 1846516 64 0 66 64 1036 1222 0 1 99 1 0 0 0 743196 3799856 6308 1846652 0 0 0 0 1006 871 0 0 100 0 0 0 0 743196 3800008 6308 1846632 32 0 32 0 1027 925 0 0 100 0 0 0 0 743196 3799568 6332 1846676 0 0 20 844 1031 929 0 0 99 1 0 0 0 743196 3799568 6332 1846744 96 0 96 0 1027 908 1 0 99 0 0 0 0 743196 3799968 6352 1846772 0 0 0 87 1017 890 0 0 99 1 0 0 0 743196 3799872 6352 1846848 0 0 0 0 1026 883 0 0 100 0 0
Event posted on 06-16-2010 03:46pm EDT by jbrier "Has there been any attempt to reproduce this on an ext3 filesystem? If it is reproducible on ext3, then there would be no need to keep Symantec involved." Ya, I pushed back on this in the email, they are working on it. "Is there a measurable performance problem here? Or is this trying to interpret some unusual system statistics?" "In the testing over the weekend, the parameter suggested by Symantec to change was modified. From what I read, the system responded as expected. I/O wait time dropped. It wasn't until a higher load was placed on the system that is showed any of the signs. Is that load something that would be expected in a normal production environment?" They kind of touched on this previously ====== However, throughout the testing I did observe that the CPU IO wait % was a lot lower than before the throttling was set. It tended to hold around 25% during copy, with the CPU over 50% idle. Even during the swap dumps, IO wait% remained below 50%. This is in contrast to the 60% IO wait seen without the throttling during a copy, and 90% IO wait following a swap dump. Given this, I'll put the question to our apps team to see if they can measure the response times of the retrieval process within the Java apps, and whether we're at a point where we have alleviated the issue of the retrievals timing out. === Which I followed up on. Here is a description of the actual problem, without complaining about cache usage, or swap usage: === As for the actual problem. There are 2 types of actions occurring on the hosts. We are performing a "retrieval" operation, where a client is connecting to our Tomcat/Java processes and reading a file from our VxFS filesystems. I have been told we get up to 90,000 requests a day. The second is a "load" operation, where we are copying 50mb data files from a share to the VxFS filesystems. This loading occurs on demand as the files are made available. This usually occurs in batches. We were able to perform both of these operations from a single SuSE host with 4gb of RAM for several years without noticing any performance issues. After we migrated to the new Redhat 5.5 servers with all new hardware and 8gb RAM we started to experience timeouts within the retrieval process during or soon after a load process commenced. A timeout means that a client requested a data file, and the Tomcat was unable to fulfill the request within a period. When looking at the server when these retrieval timeouts were occurring we noticed that large amounts of data were being instantly dumped to swap, and the server was reporting high CPU IO wait %, in the range of 50-90%, and this high IO wait was persisting for long periods, effectively rendering our retrieval app non-operational. === Bill, I'll wait for your update after the out of band email thread with Larry Woodman from today. It looks like we figured out some good stuff about one of the NUMA nodes being very low, which caused page reclaiming even though other NUMA nodes were not that low. John Brier This event sent from IssueTracker by jwest issue 865263
Event posted on 06-16-2010 05:52pm EDT by bbraswel The system is acting as if it is low on memory, because from a NUMA (Non-Uniform Memory Architecture) stand point, it is. In NUMA, memory of each node is managed semi-independently of the other nodes. Once page reclamation begins the system continues to reclaim pages until all nodes are no longer memory starved. In the data we have looked at the free memory on node 0 is well below min_free_kbytes. The system will continue to preform page recovery until free memory on node 0 is above min_free_kbytes. Even at the expense of other nodes. This can cause swapping on systems where the overall total free count is huge. While doing memory recovery, it will check each node to see if free memory is above the minimum. It will continue until all nodes are. The problem is, it continues to do the recovery operations on all node, even those that are not low on memory. This can cause pages to be swapped out that should not have been. We are looking at a patch from upstream that checks a node before performing recovery no that node. Since all DMA memory does reside in node 0, and that is the node having memory starvation issues, we would still like to see if it can be reproduced on ext3. This could eliminate vxfs as being a concern here. Bill This event sent from IssueTracker by jwest issue 865263
Event posted on 06-21-2010 02:01pm EDT by cmarcant Hello. John is out of the office for the next week so I am watching this case for him... The patch that I saw from Larry's e-mail looks like it was already backported to RHEL5 from what I can see/tell. Line numbers appear to match up to a current RHEL5 source tree, and luckily it's basically a 1 line patch (plus comments) so it's not too hard to drop into place as appropriate if you're working off a different tree for some reason. Is that likely our best plan of action? Do you want me to just hand build a kernel at my desk with that patch applied and have the customer try it? Would you rather push a build through brew and point me to that package? The customer is currently working on securing more disk space for a test on ext3, but in the meantime is a bit anxious to see an update from our end and where we are at in regards to this issue. Any preference for how we go about having them test this patch? Chris This event sent from IssueTracker by jwest issue 865263
Event posted on 06-22-2010 10:39am EDT by cmarcant Hmmmm. I tried to build a kernel here at my desk last night with the following patch (applied against the -194 tree): [chris@virt5-64 2.6.18]$ cat SOURCES/linux-kernel-test.patch diff -up linux-2.6.18.x86_64/mm/vmscan.c.ORIG linux-2.6.18.x86_64/mm/vmscan.c --- linux-2.6.18.x86_64/mm/vmscan.c.ORIG 2010-06-21 19:03:43.000000000 -0400 +++ linux-2.6.18.x86_64/mm/vmscan.c 2010-06-21 19:06:20.000000000 -0400 @@ -1221,7 +1221,13 @@ scan: temp_priority[i] = priority; sc.nr_scanned = 0; note_zone_scanning_priority(zone, priority); - shrink_zone(priority, zone, &sc); + /* + * We put equal pressure on every zone, unless one + * zone has way too many pages free already. + */ + if (!zone_watermark_ok(zone, order, + 8*zone->pages_high, end_zone, 0)) + shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages); [chris@virt5-64 2.6.18]$ Unfortunately, apparently a little over 3G isn't enough space to properly build a kernel and I ran out of space on the device, so the build failed. Arg. I'm trying to see if I can get this going on another system in the meantime. This event sent from IssueTracker by jwest issue 865263
Event posted on 06-22-2010 05:27pm EDT by bbraswel Chris's patch is finished. I have not been able to test the patch as of yet. It is available at: https://brewweb.devel.redhat.com/taskinfo?taskID=2540532 Please note, this patch will not prevent page recovery when there is a high overall total free memory. What it does, is prevent page recovery on NUMA nodes that have enough free memory. But there will still be page recovery operations on NUMA nodes that have a low free page count. Bill Internal Status set to 'Waiting on Support' This event sent from IssueTracker by jwest issue 865263
Event posted on 06-24-2010 04:07pm EDT by cmarcant File uploaded: Flook_06_24_10-ext3.tar.gz This event sent from IssueTracker by jwest issue 865263 it_file 803043
Event posted on 06-25-2010 04:05pm EDT by cmarcant File uploaded: Flook_06_25_10-kernel.tar.gz This event sent from IssueTracker by jwest issue 865263 it_file 807513
Event posted on 06-25-2010 04:05pm EDT by cmarcant One more update from the customer today. He has tested the patched kernel with VXFS: <snip> I did some testing on the new test kernel. I still observed some swapping activity in the order of 150mb, but no big dumps of 1gb or more. Flook_06_25_10-kernel.tar.gz is attached. This testing was performed on 2 vxfs filesystems with write_throttle at the default of 0 (off). 14:24 - Start collect 14:26 - Start copy 14:32 - stop copy 14:34 - start (2 mins in 140mb dump) 14:40 - stop copy (+80mb swap) 14:43 - start copy 14:49 - stop copy (120mb swap) 14:51 - start copy 14:56 - stop copy (+70mb swap) 14:58 - start copy 15:04 - stop copy 15:11 - stop collect. I will run some more testing over the weekend. Let me know if you want any tests under specific conditions (eg, using ext3, different kernel parameters, using write_throttle, etc). Michael Benn <snip> I've attached this tarball of data as well. I'm also in the process of asking the customer if this has positively impacted the timeouts they have previously complained about. Is there anything else you want me to have them test, implement, or answer? Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by jwest issue 865263
Event posted on 06-30-2010 06:30pm EDT by jbrier File uploaded: Flook_06_25_10-kernel.tar.gz This event sent from IssueTracker by jwest issue 865263 it_file 822003
Event posted on 06-30-2010 06:30pm EDT by jbrier File uploaded: Flook_06_27_10-kern_ext3.tar.gz This event sent from IssueTracker by jwest issue 865263 it_file 822013
Event posted on 06-30-2010 06:35pm EDT by jbrier The customer got some good results with the test kernel: === I did some testing on the new test kernel. I still observed some swapping activity in the order of 150mb, but no big dumps of 1gb or more. Flook_06_25_10-kernel.tar.gz is attached. This testing was performed on 2 vxfs filesystems with write_throttle at the default of 0 (off). 14:24 - Start collect 14:26 - Start copy 14:32 - stop copy 14:34 - start (2 mins in 140mb dump) 14:40 - stop copy (+80mb swap) 14:43 - start copy 14:49 - stop copy (120mb swap) 14:51 - start copy 14:56 - stop copy (+70mb swap) 14:58 - start copy 15:04 - stop copy 15:11 - stop collect. I will run some more testing over the weekend. Let me know if you want any tests under specific conditions (eg, using ext3, different kernel parameters, using write_throttle, etc). Michael Benn Attachment uploadedFlook_06_25_10-kernel.tar.gz === === I have attached a round of testing using the new kernel, with copies occurring on the ext3 filesystems (on Veritas volumes). Flook_06_27_10-kern_ext3.tar.gz 11:07 - Start collection 11:09 - Start copy 11:15 - Stop copy (230 mb dump) 11:18 - Start copy 11:24 - Stop copy (+100mb) 11:26 - Start copy 11:31 - Stop copy (230 mb dump) 11:38 - Stop collection Michael Benn === The customer would like to know if "the new kernel is behaving properly ?" It seems they already personally saw better behavior for their situation, do you agree? Also, since we are seeing good results, can we get this patch in a supported hotfix kernel? The customer is working to see if these increases are enough to fix the problem: "I am organising some more detailed testing with our apps team to confirm if the recent changes are enough to alleviate the issue" Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by jwest issue 865263
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Arg, I can't publish a comment from IT into BZ on this case (I think it has something to do with permissions based on the queue it was opened in), so I'll have to post directly into Bugzilla. The patch that Jeremy pushed through in Comment 5 looks complete and correct to me. It is the exact patch I used in my Brew build that I provided to the customer. I snagged this patch out of an e-mail thread with Larry Woodman where he made reference to it. By the sound of it, this *is* from upstream, though there isn't a specific commit referenced, so unfortunately I can't provide any more insight on that front. I'll attach the e-mail here as well for completeness...
in kernel-2.6.18-208.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Hi, I called the customer today and discussed the state of the issue. What they have observed is that despite seeing some minmal swapping the application is no longer seeing IO blocking since implementing: * patched kernel * vxfs tuning (write throttle) * kernel vm tuning So what we need to do now is work to get this patched kernel included in an upcoming errata.
Created attachment 461244 [details] the script test.sh creates the log
Created attachment 461245 [details] logfile when no patch
Created attachment 461246 [details] malloc_over.c program
I'm removing (closed) sales force ticket 00318498 form this Bug because it is the old migrated Issue Tracker ticket which was only used as a bridge from production support to SEG and from SEG to BZ.. now we have the main customer facing migrated Oracle CRM ticket (now salesforce 00329728) attached to this BZ..
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html