Bug 604779
Summary: | Page out activity when there is no current VM load | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Bill Braswell <bbraswel> | ||||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Zhouping Liu <zliu> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 5.3 | CC: | bhamrick, jbrier, jwest, nobody+295318, qcai, tao, zliu | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2011-01-13 21:37:53 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Bill Braswell
2010-06-16 17:50:11 UTC
Event posted on 06-16-2010 03:46pm EDT by jbrier "Has there been any attempt to reproduce this on an ext3 filesystem? If it is reproducible on ext3, then there would be no need to keep Symantec involved." Ya, I pushed back on this in the email, they are working on it. "Is there a measurable performance problem here? Or is this trying to interpret some unusual system statistics?" "In the testing over the weekend, the parameter suggested by Symantec to change was modified. From what I read, the system responded as expected. I/O wait time dropped. It wasn't until a higher load was placed on the system that is showed any of the signs. Is that load something that would be expected in a normal production environment?" They kind of touched on this previously ====== However, throughout the testing I did observe that the CPU IO wait % was a lot lower than before the throttling was set. It tended to hold around 25% during copy, with the CPU over 50% idle. Even during the swap dumps, IO wait% remained below 50%. This is in contrast to the 60% IO wait seen without the throttling during a copy, and 90% IO wait following a swap dump. Given this, I'll put the question to our apps team to see if they can measure the response times of the retrieval process within the Java apps, and whether we're at a point where we have alleviated the issue of the retrievals timing out. === Which I followed up on. Here is a description of the actual problem, without complaining about cache usage, or swap usage: === As for the actual problem. There are 2 types of actions occurring on the hosts. We are performing a "retrieval" operation, where a client is connecting to our Tomcat/Java processes and reading a file from our VxFS filesystems. I have been told we get up to 90,000 requests a day. The second is a "load" operation, where we are copying 50mb data files from a share to the VxFS filesystems. This loading occurs on demand as the files are made available. This usually occurs in batches. We were able to perform both of these operations from a single SuSE host with 4gb of RAM for several years without noticing any performance issues. After we migrated to the new Redhat 5.5 servers with all new hardware and 8gb RAM we started to experience timeouts within the retrieval process during or soon after a load process commenced. A timeout means that a client requested a data file, and the Tomcat was unable to fulfill the request within a period. When looking at the server when these retrieval timeouts were occurring we noticed that large amounts of data were being instantly dumped to swap, and the server was reporting high CPU IO wait %, in the range of 50-90%, and this high IO wait was persisting for long periods, effectively rendering our retrieval app non-operational. === Bill, I'll wait for your update after the out of band email thread with Larry Woodman from today. It looks like we figured out some good stuff about one of the NUMA nodes being very low, which caused page reclaiming even though other NUMA nodes were not that low. John Brier This event sent from IssueTracker by jwest issue 865263 Event posted on 06-16-2010 05:52pm EDT by bbraswel The system is acting as if it is low on memory, because from a NUMA (Non-Uniform Memory Architecture) stand point, it is. In NUMA, memory of each node is managed semi-independently of the other nodes. Once page reclamation begins the system continues to reclaim pages until all nodes are no longer memory starved. In the data we have looked at the free memory on node 0 is well below min_free_kbytes. The system will continue to preform page recovery until free memory on node 0 is above min_free_kbytes. Even at the expense of other nodes. This can cause swapping on systems where the overall total free count is huge. While doing memory recovery, it will check each node to see if free memory is above the minimum. It will continue until all nodes are. The problem is, it continues to do the recovery operations on all node, even those that are not low on memory. This can cause pages to be swapped out that should not have been. We are looking at a patch from upstream that checks a node before performing recovery no that node. Since all DMA memory does reside in node 0, and that is the node having memory starvation issues, we would still like to see if it can be reproduced on ext3. This could eliminate vxfs as being a concern here. Bill This event sent from IssueTracker by jwest issue 865263 Event posted on 06-21-2010 02:01pm EDT by cmarcant Hello. John is out of the office for the next week so I am watching this case for him... The patch that I saw from Larry's e-mail looks like it was already backported to RHEL5 from what I can see/tell. Line numbers appear to match up to a current RHEL5 source tree, and luckily it's basically a 1 line patch (plus comments) so it's not too hard to drop into place as appropriate if you're working off a different tree for some reason. Is that likely our best plan of action? Do you want me to just hand build a kernel at my desk with that patch applied and have the customer try it? Would you rather push a build through brew and point me to that package? The customer is currently working on securing more disk space for a test on ext3, but in the meantime is a bit anxious to see an update from our end and where we are at in regards to this issue. Any preference for how we go about having them test this patch? Chris This event sent from IssueTracker by jwest issue 865263 Event posted on 06-22-2010 10:39am EDT by cmarcant Hmmmm. I tried to build a kernel here at my desk last night with the following patch (applied against the -194 tree): [chris@virt5-64 2.6.18]$ cat SOURCES/linux-kernel-test.patch diff -up linux-2.6.18.x86_64/mm/vmscan.c.ORIG linux-2.6.18.x86_64/mm/vmscan.c --- linux-2.6.18.x86_64/mm/vmscan.c.ORIG 2010-06-21 19:03:43.000000000 -0400 +++ linux-2.6.18.x86_64/mm/vmscan.c 2010-06-21 19:06:20.000000000 -0400 @@ -1221,7 +1221,13 @@ scan: temp_priority[i] = priority; sc.nr_scanned = 0; note_zone_scanning_priority(zone, priority); - shrink_zone(priority, zone, &sc); + /* + * We put equal pressure on every zone, unless one + * zone has way too many pages free already. + */ + if (!zone_watermark_ok(zone, order, + 8*zone->pages_high, end_zone, 0)) + shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages); [chris@virt5-64 2.6.18]$ Unfortunately, apparently a little over 3G isn't enough space to properly build a kernel and I ran out of space on the device, so the build failed. Arg. I'm trying to see if I can get this going on another system in the meantime. This event sent from IssueTracker by jwest issue 865263 Event posted on 06-22-2010 05:27pm EDT by bbraswel Chris's patch is finished. I have not been able to test the patch as of yet. It is available at: https://brewweb.devel.redhat.com/taskinfo?taskID=2540532 Please note, this patch will not prevent page recovery when there is a high overall total free memory. What it does, is prevent page recovery on NUMA nodes that have enough free memory. But there will still be page recovery operations on NUMA nodes that have a low free page count. Bill Internal Status set to 'Waiting on Support' This event sent from IssueTracker by jwest issue 865263 Event posted on 06-24-2010 04:07pm EDT by cmarcant File uploaded: Flook_06_24_10-ext3.tar.gz This event sent from IssueTracker by jwest issue 865263 it_file 803043 Event posted on 06-25-2010 04:05pm EDT by cmarcant File uploaded: Flook_06_25_10-kernel.tar.gz This event sent from IssueTracker by jwest issue 865263 it_file 807513 Event posted on 06-25-2010 04:05pm EDT by cmarcant One more update from the customer today. He has tested the patched kernel with VXFS: <snip> I did some testing on the new test kernel. I still observed some swapping activity in the order of 150mb, but no big dumps of 1gb or more. Flook_06_25_10-kernel.tar.gz is attached. This testing was performed on 2 vxfs filesystems with write_throttle at the default of 0 (off). 14:24 - Start collect 14:26 - Start copy 14:32 - stop copy 14:34 - start (2 mins in 140mb dump) 14:40 - stop copy (+80mb swap) 14:43 - start copy 14:49 - stop copy (120mb swap) 14:51 - start copy 14:56 - stop copy (+70mb swap) 14:58 - start copy 15:04 - stop copy 15:11 - stop collect. I will run some more testing over the weekend. Let me know if you want any tests under specific conditions (eg, using ext3, different kernel parameters, using write_throttle, etc). Michael Benn <snip> I've attached this tarball of data as well. I'm also in the process of asking the customer if this has positively impacted the timeouts they have previously complained about. Is there anything else you want me to have them test, implement, or answer? Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by jwest issue 865263 Event posted on 06-30-2010 06:30pm EDT by jbrier File uploaded: Flook_06_25_10-kernel.tar.gz This event sent from IssueTracker by jwest issue 865263 it_file 822003 Event posted on 06-30-2010 06:30pm EDT by jbrier File uploaded: Flook_06_27_10-kern_ext3.tar.gz This event sent from IssueTracker by jwest issue 865263 it_file 822013 Event posted on 06-30-2010 06:35pm EDT by jbrier The customer got some good results with the test kernel: === I did some testing on the new test kernel. I still observed some swapping activity in the order of 150mb, but no big dumps of 1gb or more. Flook_06_25_10-kernel.tar.gz is attached. This testing was performed on 2 vxfs filesystems with write_throttle at the default of 0 (off). 14:24 - Start collect 14:26 - Start copy 14:32 - stop copy 14:34 - start (2 mins in 140mb dump) 14:40 - stop copy (+80mb swap) 14:43 - start copy 14:49 - stop copy (120mb swap) 14:51 - start copy 14:56 - stop copy (+70mb swap) 14:58 - start copy 15:04 - stop copy 15:11 - stop collect. I will run some more testing over the weekend. Let me know if you want any tests under specific conditions (eg, using ext3, different kernel parameters, using write_throttle, etc). Michael Benn Attachment uploadedFlook_06_25_10-kernel.tar.gz === === I have attached a round of testing using the new kernel, with copies occurring on the ext3 filesystems (on Veritas volumes). Flook_06_27_10-kern_ext3.tar.gz 11:07 - Start collection 11:09 - Start copy 11:15 - Stop copy (230 mb dump) 11:18 - Start copy 11:24 - Stop copy (+100mb) 11:26 - Start copy 11:31 - Stop copy (230 mb dump) 11:38 - Stop collection Michael Benn === The customer would like to know if "the new kernel is behaving properly ?" It seems they already personally saw better behavior for their situation, do you agree? Also, since we are seeing good results, can we get this patch in a supported hotfix kernel? The customer is working to see if these increases are enough to fix the problem: "I am organising some more detailed testing with our apps team to confirm if the recent changes are enough to alleviate the issue" Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by jwest issue 865263 This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Arg, I can't publish a comment from IT into BZ on this case (I think it has something to do with permissions based on the queue it was opened in), so I'll have to post directly into Bugzilla. The patch that Jeremy pushed through in Comment 5 looks complete and correct to me. It is the exact patch I used in my Brew build that I provided to the customer. I snagged this patch out of an e-mail thread with Larry Woodman where he made reference to it. By the sound of it, this *is* from upstream, though there isn't a specific commit referenced, so unfortunately I can't provide any more insight on that front. I'll attach the e-mail here as well for completeness... in kernel-2.6.18-208.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Hi, I called the customer today and discussed the state of the issue. What they have observed is that despite seeing some minmal swapping the application is no longer seeing IO blocking since implementing: * patched kernel * vxfs tuning (write throttle) * kernel vm tuning So what we need to do now is work to get this patched kernel included in an upcoming errata. Created attachment 461244 [details]
the script test.sh creates the log
Created attachment 461245 [details]
logfile when no patch
Created attachment 461246 [details]
malloc_over.c program
I'm removing (closed) sales force ticket 00318498 form this Bug because it is the old migrated Issue Tracker ticket which was only used as a bridge from production support to SEG and from SEG to BZ.. now we have the main customer facing migrated Oracle CRM ticket (now salesforce 00329728) attached to this BZ.. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html |