Bug 145950
Summary: | high loads / high iowait / up 100% cpu time for kscand on oracle box | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | gman | ||||||||||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 3.0 | CC: | acjohnso, andre, bill.irwin, dff, greg.marsden, jonathan, peterm, petrides, redhat, rick.beldin, riel, rtaylor, tao, tkincaid, van.okamura | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | i686 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | RHSA-2005-663 | Doc Type: | Bug Fix | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2005-09-28 14:43:32 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 156320 | ||||||||||||||||
Attachments: |
|
Description
gman
2005-01-24 11:38:31 UTC
Created attachment 110182 [details]
output of top prior to peak. last one is post peak.
output of top prior to peak. last one is post peak.
Created attachment 110183 [details]
graph of cpu usage
Here is a cpu usage graph. I read in /proc/stat and let rrd do the graphing
Created attachment 110184 [details]
graph of load
same as above except for uptime via /usr/bin/uptime
gman, can you get me several AltSysrq-M and AltSysrq-W outputs when you see kscand in this state(eating up lots of cpu time)? I have tried to reproduce this problem internally and never even see kscand on a top output! Alternatively, you can profile the kernel and get even more debug data for me but that does require a reboot. I'll send you the instructions to do this if you can reboot. Thanks for your help, Larry Woodman The dba decided to move back to 2.4.20-20.EL and incresed shmmax to 4GB. That seem to help but the swapping issue may be coming back after 8days of uptime. I can't get you the info until I can get them to decide to move to the newer kernel again. OK, I cant really make any progress on this issue if they are no longer running the kernel that this problem was reported on. I have not seen any other problem reports on this issue so I need to rely on you for help debugging this. Larry Woodman We just rebooted back into 2.4.21-27.0.1.ELsmp. We saw the spikes right away after the reboot. I'll have AltSysrq-M and AltSysrq-W outputs sometime today. I'll see about adding profiling in, can you send instructions on howto do this? Created attachment 110982 [details]
outputs of AltSysrq-W and AltSysrq-M
Here are the outputs of AltSysrq-W and AltSysrq-M.. First one was right at the
large peak that we see in the graphs. The rest are some when kscand making the
system unresponsive post peak.
Created attachment 111072 [details]
more sysreq-m and sysreq-w
Ran sysreq every minute for 1+ hr. The peak was some time around 14:45 and
15:00 before the it started all over again.
found the problem. dba requested settings that caused the whole problem. I think if we left limits.conf as is via the default install, we would have been good also. --- limits.conf.old 2005-04-05 21:03:37.000000000 -0700 +++ limits.conf.new 2005-04-19 19:49:05.000000000 -0700 @@ -40,12 +40,19 @@ #ftp hard nproc 0 #@student - maxlogins 4 -oracle soft nofile 4096 + +oracle soft nofile 8192 oracle hard nofile 65535 -oracle soft rss 8192 -oracle hard rss 65535 +# oracle soft rss 8192 +# oracle hard rss 65535 +oracle soft rss unlimited +oracle hard rss unlimited +oracle soft stack unlimited +oracle hard stack unlimited +oracle soft nproc 32768 +oracle hard nproc 65535 +oracle soft memlock unlimited +oracle hard memlock unlimited OK, here is whats going on with this bug: The system has millions of pages that are mostly active and mapped into hundreds or even thousands of user address spaces. When kscand runs and decides it needs scan the active list to age pages it takes the zone_lru lock and walks every page in the active list. For each page it walks the pte_chain, mapping the highmem ptes into a kernel virtual window to test and clear the reference bit. Since this is an N x M algorithm where N is millions of pages and M is thousands of processes, this can result in billions of itterations of that mapping, testing and clearing and unmapping while holding the zone_lru lock. When this lock is held, practically all other operations within that memory zone stalls. In order to fix this problem we must make kscand more scalable that it currently is. Im must periodically release the zone_lru lock so other work may progress. Larry Woodman We are running AS 3 kernel version 2.4.21-27.0.2.1.3.ELsmp, and are running into the same problem. We have cyclical CPU spikes every 5 minutes (load average jumps to between 60 & 80). kscand has consumed over 6 hours CPU over last 8 days. root 12 1 3 May03 ? 06:53:15 [kscand] root 27642 8815 0 16:05 pts/9 00:00:00 grep kscand Another voice from the wilderness: I'm having the same problem with RHEL3 after installing update 4 including kernel 2.4.21-27.0.4.EL: periodically sluggish performance with kscand having used 7:57 hours out of 41:00 hours of uptime. It appears that 2.6 appears to address this by changing how much they have to lock. From RHEL4 2.6.9 vmscan.c: /* * zone->lru_lock is heavily contented. We relieve it by quickly privatising * a batch of pages and working on them outside the lock. Any pages which were * not freed will be added back to the LRU. * * shrink_cache() adds the number of pages reclaimed to sc->nr_reclaimed * * For pagecache intensive workloads, the first loop here is the hottest spot * in the kernel (apart from the copy_*_user functions). */ If I understand this comment (and the write-up on page 178 of Understanding the Linux Virtual Memory Manager by Mel Gorman), the idea is to move a 'block' of pages is removed from the list, thus freeing the list up more quickly. I'm not sure how well this strategy would translate to the 2.4.21 kernel. In 2.6 this is being called out of kswapd, but in 2.4.21 this is being done from a different thread, kscand, which doesn't exist at 2.6. It also appears that the pagevec structure is also a 2.6ism, which means that something would have to be 'invented' for 2.4. Code is sufficiently differnt from 2.4 to 2.6 that it would appear that new sections would have to be written. Just thinking aloud... The options seem to be: - do something 'similar' to 2.6, where lock is held long enough to move them quickly out of the way for later processing - find other places to unlock for short periods - perhaps 'checkpoint' the scan operation by saving state 'somehow', releasing the lock and then making kscand runnable again, to pick up where it was before Just thinking aloud... I welcome all flames... Some customers have indicated that this behavior (long kscand run times) were introduced somewhere between U2 and U4, but we don't have data to back that up. The typical reason you see this problem on a larger system running Oracle is that you are not using hugepages. Without hugepages the system treats system V shared memory pages as simple mapped file pages. Since the SGA is very large and active kscand consumes lots of cputime deactivating active pages and since they are mapped by lots of Oracle processes thie activity is not very scalable. We are working on making kscand more efficiant in RHEL3 but there will never be any substitute for using hugepages when running Oracle. Please try this and let me know if it works out OK for you. Larry Woodman We spoke with Oracle about moving to hugepages and they indicated that we should expect a 10-15% performance hit from going to this architecture. This represesents a fairly significant performance loss for us. Our database environment has only 8GB of RAM on each Oracle RAC cluster member, and our SGA is currently only set to 1.7GB, which is below the minimum requirement for hugememory pages. We are generally using less than half the available RAM on our servers. That doenst make sense, hugepages give Oracle a performance boost not a loss! hugepages allow the entire SGA to be mapped into the TLB because the page size increases by 1000. As far as the 1.7GB SGA being smaller than the requirement for hugepages, thats an Oracle limitation and not a kernel limitation. Either way, we are working on kscand improvements so I do expect to make this better but it will never be as good as using hugepages because that eliminates the kernel's involvement entirely. Larry Woodman I would hazard a guess that someone has confused hugemem kernel with the hugepages tunable. There is some performance penalty with the hugemem kernel, but use of hugepages does not depend on hugemem kernel. Has this issue been addressed in U6? Just looking for an update. Thanks. Al, the answer is no. Larry, could you please clarify why this bug is still in NEEDINFO? What information are you still waiting for?
The answer is yes, we added a new tunable "/proc/sys/vm/kscand_work_percent".
This tunable defaults to 100 but if one insists on running large Oracle systems
without hugepages, this tunable should be lowered to 10 or so. This will
prevent kscand from holding the zone lru list lock for very long amounts of time
thereby allowing other processes to get the lock and run.
Larry Woodman
Ernie, the patch tracking file for this bug is
>>>1036.lwoodman.kscand-work-percnt.patch
Support for the "kscand_work_percent" tunable was committed to the RHEL3 U6 patch pool on 15-Jul-2005 (in kernel version 2.4.21-32-12.EL). PeterM/DonF/TomK, please run this bugzilla through the ack process ASAP so that I can add it to the RHEL3 U6 advisory. (In reply to comment #21) > >>>1036.lwoodman.kscand-work-percnt.patch Can we get this patch posted as an attachment please? Thanks, Rick Created attachment 117781 [details]
kscand_work_percent sysctl patch committed to RHEL3 U6
Rick, this is the patch that was committed. But we'd rather you test
the RHEL3 U6 beta kernel (2.4.21-34.EL) than something you build manually.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html *** Bug 169547 has been marked as a duplicate of this bug. *** I have a clean install of RHEL3 U5 and I would like to apply this patch. Unfortunately, RHEL3 U6 and RHEL4 are not options for me because of the limitations of my "hardware" platform, namely VMware ESX Server. When applying the patch, I got some errors in sysctl because I lacked oom_kill_limit. /* int: limit on concurrent OOM kills */ I am wondering if there is another patch that I need to apply first, or if this patch will work without the OOM functionality. (I'm going to try it anyway, but I'm a newbie, so I don't really know how to test it beyond the obvious (Linux no longer boots up or something.) Alter the patch to remove the OOM_KILL_LIMIT line and adjust the line counts: -------------------------------------------------------------------------------- --- linux-2.4.21/include/linux/sysctl.h.orig +++ linux-2.4.21/include/linux/sysctl.h @@ -160,5 +160,6 @@ enum VM_STACK_DEFER_THRESHOLD=26, /* int: softirq-defer threshold */ VM_SKIP_MAPPED_PAGES=27,/* int: don't reclaim pages w/active mappings */ + VM_KSCAND_WORK_PERCENT=29, /* int: % of work on each kscand iteration */ }; |