Red Hat Bugzilla – Bug 145950
high loads / high iowait / up 100% cpu time for kscand on oracle box
Last modified: 2011-11-24 22:33:56 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3)
Description of problem:
We have el3u3 installed on a dl380 w/ two 3.06mghz cpus, 8GB ram
attached to a 1TB FC attached raid array. This box is being used as
one of out oracle 8 boxes.
We had some problems w/ swapping to much when running
kernel-smp-2.4.21-20.EL. We then upgraded to
kernel-smp-2.4.21-27.0.1.EL, which seem to solve that swapping problem.
I'm seeing the load slowly rise to 20-40 over a 45-90 minutes
interval, then dive down to below 1. The busier oracle the faster it
happens. When the load does go down to 1, system cpu usage also drops
to about 2%/cpu. While it rises, it stays at about 8% cpu usage. When
the spike hits it's peak, you can see kscand take up to 100% of all
the cpus. iowait also get really high up there also if its not
kscand. Then the cycle begins again.
The system becomes more and more unresponsive the closer its gets to
the peak time. We've making some changes to
bdflush/pagecache/inactive_clean_percent, but nothing has helped or
made any changes.
Also swap is at 25% and still growing.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
This has been on going since we first booted up w/ the kernel 4 days ago.
Created attachment 110182 [details]
output of top prior to peak. last one is post peak.
output of top prior to peak. last one is post peak.
Created attachment 110183 [details]
graph of cpu usage
Here is a cpu usage graph. I read in /proc/stat and let rrd do the graphing
Created attachment 110184 [details]
graph of load
same as above except for uptime via /usr/bin/uptime
gman, can you get me several AltSysrq-M and AltSysrq-W outputs when
you see kscand in this state(eating up lots of cpu time)? I have
tried to reproduce this problem internally and never even see kscand
on a top output!
Alternatively, you can profile the kernel and get even more debug data
for me but that does require a reboot. I'll send you the instructions
to do this if you can reboot.
Thanks for your help, Larry Woodman
The dba decided to move back to 2.4.20-20.EL and incresed shmmax to
4GB. That seem to help but the swapping issue may be coming back after
8days of uptime. I can't get you the info until I can get them to
decide to move to the newer kernel again.
OK, I cant really make any progress on this issue if they are no
longer running the kernel that this problem was reported on. I have
not seen any other problem reports on this issue so I need to rely on
you for help debugging this.
We just rebooted back into 2.4.21-27.0.1.ELsmp. We saw the spikes right away
after the reboot. I'll have AltSysrq-M and AltSysrq-W outputs sometime today.
I'll see about adding profiling in, can you send instructions on howto do this?
Created attachment 110982 [details]
outputs of AltSysrq-W and AltSysrq-M
Here are the outputs of AltSysrq-W and AltSysrq-M.. First one was right at the
large peak that we see in the graphs. The rest are some when kscand making the
system unresponsive post peak.
Created attachment 111072 [details]
more sysreq-m and sysreq-w
Ran sysreq every minute for 1+ hr. The peak was some time around 14:45 and
15:00 before the it started all over again.
found the problem. dba requested settings that caused the whole problem. I think
if we left limits.conf as is via the default install, we would have been good also.
--- limits.conf.old 2005-04-05 21:03:37.000000000 -0700
+++ limits.conf.new 2005-04-19 19:49:05.000000000 -0700
@@ -40,12 +40,19 @@
#ftp hard nproc 0
#@student - maxlogins 4
-oracle soft nofile 4096
+oracle soft nofile 8192
oracle hard nofile 65535
-oracle soft rss 8192
-oracle hard rss 65535
+# oracle soft rss 8192
+# oracle hard rss 65535
+oracle soft rss unlimited
+oracle hard rss unlimited
+oracle soft stack unlimited
+oracle hard stack unlimited
+oracle soft nproc 32768
+oracle hard nproc 65535
+oracle soft memlock unlimited
+oracle hard memlock unlimited
OK, here is whats going on with this bug:
The system has millions of pages that are mostly active and mapped into hundreds
or even thousands of user address spaces. When kscand runs and decides it needs
scan the active list to age pages it takes the zone_lru lock and walks every
page in the active list. For each page it walks the pte_chain, mapping the
highmem ptes into a kernel virtual window to test and clear the reference bit.
Since this is an N x M algorithm where N is millions of pages and M is thousands
of processes, this can result in billions of itterations of that mapping,
testing and clearing and unmapping while holding the zone_lru lock. When this
lock is held, practically all other operations within that memory zone stalls.
In order to fix this problem we must make kscand more scalable that it currently
is. Im must periodically release the zone_lru lock so other work may progress.
We are running AS 3 kernel version 2.4.21-22.214.171.124.3.ELsmp, and are running into
the same problem. We have cyclical CPU spikes every 5 minutes (load average
jumps to between 60 & 80).
kscand has consumed over 6 hours CPU over last 8 days.
root 12 1 3 May03 ? 06:53:15 [kscand]
root 27642 8815 0 16:05 pts/9 00:00:00 grep kscand
Another voice from the wilderness: I'm having the same problem with RHEL3 after
installing update 4 including kernel 2.4.21-27.0.4.EL: periodically sluggish
performance with kscand having used 7:57 hours out of 41:00 hours of uptime.
It appears that 2.6 appears to address this by changing how much they have to
lock. From RHEL4 2.6.9 vmscan.c:
* zone->lru_lock is heavily contented. We relieve it by quickly privatising
* a batch of pages and working on them outside the lock. Any pages which were
* not freed will be added back to the LRU.
* shrink_cache() adds the number of pages reclaimed to sc->nr_reclaimed
* For pagecache intensive workloads, the first loop here is the hottest spot
* in the kernel (apart from the copy_*_user functions).
If I understand this comment (and the write-up on page 178 of Understanding the
Linux Virtual Memory Manager by Mel Gorman), the idea is to move a 'block' of
pages is removed from the list, thus freeing the list up more quickly.
I'm not sure how well this strategy would translate to the 2.4.21 kernel. In
2.6 this is being called out of kswapd, but in 2.4.21 this is being done from a
different thread, kscand, which doesn't exist at 2.6. It also appears that the
pagevec structure is also a 2.6ism, which means that something would have to be
'invented' for 2.4. Code is sufficiently differnt from 2.4 to 2.6 that it
would appear that new sections would have to be written.
Just thinking aloud... The options seem to be:
- do something 'similar' to 2.6, where lock is held long enough to move them
quickly out of the way for later processing
- find other places to unlock for short periods
- perhaps 'checkpoint' the scan operation by saving state 'somehow', releasing
the lock and then making kscand runnable again, to pick up where it was before
Just thinking aloud... I welcome all flames...
Some customers have indicated that this behavior (long kscand run times) were
introduced somewhere between U2 and U4, but we don't have data to back that up.
The typical reason you see this problem on a larger system running Oracle is
that you are not using hugepages. Without hugepages the system treats system V
shared memory pages as simple mapped file pages. Since the SGA is very large
and active kscand consumes lots of cputime deactivating active pages and since
they are mapped by lots of Oracle processes thie activity is not very scalable.
We are working on making kscand more efficiant in RHEL3 but there will never be
any substitute for using hugepages when running Oracle. Please try this and let
me know if it works out OK for you.
We spoke with Oracle about moving to hugepages and they indicated that we should
expect a 10-15% performance hit from going to this architecture. This
represesents a fairly significant performance loss for us.
Our database environment has only 8GB of RAM on each Oracle RAC cluster member,
and our SGA is currently only set to 1.7GB, which is below the minimum
requirement for hugememory pages. We are generally using less than half the
available RAM on our servers.
That doenst make sense, hugepages give Oracle a performance boost not a loss!
hugepages allow the entire SGA to be mapped into the TLB because the page size
increases by 1000. As far as the 1.7GB SGA being smaller than the requirement
for hugepages, thats an Oracle limitation and not a kernel limitation.
Either way, we are working on kscand improvements so I do expect to make this
better but it will never be as good as using hugepages because that eliminates
the kernel's involvement entirely.
I would hazard a guess that someone has confused hugemem kernel with the
hugepages tunable. There is some performance penalty with the hugemem kernel,
but use of hugepages does not depend on hugemem kernel.
Has this issue been addressed in U6? Just looking for an update. Thanks.
Al, the answer is no.
Larry, could you please clarify why this bug is still in NEEDINFO? What
information are you still waiting for?
The answer is yes, we added a new tunable "/proc/sys/vm/kscand_work_percent".
This tunable defaults to 100 but if one insists on running large Oracle systems
without hugepages, this tunable should be lowered to 10 or so. This will
prevent kscand from holding the zone lru list lock for very long amounts of time
thereby allowing other processes to get the lock and run.
Ernie, the patch tracking file for this bug is
Support for the "kscand_work_percent" tunable was committed to the
RHEL3 U6 patch pool on 15-Jul-2005 (in kernel version 2.4.21-32-12.EL).
PeterM/DonF/TomK, please run this bugzilla through the ack process ASAP
so that I can add it to the RHEL3 U6 advisory.
(In reply to comment #21)
Can we get this patch posted as an attachment please?
Created attachment 117781 [details]
kscand_work_percent sysctl patch committed to RHEL3 U6
Rick, this is the patch that was committed. But we'd rather you test
the RHEL3 U6 beta kernel (2.4.21-34.EL) than something you build manually.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
*** Bug 169547 has been marked as a duplicate of this bug. ***
I have a clean install of RHEL3 U5 and I would like to apply this patch.
Unfortunately, RHEL3 U6 and RHEL4 are not options for me because of the
limitations of my "hardware" platform, namely VMware ESX Server.
When applying the patch, I got some errors in sysctl because I lacked
oom_kill_limit. /* int: limit on concurrent OOM kills */
I am wondering if there is another patch that I need to apply first, or if this
patch will work without the OOM functionality. (I'm going to try it anyway, but
I'm a newbie, so I don't really know how to test it beyond the obvious (Linux no
longer boots up or something.)
Alter the patch to remove the OOM_KILL_LIMIT line and adjust the line counts:
@@ -160,5 +160,6 @@ enum
VM_STACK_DEFER_THRESHOLD=26, /* int: softirq-defer threshold */
VM_SKIP_MAPPED_PAGES=27,/* int: don't reclaim pages w/active mappings */
+ VM_KSCAND_WORK_PERCENT=29, /* int: % of work on each kscand iteration */