Bug 145950

Summary: high loads / high iowait / up 100% cpu time for kscand on oracle box
Product: Red Hat Enterprise Linux 3 Reporter: gman
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: acjohnso, andre, bill.irwin, dff, greg.marsden, jonathan, peterm, petrides, redhat, rick.beldin, riel, rtaylor, tao, tkincaid, van.okamura
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2005-663 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-28 14:43:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 156320    
Attachments:
Description Flags
output of top prior to peak. last one is post peak.
none
graph of cpu usage
none
graph of load
none
outputs of AltSysrq-W and AltSysrq-M
none
more sysreq-m and sysreq-w
none
kscand_work_percent sysctl patch committed to RHEL3 U6 none

Description gman 2005-01-24 11:38:31 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3)
Gecko/20041020 Galeon/1.3.18

Description of problem:
We have el3u3 installed on a dl380 w/ two 3.06mghz cpus, 8GB ram
attached to a 1TB FC attached raid array. This box is being used as
one of out oracle 8 boxes.

We had some problems w/ swapping to much when running
kernel-smp-2.4.21-20.EL. We then upgraded to
kernel-smp-2.4.21-27.0.1.EL, which seem to solve that swapping problem.

I'm seeing the load slowly rise to 20-40 over a 45-90 minutes
interval, then dive down to below 1. The busier oracle the faster it
happens. When the load does go down to 1, system cpu usage also drops
to about 2%/cpu. While it rises, it stays at about 8% cpu usage. When
the spike hits it's peak, you can see kscand take up to 100% of all
the cpus.  iowait also get really high up there also if its not
kscand. Then the cycle begins again.


The system becomes more and more unresponsive the closer its gets to
the peak time. We've making some changes to
bdflush/pagecache/inactive_clean_percent, but nothing has helped or
made any changes. 

Also swap is at 25% and still growing.


Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-27.0.1.EL

How reproducible:
Didn't try

Steps to Reproduce:
This has been on going since we first booted up w/ the kernel 4 days ago.

Additional info:

Comment 1 gman 2005-01-25 06:15:59 UTC
Created attachment 110182 [details]
output of top prior to peak. last one is post peak.

output of top prior to peak. last one is post peak.

Comment 2 gman 2005-01-25 06:25:54 UTC
Created attachment 110183 [details]
graph of cpu usage

Here is a cpu usage graph. I read in /proc/stat and let rrd do the graphing

Comment 3 gman 2005-01-25 06:28:15 UTC
Created attachment 110184 [details]
graph of load

same as above except for uptime via /usr/bin/uptime

Comment 4 Larry Woodman 2005-01-26 14:28:58 UTC
gman, can you get me several AltSysrq-M and AltSysrq-W outputs when
you see kscand in this state(eating up lots of cpu time)?  I have
tried to reproduce this problem internally and never even see kscand
on a top output!

Alternatively, you can profile the kernel and get even more debug data
for me but that does require a reboot.  I'll send you the instructions
to do this if you can reboot.

Thanks for your help, Larry Woodman


Comment 5 gman 2005-02-04 18:18:45 UTC
The dba decided to move back to 2.4.20-20.EL and incresed shmmax to
4GB. That seem to help but the swapping issue may be coming back after
8days of uptime. I can't get you the info until I can get them to
decide to move to the newer kernel again.

Comment 6 Larry Woodman 2005-02-07 15:13:12 UTC
OK, I cant really make any progress on this issue if they are no
longer running the kernel that this problem was reported on.  I have
not seen any other problem reports on this issue so I need to rely on
you for help debugging this.

Larry Woodman


Comment 7 gman 2005-02-09 19:57:37 UTC
We just rebooted back into 2.4.21-27.0.1.ELsmp. We saw the spikes right away
after the reboot. I'll have AltSysrq-M and AltSysrq-W outputs sometime today.
I'll see about adding profiling in, can you send instructions on howto do this?

Comment 8 gman 2005-02-11 18:42:12 UTC
Created attachment 110982 [details]
outputs of AltSysrq-W and AltSysrq-M

Here are the outputs of AltSysrq-W and AltSysrq-M.. First one was right at the
large peak that we see in the graphs. The rest are some when kscand making the
system unresponsive post peak.

Comment 9 gman 2005-02-15 00:46:31 UTC
Created attachment 111072 [details]
more sysreq-m and sysreq-w

Ran sysreq every minute for 1+ hr. The peak was some time around 14:45 and
15:00 before the it started all over again.

Comment 10 gman 2005-04-20 02:53:54 UTC
found the problem. dba requested settings that caused the whole problem. I think
if we left limits.conf as is via the default install, we would have been good also.

--- limits.conf.old     2005-04-05 21:03:37.000000000 -0700
+++ limits.conf.new     2005-04-19 19:49:05.000000000 -0700
@@ -40,12 +40,19 @@
 #ftp             hard    nproc           0
 #@student        -       maxlogins       4
 

-oracle       soft    nofile  4096
+
+oracle       soft    nofile  8192
 oracle       hard    nofile  65535

-oracle       soft    rss     8192
-oracle       hard    rss     65535
+# oracle       soft    rss     8192
+# oracle       hard    rss     65535
+oracle       soft    rss     unlimited
+oracle       hard    rss     unlimited
+oracle       soft    stack     unlimited
+oracle       hard    stack     unlimited
+oracle       soft    nproc     32768
+oracle       hard    nproc     65535
+oracle       soft    memlock     unlimited
+oracle       hard    memlock     unlimited

Comment 11 Larry Woodman 2005-05-11 20:48:33 UTC
OK, here is whats going on with this bug:

The system has millions of pages that are mostly active and mapped into hundreds
or even thousands of user address spaces.  When kscand runs and decides it needs
scan the active list to age pages it takes the zone_lru lock and walks every
page in the active list.  For each page it walks the pte_chain, mapping the
highmem ptes into a kernel virtual window to test and clear the reference bit. 
Since this is an N x M algorithm where N is millions of pages and M is thousands
of processes, this can result in billions of itterations of that mapping,
testing and clearing and unmapping while holding the zone_lru lock.  When this
lock is held, practically all other operations within that memory zone stalls. 
In order to fix this problem we must make kscand more scalable that it currently
is.  Im must periodically release the zone_lru lock so other work may progress.

Larry Woodman

Comment 12 Robert Taylor 2005-05-11 23:32:53 UTC
We are running AS 3 kernel version 2.4.21-27.0.2.1.3.ELsmp, and are running into
the same problem.  We have cyclical CPU spikes every 5 minutes (load average
jumps to between 60 & 80). 

kscand has consumed over 6 hours CPU over last 8 days.
root        12     1  3 May03 ?        06:53:15 [kscand]
root     27642  8815  0 16:05 pts/9    00:00:00 grep kscand


Comment 13 Bucks vs Bytes Inc 2005-05-18 16:49:35 UTC
Another voice from the wilderness: I'm having the same problem with RHEL3 after
installing update 4 including kernel 2.4.21-27.0.4.EL: periodically sluggish
performance with kscand having used 7:57 hours out of 41:00 hours of uptime.

Comment 14 Rick Beldin 2005-05-23 00:52:29 UTC
It appears that 2.6 appears to address this by changing how much they have to
lock.  From RHEL4 2.6.9 vmscan.c: 

/*
 * zone->lru_lock is heavily contented.  We relieve it by quickly privatising
 * a batch of pages and working on them outside the lock.  Any pages which were
 * not freed will be added back to the LRU.
 *
 * shrink_cache() adds the number of pages reclaimed to sc->nr_reclaimed
 *
 * For pagecache intensive workloads, the first loop here is the hottest spot
 * in the kernel (apart from the copy_*_user functions).
 */

If I understand this comment (and the write-up on page 178 of Understanding the
Linux Virtual Memory Manager by Mel Gorman), the idea is to move a 'block' of
pages is removed from the list, thus freeing the list up more quickly. 

I'm not sure how well this strategy would translate to the 2.4.21 kernel.  In
2.6 this is being called out of kswapd, but in 2.4.21 this is being done from a
different thread, kscand, which doesn't exist at 2.6.   It also appears that the
pagevec structure is also a 2.6ism, which means that something would have to be
'invented' for 2.4.    Code is sufficiently differnt from 2.4 to 2.6 that it
would appear that new sections would have to be written. 

Just thinking aloud...  The options seem to be: 

- do something 'similar' to 2.6, where lock is held long enough to move them
quickly out of the way for later processing

- find other places to unlock for short periods

- perhaps 'checkpoint' the scan operation by saving state 'somehow', releasing
the lock and then making kscand runnable again, to pick up where it was before

Just thinking aloud... I welcome all flames... 

Some customers have indicated that this behavior (long kscand run times) were
introduced somewhere between U2 and U4, but we don't have data to back that up. 




Comment 15 Larry Woodman 2005-06-17 14:19:54 UTC
The typical reason you see this problem on a larger system running Oracle is
that you are not using hugepages.  Without hugepages the system treats system V
shared memory pages as simple mapped file pages.  Since the SGA is very large
and active kscand consumes lots of cputime deactivating active pages and since
they are mapped by lots of Oracle processes thie activity is not very scalable.
 We are working on making kscand more efficiant in RHEL3 but there will never be
any substitute for using hugepages when running Oracle.  Please try this and let
me know if it works out OK for you.

Larry Woodman


Comment 16 Robert Taylor 2005-06-17 16:47:12 UTC
We spoke with Oracle about moving to hugepages and they indicated that we should
expect a 10-15% performance hit from going to this architecture.  This
represesents a fairly significant performance loss for us.  

Our database environment has only 8GB of RAM on each Oracle RAC cluster member,
and our SGA is currently only set to 1.7GB, which is below the minimum
requirement for hugememory pages.  We are generally using less than half the
available RAM on our servers.

Comment 17 Larry Woodman 2005-06-17 17:39:52 UTC
That doenst make sense, hugepages give Oracle a performance boost not a loss!
hugepages allow the entire SGA to be mapped into the TLB because the page size
increases by 1000.  As far as the 1.7GB SGA being smaller than the requirement
for hugepages, thats an Oracle limitation and not a kernel limitation.

Either way, we are working on kscand improvements so I do expect to make this
better but it will never be as good as using hugepages because that eliminates
the kernel's involvement entirely.

Larry Woodman


Comment 18 Rick Beldin 2005-06-20 11:57:48 UTC
I would hazard a guess that someone has confused hugemem kernel with the
hugepages tunable.   There is some performance penalty with the hugemem kernel,
but use of hugepages does not depend on hugemem kernel. 

Comment 19 AJ Johnson 2005-08-15 19:38:58 UTC
Has this issue been addressed in U6?  Just looking for an update.  Thanks.

Comment 20 Ernie Petrides 2005-08-15 20:14:07 UTC
Al, the answer is no.

Larry, could you please clarify why this bug is still in NEEDINFO?  What
information are you still waiting for?


Comment 21 Larry Woodman 2005-08-15 20:44:59 UTC
The answer is yes, we added a new tunable "/proc/sys/vm/kscand_work_percent". 
This tunable defaults to 100 but if one insists on running large Oracle systems
without hugepages, this tunable should be lowered to 10 or so.  This will
prevent kscand from holding the zone lru list lock for very long amounts of time
thereby allowing other processes to get the lock and run.

Larry Woodman

Ernie, the patch tracking file for this bug is

>>>1036.lwoodman.kscand-work-percnt.patch

Comment 22 Ernie Petrides 2005-08-15 21:48:54 UTC
Support for the "kscand_work_percent" tunable was committed to the
RHEL3 U6 patch pool on 15-Jul-2005 (in kernel version 2.4.21-32-12.EL).

PeterM/DonF/TomK, please run this bugzilla through the ack process ASAP
so that I can add it to the RHEL3 U6 advisory.


Comment 23 Rick Beldin 2005-08-15 22:09:20 UTC
(In reply to comment #21)

> >>>1036.lwoodman.kscand-work-percnt.patch

Can we get this patch posted as an attachment please? 

Thanks,

Rick

Comment 24 Ernie Petrides 2005-08-16 01:09:58 UTC
Created attachment 117781 [details]
kscand_work_percent sysctl patch committed to RHEL3 U6

Rick, this is the patch that was committed.  But we'd rather you test
the RHEL3 U6 beta kernel (2.4.21-34.EL) than something you build manually.

Comment 28 Red Hat Bugzilla 2005-09-28 14:43:32 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html


Comment 29 Ernie Petrides 2005-10-07 06:09:19 UTC
*** Bug 169547 has been marked as a duplicate of this bug. ***

Comment 30 Jonathan Liu 2006-04-06 15:44:53 UTC
I have a clean install of RHEL3 U5 and I would like to apply this patch.

Unfortunately, RHEL3 U6 and RHEL4 are not options for me because of the
limitations of my "hardware" platform, namely VMware ESX Server.

When applying the patch, I got some errors in sysctl because I lacked
oom_kill_limit.  /* int: limit on concurrent OOM kills */

I am wondering if there is another patch that I need to apply first, or if this
patch will work without the OOM functionality.  (I'm going to try it anyway, but
I'm a newbie, so I don't really know how to test it beyond the obvious (Linux no
longer boots up or something.)

Comment 31 Larry Woodman 2006-04-06 16:00:56 UTC
Alter the patch to remove the OOM_KILL_LIMIT line and adjust the line counts:

--------------------------------------------------------------------------------
--- linux-2.4.21/include/linux/sysctl.h.orig
+++ linux-2.4.21/include/linux/sysctl.h
@@ -160,5 +160,6 @@ enum
 	VM_STACK_DEFER_THRESHOLD=26, /* int: softirq-defer threshold */
 	VM_SKIP_MAPPED_PAGES=27,/* int: don't reclaim pages w/active mappings */
+	VM_KSCAND_WORK_PERCENT=29, /* int: % of work on each kscand iteration */
 };