Bug 100680

Summary:	kscand eating lots of cpu
Product:	[Retired] Red Hat Linux	Reporter:	Janne Pikkarainen <jabapi>
Component:	kernel	Assignee:	Dave Jones <davej>
Status:	CLOSED ERRATA	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.3	CC:	alex, bastian.zacher, bbaetz, blouin, carl, david, davidy, djfoxy, dmichaud, pfrields, riel, scorpionlab, smoyse, steveh, yusufg
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-01-05 03:26:02 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Janne Pikkarainen 2003-07-24 09:45:08 UTC

Description of problem:

First of all, even though our server is running Red Hat 7.3, I think this very
same problem could be reproduced with any current version, so please keep on
reading. 

One of our mail servers (IBM xSeries 350, 2x700 MHz Pentium III, 5 GB RAM) has
been behaving quite strangely after memory upgrade: a kernel process called
kscand is continuously eating up tens of percents of precious cpu time.

Here's a snippet from top-command:

---
CPU0 states: 11.0% user, 73.0% system,  8.0% nice, 15.0% idle
CPU1 states: 12.0% user, 67.0% system,  8.0% nice, 20.0% idle
Mem:  4801720K av, 4791056K used,   10664K free,       0K shrd,   42720K buff
Swap: 2096252K av,   79612K used, 2016640K free                 4558552K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
    6 root      20   0     0    0     0 RW   49.5  0.0 19599m kscand
---

Before we pumped up the server's physical memory beyound the 4 GB limit,
everything was ok. After the upgrade I thought that the kscand should be
behaving like it is, but according to Google this may not be the case:

http://groups.google.fi/groups?q=kscand+cpu&hl=fi&lr=&ie=UTF-8&oe=UTF-8&selm=11b6e551.0307080633.76d14e05%40posting.google.com&rnum=3

Should someone calm down kscand a bit? Or should someone calm me down instead? :-)


Version-Release number of selected component (if applicable):

Linux xxxxxxxxxxx 2.4.20-13.7bigmem #1 SMP Mon May 12 11:27:47 EDT 2003 i686 unknown

The Linux distribution is Red Hat 7.3 with all the up2date patches applied. I
think this same bug could be reproduced with any current version, though.

Thank you in advance!


Additional info:

I gladly provide you any additional information you may need.

Comment 1 David Yerger 2003-08-06 15:11:38 UTC

I'd like to throw in a "Me Too" - I'm running 2.4.20-19.7 on RH 7.2,
having between 20-50% cpu being used by kscand when system is being moderately
used, running Intersystems' CachÃ©, after memory upgrade to 2GB from 1GB.

Half of that is allocated as shared memory for CachÃ©.

Was about to say that no swap is being used, but I see now that there is some
being used. (atop says swin is 0, swout is ~ 10)

Comment 2 David Yerger 2003-08-30 00:20:37 UTC

This *may* have been fixed for me with the 2.4.20-20.7 kernel - doing the same
work it appears that kscand is mostly staying < 10%, only difference is almost 
nobody else is using the machine at all, just my database running integrity
check and me on ssh.

Comment 3 Janne Pikkarainen 2003-08-30 17:10:26 UTC

Ever since my post I've also noticed the following things:

- kscand cpu usage jumps up WAY much higher (>70%) whenever it's time for daily
or full backups with dump (the big partition is using ext3 with data=ordered as
filesystem)

- During normal usage (outside backups) kscand spends anything between 10-50
percents of cpu.

Recently we also had to reboot the server... or actually it rebooted itself
after a power outage, and after that it's been using a kernel without >4GB
support, so nowadays it's using only 4GB of memory. kscand is still using high
amounts of cpu, so I think that takes PAE out of question in this case... I
wonder what is making this thing so resource-consuming. I will install a kernel
which we already know about its good-behaviour to this server (2.4.20 with
ptrace-patch and grsecurity) and tell you if there is any difference to this
RH-kernel.

Comment 4 Simon Karpen 2003-09-06 12:15:42 UTC

I can confirm that this bug exists in kernel 2.4.20-19 on both Redhat 7.3 and
Redhat 8, and appears to be fixed on both platforms in kernel 2.4.20-20. 

I can also confirm that it only appears to be an issue if HighMem is in use (as
in, machines with more than 896MB of RAM)

Comment 5 Simon Karpen 2003-09-07 14:04:01 UTC

As an addendum to my comment on this bug:

- The Bug is not completely fixed in 2.4.20-20, but it is definitely less pronounced
- My backups, rsyncs, etc. are now taking a normal amount of time, but kscand
CPU usage is still excessive.

Comment 6 Seth Vidal 2003-09-10 16:11:29 UTC

I can confirm this on 7.3 with 2.4.20-18.7 kernel. We've got some scheduled
downtime coming up to upgrade this system. We'll be adding a processor and
bumping to a newer kernel. We'll see if they resolve this problem.

Thanks

Comment 7 Ricardo Marcelo 2003-09-10 18:21:44 UTC

Hi kscand problem friends,
-------------------------------
using RedHat 8.0 with 
kernel 2.4.20.8smp, 
system is a Athlon MP2000 dual.
-------------------------------
 
After a complete crash in the system, it was sent to repair, after that
(motherboard circuit problems) it was get back, memory was changed (the modules
and the layout, before was two 521Mb DDR and now 2x256Mb + 1x512Mb),
cpus were changed and of course motherboard (AMD760/768).
lm_sensors shows ALARM in VCore 1 and 2, -5V and fan3 fields, I'm not sure but
something wrong is happening, BIOS screen shows it fine...
After look up with this bug-forum I recognize the same problems ocurring with me.
Crash after crash, always happening with my very-intensive users. So, I started
to make some tests, first was a waste_cpu fortran program... nothing happens,
and after found this forum I wrote a waste_io fortran program and the problem
happened! 
Kernel crash dump EIP report showing kscand...

BTW, the swap memory was created as a 2048Mb one partition, should I split it?

Comment 8 Ricardo Marcelo 2003-09-10 21:05:08 UTC

Please, let me know if a group of different EIP (oops logs) could be useful.
I have at least seven...
As I told, they are different but almost concerning memory map, just one
reporting a "BUG in smp.c".

regrads,
Ricardo.

Comment 9 David Yerger 2003-09-10 21:08:43 UTC

FWIW, mine is a single processor Athlon system, and IO-APIC is turned off in the
stock RedHat Athlon kernel AFAIK.

Comment 10 david 2003-09-13 03:50:04 UTC

Yes we have also observed a server with dual Xeon 2.4 GHz, 4 gigs of ram, Mylex
raid array running Redhat 7.2 with kernel 2.4.20-20.7smp have kscand use a large
amount of cpu resources and start using the swap file when there was still free
memory.  After the server has been up for a while, the performance steadily
decreases and the server keeps running slower and slower.

Reverting back to 2.4.18-27.7.xsmp the server runs normally without any memory
issues or excessive kscand/kswapd usage.

Comment 11 david 2003-09-13 03:58:18 UTC

Here's another one.

http://linux.derkeiler.com/Mailing-Lists/RedHat/2003-07/0053.html

Comment 12 Janne Pikkarainen 2003-09-13 10:50:45 UTC

Meanwhile I've tried kernel 2.6.0-test4, my own grsecurity-patched
vanilla-2.4.20 kernel and RH 2.4.18-27.7 - the all seem to resolve this problem.
Right after booting to 2.4.20-13.7bigmem (which I mentioned in my original post)
brings the problem back. kscand usage is excessively high during backup
operation and backup also takes about five times longer than with other kernels.

Comment 13 Andrew 2003-09-17 01:23:42 UTC

Anyone know of any fixes for this issue? Same issue is happening on a dual 
Xeon 2.4ghz, 6gigs ram... Kernel 2.4.20-20.7bigmem.

Comment 14 Steven Moyse 2003-09-17 02:37:35 UTC

I have now have three customers with this issue ranging from a HP 8xPIII with
8Gb RAM to an IBM 2xP4ht with 4Gb RAM. 
In all cases kscand has consumed somewhere between 25-50% of uptime in CPU time.
The only reliable solution is to roll back the kernel to 2.4.18-27 which is what
I have done in all three cases.

Comment 15 Yusuf Goolamabbas 2003-09-21 12:31:19 UTC

Are people seeing kswapd/kscand issues with 2.4.20-xx kernels on RH 7.3 try the
test scripts mentioned in bug <a
href="https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=104652">104562</a>

and see if they see similar behaviour and if it goes away with the 2.4.18-xx kernels

It would be nice to get this and the bugs marked with (VM) in summary fixed for
Cambridge

Comment 16 Seth Vidal 2003-09-23 14:44:40 UTC

It doesn't seem like this bug and 104562 are affected. When I asked sct about
this on one of the many lists he said he didn't think they were related. Add to
that my inode_cache doesn't appear to be rising and something is just weird.

I updated to 2.4.20-20.9 on 7.3 and I see the same behavior as on any of the
2.4.20's.

I'm going to be putting second processor in this machine to see if this server
has just suddenly become loaded more heavily, I doubt this will have the desired
effect. If it doesn't I'm going to try  some of the other suggestions in this
and other bugs about changing the bdflush settings.

Any other good test advice for a 7.3 box?
Thanks

Comment 17 Janne Pikkarainen 2003-09-23 21:27:24 UTC

This time running 2.4.20-18.7bigmem, and kscand is still a wild boy. I know, I 
should try a more recent kernel but I'd like to mention one thing I've 
observed: buffered mem stays at relatively low level whilst running the buggy 
kernels. I think that normally our server likes to have 200-300 megabytes of 
buffered mem, but nowadays it looks like this: 
 
--- 
            total       used       free     shared    buffers     cached 
Mem:       4801720    4790768      10952          0      28224    4565276 
-/+ buffers/cache:     197268    4604452 
Swap:      2096252      53160    2043092 
--- 
 
I don't know if that's a coincidence or a lead to how to track down this el 
bastardo bug :-), but still it seems weird to me that there's almost 10x  
difference in buffer usage between different kernels. I'll check after next 
reboot what happens with a working kernel, does the bufmem usage jump up or 
not.

Comment 18 Yusuf Goolamabbas 2003-09-24 04:54:02 UTC

Seth, have you tried playing with /proc/sys/vm/{min,max}-readahead. From what I
can determine, the values changed between 2.4.18-xx and 2.4.20-xx. It used to be
3,127 in 18-xx and know its 3,31 in 20-xx

I've seen read speeds doubled if I modify this to 31,127 for 20-xx kernels (It
doesn't make the kscand/kswapd issue go away but seems to lessen it

Hopefully Arjan,Rik can have some test kernels for us to try

Comment 19 Seth Vidal 2003-09-24 04:59:48 UTC

I have to profess a certain amount of ignorance about these options. What do
they end up affecting - and what was the basis for them being changed in the
first place? 

I can give them a try, are they dynamically updated? If I change them right now
will it effect the system immediately or will I need to reboot? I am distinctly
ignornant about the vm system changes that have gone in the not-so-distant past
and would appreciate pointers on where to read up on this info.

Thanks

Comment 20 Rik van Riel 2003-09-24 14:46:06 UTC

My recommendation to davej would be to simply take most of the VM fixes from Taroon.

Comment 21 Seth Vidal 2003-09-24 16:05:16 UTC

couple of comments:
max-readahead has not improved the situation

suggestion from riel on irc was - rebuild 2.4.20-20.9 with rmap-updates patch
included.

Trying to do that now but hitting some snags when moving that patch around.

Comment 22 Seth Vidal 2003-09-25 01:40:01 UTC

ok got it patched.
I had to include:
rmap-updates
inode-patches
akmp-elevator patch

you grab the patched kernels or src.rpm from here:
http://linux.duke.edu/~skvidal/RPMS/
http://linux.duke.edu/~skvidal/RPMS/kernel-2.4.20-20.7.duke.1.src.rpm
http://linux.duke.edu/~skvidal/RPMS/kernel-smp-2.4.20-20.7.duke.1.i686.rpm
http://linux.duke.edu/~skvidal/RPMS/kernel-bigmem-2.4.20-20.7.duke.1.i686.rpm

I've tested them on one machine so I know they work but I don't yet know if this
solves the problem.

Comment 23 Seth Vidal 2003-09-26 04:34:10 UTC

I'm testing this on a production machine now. I'll let you know more as I see it.

But one thing of note, so far, kscand is not sitting pegged at the top of the
cpu utilization chart.

maybe a good sign, maybe too early too tell.

Comment 24 Seth Vidal 2003-09-26 20:35:17 UTC

so the system is running 'better' It's not completely great but kscand doesn't
stay pegged at the top. So this problem, for me, might just be overload on this
system.

I'll keep watching it - but in 2 weeks I'm adding a second processor to this
system and I'll take another look then to see if the problem goes away entirely.

If the others of you could test the kernels I posted I'd love to know your
experiences too.

Thanks

Comment 25 Seth Vidal 2003-09-29 13:57:38 UTC

System still running. It's better than it was, but not great. If any of the rest
of the people affected by this problem could test the kernel rpms I posted I'd
appreciate it.

I'd like to see some other folks experiences to see if some of my problems with
this system are just load-related and not also kernel-related.

thanks

Comment 26 Peter J. DiVerde 2003-09-29 17:54:36 UTC

Your patches seemed to improve the situation here (P4 3.06 with HT on, 1 gig of
RAM, under very heavy CPU and I/O load). We've been torture testing 2 machines
for the past week only to see kscand riding high on cpu usage followed by kswapd
after enough time had elapsed. This, of course, grinds the machines into the
ground after a few hours.

We've been running with a P4 compiled smp version of 2.4.20-20.7.duke.1 for 26
hours now - kscand has hardly used any CPU time (12:14) and kswapd (2:10) even
less. Definately an improvement.

Comment 27 Yusuf Goolamabbas 2003-09-30 03:13:31 UTC

We ran it on a webserver (apache/mod_perl). swap usage was around 500MB. 2G swap
partition was defined physical ram on box is 2G, kscand shows up regularly as 7-8%
kswapped is no longer visible via top. The ops guys feel its better than before
they are going to run more tests

davej, your diary mentions some VM hacking you have done on RH 7.x/8 kernels.
Any chance, you can make those test kernels available

Comment 28 Gunther Schadow 2003-10-23 02:40:19 UTC

We are suffering too. I tried Seth Vidal's kernel and it might be a tad better 
but not good. Our server is a 4 CPU Xeon (ProLiant) with 20 GB of memory. We 
are running Oracle on it and it's not doing well at all. The system's 
performance degrades within an hour or two at which point the system is 
consumed by kscand and kswapd activity.

There is no reason for our system to thrash as it has enough memory. When I do 
ps -ax -o rss and sum up those numbers I get to roughly 1 GB of total memory 
used by processes, add to that 1.5 GB more for shared memory and it still 
isn't enough to warrant swapping. Vmstat shows 19GB in the "cache" category 
and perhaps 20 MB free. Vmstat says it's not a lot of swapping, but kscand and 
kswapd's activity consume 1.5 CPUs alone. They regularly bring the entire 
system to a halt for a second or so every 10 seconds. IO performance degrades 
very badly when that happens.

I suspect there is some serious internal memory leak that is aggravated by lots 
of memory. The more memory you install the more gets lost and the more has to 
be managed and scanned by kscand.

When I stop Oracle, it is getting a bit better, but not good. Right now I have 
4 tar |gzip process-pairs going (they do not use much memory) and kscand 
consumes 40% CPU.

My worry is this: when will this be resolved? This problem must have been 
lingering for many months now and yet it seems like few folks even know about 
this. Is this known to be resolved even in RHat Enterprise 3?

Thanks

Comment 29 Bastian Zacher 2003-11-03 08:00:00 UTC

I can confirm, too. Our production system (compaq, 2gb ram) also
suffers from kscand. It's not as bad as descriped above, but the
daemon is mostly  on the top of "top".
Already booted the patched "bigmem"-kernel posted few comments back,
kscand is now quieter but not calm.
After analyzing the machine, I determined that the swap is set to 1gb,
since the first Installation of RH (originally installed 512MB) the
memory was updated to 2gb.
The swap should be 2xRAM so the swap is to small for this configuration.
Maybe this associates with the problem descriped?

Can anybody approve my guess?

Thanks
bastian

Comment 30 dmichaud 2003-11-04 13:36:15 UTC

That's not it Bastian.. I have a system with 1GB RAM and 2GB swap, and
I'm still seeing kscand eat up a lot of cpu with a bigmem enabled
kernel (it's enabled to ease memory expansion).

Is anyone working on this bug? My fear is that it's falling by the
wayside because it's in th 7.3 project, even though it is also current
for newer distro releases.

Comment 31 Seth Vidal 2003-11-04 18:50:18 UTC

I'd be kinda curious what the results would be to rebuild the errata
AS 2.1 kernel on 7.3 - how much stuff would be different or bad but
more importantly if this problem is not present.

Comment 32 Jean Blouin 2003-11-05 23:22:52 UTC

We are suffering from this also, it seems about every 10 seconds
kscand request cpu usage, low cpu usage from as seen in top everytime
but it seems to affect the system performance.

Running RH 8.0 kernel 2.4.20-19.8smp

Comment 33 Seth Vidal 2003-11-18 14:12:25 UTC

Additional comment, I've been running the kernels I posted above for a
while now and the problem isn't gone. I still get randomnly high
spikes of kscand activity and the load will jump from < 1 to 12-15+ in
seconds.

The system will appear to stall during this time (and the nfs clients
it is serving will stall too) and then it will recover and move along
again.

Any additional ideas of where to check?

Comment 34 Brian Lowrance 2003-11-21 16:01:08 UTC

   6 root      15   0     0    0     0 SW    1.1  0.0 531:59 kscand

Dual Xeon 1GB Ram...has been up for almost 16 days.  Finally decided 
to see what was causing a crash every once in a while--and I think 
this is it.  Bugfix would be nice :)

Comment 35 David Yerger 2003-11-21 16:30:05 UTC

I gave up and upgraded to RH9, kernel 2.4.20-20.9.XFS1.3.1 (from SGI's
site), top shows that kscand has been replaced with kscand/HighMem and
kscandd/LowMem, and I don't see either of them going over 1% even with
the CPU 50% loaded.  I doubt it's anything SGI did, I'm only running
their kernel for the XFS support.

Comment 36 David Yerger 2003-11-21 16:32:28 UTC

Sorry, kscand/HighMem and kscand/Normal not LowMem.

Comment 37 Seth Vidal 2003-12-16 20:57:18 UTC

Interesting.
Could we be tipped off what is ON_QA? What changes did you make?

Comment 38 Dave Jones 2003-12-17 14:25:41 UTC

You can grab what we have so far at..
http://people.redhat.com/davej/rhl-errata/

Comment 39 Seth Vidal 2003-12-17 23:17:33 UTC

Oddity noted once new kernel was installed:

I got this error from sar(sa1):

Cannot append data to that file

Nothing else a problem so far, rebooting onto a production server
tonight but this is what I have so far.

I'm not sure this problem is related but it is the only change on the
machine and then this error.

Very odd.

Comment 40 Seth Vidal 2003-12-18 16:03:08 UTC

ok, Been running on a production machine since last night around
midnight. Load is better than before and the loads aren't spiking
around as much as they were. These are all good things.

Additionally kscand isn't staying pegged at the top.
I'll post more as I see it.

oh and disregard my last post, seems like a fluke on the one test machine.

Comment 41 Joris Crul 2003-12-25 13:15:25 UTC

Does anyone have the errata which was located on
http://people.redhat.com/davej/rhl-errata/ ?
Because the site is offline at the moment I I really need the updates
so I can test if they work with our server

Comment 42 Seth Vidal 2003-12-25 14:18:15 UTC

They've been released as errata.
https://rhn.redhat.com/errata/RHBA-2003-394.html