Bug 100680
Summary: | kscand eating lots of cpu | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Janne Pikkarainen <jabapi> |
Component: | kernel | Assignee: | Dave Jones <davej> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.3 | CC: | alex, bastian.zacher, bbaetz, blouin, carl, david, davidy, djfoxy, dmichaud, pfrields, riel, scorpionlab, smoyse, steveh, yusufg |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-01-05 03:26:02 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Janne Pikkarainen
2003-07-24 09:45:08 UTC
I'd like to throw in a "Me Too" - I'm running 2.4.20-19.7 on RH 7.2, having between 20-50% cpu being used by kscand when system is being moderately used, running Intersystems' Caché, after memory upgrade to 2GB from 1GB. Half of that is allocated as shared memory for Caché. Was about to say that no swap is being used, but I see now that there is some being used. (atop says swin is 0, swout is ~ 10) This *may* have been fixed for me with the 2.4.20-20.7 kernel - doing the same work it appears that kscand is mostly staying < 10%, only difference is almost nobody else is using the machine at all, just my database running integrity check and me on ssh. Ever since my post I've also noticed the following things: - kscand cpu usage jumps up WAY much higher (>70%) whenever it's time for daily or full backups with dump (the big partition is using ext3 with data=ordered as filesystem) - During normal usage (outside backups) kscand spends anything between 10-50 percents of cpu. Recently we also had to reboot the server... or actually it rebooted itself after a power outage, and after that it's been using a kernel without >4GB support, so nowadays it's using only 4GB of memory. kscand is still using high amounts of cpu, so I think that takes PAE out of question in this case... I wonder what is making this thing so resource-consuming. I will install a kernel which we already know about its good-behaviour to this server (2.4.20 with ptrace-patch and grsecurity) and tell you if there is any difference to this RH-kernel. I can confirm that this bug exists in kernel 2.4.20-19 on both Redhat 7.3 and Redhat 8, and appears to be fixed on both platforms in kernel 2.4.20-20. I can also confirm that it only appears to be an issue if HighMem is in use (as in, machines with more than 896MB of RAM) As an addendum to my comment on this bug: - The Bug is not completely fixed in 2.4.20-20, but it is definitely less pronounced - My backups, rsyncs, etc. are now taking a normal amount of time, but kscand CPU usage is still excessive. I can confirm this on 7.3 with 2.4.20-18.7 kernel. We've got some scheduled downtime coming up to upgrade this system. We'll be adding a processor and bumping to a newer kernel. We'll see if they resolve this problem. Thanks Hi kscand problem friends, ------------------------------- using RedHat 8.0 with kernel 2.4.20.8smp, system is a Athlon MP2000 dual. ------------------------------- After a complete crash in the system, it was sent to repair, after that (motherboard circuit problems) it was get back, memory was changed (the modules and the layout, before was two 521Mb DDR and now 2x256Mb + 1x512Mb), cpus were changed and of course motherboard (AMD760/768). lm_sensors shows ALARM in VCore 1 and 2, -5V and fan3 fields, I'm not sure but something wrong is happening, BIOS screen shows it fine... After look up with this bug-forum I recognize the same problems ocurring with me. Crash after crash, always happening with my very-intensive users. So, I started to make some tests, first was a waste_cpu fortran program... nothing happens, and after found this forum I wrote a waste_io fortran program and the problem happened! Kernel crash dump EIP report showing kscand... BTW, the swap memory was created as a 2048Mb one partition, should I split it? Please, let me know if a group of different EIP (oops logs) could be useful. I have at least seven... As I told, they are different but almost concerning memory map, just one reporting a "BUG in smp.c". regrads, Ricardo. FWIW, mine is a single processor Athlon system, and IO-APIC is turned off in the stock RedHat Athlon kernel AFAIK. Yes we have also observed a server with dual Xeon 2.4 GHz, 4 gigs of ram, Mylex raid array running Redhat 7.2 with kernel 2.4.20-20.7smp have kscand use a large amount of cpu resources and start using the swap file when there was still free memory. After the server has been up for a while, the performance steadily decreases and the server keeps running slower and slower. Reverting back to 2.4.18-27.7.xsmp the server runs normally without any memory issues or excessive kscand/kswapd usage. Here's another one. http://linux.derkeiler.com/Mailing-Lists/RedHat/2003-07/0053.html Meanwhile I've tried kernel 2.6.0-test4, my own grsecurity-patched vanilla-2.4.20 kernel and RH 2.4.18-27.7 - the all seem to resolve this problem. Right after booting to 2.4.20-13.7bigmem (which I mentioned in my original post) brings the problem back. kscand usage is excessively high during backup operation and backup also takes about five times longer than with other kernels. Anyone know of any fixes for this issue? Same issue is happening on a dual Xeon 2.4ghz, 6gigs ram... Kernel 2.4.20-20.7bigmem. I have now have three customers with this issue ranging from a HP 8xPIII with 8Gb RAM to an IBM 2xP4ht with 4Gb RAM. In all cases kscand has consumed somewhere between 25-50% of uptime in CPU time. The only reliable solution is to roll back the kernel to 2.4.18-27 which is what I have done in all three cases. Are people seeing kswapd/kscand issues with 2.4.20-xx kernels on RH 7.3 try the test scripts mentioned in bug <a href="https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=104652">104562</a> and see if they see similar behaviour and if it goes away with the 2.4.18-xx kernels It would be nice to get this and the bugs marked with (VM) in summary fixed for Cambridge It doesn't seem like this bug and 104562 are affected. When I asked sct about this on one of the many lists he said he didn't think they were related. Add to that my inode_cache doesn't appear to be rising and something is just weird. I updated to 2.4.20-20.9 on 7.3 and I see the same behavior as on any of the 2.4.20's. I'm going to be putting second processor in this machine to see if this server has just suddenly become loaded more heavily, I doubt this will have the desired effect. If it doesn't I'm going to try some of the other suggestions in this and other bugs about changing the bdflush settings. Any other good test advice for a 7.3 box? Thanks This time running 2.4.20-18.7bigmem, and kscand is still a wild boy. I know, I should try a more recent kernel but I'd like to mention one thing I've observed: buffered mem stays at relatively low level whilst running the buggy kernels. I think that normally our server likes to have 200-300 megabytes of buffered mem, but nowadays it looks like this: --- total used free shared buffers cached Mem: 4801720 4790768 10952 0 28224 4565276 -/+ buffers/cache: 197268 4604452 Swap: 2096252 53160 2043092 --- I don't know if that's a coincidence or a lead to how to track down this el bastardo bug :-), but still it seems weird to me that there's almost 10x difference in buffer usage between different kernels. I'll check after next reboot what happens with a working kernel, does the bufmem usage jump up or not. Seth, have you tried playing with /proc/sys/vm/{min,max}-readahead. From what I can determine, the values changed between 2.4.18-xx and 2.4.20-xx. It used to be 3,127 in 18-xx and know its 3,31 in 20-xx I've seen read speeds doubled if I modify this to 31,127 for 20-xx kernels (It doesn't make the kscand/kswapd issue go away but seems to lessen it Hopefully Arjan,Rik can have some test kernels for us to try I have to profess a certain amount of ignorance about these options. What do they end up affecting - and what was the basis for them being changed in the first place? I can give them a try, are they dynamically updated? If I change them right now will it effect the system immediately or will I need to reboot? I am distinctly ignornant about the vm system changes that have gone in the not-so-distant past and would appreciate pointers on where to read up on this info. Thanks My recommendation to davej would be to simply take most of the VM fixes from Taroon. couple of comments: max-readahead has not improved the situation suggestion from riel on irc was - rebuild 2.4.20-20.9 with rmap-updates patch included. Trying to do that now but hitting some snags when moving that patch around. ok got it patched. I had to include: rmap-updates inode-patches akmp-elevator patch you grab the patched kernels or src.rpm from here: http://linux.duke.edu/~skvidal/RPMS/ http://linux.duke.edu/~skvidal/RPMS/kernel-2.4.20-20.7.duke.1.src.rpm http://linux.duke.edu/~skvidal/RPMS/kernel-smp-2.4.20-20.7.duke.1.i686.rpm http://linux.duke.edu/~skvidal/RPMS/kernel-bigmem-2.4.20-20.7.duke.1.i686.rpm I've tested them on one machine so I know they work but I don't yet know if this solves the problem. I'm testing this on a production machine now. I'll let you know more as I see it. But one thing of note, so far, kscand is not sitting pegged at the top of the cpu utilization chart. maybe a good sign, maybe too early too tell. so the system is running 'better' It's not completely great but kscand doesn't stay pegged at the top. So this problem, for me, might just be overload on this system. I'll keep watching it - but in 2 weeks I'm adding a second processor to this system and I'll take another look then to see if the problem goes away entirely. If the others of you could test the kernels I posted I'd love to know your experiences too. Thanks System still running. It's better than it was, but not great. If any of the rest of the people affected by this problem could test the kernel rpms I posted I'd appreciate it. I'd like to see some other folks experiences to see if some of my problems with this system are just load-related and not also kernel-related. thanks Your patches seemed to improve the situation here (P4 3.06 with HT on, 1 gig of RAM, under very heavy CPU and I/O load). We've been torture testing 2 machines for the past week only to see kscand riding high on cpu usage followed by kswapd after enough time had elapsed. This, of course, grinds the machines into the ground after a few hours. We've been running with a P4 compiled smp version of 2.4.20-20.7.duke.1 for 26 hours now - kscand has hardly used any CPU time (12:14) and kswapd (2:10) even less. Definately an improvement. We ran it on a webserver (apache/mod_perl). swap usage was around 500MB. 2G swap partition was defined physical ram on box is 2G, kscand shows up regularly as 7-8% kswapped is no longer visible via top. The ops guys feel its better than before they are going to run more tests davej, your diary mentions some VM hacking you have done on RH 7.x/8 kernels. Any chance, you can make those test kernels available We are suffering too. I tried Seth Vidal's kernel and it might be a tad better but not good. Our server is a 4 CPU Xeon (ProLiant) with 20 GB of memory. We are running Oracle on it and it's not doing well at all. The system's performance degrades within an hour or two at which point the system is consumed by kscand and kswapd activity. There is no reason for our system to thrash as it has enough memory. When I do ps -ax -o rss and sum up those numbers I get to roughly 1 GB of total memory used by processes, add to that 1.5 GB more for shared memory and it still isn't enough to warrant swapping. Vmstat shows 19GB in the "cache" category and perhaps 20 MB free. Vmstat says it's not a lot of swapping, but kscand and kswapd's activity consume 1.5 CPUs alone. They regularly bring the entire system to a halt for a second or so every 10 seconds. IO performance degrades very badly when that happens. I suspect there is some serious internal memory leak that is aggravated by lots of memory. The more memory you install the more gets lost and the more has to be managed and scanned by kscand. When I stop Oracle, it is getting a bit better, but not good. Right now I have 4 tar |gzip process-pairs going (they do not use much memory) and kscand consumes 40% CPU. My worry is this: when will this be resolved? This problem must have been lingering for many months now and yet it seems like few folks even know about this. Is this known to be resolved even in RHat Enterprise 3? Thanks I can confirm, too. Our production system (compaq, 2gb ram) also suffers from kscand. It's not as bad as descriped above, but the daemon is mostly on the top of "top". Already booted the patched "bigmem"-kernel posted few comments back, kscand is now quieter but not calm. After analyzing the machine, I determined that the swap is set to 1gb, since the first Installation of RH (originally installed 512MB) the memory was updated to 2gb. The swap should be 2xRAM so the swap is to small for this configuration. Maybe this associates with the problem descriped? Can anybody approve my guess? Thanks bastian That's not it Bastian.. I have a system with 1GB RAM and 2GB swap, and I'm still seeing kscand eat up a lot of cpu with a bigmem enabled kernel (it's enabled to ease memory expansion). Is anyone working on this bug? My fear is that it's falling by the wayside because it's in th 7.3 project, even though it is also current for newer distro releases. I'd be kinda curious what the results would be to rebuild the errata AS 2.1 kernel on 7.3 - how much stuff would be different or bad but more importantly if this problem is not present. We are suffering from this also, it seems about every 10 seconds kscand request cpu usage, low cpu usage from as seen in top everytime but it seems to affect the system performance. Running RH 8.0 kernel 2.4.20-19.8smp Additional comment, I've been running the kernels I posted above for a while now and the problem isn't gone. I still get randomnly high spikes of kscand activity and the load will jump from < 1 to 12-15+ in seconds. The system will appear to stall during this time (and the nfs clients it is serving will stall too) and then it will recover and move along again. Any additional ideas of where to check? 6 root 15 0 0 0 0 SW 1.1 0.0 531:59 kscand Dual Xeon 1GB Ram...has been up for almost 16 days. Finally decided to see what was causing a crash every once in a while--and I think this is it. Bugfix would be nice :) I gave up and upgraded to RH9, kernel 2.4.20-20.9.XFS1.3.1 (from SGI's site), top shows that kscand has been replaced with kscand/HighMem and kscandd/LowMem, and I don't see either of them going over 1% even with the CPU 50% loaded. I doubt it's anything SGI did, I'm only running their kernel for the XFS support. Sorry, kscand/HighMem and kscand/Normal not LowMem. Interesting. Could we be tipped off what is ON_QA? What changes did you make? You can grab what we have so far at.. http://people.redhat.com/davej/rhl-errata/ Oddity noted once new kernel was installed: I got this error from sar(sa1): Cannot append data to that file Nothing else a problem so far, rebooting onto a production server tonight but this is what I have so far. I'm not sure this problem is related but it is the only change on the machine and then this error. Very odd. ok, Been running on a production machine since last night around midnight. Load is better than before and the loads aren't spiking around as much as they were. These are all good things. Additionally kscand isn't staying pegged at the top. I'll post more as I see it. oh and disregard my last post, seems like a fluke on the one test machine. Does anyone have the errata which was located on http://people.redhat.com/davej/rhl-errata/ ? Because the site is offline at the moment I I really need the updates so I can test if they work with our server They've been released as errata. https://rhn.redhat.com/errata/RHBA-2003-394.html |