Bug 154441
Summary: | bdflush kupdated kjournald kswapd taking all CPU resources; iowait is 100%; lot's of processes are in schedule_timeout function | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Georgi Hristov <ghristov> | ||||||||||||||||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||||||||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||||||||
Priority: | medium | ||||||||||||||||||||||
Version: | 3.0 | CC: | jbaron, johan.lithander, keith-brautigam, petrides, riel, sct | ||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||
Hardware: | i386 | ||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2007-10-19 19:04:38 UTC | Type: | --- | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Attachments: |
|
Description
Georgi Hristov
2005-04-11 18:15:26 UTC
[root@drtsut10 mm]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/cciss/c0d0p3 8256824 1685996 6151404 22% / /dev/cciss/c0d0p1 98747 14960 78688 16% /boot /dev/cciss/c0d0p2 8256824 32844 7804556 1% /home none 3986896 0 3986896 0% /dev/shm /dev/cciss/c0d0p5 518000 16444 475244 4% /tmp /dev/cciss/c0d0p6 48853752 32828 46339268 1% /u01 /dev/cciss/c1d0p1 705594876 453898524 215854136 68% /u02 /dev/cciss/c1d1p1 705594876 335905260 333847400 51% /u03 /dev/cciss/c2d0p1 705594876 473578096 196174564 71% /u04 /dev/cciss/c2d1p1 705594876 335905260 333847400 51% /u05 Created attachment 112971 [details]
SysRq Show-CPUs.txt
Created attachment 112972 [details]
SysRq Show-Memory.txt
Created attachment 112973 [details]
SysRq Show-State.txt
Created attachment 112974 [details]
Process listing [ps xwo pid,command,wchan, ps -ef, ps aufxmww]
Created attachment 112975 [details]
Output from iostat
Please note the inf values, nan values, and the large await values. The %util
is very high, when there is a value.
Created attachment 112980 [details]
Script that creates the I/O
This is a simple bash script that uses dd to create the I/O.
Created attachment 112981 [details]
Output from the dd-fs-throughput.sh
This is the output of the dd-fs-throughput.sh script. The script has not
completed yet, but the results should give you pretty good idea of the
throughput.
Created attachment 112982 [details]
System report produced by /usr/sbin/sysreport
Looks to be either the VM or ext3 - nice backtraces. N/M. The system is just waiting on IO and not actually hanging in anything, all the tasks are in D state. There is a known performance issue with RHEL3 and 3ware cards in certain RAID modes; that problem is being tracked down. Larry, do you think this case could benefit from the highmem bounce buffering improvement patch you created a while ago ? *** This bug has been marked as a duplicate of 121434 *** I just read bug 121434 and I don't think that this bug is duplicate. They do have similarities; however, there is too many differences. I use 2x HP Smart Array 6402 controller, I use 2x HP MSA30-DB disk enclosure. All disks are high end Ultra320 SCSI disks. I don't think there is problem with the hardware. I don't think it is a problem with the driver. Created attachment 113109 [details]
Output from dmesg
TEST Hi, It seems we face same kind of kswapd excessive CPU usage issue on our HP DL360G4 machine. We run RHEL 3 U3, kernel is 2.4.21-20.ELsmp. Machine has 4 GB RAM. It gtar-ed- a massive NFS server set of files into /dev/null. That went well in the early minutes of the transfert, but went to trouble quite quickly. I've killed the tar, but kswapd is still running 100%. Here is the top output of my machine: 19:11:31 up 2 days, 23:38, 2 users, load average: 1.90, 1.40, 1.19 71 processes: 67 sleeping, 4 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 0.0% 0.0% 95.6% 0.0% 0.0% 0.0% 304.0% cpu00 0.0% 0.0% 32.1% 0.0% 0.0% 0.0% 67.8% cpu01 0.0% 0.0% 16.7% 0.0% 0.0% 0.0% 83.2% cpu02 0.0% 0.0% 31.3% 0.0% 0.0% 0.0% 68.6% cpu03 0.0% 0.0% 15.3% 0.0% 0.0% 0.0% 84.6% Mem: 4031032k av, 4006332k used, 24700k free, 0k shrd, 89500k buff 2657584k actv, 520684k in_d, 78228k in_c Swap: 2044072k av, 6776k used, 2037296k free 3240448k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 11 root 25 0 0 0 0 SW 95.4 0.0 3433m 2 kswapd 6 root 15 0 0 0 0 SW 0.1 0.0 16:00 2 keventd 1 root 15 0 512 476 452 S 0.0 0.0 0:08 3 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 We also face same kind of issue with HP XW4100 boxes when IO activity was high (diff -r on local disks). Bye. I have another instance of this: kswapd excessive CPU usage HP DL560G1 with 4 GB RAM and (1) 2GB SWAP and (4) Xeon 2.8GHz HP 6402 controller to HP MSA500G2 SmartArray RHES3 with 2.4.21-20.ELsmp and latest COMPAQ patches I cp a tree from one LV on the array to another on the array and after about 10 minutes, kswapd hits the "top" and reboot is the only way out. I can still get in (just) to do shutdown -r now and non-shell access (Oracle Apps/SQL*Net) performs just fine. I dropped pagecache to "2 30 50" but kswapd still rises to the top. Mike up2date fixed this problem Mike, which package fixed the issue. You cann't say to up2date fixed it. Did you have the same issue as me, or just look alike. Mike, which package fixed the issue. You cann't say just up2date fixes it. What packages that up2date installed fixed the problem. Specify old and new version. Did you have the same issue as me, or just look alike. Please try a later RHEL3 kernel. The RHEL3-U3 kernel had a bug that prevented kswapd from reclaiming inodes therefore lowmem became totally consumed and many of the system daemons chewed up lots of CPU time trying to recover forom this. The fix for this bug was included in RHEL3-U4(kernel-2.4.21-27.EL), please try the latest RHEL3-U5 kernel and let me know it it fixes the problem. Larry Woodman I am having the same problem on a server running RHEL AS 3 that has a 1.5TB ext3 filesystem and 4 1.5TB reiserfs filesystems, 2 GB of RAM and 2 xenon CPUs that are hyperthreaded. When doing any intensive disk IO on the ext3 filesystem (cp, rsync, etc.) kswapd begins to use about 25% of the CPU. Further, about 50% of the CPU power is consumed under the system portion of top. During this state no swap is used and the system has plenty of free RAM. Eventually, (overnight) 100% of the system resources are used. At that point text is logged to the console that an out of memory state exists and that pids are killed. The only way to recover is to kill the power. This does not happen when disk IO is done on the reiserfs filesystems. Further, this condition did not exist in the past. Because the data is not used as much on the ext3 filesystem I'm not sure when the problem cropped up. It seems within the last few weeks. The server is up2date on everything as of today (2005-05-04). I am currently running kernel 2.4.21-27.0.4.ELsmp However, I have also tried the non-smp version of the same kernel and the original AS 3 kernel (2.4.21-4.ELsmp). The same problem results with both kernels. Please let me know if there is any more information needed from me. This is a major problem for us as our data cannot be removed from the ext3 filesystem and it cannot really be accessed either. Keith Brautigam p.s. I am willing to try new kernels etc. if needed. Thanks! I have compiled and installed the stock 2.4.3 kernel, which has allowed me to read and write to and from my ext3 partition without kswapd activating and consuming all my cpu resources in the system categories. I used the .config file from the kernel-2.4.21-i686-smp that is included with the kernel-source package when compiling the new kernel. I said 'no' to all of the new modules. In the past when the machine became unusable a message was printed to the console warning that PIDs were being killed because lowmem = 0. With the new kernel I have not been able to reproduce this problem. When I have a chance I will reboot with the most current redhat kernel and see if the problem occurs with the same partition (now formatted as reiserfs) just to make sure it was not the module from my RAID card that made the difference vs something related to having an ext3 filesystem. Keith This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you. |