Red Hat Bugzilla – Bug 117902
kswapd consumes a large amount of CPU for an extended period of time
Last modified: 2007-11-30 17:06:54 EST
Description of problem: On an active RHAS 2.1 system, kswapd can begin consuming a large amount of CPU for an extended period of time (hours). It stays at or near the top of a "top" listing the entire time, essentially consuming an entire CPU's worth of CPU (on a multiprocessor system) throughout the interval. It appears that the required eliciting condition is an active system on which all memory is either allocated, or has been used for buffers/cache (leaving only a very small amount of genuinely "free" memory). On a system where this was occurring, memory stats were as follows: # free total used free shared buffers cached Mem: 3924400 3914832 9568 1203884 125956 1394120 -/+ buffers/cache: 2394756 1529644 Swap: 6289320 755704 5533616 # cat /proc/swaps Filename Type Size Used Priority /dev/sda6 partition 2096440 755704 -1 /dev/sda7 partition 2096440 0 -2 /dev/sda8 partition 2096440 0 -3 Note that this bug appears to be the same as bug 58406, which was opened against RH 7.2 but then closed because it was reportedly fixed in RH 7.3. Apparently whatever fix appeared in RH 7.3 was never backported to RHAS 2.1. Since RHAS 2.1 is in use by enterprise customers and will be supported for many years, it would seem that a fix needs to be backported. Also, this bug appears to be related to bug 117460 (previously filed by us), in that they both manifest themselves on a RHAS 2.1 system on which all memory is in use (specifically, both bugs are showing up on an RHAS 2.1 system running Oracle, though we've seen the same behavior on a non-Oracle system as well). They are also related in the sense that they are both instances of bugs that were addressed in previous RH errata, but those fixes were not rolled into RHAS 2.1. I'm marking this as high priority (like bug 117460) because it can directly impact the stability of a production system; since RHAS 2.1 is intended for enterprise use, this is clearly a problem. Version-Release number of selected component (if applicable): kswapd / kernel-smp-2.4.9-e.38 How reproducible: See above. Unfortunately we've discovered no systematic way to reproduce the bug. It seems to require a system on which a fairly large amount of memory is allocated, and the bulk of the remainder is allocated for buffers/cache, leaving only a small amount of free RAM; basically, RHAS 2.1 doesn't deal with this condition very well. Actual results: kswapd consumes large amounts of CPU for extended periods of time Expected results: kswapd consumes negligible amounts of CPU for extended periods of time Additional info: While it's primarily kswapd that consumes CPU during these intervals, krefilld also occasionally shows up at the top of the "top" listing for minutes at a time (even above kswapd, while this is happening).
Please get /proc/slabinfo, top and "AltSysrq M" outouts when the system is in this condition. Larry Woodman
I'll get you the /proc/slabinfo contents the next time kswapd goes crazy. Alt-Sysrq-M is a problem though, because this is a production database server and it cannot go down, ever--in fact we have kernel.sysrq disabled on there. Is there some non-invasive way to get the information you're looking for? Also, what specifically do you want from top...just the standard display? Other than the memory stats (which I showed you with the "free" output) I'm not sure what you might be looking for. If it's kswapd's CPU time, here's the current ps output for it: root 10 0.3 0.0 0 0 ? SW Feb27 52:53 [kswapd]
I was able to get kswapd going on another database server (not production) using the method from bug 58406 of dumping a large filesystem. Not sure if that's useful to you or not, but it can't hurt while we're waiting for the production server to run into the problem again. Here are the results from that test: # top 9:09pm up 8 days, 5:58, 3 users, load average: 1.75, 1.74, 1.21 153 processes: 152 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 1.0% user, 33.0% system, 0.0% nice, 65.0% idle CPU1 states: 0.0% user, 33.0% system, 0.0% nice, 66.0% idle CPU2 states: 0.0% user, 14.0% system, 0.0% nice, 85.0% idle CPU3 states: 1.0% user, 5.0% system, 0.0% nice, 93.0% idle Mem: 3924400K av, 3916792K used, 7608K free, 0K shrd, 627528K buff Swap: 6289320K av, 11360K used, 6277960K free 3098000K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 10 root 16 0 0 0 0 SW 33.0 0.0 2:49 kswapd 22905 root 16 0 1012 480 264 D 14.0 0.0 0:06 dump 22903 root 15 0 1012 480 264 S 13.0 0.0 0:06 dump 13 root 39 0 0 0 0 SW 12.0 0.0 0:22 bdflush 22904 root 15 0 1012 480 264 S 8.0 0.0 0:06 dump 22902 root 15 0 2108 512 264 S 3.0 0.0 0:02 dump 21556 root 15 0 1096 1096 824 R 2.0 0.0 0:05 top 1 root 15 0 116 68 68 S 0.0 0.0 0:03 init 2 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd 3 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd 4 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd # cat /proc/slabinfo slabinfo - version: 1.1 (SMP) kmem_cache 80 80 244 5 5 1 : 252 126 nfs_read_data 0 0 384 0 0 1 : 124 62 nfs_inode_cache 2 34 224 2 2 1 : 252 126 nfs_write_data 0 0 384 0 0 1 : 124 62 nfs_page 0 0 96 0 0 1 : 252 126 ip_fib_hash 15 226 32 2 2 1 : 252 126 journal_head 27222 27222 48 349 349 1 : 252 126 revoke_table 6 253 12 1 1 1 : 252 126 revoke_record 0 0 32 0 0 1 : 252 126 clip_arp_cache 0 0 128 0 0 1 : 252 126 ip_mrt_cache 0 0 96 0 0 1 : 252 126 tcp_tw_bucket 30 30 128 1 1 1 : 252 126 tcp_bind_bucket 15 226 32 2 2 1 : 252 126 tcp_open_request 1 40 96 1 1 1 : 252 126 inet_peer_cache 0 0 64 0 0 1 : 252 126 ip_dst_cache 12 100 192 5 5 1 : 252 126 arp_cache 3 90 128 3 3 1 : 252 126 blkdev_requests 38400 38680 96 967 967 1 : 252 126 kioctx 0 0 96 0 0 1 : 252 126 kiocb 0 0 96 0 0 1 : 252 126 dnotify cache 0 0 20 0 0 1 : 252 126 file lock cache 183 252 92 6 6 1 : 252 126 async poll table 0 0 140 0 0 1 : 252 126 fasync cache 0 0 16 0 0 1 : 252 126 uid_cache 5 226 32 2 2 1 : 252 126 skbuff_head_cache 1273 2040 160 85 85 1 : 252 126 sock 105 135 1312 45 45 1 : 60 30 sigqueue 58 58 132 2 2 1 : 252 126 kiobuf 0 0 8768 0 0 4 : 0 0 cdev_cache 18 236 64 4 4 1 : 252 126 bdev_cache 10 118 64 2 2 1 : 252 126 mnt_cache 19 118 64 2 2 1 : 252 126 inode_cache 52828 59013 448 6557 6557 1 : 124 62 dentry_cache 559 1950 128 65 65 1 : 252 126 dquot 0 0 128 0 0 1 : 252 126 filp 1097 1240 96 31 31 1 : 252 126 names_cache 3 3 4096 3 3 1 : 60 30 buffer_head 622070 657880 96 16444 16447 1 : 252 126 mm_struct 260 260 192 13 13 1 : 252 126 vm_area_struct 1525 2006 64 34 34 1 : 252 126 fs_cache 297 354 64 6 6 1 : 252 126 files_cache 117 117 416 13 13 1 : 124 62 signal_act 102 102 1312 34 34 1 : 60 30 size-131072(DMA) 0 0 131072 0 0 32 : 0 0 size-131072 0 0 131072 0 0 32 : 0 0 size-65536(DMA) 0 0 65536 0 0 16 : 0 0 size-65536 4 4 65536 4 4 16 : 0 0 size-32768(DMA) 0 0 32768 0 0 8 : 0 0 size-32768 18 18 32768 18 18 8 : 0 0 size-16384(DMA) 0 0 16384 0 0 4 : 0 0 size-16384 0 0 16384 0 0 4 : 0 0 size-8192(DMA) 0 0 8192 0 0 2 : 0 0 size-8192 17 17 8192 17 17 2 : 0 0 size-4096(DMA) 0 0 4096 0 0 1 : 60 30 size-4096 44 44 4096 44 44 1 : 60 30 size-2048(DMA) 0 0 2048 0 0 1 : 60 30 size-2048 860 918 2048 459 459 1 : 60 30 size-1024(DMA) 0 0 1024 0 0 1 : 124 62 size-1024 252 252 1024 63 63 1 : 124 62 size-512(DMA) 0 0 512 0 0 1 : 124 62 size-512 304 304 512 38 38 1 : 124 62 size-256(DMA) 0 0 256 0 0 1 : 252 126 size-256 390 390 256 26 26 1 : 252 126 size-128(DMA) 0 0 128 0 0 1 : 252 126 size-128 1044 1170 128 39 39 1 : 252 126 size-64(DMA) 0 0 64 0 0 1 : 252 126 size-64 434 885 64 15 15 1 : 252 126 size-32(DMA) 0 0 32 0 0 1 : 252 126 size-32 1525 4068 32 36 36 1 : 252 126 # free total used free shared buffers cached Mem: 3924400 3915392 9008 0 660120 3077868 -/+ buffers/cache: 177404 3746996 Swap: 6289320 11360 6277960 # cat /proc/swaps Filename Type Size Used Priority /dev/sda6 partition 2096440 10544 -1 /dev/sda7 partition 2096440 816 -2 /dev/sda8 partition 2096440 0 -3
We really need the AltSysrq-M output to debug this problem. Can you try "echo m > /proc/sysrq-trigger" and "dmesg" ? That should get the same results as the console keyboard. Larry
I know another same report that says "this problem doesn't reproduce when RAM is not 4GB but 2GB." (a borderline problem?)
Perhaps, I still need that AltSysrq-M output to determine it. Any luck getting that yet? Larry
Akira: I can easily trigger kswapd via the dump test on a machine with 2GB of RAM, so I don't think it's restricted to machines with 4GB. Larry: I've got the output you wanted from the other machine, after running another dump to trigger kswapd. I'm not sure if the dump test is actually giving you what you need, though--let me know (since this bug was already reported and fixed previously, I assume you're just looking for some kind of verification anyway). The production database server has been having the problem once a day, so with any luck I'll be able to capture the output you want from it fairly soon as well. Here's the output from the dump test (a 0-level dump of a filesystem to /dev/null): --> /proc/slabinfo slabinfo - version: 1.1 (SMP) kmem_cache 80 80 244 5 5 1 : 252 126 nfs_read_data 0 0 384 0 0 1 : 124 62 nfs_inode_cache 2 34 224 2 2 1 : 252 126 nfs_write_data 0 0 384 0 0 1 : 124 62 nfs_page 0 0 96 0 0 1 : 252 126 ip_fib_hash 15 226 32 2 2 1 : 252 126 journal_head 20106 31824 48 338 408 1 : 252 126 revoke_table 6 253 12 1 1 1 : 252 126 revoke_record 0 0 32 0 0 1 : 252 126 clip_arp_cache 0 0 128 0 0 1 : 252 126 ip_mrt_cache 0 0 96 0 0 1 : 252 126 tcp_tw_bucket 1 30 128 1 1 1 : 252 126 tcp_bind_bucket 14 226 32 2 2 1 : 252 126 tcp_open_request 1 40 96 1 1 1 : 252 126 inet_peer_cache 2 59 64 1 1 1 : 252 126 ip_dst_cache 44 140 192 7 7 1 : 252 126 arp_cache 3 90 128 3 3 1 : 252 126 blkdev_requests 38400 38680 96 967 967 1 : 252 126 kioctx 0 0 96 0 0 1 : 252 126 kiocb 0 0 96 0 0 1 : 252 126 dnotify cache 0 0 20 0 0 1 : 252 126 file lock cache 126 252 92 6 6 1 : 252 126 async poll table 0 0 140 0 0 1 : 252 126 fasync cache 0 0 16 0 0 1 : 252 126 uid_cache 5 226 32 2 2 1 : 252 126 skbuff_head_cache 1171 2016 160 84 84 1 : 252 126 sock 131 147 1312 49 49 1 : 60 30 sigqueue 82 87 132 3 3 1 : 252 126 kiobuf 0 0 8768 0 0 4 : 0 0 cdev_cache 18 236 64 4 4 1 : 252 126 bdev_cache 10 118 64 2 2 1 : 252 126 mnt_cache 19 118 64 2 2 1 : 252 126 inode_cache 52948 59166 448 6574 6574 1 : 124 62 dentry_cache 687 2160 128 72 72 1 : 252 126 dquot 0 0 128 0 0 1 : 252 126 filp 1097 1240 96 31 31 1 : 252 126 names_cache 3 3 4096 3 3 1 : 60 30 buffer_head 913370 914000 96 22850 22850 1 : 252 126 mm_struct 280 280 192 14 14 1 : 252 126 vm_area_struct 1703 2006 64 34 34 1 : 252 126 fs_cache 290 413 64 7 7 1 : 252 126 files_cache 113 153 416 17 17 1 : 124 62 signal_act 103 105 1312 35 35 1 : 60 30 size-131072(DMA) 0 0 131072 0 0 32 : 0 0 size-131072 0 0 131072 0 0 32 : 0 0 size-65536(DMA) 0 0 65536 0 0 16 : 0 0 size-65536 4 4 65536 4 4 16 : 0 0 size-32768(DMA) 0 0 32768 0 0 8 : 0 0 size-32768 18 18 32768 18 18 8 : 0 0 size-16384(DMA) 0 0 16384 0 0 4 : 0 0 size-16384 0 0 16384 0 0 4 : 0 0 size-8192(DMA) 0 0 8192 0 0 2 : 0 0 size-8192 17 17 8192 17 17 2 : 0 0 size-4096(DMA) 0 0 4096 0 0 1 : 60 30 size-4096 67 67 4096 67 67 1 : 60 30 size-2048(DMA) 0 0 2048 0 0 1 : 60 30 size-2048 874 928 2048 464 464 1 : 60 30 size-1024(DMA) 0 0 1024 0 0 1 : 124 62 size-1024 260 260 1024 65 65 1 : 124 62 size-512(DMA) 0 0 512 0 0 1 : 124 62 size-512 304 304 512 38 38 1 : 124 62 size-256(DMA) 0 0 256 0 0 1 : 252 126 size-256 546 675 256 45 45 1 : 252 126 size-128(DMA) 0 0 128 0 0 1 : 252 126 size-128 1170 1170 128 39 39 1 : 252 126 size-64(DMA) 0 0 64 0 0 1 : 252 126 size-64 685 944 64 16 16 1 : 252 126 size-32(DMA) 0 0 32 0 0 1 : 252 126 size-32 1592 4859 32 43 43 1 : 252 126 --> top 12:17pm up 10 days, 21:05, 2 users, load average: 4.45, 1.78, 0.66 155 processes: 151 sleeping, 4 running, 0 zombie, 0 stopped CPU0 states: 0.5% user, 21.3% system, 0.0% nice, 77.1% idle CPU1 states: 0.3% user, 12.0% system, 0.0% nice, 87.1% idle CPU2 states: 1.2% user, 34.0% system, 0.0% nice, 64.1% idle CPU3 states: 0.0% user, 22.0% system, 0.0% nice, 77.4% idle Mem: 3924400K av, 3918784K used, 5616K free, 0K shrd, 658120K buff Swap: 6289320K av, 11284K used, 6278036K free 3087656K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 10 root 15 0 0 0 0 SW 21.5 0.0 5:07 kswapd 8043 root 16 0 1008 476 264 R 20.9 0.0 0:17 dump 8041 root 15 0 1008 476 264 R 20.1 0.0 0:17 dump 8042 root 15 0 1008 476 264 R 17.2 0.0 0:16 dump 13 root 39 0 0 0 0 SW 4.9 0.0 0:31 bdflush 8040 root 15 0 2104 508 264 S 3.9 0.0 0:04 dump 377 root 15 0 0 0 0 DW 2.3 0.0 0:24 kjournald 8021 root 15 0 1092 1092 824 R 0.3 0.0 0:00 top 1 root 15 0 116 68 68 S 0.0 0.0 0:03 init 2 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd 3 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd 4 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd --> AltSysrq M SysRq : Show Memory Mem-info: Free pages: 19892kB ( 2040kB HighMem) ( Active: 444607, inactive_dirty: 412756, inactive_clean: 66514, free: 4973 (638 1276 1914) ) 179*4kB 9*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4 096kB = 1300kB active: 1424, inactive_dirty: 145, inactive_clean: 0, free: 325 (128 256 384) 3768*4kB 60*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0 *4096kB = 16560kB active: 125235, inactive_dirty: 31299, inactive_clean: 0, free: 4149 (255 510 765) 2*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*409 6kB = 2040kB active: 317948, inactive_dirty: 381312, inactive_clean: 66514, free: 510 (255 510 765) Swap cache: add 2295856, delete 2294373, find 1121400/1121926 Page cache size: 775110 Buffer mem: 148767 Ramdisk pages: 0 Free swap: 6278036kB 1048576 pages of RAM 770028 pages of HIGHMEM 67476 reserved pages 713490 pages shared 1484 pages swap cached 22 pages in page table cache 33274 pages in slab cache Buffer memory: 595068kB CLEAN: 518648 buffers, 2074496 kbyte, 236 used (last=517543), 2 locked, 0 pr otected, 0 dirty LOCKED: 23366 buffers, 93464 kbyte, 23366 used (last=23366), 12084 locked, 0 protected, 0 dirty
Ok, we just saw the kswapd issue occur naturally on the production database server. This wasn't a severe instance...we've had it go for hours in the past, and this was just for a few minutes. But it may give you what you need. BTW, there are suggestions in the Oracle support forums about mitigating this problem by changing the cache settings via /proc/sys/vm/pagecache (for instance, setting it to "2 10 30"). Does this make sense as a workaround? Here's the output you requested: --> /proc/slabinfo slabinfo - version: 1.1 (SMP) kmem_cache 80 80 244 5 5 1 : 252 126 nfs_read_data 110 110 384 11 11 1 : 124 62 nfs_inode_cache 185 306 224 18 18 1 : 252 126 nfs_write_data 30 50 384 5 5 1 : 124 62 nfs_page 171 240 96 5 6 1 : 252 126 ip_fib_hash 17 339 32 3 3 1 : 252 126 journal_head 338 1560 48 20 20 1 : 252 126 revoke_table 6 253 12 1 1 1 : 252 126 revoke_record 0 0 32 0 0 1 : 252 126 clip_arp_cache 0 0 128 0 0 1 : 252 126 ip_mrt_cache 0 0 96 0 0 1 : 252 126 tcp_tw_bucket 4 60 128 2 2 1 : 252 126 tcp_bind_bucket 19 452 32 4 4 1 : 252 126 tcp_open_request 3 40 96 1 1 1 : 252 126 inet_peer_cache 0 0 64 0 0 1 : 252 126 ip_dst_cache 95 380 192 19 19 1 : 252 126 arp_cache 4 60 128 2 2 1 : 252 126 blkdev_requests 38400 38680 96 967 967 1 : 252 126 kioctx 0 0 96 0 0 1 : 252 126 kiocb 0 0 96 0 0 1 : 252 126 dnotify cache 0 0 20 0 0 1 : 252 126 file lock cache 175 378 92 9 9 1 : 252 126 async poll table 0 0 140 0 0 1 : 252 126 fasync cache 0 0 16 0 0 1 : 252 126 uid_cache 6 226 32 2 2 1 : 252 126 skbuff_head_cache 1271 1896 160 79 79 1 : 252 126 sock 295 366 1312 122 122 1 : 60 30 sigqueue 4 58 132 2 2 1 : 252 126 kiobuf 0 0 8768 0 0 4 : 0 0 cdev_cache 234 236 64 4 4 1 : 252 126 bdev_cache 10 118 64 2 2 1 : 252 126 mnt_cache 19 177 64 3 3 1 : 252 126 inode_cache 999 3447 448 383 383 1 : 124 62 dentry_cache 828 2580 128 86 86 1 : 252 126 dquot 0 0 128 0 0 1 : 252 126 filp 5622 5840 96 146 146 1 : 252 126 names_cache 5 5 4096 5 5 1 : 60 30 buffer_head 48484 196240 96 4906 4906 1 : 252 126 mm_struct 344 660 192 33 33 1 : 252 126 vm_area_struct 9748 10266 64 174 174 1 : 252 126 fs_cache 346 708 64 12 12 1 : 252 126 files_cache 344 468 416 52 52 1 : 124 62 signal_act 302 336 1312 112 112 1 : 60 30 size-131072(DMA) 0 0 131072 0 0 32 : 0 0 size-131072 0 0 131072 0 0 32 : 0 0 size-65536(DMA) 0 0 65536 0 0 16 : 0 0 size-65536 4 4 65536 4 4 16 : 0 0 size-32768(DMA) 0 0 32768 0 0 8 : 0 0 size-32768 18 18 32768 18 18 8 : 0 0 size-16384(DMA) 0 0 16384 0 0 4 : 0 0 size-16384 0 0 16384 0 0 4 : 0 0 size-8192(DMA) 0 0 8192 0 0 2 : 0 0 size-8192 17 17 8192 17 17 2 : 0 0 size-4096(DMA) 0 0 4096 0 0 1 : 60 30 size-4096 44 45 4096 44 45 1 : 60 30 size-2048(DMA) 0 0 2048 0 0 1 : 60 30 size-2048 960 960 2048 480 480 1 : 60 30 size-1024(DMA) 0 0 1024 0 0 1 : 124 62 size-1024 280 280 1024 70 70 1 : 124 62 size-512(DMA) 0 0 512 0 0 1 : 124 62 size-512 288 288 512 36 36 1 : 124 62 size-256(DMA) 0 0 256 0 0 1 : 252 126 size-256 330 330 256 22 22 1 : 252 126 size-128(DMA) 0 0 128 0 0 1 : 252 126 size-128 1200 1200 128 40 40 1 : 252 126 size-64(DMA) 0 0 64 0 0 1 : 252 126 size-64 705 1003 64 17 17 1 : 252 126 size-32(DMA) 0 0 32 0 0 1 : 252 126 size-32 1359 5537 32 49 49 1 : 252 126 --> top 1:58pm up 13 days, 23:13, 2 users, load average: 1.86, 0.80, 0.54 324 processes: 321 sleeping, 3 running, 0 zombie, 0 stopped CPU0 states: 3.3% user, 41.4% system, 0.0% nice, 54.3% idle CPU1 states: 12.0% user, 14.0% system, 0.0% nice, 73.5% idle CPU2 states: 11.5% user, 22.3% system, 0.0% nice, 65.3% idle CPU3 states: 10.2% user, 30.2% system, 0.0% nice, 59.1% idle Mem: 3924400K av, 3919436K used, 4964K free, 1189344K shrd, 98792K buff Swap: 6289320K av, 683788K used, 5605532K free 1531032K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 10 root 20 0 0 0 0 RW 64.1 0.0 94:08 kswapd 27055 oracle 18 0 1108M 1.1G 1101M R 44.9 28.7 9:06 oracle 29762 root 39 0 1208 1208 824 R 14.7 0.0 1:17 top 30322 oracle 15 0 1052M 1.0G 1044M S 7.6 27.3 4:16 oracle 3935 oracle 15 0 207M 206M 200M S 6.2 5.3 0:05 oracle 26997 oracle 15 0 1144M 1.1G 1133M S 5.6 29.6 13:57 oracle 26706 oracle 15 0 1134M 1.1G 1126M S 3.2 29.4 11:24 oracle 8748 oracle 16 0 171M 169M 169M S 2.3 4.4 0:29 oracle 27207 oracle 15 0 1111M 1.1G 1109M S 2.2 28.9 8:10 oracle 8746 oracle 15 0 171M 169M 169M S 1.6 4.4 0:29 oracle --> AltSysrq M SysRq : Show Memory Mem-info: Free pages: 5148kB ( 2344kB HighMem) ( Active: 533299, inactive_dirty: 158519, inactive_clean: 173836, free: 1287 (638 1276 1914) ) 1*4kB 1*8kB 13*16kB 3*32kB 9*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1020kB active: 1865, inactive_dirty: 985, inactive_clean: 0, free: 255 (128 256 384) 15*4kB 2*8kB 73*16kB 15*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1788kB active: 110216, inactive_dirty: 73392, inactive_clean: 0, free: 447 (255 510 765) 58*4kB 10*8kB 1*16kB 9*32kB 5*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 2344kB active: 421218, inactive_dirty: 84142, inactive_clean: 173836, free: 586 (255 510 765) Swap cache: add 1141819, delete 980856, find 2998592/3011923 Page cache size: 840946 Buffer mem: 24708 count_ramdisk_pages: pagemap_lru_lock locked Ramdisk pages: 0 Free swap: 5605532kB 1048576 pages of RAM 770028 pages of HIGHMEM 67476 reserved pages 14755969 pages shared 160943 pages swap cached 37 pages in page table cache 5709597 pages in slab cache Buffer memory: 98832kB CLEAN: 25621 buffers, 102106 kbyte, 151 used (last=25593), 0 locked, 0 protected, 0 dirty DIRTY: 70 buffers, 280 kbyte, 70 used (last=70), 0 locked, 0 protected, 39 dirty
Ah, I see the problem here, I missed the comment about the "dump test". This is causing the normal zone to be consumed with buffermem pages! Since buffermem pages must come from lowmem(from the DMA and/or normal zone) and because buffermem pages have no mapping(they are not linked on an inode) in AS2.1, excessive buffermem usage causes kswapd to very hard and not accomplish enough. What is this "dump test" program, where can I get it and do you really need to be running it as part of your production environment? I will look into making kswapd more appropriately and aggressively reclaiming buffermem pages under these conditions. Larry Woodman
Larry, no offense, but I think you may need to read the case notes a bit more closely. I'm wondering what you missed (I now understand why you weren't answering some questions...), so I'll just repeat the most important points: 1) This bug appears to be a duplicate of bug 58406. It seems worthwhile for you to find out what fix was put in for kswapd in between RH7.2 and RH7.3 that led Arjan to close that case, and consider backporting that fix to RHAS 2.1 (he specifically mentions some improvements in kernel 2.4.9-21, and then says it was completely fixed in RH7.3). 2) The "dump test" I'm talking about is a 0-level dump of a filesystem to /dev/null (e.g., "dump -0f /dev/null /bigfs"). This methodology was suggested by bug 58046--it's not something I came up with or was doing before I reported this bug, and it's certainly not anything we do in production. 3) I only ran the dump test on a non-production database server, in order to try to get some useful output for you to look at while we were waiting for the production database server to exhibit the behavior. The output is in comment #7. 4) I was finally able to capture an instance of the problem on the production database server; that output is in comment #8. Again, I did NOT run the dump test to trigger the instance of the problem in comment #8; it happened naturally, as a part of the regular operation of this server (which is completely dedicated to running Oracle). Those are the main points. As I mentioned, I've seen a suggestion to change the settings in /proc/sys/vm/pagecache (and later also /proc/sys/vm/buffermem) as a workaround, though I'm still trying to determine reasonable values. I received a recommendation from the admin of an Oracle site that was running into the same bug to set them as follows: vm.buffermem = 0 4 5 vm.pagecache = 2 10 20 He claimed this resolved the problem on their database server (although clearly it would be better to just fix the bug in the kernel itself).
John, I did read this case. However, once I read that 7.2 and 7.3 were being compared and that some 7.3 fix was missing from 2.4.9-21 I discounted most of it and focused on the actual problem you were seeing because the VM code in 7.3 is so different from 7.2 its impossible bring changes from 7.3 backward. AS2.1 is based on the same VM system that 7.2 is based on. From 7.3 on *everything* changed in the VM, so much so that backporting anything to do with page reclamation is nearly impossible. And BTW, the "fix" that took place in the 7.3 kernel referenced in bug 58406 is pretty much a replacement of the VM system! The problem you are having without the dump test running is that the number of pages in the slabcache is totally wrong "5709597 pages in slab cache" and kswapd uses that number to determine how hard it should work. The problem you are having with the dump test running is most of lowmem is consumed in buffermem and kswapd is not aggressive enough to reclaim buffermem pages in AS2.1. I am building you a test kernel that fixes both of these problems. The patch to fix for the wrong slabcache page count is: --- linux/mm/slab.c.orig +++ linux/mm/slab.c @@ -494,7 +494,8 @@ static inline void * kmem_getpages (kmem */ flags |= cachep->gfpflags; addr = (void*) __get_free_pages(flags, cachep->gfporder); - slabpages += 1 << cachep->gfporder; + if (addr) + slabpages += 1 << cachep->gfporder; /* Assume that now we have the pages no one else can legally * messes with the 'struct page's. * However vm_scan() might try to test the structure to see if The patch to fix the excessive number of buffermem pages is: --- linux/mm/vmscan.c.orig +++ linux/mm/vmscan.c @@ -639,10 +639,11 @@ dirty_page_rescan: * * 1) we avoid a writeout for that page if its dirty. * 2) if its a buffercache page, and not a pagecache - * one, we skip it since we cannot move it to the - * inactive clean list --- we have to free it. + * one, we skip(unless its a lowmem zone and buffermem + * is over maxpercent) it since we cannot move it to + * the inactive clean list --- we have to free it. */ - if (zone_free_plenty(page->zone)) { + if (zone_free_plenty(page->zone) && !buffermem_over_max()) { /* here be dragons! do not change this or it breaks */ if (!page->mapping || page_dirty(page)) { list_del(page_lru); --- linux/include/linux/pagemap.h.orig +++ linux/include/linux/pagemap.h @@ -35,6 +35,14 @@ extern void page_steal_zone(struct zone_struct *, int); extern buffer_mem_t page_cache; +static inline int buffermem_over_max(void) +{ + int buffermem = atomic_read(&buffermem_pages); + int limit = max_low_pfn * buffer_mem.max_percent / 100; + + return buffermem > limit; +} + static inline int pagecache_over_max(void) { int pagecache = atomic_read(&page_cache_size) - swapper_space.nrpages;
Just for reporting purposes, we are also experiencing this same issue. We have aquad processor Compaq server with 4GB of memory and are experiencing the same results (kswapd taking up large amounts of CPU and Memory). Larry Grillo lg34@dcx.com
OK Guys, I made an attempt to fix this problem. When you get a chance please try out the appropriate kernel in: http://people.redhat.com/~lwoodman/.for_bug117902/ Larry Woodman
Actually you can easily reproduce/test this yourself: just run a "dump -0f /dev/null /usr" (or some other filesystem) repeatedly in one window, and run top in another window. I can't install this kernel on our production systems, but I did put it on a test system (2GB, 2 CPUs) and run the dump test there. Under the e.38 kernel kswapd consumes about 17-25% CPU during the dump test, but with the e.39.1 kernel that you provided kswapd consumes about 6-10% CPU. It also seems to be at the top of the top listing less frequently. So this seems like an improvement (but since I don't know what the expected behavior is, I don't know if this is back to what would be considered normal). One caveat though. In my first round of tests with the e.39.1 kernel, I stopped the dump test after 3 dumps or so, and free memory was listed as about 20MB in top. But even with the dumps stopped and no other major activity on the system (this system is idle except for system processes), kswapd continued to consume around 3% CPU...and it never stopped. I let it run for 30 minutes before rebooting the system to revert to the e.38 kernel, and throughout that period kswapd was continually consuming 3% CPU. Not only that, but free memory *stayed* at 20MB throughout that entire interval...so whatever kswapd was doing, it wasn't actually reclaiming any memory. However, I was unable to reproduce this behavior after subsequent reboots and tests. Actually, after reverting and testing e.38 I did see similar behavior, but in that case kswapd only continued running for one or two minutes (again, at 3% CPU) after the dump test was stopped.
Ok.. We've installed it on our production server (since that'll be the real test - risky - but confident). The changes will take affect on our next reboot, which will be later this evening (unless the system crashes earlier). I appreciate all your help !! Larry Grillo lg34@dcx.com
any feedback from the production server?
Not sure what Larry (Grillo) has seen so far, but we're planning to install the test kernel on our production database server this weekend during our maintenance window. Based on our historic CPU data for the system, it looks the behavior is normal for the first week after a reboot and the problems with kswapd don't show up until at least the second week, so it'll take some time before we can say whether or not it appears to have resolved the kswapd issue (and others...).
John, we are still waiting for feedback on whether these changes fixed the excessive kswapd usage problem you reported. Thanks, Larry Woodman
As I said above, it'll take at least two weeks to really see if it's doing anything--and this is only the start of week two for us. Our historic CPU data show that the behavior doesn't normally kick in until the end of the second week. If it's still behaving normally by the end of *next* week (4/9/2004), we'll have a reasonable indication that the kswapd issue is fixed by this kernel. We've only seen minor spikes in system CPU so far, and I'm not sure if those were due to kswapd. As I mentioned in bug 117460, though, it doesn't seem to have fixed the ENOBUFS problems.
Sorry it's taken so long to reply back.. However, I have pasted our results below. At the time of writing this update, our server has been running on the new kernel for '7 days, 14:12. It seems like kswapd (although running better) is still running for very long periods of time. Check out the info below. Let me know what you think.. Thanks !! ----------------------------------------------------- 355 processes: 354 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle CPU1 states: 1.0% user, 5.0% system, 0.0% nice, 92.0% idle CPU2 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle CPU3 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle Mem: 4112276K av, 3945928K used, 166348K free, 328K shrd, 551992K buff Swap: 2097096K av, 1153768K used, 943328K free 1998780K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 30762 pvcsadm 15 0 1264 1260 828 R 7.3 0.0 0:00 top 1 root 15 0 508 460 460 S 0.0 0.0 0:04 init 2 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd 3 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd 4 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd 5 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd 6 root 34 19 0 0 0 SWN 0.0 0.0 0:03 ksoftirqd_CPU0 7 root 34 19 0 0 0 SWN 0.0 0.0 0:01 ksoftirqd_CPU1 8 root 34 19 0 0 0 SWN 0.0 0.0 0:02 ksoftirqd_CPU2 9 root 34 19 0 0 0 SWN 0.0 0.0 0:03 ksoftirqd_CPU3 10 root 15 0 0 0 0 SW 0.0 0.0 74:10 kswapd 11 root 15 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd 12 root 15 0 0 0 0 SW 0.0 0.0 4:06 krefilld 13 root 15 0 0 0 0 SW 0.0 0.0 0:01 bdflush 14 root 15 0 0 0 0 SW 0.0 0.0 0:06 kupdated ------------------------------------------------------------------ Take Care, Larry Grillo lg34@dcx.com
Larry, is there another problem other than kswapd rund for 74 minutes out of a week? In AS2.1, kswapd is quite cpu intensive because it walks the page tables of all processes looking for mapped pages to reclaim. Basically kswapd is running 1/136 of less than 1% of the time. This is not unusual for AS2.1 when there is memory pressure. Larry Woodman
Larry G: In case it helps you track it down, we see a few fingerprints for this issue: 1) system CPU totals much higher than normal (our standard workload is almost entirely user CPU), which you can verify with sar -u; 2) kswapd staying at or near the top of a top listing for extended periods. We're now monitoring the latter via "top -b -n 20160 -d 30 > /var/tmp/topout" so we'll have a record if it happens when nobody's watching. Larry W: looking at our own systems, 74 minutes in a week does seem to be a fairly high total for kswapd. The total CPU for kswapd is often a good indicator that a system is having the problem, but the real point is whether or not kswapd accrued a large amount of CPU in one extended burst. We did have one minor burst this morning, in which kswapd accrued 1 minute of CPU time over a 25 minute stretch (averaging about 4% CPU in the top listing, which makes sense). Still watching to see if it'll hit a dramatic spike as it used to on the e.38 kernel, though.
Larry, please see Service Request #315520.
Ok, it's 4/9/2004 and we've not run into any extended CPU hits because of kswapd; the biggest kswapd-related hit remains the 25- minute stretch I described in comment 22, which was nothing to worry about. So that's at least one week into the period when we were seeing problems before--and we *were* seeing the ENOBUFS errors (as per bug 117460), which normally accompany the kswapd problems, so it seems likely that if kswapd were still misbehaving we'd have seen it do so this week. So, it looks to me like this test kernel (e.39.1) may have fixed the kswapd issue.
Our DBA raised an alarm with me about krefilld recently, saying that he's seeing it hit the top of a "top" listing several times a day on our production database server (running the e.40.8smp kernel). The spikes seem to be fairly short (2-3 minutes), but total CPU time for krefilld seems somewhat high--141:56 over 11 days of uptime. At least, higher than it was in the past, and in the ballpark that kswapd used to reach. I don't really think this is cause for alarm since the system is fairly busy and I haven't seen any of the dramatic spikes that we used to see with kswapd, but I thought I'd mention it.
We've been trying a 40.8 kernel downloaded from Jason Baron's people website with good result. We are running RedHat Cluster Manager on a heavily loaded machine doing lots of i/o. Kswapd was going beserk consuming hours of cpu time (the machine being used is a Compaq DL380 G3, dual Xeon 3.06GHz, 9GB of memory). Kswapd went from consuming hours, sometimes 10s of hours of cpu time on heavy loads to values in the minutes. The on e27, cluster manager was experiencing false failovers, we were getting ENOMEMs from reads and writes and response time to interactive commands was poor. All this and the performance of our load processes (high i/o levels)was bad as kswapd and loads got in tight loops looking for memory. I'm pleased to say all this has changed with 40.8; the system is behaving under the same loads in a very civilized manner. I'm wondering why the patch isn't being included in the regular errata releases.
These changes will be included in U5. A preview of U5 is available at: http://people.redhat.com/~jbaron/.private/u5/2.4.9-e.41.12/ As always, any feedback that you can provide on that kernel is much appreciated.
When pushing the system with high I/O traffic I'm seeing krefilld consume between 50 to 60% CPU ( monitoring via "top" ) for several minutes at a time. Considering that this task is running at priority 25, little else is getting done on the system. This behavior with krefilld started someplace between kernel builds 2.4.9-e.24smp and 2.4.9-e.41smp. When I was running on kernel 2.4.9-e.24smp I exhibited the behavior mentioned above of kswapd going wild and consuming 77 to 89% CPU for extended periods of time (up to several minutes). After upgrading to 2.4.9-e.41smp the behavior of kswapd got much better, but the krefilld behavior described above started showing up. Recently I've upgraded to 2.4.9-e.41.8smp ( can't find Jason's 2.4.9-e.41.12 mentioned above ) and kswapd seems to be running just fine, but krefilld is still a heavy hitter. Is anyone looking into this? Thanks Jim
krefilld does virtual page scanning to try and free pages, when we are low on memory. rhel2.1 does not have a reverse mapping...pages to virtual addresses so this can be very time consuming, and is a fundamental limitation in rhel2.1. The latest beta kernel is at: http://people.redhat.com/~jbaron/.private/u5/2.4.9-e.47.5.3/
Thanks Jason, I tried the 2.4.9-e.47.5.3 kernel and krefilld is behaving much better in this version as is kswapd. I only noticed krefilld kicking in when free memory was going below 5MB and ended when free was over 5MB ( usually no more than a min of run time ). I also noticed the %CPU of krefilld now isn't going above 70% which is a good thing since it's running at priority 25. Jim
I take it 2.4.9-e.47.5.3 is whatever patchset you're using applied to the base e.47 kernel...do you have it for e.48? Since there's a major security fix in that kernel. Also, what's the timeline for releasing this fix officially (which I guess actually means: when is U5 due out)?
Hot of the presses is the U5 beta respun with the security fixes, e.49, pls find it at the usualy place: http://people.redhat.com/~jbaron/.private/u5/ U5 was on schedule to ship August 18th, but the security issues have pushed that date out a bit now. maybe ~1 week later...
James, please restest with the newer Kernel.
I tested with the e.49 kernel and got the same results as I did with the e.47.5.3 kernel. Kswapd seemed to be well behaved and krefilld behaved as mentioned above. So it looks good, at least from my standpoint. Thanks Jim
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-437.html
Plz somebody tell me can i compile a new kernel from kernel-source- 2.4.9-e3smp to solve this problem?Must i get a new rpm format kernel? or should i make a new kernel from tarball kernel source??And what should i do to make a new kernel from tarball kernel source to fix this annoying kswapd bug???
You really need to download a new version of the AS2.1 kernel. I believe we are up to 2.4.9-e54 or something like that...
e49 and beyond kernels have the fix. IIRC, the current released version is e59. You can download it from the errata section of the RedHat network.
I'm running 2.4.9-e.59summit on an IBM x445 running oracle. I'm seeing krefilld just sit at the top of top, it's consumed 487.49 minutes of cpu time, and the machine has only been up for 20 days. The really strange thing is that only half of the available memory is being used, there's 8gig used and 8 free. Is there anything in the works to fix krefilld?