From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041001 Fuckraccoon/0.10.1 Description of problem: Hi, this is simply a re-open of https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=89226 as it _also_ and _still_ applies to RHEL3 kernels. [root@scalix root]# cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 1977856000 617140224 1360715776 0 111403008 362024960 Swap: 534634496 532480 534102016 MemTotal: 1931500 kB MemFree: 1328824 kB MemShared: 0 kB Buffers: 108792 kB Cached: 353020 kB SwapCached: 520 kB Active: 190836 kB ActiveAnon: 19332 kB ActiveCache: 171504 kB Inact_dirty: 283300 kB Inact_laundry: 58996 kB Inact_clean: 26736 kB Inact_target: 111972 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 1931500 kB LowFree: 1328824 kB SwapTotal: 522104 kB SwapFree: 521584 kB [root@scalix root]# uptime 19:22:08 up 23:21, 2 users, load average: 0.00, 0.00, 0.00 [root@scalix root]# rpm -q kernel kernel-2.4.21-20.EL [root@scalix root]# For what on earth does the kernel swap if there is 1,3GB memory _available_? .. Guys, you've been ignoring this bug for over a year now. I've supplied vm_anon_lru for latest RHEL3 kernel in #89226 which prevents rmap from braindead swapping instead releasing cache. There's only one bug vs. rmap: if one application allocates more RAM than the system has, it oom kills the application instead of starting to swap. This has something to do with active/inactive. As I am _not_ a VM guru, I am unable to fix this 100% on my own. I really cry for help now that you, Rik, Larry, take a deep look into this issue and fix this long outstanding bug up. I am here to test whatever you come up with (I offered this also some months ago via private email, you may remember) ... Maybe a look at vm_mapped_ratio in 2.4 mainline / 2.4-aa may help here too. ciao, Marc Version-Release number of selected component (if applicable): kernel-source-2.4.21-20.EL How reproducible: Always Steps to Reproduce: 1. run updatedb and/or start some applications 2. watch cache growing and growing, see the VM swapping and cache growing and growing w/o stopping to grow. 3. Actual Results: Silly swapping instead releasing cache. Expected Results: Release cache and if it's almost empty, swap. Or a knob which controls when and how to swap. Additional info:
Marc, quite a few of the issues you mentioned have been fixed already, in kernel 2.4.21-25.EL. Please retest with the RHEL3 Update 4 beta kernel and let us know if things are still misbehaving. Chances are there are still some corner cases left, but we can only change things very slowly in RHEL, otherwise we end up introducing regressions...
Hi Rik, hmm, before I submitted this bugzilla entry, I searched the redhat ftp and only found 20.EL. I also did up2date -u and nothing newer was available. Funny :) ... Now searching google I found 2 hits. Downloading it currently and retest things. Thank you. ciao, Marc
I suspect they're in the beta channel, which is why up2date won't see them by default.
Hi Rik, ok, for me nothing has changed in any direction. It still prefers to swap braindeadly. I guess this is a never-ending story with rmap ;( ciao, Marc
Hi again, a fix for all this issues is really vm_anon_lru set to 0 (in 2.4-aa, SLES and since yesterday also 2.4 mainline). With this I can see cache shrinking and if it's almost empty, the VM starts to swap. The only problem is, I wrote that above, that with vm_anon_lru 0 (means feature enabled) and a program wants to allocate more memory than memtotal, the process is killed via OOM instead using available swap space. For you it might be a trivial joke to fix vm_anon_lru up to work properly with rmap, but for me it's not possible, as I don't understand the VM. It's still a mistery for me :( If you have interest in fixing this, send me whatever you may create then, I test it. http://linux.bkbits.net:8080/linux-2.4/cset@1.1540?nav=index.html|ChangeSet@-4d ciao, Marc
Created attachment 107488 [details] experimental patch to page_anon Marc, I suspect this patch (by Larry Woodman) can be considered the -rmap equivalent of the upstream patch you quoted. The main reason we have not added it into RHEL3 yet is that we do not know whether it will cause regressions and are still testing it to make sure it does the right thing. If you feel like testing it, please do. Your test results can help us decide whether or not the patch is safe to add to RHEL3.
Hey Rik, coolio =) ... Recompiling now. Thanks alot. ciao, Marc
Hi again, sorry, my fingers and brain were too fast. I already tested this patch, it's this: http://linuxvm.bkbits.net:8080/linux-2.4-rmap/gnupatch@40928fe12mmfR1jtsu8yWoGP2Yn3ig and it did not help at all too. Well, it's almost that one. So should I only test that one you attached? I doubt that it'll make a difference, but anyway, I'll test. ciao, Marc
The patch didn't use to help, because other parts of the VM were still buggy. However, now that Larry has fixed a lot of other parts of the VM, I suspect the patch might actually work as advertised...
Marc, please try 2.4.21-26.EL first. If that doesnt work for you apply the page_anon patch and if that still doesnt work get us several AltSysrq-M outputs so we can debug the problem you are having. Larry Woodman
Hi Larry, ok, but sorry, where do I get it? I found .25 on google but it seems .26 is too new ... ciao, Marc
bleh, I mean, it isn't available in the RHEL3 beta channels. Maybe .26 was a typo and you meant .25? ;) ciao, Marc
Marc, I copied the i686 smp and hugemem kernels here: >>>http://people.redhat.com/~lwoodman/RHEL3/ Larry
Hi Larry, cool. Thanks. Trying now ... May I have the source rpm too? ciao, Marc
OK Marc, the src.rpm is at the same location. Larry
Hi Larry, thanks alot :) ... Ok, just for the record: I've tested -25.EL with page-anon.patch for the past days and I have to say, that lots of the braindamage from earlier kernels is gone. The whole system is alot smoother while doing heavy things (I/O, swapping and such) and the VM really starts to free cache before starting to swap. Note: this is -25.EL with page-anon (w/o page-anon it's still kind of braindamaged (the VM I mean)) - It's still not 100% perfect but you are almost there. Thanks for all the work you've done to the VM. I am now playing with -26.EL and give it a shot too and let you know how things are there. Please stay tuned. ciao, Marc
Hi again, ok, I diff'ed -25.EL and -26.EL to find differences and the only difference is in fs/binfmt_elf.c so I doubt anything will change with .26 from .25 in VM direction. ciao, Marc
You are right Marc, there is no VM diffs between .25 and .26! As far as the page_anon patch is concerned it is and will not be part of RHEL3-U4 because it hasnt received the testing it needs. Your imput will help determin if its the correct thing to do for U5. Please try experimenting with /proc/sys/vm/pagecache.maxpercnent(third value) in without the page_anon patch and see if it helps your system. The default value in U4 is 30%, increasing it will cause more swapping and decreasing it will cause less. Obviously without the page_anon patch we will reactivate mapped pagecache pages as though they were anonymous and that will cause inconsistencies. Larry Woodman
Hi Larry, well, my experience (we talked about it via private email some time ago) is that /proc/sys/vm/pagecache is a pseudo tweak tunable for me as it does not change anything, well, almost not anything. Remember when I said that pagecache = 1 10 10 is the only way to go to fix up some of the braindamage of the VM? :) I've played with pagecache again and it still does not make a difference. I tried for example 1 15 16 or 1 15 20 or 1 5 10 and such but with all these different values the VM starts to swap very very soon. I can easily trigger it on my p4 1gb ram: start vmware with winxp (256MB ram), start Quake3 and use a small map -> voila: ~20mb in swap where cache is ~700 megabyte and growing and growing and swap usage is also growing on a per minute basis gaming quake3. Example: I play quake3 for 15 minutes and I am 150-200mb in swap. For my usage it seems page_anon is the only way to go for now. _With_ page_anon patch applied and default pagecache 1 15 30 (U4) I can easily start vmware with winxp inside, start quake3, play it, even start complete KDE, Mozilla Firefox, xchat and whatnot and I am not in swap, instead I see cache shrinkage (as I expected to see :) I use that RHEL3 kernel on my desktop (beside the scalix groupware machine where it was bought for) as it's easier to trash^Wtest the VM there. You don't seem to like page_anon ;) so I assume it's time now for some sysrq-M things? :) Another note: It _seems to me_ that if there is a program that allocates alot of ram at once (like vmware does, like quake3 does), the VM is too slow to catch that up and starts to swap as there isn't lots of free memory available. ciao, Marc
Marc, first of all, I wrote the page_anon patch so yes I do like it. However, we need to make sure that it does ahve benefit and does not have negative side effects that over-shadow the positive side effects. What the system does without the page_anon patch is to move pagecache pages to the active anon list when they are re-activated if they a mapped. This means that applications that mmap lots of file pages and push the system into heavy page reclamation will swap rather than reclaiming pagecache pages. With the page_anon patch, those mapped pagecache pages will be reactivated back onto the active cache list rather than the active anon list and this will prevent the swapping. However, the pages of critical mapped files such as libc, libx, etc, will also be reactivated to the active cache list and that has the potential of causing an overall performance degradation. This is what needs to be tested and if I dont find any degradation I will make sure it goes into U5. Evdently vmware, mmap()s lots of file pages and that is why it runs so much better with the page_anon patch. If this Larry Woodman
Hi Larry, I am very sorry but I had by accident the vm_anon_lru patch I've supplied in #89226 applied to the kernel source and were using that one enabled and disabled it when there was almost no free memory (page_anon was applied too). Now disabled vm_anon_lru at all and only using page_anon the system is still swapping like an idiot. I am able to get a good working kernel when the following is done: 1. boot with vm_anon_lru to 0 (feature enabled) 2. Fillup memory (i.e. start X, kmail, firefox, vmware etc.) 3. When there is no swap usage, set vm_anon_lru to 1 (feature disabled) Now the kernel needs a very long time to start to swap. Overall: page_anon alone does nothing for me. pagecache tweak does nothing for me too. ciao, Marc
Marc, pardon my ignorance about this but I dont know what vm_anon_lru patch you are talking about. There is no vm_anon_lru variable in RHEL3, it was removed as part of the rmap changes. Can you attach that patch so I can see it. Also, please explain the workload you are running. Larry
Hi Larry, there has never been vm_anon_lru in RHEL because is was just merged in 2.4.29-pre1 ;) ... I've told about vm_anon_lru at the very first beginning of this bugreport. vm_anon_lru comes from 2.4-aa tree from Andrea Arcangeli and it is in all SLES kernels. ciao, Marc
Hi again, we just got a report from a customer that they also see this kind of braindamage with RHEL3 kernels. 8GB memtotal, 2GB swap partition which is 100% _used_ with ~6GB _free_ memory. The machine is running Oracle. ciao, Marc
That(swapping with lots of free memory) is typically caused by lowmem exhaustion. Lowmem is constrained to ~1GB on smp kernels and ~4GB on hughmem kernels and it can be exhausted by the kernel before highmem(the remaining ~5GB on an smp kernel) and therefore the system can swapout lowmem in order to reclaim it. Please ask the customer to open up a bug and get me an AltSysrq-M and a /proc/slabinfo output when his system is in that state. Larry Woodman
Hi Larry, I've asked the customer to do so and he will fill out the needed things in this bug since it perfectly fits in here. My personal opinion: I've search RH bugzilla and found uncountable bugreports of the same thing I've reported here, tons of users supplied meminfo dump, slabinfo, process listing and whatnot and it did not help, or at least it did not help much to get rid of the problems so much customers have. I doubt anything our customer will provide here will help to fix things, b/c it'll be almost kind of the same things tons of others already supplied to bugzilla. Anyway, I may be silly or dumb in VM related things but vm_anon_lru might really help you|us to get rid of the problems. There's only one problem left (I've described it at the beginning of this bug). Did you take a look at that one or should I attach it here so you can have a look? ciao, Marc
Marc, adding the vm_anon_lru patch to -rmap would result in the system being completely unable to swap out any anonymous memory, leading to an OOM kill whenever the anonymous memory is using up all of a memory zone. This is because rmap (and upstream 2.6) walk the LRU lists to find freeable pages, and never walk the process page tables.
Hi Rik, that's what I said in the very first beginning (ok, not so detailed as you now but ... ;) Anyway, I thought I share my experience with it so you and Larry might get an idea what might be wrong. ciao, Marc
Marc, sorry but I cant seem to find any bugs that show 8GB total, 6GB free and the 2GB swapfile totally used. Like I said, this is indicitive of lowmem exhaustion but it is very unusual. I'd really appreciate gettting an AltSysrq-M from a system in such a strang state, I've never seen something that odd. Thanks, Larry
Created attachment 107985 [details] better zone balancing
Hi Larry, Rik, sorry, but my customer still did not do the things I've asked for, so we all have to wait now :( Beside that, I now have a perfectly working VM, and I mean perfectly :) (for my workload) ... It's awesome. The only things I've incorporated into the current RHEL3 -26.EL kernel are: 1. page_anon 2. attached patch above. 3. pagecache set to 1 10 10 and for interactivity (plain RHEL3 kernels are just too slow like a dog when there is cpu load): https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=103064&action=view https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=103065&action=view I want to introduce a "tuning" boot parameter thing, or a /proc value to change the behaviour. Imagine, you boot with it and the VM will be tuned to use the above, if you ommit the parameter, everything is default again like it is now in -26.EL. The parameter name is just pulled out of my &$&($ ;) ... We can for sure name it different. What do you think? If you agree, I'll cook up something. So page_anon can be safely integrated, even in U4 which defaults to off and if you boot with the special parameter (or tune a /proc value) page_anon becomes active along the other things) ciao, Marc
Marc, the page_anon change has already been committed into what will become RHEL3-U5. However, the fixup_freespace patch that you included looks questionable. The only caller of fixup_freespace(__alloc_pages) will never make the call unless direct_reclaim is set. This makes the check for direct_reclaim inside the fixup_freespace() routine a big no-op anyway! ------------------- __alloc_pages() ------------------------------- if (direct_reclaim) { for (;;) { zone_t *z = *(zone++); if (!z) break; if (!z->size) continue; if (z->free_pages < z->pages_min) fixup_freespace(z, direct_reclaim); } } ------------------------------------------------------------------ Can you run the pre-RHEL3-U5 kernel with just the anon_page change and see if this alone corrects the problem you are seeing? Thanks, Larry Woodman
Hi Larry, where do I find that pre-RHEL3-U5 kernel? Or in other words, what's the version number of it? ciao, Marc
My enterprise clients are experiencing this bug sporadically. I am not sure where this bug was OFFICIALLY fixed. Can you please let me know which RHEL U this was officially fixed? I thought it was 2.4.21-27.0.4 but that is just a guess. Thank you, LDB
This bug has not been fixed in any officially released RHEL3 kernel, which is why this bugzilla report is still in ASSIGNED state.
We also see this on our RHE 3.5 server, where we have 1 GB of physical memory, of which there is always 650+ MB of buffers cached (i.e. plenty of memory available), but Swap goes up to 75-125 MB of swap used throughout the day with no decrease in available physical memory. For us, it seems the culrpits are background processes such as spamd from spamassassin and clamd from clamav, which do not spike throughout the day, but seem to consume large amounts of swap space. We do not see this on our RHE 4.x servers, or older 2.2.x kernels. Restarting spamd and clamd frees up 95% of used swap space, but it shouldn't go to this level in the first place. If you need us to get any data, just let me know. Thanks. Rob
Hi, we have exactly the same bug on RH3 U5. When it will be solved ?
Does anyone know if RH3 Beta U6 solves this bug in swap space being consumed prematurely? Rob
I have no record of this being fixed in U6 (which is why it's still in ASSIGNED state).
Ernie, thanks for the followup. Do you know if this will be fixed soon? Does Red Hat need more data? For an enterprise OS to have a large kernel memory problem where swap space gets used heavily when plenty of cached buffer memory is available seems to be a serious issue, especially for those of us who invest lots of $$ into the OS and use it extensively on many production servers. Watching loads rise to 1+ because of swap usage when a GB or more of memory is free is not the most comforting thing. Rob
I don't know. Larry, please follow up on Rob's questions.
This bug is not a black and white issue, where it is fixed or not. Performance is a continuum with many shades of grey, and I believe that the situation has been improved spectacularly in U5 and U6. However, having said that I realise that in some situations things still do not behave quite as they should. I have a question for Rob, in response to comment #36: is there significant swapin IO during the day, or is the data that was swapped out also in the swap cache, and is there no swap IO happening? If the data that's in swap is also resident in memory, there should not be a performance degradation (and the problem is mostly cosmetic). If there is a performance degradation, we should analyze it and fix it.
Snapshot of what is currently happening: drum:~# free total used free shared buffers cached Mem: 1023612 998676 24936 0 130340 510708 -/+ buffers/cache: 357628 665984 Swap: 2096472 90936 2005536 drum:~# vmstat procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 90936 24936 130340 510716 1 2 7 2 3 3 3 1 5 8 drum:~# cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 1048178688 1022152704 26025984 0 134098944 579231744 Swap: 2146787328 93118464 2053668864 MemTotal: 1023612 kB MemFree: 25416 kB MemShared: 0 kB Buffers: 130956 kB Cached: 502904 kB SwapCached: 62752 kB Active: 705200 kB ActiveAnon: 245264 kB ActiveCache: 459936 kB Inact_dirty: 133676 kB Inact_laundry: 21576 kB Inact_clean: 18508 kB Inact_target: 175792 kB HighTotal: 129216 kB HighFree: 2160 kB LowTotal: 894396 kB LowFree: 23256 kB SwapTotal: 2096472 kB SwapFree: 2005536 kB CommitLimit: 2608276 kB Committed_AS: 549848 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB The "si" and "so" blocks under vmstat never go down, only up. Even if I restart almost every single program, I can get "Swap used" down to 1 MB or less, but si and so still show blocks being swapped. If you need me to get more specific info, please let me know what commands to run, and I'll do so. We'll notice sometimes that the load average will sit around 1 to 1.5, with nothing running really, and stay that way for a few hours, and then drop back to zero. The system doesn't seem sluggish during these times, but it's hard to monitor actual system performance when you're not sure what is driving the load, or eating up swap. Rob
Hey Rob, we have this problem as well, and I was wondering could I trouble you to post /proc/slabinfo as well? I am just curious from a comparision perspective. Thank you, LDB
as requested: drum:~# cat /proc/slabinfo slabinfo - version: 1.1 (SMP) kmem_cache 96 96 244 6 6 1 : 1008 252 ip_conntrack 2055 2890 384 287 289 1 : 496 124 ip_fib_hash 95 224 32 2 2 1 : 1008 252 ext3_xattr 0 0 44 0 0 1 : 1008 252 journal_head 1839 9548 48 52 124 1 : 1008 252 revoke_table 6 500 12 2 2 1 : 1008 252 revoke_record 448 448 32 4 4 1 : 1008 252 clip_arp_cache 0 0 256 0 0 1 : 1008 252 ip_mrt_cache 0 0 128 0 0 1 : 1008 252 tcp_tw_bucket 1115 1500 128 38 50 1 : 1008 252 tcp_bind_bucket 846 1008 32 8 9 1 : 1008 252 tcp_open_request 1110 1110 128 37 37 1 : 1008 252 inet_peer_cache 116 116 64 2 2 1 : 1008 252 secpath_cache 0 0 128 0 0 1 : 1008 252 xfrm_dst_cache 0 0 256 0 0 1 : 1008 252 ip_dst_cache 1682 2325 256 117 155 1 : 1008 252 arp_cache 255 360 256 17 24 1 : 1008 252 flow_cache 0 0 128 0 0 1 : 1008 252 blkdev_requests 4096 4110 128 137 137 1 : 1008 252 kioctx 0 0 128 0 0 1 : 1008 252 kiocb 0 0 128 0 0 1 : 1008 252 dnotify_cache 0 0 20 0 0 1 : 1008 252 file_lock_cache 360 360 96 9 9 1 : 1008 252 async_poll_table 0 0 140 0 0 1 : 1008 252 fasync_cache 0 0 16 0 0 1 : 1008 252 uid_cache 260 1120 32 4 10 1 : 1008 252 skbuff_head_cache 1203 3634 168 64 158 1 : 1008 252 sock 458 718 1408 229 359 1 : 240 60 sigqueue 1008 1015 132 35 35 1 : 1008 252 kiobuf 0 0 128 0 0 1 : 1008 252 cdev_cache 13 116 64 2 2 1 : 1008 252 bdev_cache 116 116 64 2 2 1 : 1008 252 mnt_cache 18 116 64 2 2 1 : 1008 252 inode_cache 73128 88592 512 12592 12656 1 : 496 124 dentry_cache 76813 152400 128 5072 5080 1 : 1008 252 dquot 750 750 128 25 25 1 : 1008 252 filp 7652 7680 128 256 256 1 : 1008 252 names_cache 120 136 4096 120 136 1 : 240 60 buffer_head 83231 108395 108 2928 3097 1 : 1008 252 mm_struct 585 640 384 59 64 1 : 496 124 vm_area_struct 6855 21672 68 205 387 1 : 1008 252 fs_cache 845 1102 64 16 19 1 : 1008 252 files_cache 586 616 512 84 88 1 : 496 124 signal_cache 621 1160 64 19 20 1 : 1008 252 sighand_cache 341 422 1408 171 211 1 : 240 60 pte_chain 7636 19470 128 437 649 1 : 1008 252 pae_pgd 845 1160 64 15 20 1 : 1008 252 size-131072(DMA) 0 0 131072 0 0 32 : 0 0 size-131072 0 0 131072 0 0 32 : 0 0 size-65536(DMA) 0 0 65536 0 0 16 : 0 0 size-65536 0 0 65536 0 0 16 : 0 0 size-32768(DMA) 0 0 32768 0 0 8 : 0 0 size-32768 0 1 32768 0 1 8 : 0 0 size-16384(DMA) 0 0 16384 0 0 4 : 0 0 size-16384 21 44 16384 21 44 4 : 0 0 size-8192(DMA) 0 0 8192 0 0 2 : 0 0 size-8192 4 86 8192 4 86 2 : 0 0 size-4096(DMA) 0 0 4096 0 0 1 : 240 60 size-4096 396 456 4096 396 456 1 : 240 60 size-2048(DMA) 0 0 2048 0 0 1 : 240 60 size-2048 349 616 2048 196 308 1 : 240 60 size-1024(DMA) 0 0 1024 0 0 1 : 496 124 size-1024 554 748 1024 158 187 1 : 496 124 size-512(DMA) 0 0 512 0 0 1 : 496 124 size-512 688 688 512 86 86 1 : 496 124 size-256(DMA) 0 0 256 0 0 1 : 1008 252 size-256 1065 1065 256 71 71 1 : 1008 252 size-128(DMA) 0 0 128 0 0 1 : 1008 252 size-128 3098 4140 128 133 138 1 : 1008 252 size-64(DMA) 0 0 128 0 0 1 : 1008 252 size-64 6540 12990 128 256 433 1 : 1008 252 size-32(DMA) 0 0 64 0 0 1 : 1008 252 size-32 4634 12006 64 194 207 1 : 1008 252
You may want to check out "vmstat 5". The first line of vmstat is simply the average number of swapins/outs a second over the system lifetime, and does not reflect the current state of the system.
Rik, thanks for the "new" command I learned today. Here is the output: drum:~# vmstat 5 procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 90936 30744 131852 503244 1 2 7 2 3 3 3 1 5 8 0 0 90936 30724 131860 503244 0 0 0 50 142 36 0 0 98 2 0 0 90936 32068 131880 503252 0 0 0 174 149 1265 7 2 88 2 0 1 90936 32560 131908 503320 0 0 13 138 188 88 0 0 97 3 0 0 90936 32600 131920 503420 0 0 10 214 322 159 1 0 95 4 0 0 90936 32068 131932 503572 0 0 28 108 227 73 0 0 94 6 0 0 90936 31508 131952 503600 0 0 5 247 266 135 1 1 92 7 1 0 90936 31708 131964 503604 0 0 2 31 157 41 0 0 99 1 0 0 90936 31052 131980 503644 0 0 6 194 235 129 2 1 93 5 0 0 90936 31048 132004 503816 0 0 34 88 230 71 0 0 92 7 --- So it appears although free lists 90 MB of swap and meminfo showing swap cached, actual swap-in/out are zero...showing a "cosmetic" memory usage it seems (I think). I'll try and capture data when the load stays sustained around 1.
Just an addendum to my last post. I watched "vmstat 2" over the last few minutes, and I see si and so spikes of 2 to 10 blocks every 15-20 cycles, which I assume is some swapping occurring. Nothing destroying the server, but with 600+ MB of free memory, no swap should be utilized I would think... Rob
I suspect these are coming from the highmem zone. We can check to see if we happen to have any patches in the current RHEL3 tree that might disturb the balancing of allocations between the highmem zone and the low memory zones. On the other hand, one small spike every 15-20 seconds should not hurt performance, when you have several hundred kB/second in filesystem IO. It means that far less than 1% of your disk IO is swap IO. Earlier RHEL3 kernels used to have actual performance problems, because of which the severity of this bug was "high". If nobody has actual performance problems any more, I suspect we can drop the severity of this bug to "normal", or even "low".
Rik: When this happens to our servers it is quite impactful. I do not think that dropping the severity level on an enterprise distribution is wise at this juncture. This is only because we have potential financial impacts if this happens on our servers. Moreover, I would like it solved because some of our customers using RHEL 3 MIGHT not have the luxury of upgrading to RHEL 4 easily. The 2.6 kernel introduces something called "swappiness". I know RH backports many 2.6 functionalities in its RHEL distributions. Is there a similar parameter to vm.swappiness in RHEL 3 that can be user adjusted?
Rik: When this happens to our servers it is quite impactful. I do not think that dropping the severity level on an enterprise distribution is wise at this juncture. This is only because we have potential financial impacts if this happens on our servers. Moreover, I would like it solved because some of our customers using RHEL 3 MIGHT not have the luxury of upgrading to RHEL 4 easily. The 2.6 kernel introduces something called "swappiness". I know RH backports many 2.6 functionalities in its RHEL distributions. Is there a similar parameter to vm.swappiness in RHEL 3 that can be user adjusted? Thank you, Lawrence Bowie
Lawrence, I'll second your opinion on this being severe, as the data I posted earlier was at a load of 0.05 or so. We have had this swap issue drive load to over 1 sustained over a few hours on a system that is very lightly loaded at all times, so it can impact performance. If I can catch it doing this again, I'll post more stats during an "impact event". Rob
RHEL3 has the /proc/sys/vm/pagecache. You can reduce the percentage to which the page cache (within a memory zone) needs to be reduced before swapping can start to 15% with the following command: # echo "2 10 15" > /proc/sys/vm/pagecache This will also set the cache memory target to 10%, and the minimum to 2%. This may reduce, or even eliminate, swapping on systems with a very cache heavy workload. Rob, stats during an "impact event" could be very helpful in tracking down the problem, as well as establishing its severity (to determine the magnitude by which things might need to be adjusted).
ok, another rhel 3 server, nothing running out of the ordinary, only connection is ssh by me, load sitting around 1 for no reason, lots of swap space taken. Details below: w 09:19:44 up 3 days, 16:18, 1 user, load average: 1.09, 0.75, 0.56 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT 09:22:34 up 3 days, 16:21, 1 user, load average: 0.64, 0.76, 0.59 84 processes: 82 sleeping, 2 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 0.5% 0.0% 1.5% 0.0% 0.0% 0.5% 97.5% Mem: 510160k av, 281284k used, 228876k free, 0k shrd, 60200k buff 236268k actv, 18372k in_d, 368k in_c Swap: 1052216k av, 155332k used, 896884k free 64704k cached vmstat 1 procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 155332 230488 58820 62976 2 3 41 77 28 58 5 1 92 1 0 0 155332 230488 58820 62976 0 0 0 0 113 23 1 0 99 0 0 0 155332 230488 58844 62976 0 0 0 132 129 52 0 0 97 3 0 0 155332 230488 58844 62976 0 0 0 0 121 22 0 0 100 0 0 0 155332 230488 58844 62980 0 0 0 0 304 86 6 0 94 0 0 0 155332 230480 58844 62980 0 0 0 0 151 59 2 0 98 0 0 0 155332 230360 58844 62980 0 0 0 0 154 73 2 1 97 0 0 0 155332 230352 58860 62980 0 0 0 340 170 68 0 0 99 1 0 0 155332 230352 58860 62988 0 0 0 0 139 51 0 0 100 0 1 0 155332 225196 58860 62988 0 0 0 0 147 56 37 5 58 0 0 0 155332 230368 58860 63064 0 0 0 0 212 68 34 1 65 0 0 0 155332 230364 58860 63064 0 0 0 0 137 46 2 0 98 0 0 0 155332 230364 58880 63068 0 0 0 512 149 65 0 0 94 6 0 0 155332 230364 58880 63068 0 0 0 0 130 31 0 0 100 0 0 0 155332 230372 58880 63068 0 0 0 0 118 30 0 0 100 0 0 0 155332 230372 58880 63108 0 0 40 0 146 28 0 0 99 1 0 0 155332 230372 58880 63108 0 0 0 0 127 21 0 0 100 0 0 0 155332 230372 58884 63112 0 0 0 224 150 38 0 0 86 14 0 0 155332 230372 58884 63112 0 0 0 0 115 27 2 0 98 0 0 0 155332 230372 58884 63112 0 0 0 0 132 30 0 0 100 0 0 0 155332 230372 58884 63112 0 0 0 0 114 20 0 0 100 0 0 0 155332 230372 58884 63112 0 0 0 0 115 19 0 0 100 0 0 0 155332 230372 58904 63112 0 0 0 248 238 56 0 1 96 3 0 0 155332 230372 58904 63112 0 0 0 0 126 25 2 0 98 0 0 0 155332 230372 58904 63116 0 0 0 0 111 21 0 0 100 0 0 0 155332 230372 58904 63116 0 0 0 0 180 52 1 0 99 0 0 0 155332 230372 58904 63120 0 0 0 0 246 95 0 0 100 0 0 0 155332 230372 58932 63124 0 0 0 328 212 88 6 1 79 14 0 0 155332 230372 58932 63124 0 0 0 0 131 34 0 0 100 0 1 0 155332 228232 58932 63124 0 0 0 0 118 27 16 0 84 0 0 0 155332 230244 58932 63132 0 0 0 0 277 109 15 2 83 0 0 0 155332 230244 58932 63132 0 0 0 0 115 17 0 0 100 0 free total used free shared buffers cached Mem: 510160 279892 230268 0 58964 63132 -/+ buffers/cache: 157796 352364 Swap: 1052216 155332 896884 cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 522403840 286609408 235794432 0 60424192 183779328 Swap: 1077469184 159059968 918409216 MemTotal: 510160 kB MemFree: 230268 kB MemShared: 0 kB Buffers: 59008 kB Cached: 63216 kB SwapCached: 116256 kB Active: 234208 kB ActiveAnon: 136348 kB ActiveCache: 97860 kB Inact_dirty: 18084 kB Inact_laundry: 5732 kB Inact_clean: 368 kB Inact_target: 51676 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 510160 kB LowFree: 230268 kB SwapTotal: 1052216 kB SwapFree: 896884 kB CommitLimit: 1307296 kB Committed_AS: 400036 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB cat /proc/slabinfo slabinfo - version: 1.1 kmem_cache 72 105 108 3 3 1 ip_conntrack 1622 1790 384 167 179 1 ip_fib_hash 15 112 32 1 1 1 urb_priv 0 0 64 0 0 1 ext3_xattr 0 0 44 0 0 1 journal_head 1084 1771 48 17 23 1 revoke_table 7 250 12 1 1 1 revoke_record 0 112 32 0 1 1 clip_arp_cache 0 0 256 0 0 1 ip_mrt_cache 0 0 128 0 0 1 tcp_tw_bucket 5 90 128 1 3 1 tcp_bind_bucket 16 112 32 1 1 1 tcp_open_request 0 90 128 0 3 1 inet_peer_cache 4 58 64 1 1 1 secpath_cache 0 0 128 0 0 1 xfrm_dst_cache 0 0 256 0 0 1 ip_dst_cache 102 225 256 7 15 1 arp_cache 3 15 256 1 1 1 flow_cache 0 0 128 0 0 1 blkdev_requests 2976 3000 128 100 100 1 kioctx 0 0 128 0 0 1 kiocb 0 0 128 0 0 1 dnotify_cache 0 0 20 0 0 1 file_lock_cache 2 41 92 1 1 1 async_poll_table 0 0 140 0 0 1 fasync_cache 0 0 16 0 0 1 uid_cache 7 112 32 1 1 1 skbuff_head_cache 191 276 168 12 12 1 sock 52 138 1408 28 69 1 sigqueue 0 29 132 0 1 1 kiobuf 0 0 128 0 0 1 cdev_cache 14 116 64 2 2 1 bdev_cache 9 58 64 1 1 1 mnt_cache 19 58 64 1 1 1 inode_cache 2930 2940 512 419 420 1 dentry_cache 5759 5790 128 193 193 1 dquot 30 90 128 3 3 1 filp 791 810 128 27 27 1 names_cache 0 5 4096 0 5 1 buffer_head 47442 47484 104 1319 1319 1 mm_struct 57 120 384 9 12 1 vm_area_struct 3058 11368 68 61 203 1 fs_cache 56 174 64 2 3 1 files_cache 57 119 512 10 17 1 signal_cache 80 174 64 2 3 1 sighand_cache 69 132 1408 37 66 1 pte_chain 7019 10380 128 235 346 1 size-131072(DMA) 0 0 131072 0 0 32 size-131072 0 0 131072 0 0 32 size-65536(DMA) 0 0 65536 0 0 16 size-65536 0 0 65536 0 0 16 size-32768(DMA) 0 0 32768 0 0 8 size-32768 0 1 32768 0 1 8 size-16384(DMA) 0 0 16384 0 0 4 size-16384 21 22 16384 21 22 4 size-8192(DMA) 0 0 8192 0 0 2 size-8192 4 13 8192 4 13 2 size-4096(DMA) 0 0 4096 0 0 1 size-4096 41 53 4096 41 53 1 size-2048(DMA) 0 0 2048 0 0 1 size-2048 84 172 2048 44 86 1 size-1024(DMA) 0 0 1024 0 0 1 size-1024 64 112 1024 16 28 1 size-512(DMA) 0 0 512 0 0 1 size-512 77 88 512 10 11 1 size-256(DMA) 0 0 256 0 0 1 size-256 54 75 256 4 5 1 size-128(DMA) 4 30 128 1 1 1 size-128 814 900 128 30 30 1 size-64(DMA) 0 0 128 0 0 1 size-64 230 300 128 10 10 1 size-32(DMA) 68 116 64 2 2 1 size-32 462 638 64 11 11 1 Thoughts?
- no processes are running or waiting on IO - the CPU is over 90% idle - there is no swapin or swapout IO - of the 150MB that got swapped out, 110MB is also resident in memory (see the SwapCached line in /proc/meminfo) - the other 40MB that got swapped out is probably not needed, otherwise it would have been swapped in - almost half of memory is free I wonder if this means your server periodically runs into a load spike. Say, the server program that runs on this system (a mail server?) periodically processing a whole bunch of email at once in a lot of processes and driving the load average up for a small period of time. It would be useful to capture that exact time and not the aftermath - the system appears mostly idle in comment #55
Rik, That's my point though. One hour later and the load is still around 1 with no load and everything idle. It's like the "cosmetic" swap being used is artifically driving the load: top d 2 10:10:48 up 3 days, 17:10, 1 user, load average: 1.36, 1.38, 1.16 81 processes: 80 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 1.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.0% Mem: 510160k av, 318248k used, 191912k free, 0k shrd, 76316k buff 267868k actv, 17716k in_d, 368k in_c Swap: 1052216k av, 155324k used, 896892k free 82908k cached --- Nothing is really running, no mail spikes, no web spikes, only 1 low traffic account on this server, and the load has hovered around 1 for 3+ hours now. Is there anything else you want me to capture right now? This is not a temporary load spike (We manage around 100 redhat servers, so we know what load spikes look like), but a seemingly "fake" load elevation for hours with the only thing being a high level of swap memory being shown, whether it's being actively used or not. Server still seems quite responsive, but load averages stay around 1. Only our RHEL 3.x with 2.4.x kernels exhibit this behavior. Rob
If you press the "i" key in top, are there still no running or blocked processes showing up?
Non idle processes show: PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 1820 root 15 0 1892 1836 1492 R 0.0 0.3 0:00 0 sshd 9862 root 15 0 1012 1012 784 R 0.0 0.1 0:00 0 top --- So, doesn't appear to be anything driving the load. I checked /var/log/messages, maillog, etc... and nothing that should cause any load at all, lsof, port checks, etc... Now load mysteriously drops off after 3 hours of being around 1, and nothing has changed memory or program wise: 10:27:08 up 3 days, 17:26, 1 user, load average: 0.01, 0.26, 0.68 Non idle processes still shows same 2 ones, top and ssh. total: used: free: shared: buffers: cached: Mem: 522403840 336461824 185942016 0 82403328 210784256 Swap: 1077469184 159010816 918458368 MemTotal: 510160 kB MemFree: 181584 kB MemShared: 0 kB Buffers: 80472 kB Cached: 89148 kB SwapCached: 116696 kB Active: 277120 kB ActiveAnon: 136220 kB ActiveCache: 140900 kB Inact_dirty: 18620 kB Inact_laundry: 9552 kB Inact_clean: 368 kB Inact_target: 61132 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 510160 kB LowFree: 181584 kB SwapTotal: 1052216 kB SwapFree: 896932 kB CommitLimit: 1307296 kB Committed_AS: 399588 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB There has to be something in the memory management of the 2.4.x kernel that causes this condition, although it seems to be quite hard to pin down... hopefully others can provide more data to back up this condition. I'll try and capture more data when it happens again on any of our RHEL 3 servers.
Since there are no processes waiting on memory, or in blocked state waiting for anything else, I suspect this is something else. Do you by any chance have mysql or another threaded program running on this server? Maybe there is a thread waiting on a lock (in D state - but not visible by default in top)?
Rik, We have many things running on this server such as MySQL, Apache, spamassassin, clamav, etc... Process output of MySQL showed no running processes previously. Also, we see this beahvior on almost all of our RHEL 3 servers, large amounts of swap not being released, load hovering around 1 with nothing running for a few hours at a time randomly. None of our RHEL4 servers or older RH 6,7 servers exhibit this behavior. I am by no means an expert, nor do I understand why this happens, but we have been running Red Hat Linux servers since 1998 and this RHEL 3 release is the only one that does unexplained things with swap and load out of 100+ servers. I can only tell you what we see and provide data as requested, as we are more of an "end user" of Red Hat rather than developers / programmers. Rob
Rob, sorry but I dont see what the actual problem is here. There appears to be a couple of systems that swapped at one time when you dont think they should have. Is this correct? Can you catch the system swapping when you dont think it should be? I dont see any swap activity in any of the vmstat or top outputs you have attached, although there is evidence that the system did swap earlier. Sorry I didnt see the activity on this bug late last week, I was out. Thanks, Larry Woodman
Larry, Almost all of our RHEL 3 servers seem to hold large amounts of swap space, even machines with very little on them and 2 GB of RAM (with most of it free in cached buffers). I know this has been said to be "cosmetic only", but many of our RHEL 3 servers will experience a few hours of the load sitting around 1, with no high usage processes, no mysql processes, nothing that should drive the load except for a large swap space being used/reported by "free". Then, the load suddenly drops although nothing has changed on the server. It seems that the RHEL 3 kernel uses swap instead of free memory for programs such as spmassassin, or running an slocate update, etc... Whether it sustains this swap usage or it is just reflected in the "free" command is tough to tell, but as mentioned above, we'll see load issues on only our rhel 3 servers that cannot be explained except for the swap space being "used". RHEL 4 or older RH 6.x servers do not exhibit this behavior. Rob
Hi Larry, we are having the same issue. We have RH3 U5, with oracle 9irac. We have many people from Oracle Rac Team involved on this issue.
I was just talking to the Oracle folks about this issue. There appears to be some confusion about what is going on here. Can someone please reproduce this problem and *while* the system is swapping rather than reclaiming pagecache memory get me the following: 1.) "vmstat 1" output 2.) AltSysrq-M output 3.) AltSysrq-P and W outputs 4.) /proc/cpuinfo output 5.) /proc/meminfo output 6.) "top 1" output 7.) a "ps aux" output. Thanks, Larry Woodman I would do all of this myself but I acn not reproduce this problem you are describing internally here at Red Hat.
Larry, is there an easy way to get AltSysrq data remotely? We are not local to the server, but believe we can reproduce the swapping running a disk intensive script. Rob
Created attachment 119064 [details] info requested during swapping Not the best example, but it shows some blocks swapping in and out during cgi script writing a few thousand files to the server, with 800+ MB RAM free. I can try and get better data with our backup script running...
to remotely generate an alt-sysreq-M as root: # echo m > /proc/sysrq-trigger if the system is not dead, if it's dead then just a serial cable can help. so.. alt-sysrq-XXX just do: # echo XXX > /proc/sysrq-trigger
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.