When I've been using RedHat9 for a while, all the available memory gets used up either by applications but also by the cache - which is a good thing, as it speeds up the system. The problem when all memory is used, is that Linux seems to prefer swapping over releasing memory from the cache. The system (X especially) appears sluggish when this happens. Here's the output from free from my system as of right now: total used free shared buffers cached Mem: 1289520 1276304 13216 0 28984 484872 -/+ buffers/cache: 762448 527072 Swap: 2040212 66512 1973700 It's using the swap, even though there's around 0.5GB memory being used by cache. According to the RedHat9 manual, the parameter /proc/sys/vm/pagecache should help me control how much memory is used for caching the filesystem, is this correct? If so, I cant adjust this as it is not available in the /proc structure. If the above parameter cannot alleviate the problem, is there some other solution that can be used? This is pretty frustrating, as my system has 1.2GB of memory in total. Version-Release number of selected component (if applicable): kernel-2.4.20-9 How reproducible: Always
This is a fundamental bug in the VM/Cache design, as a single process writing heavily to a filesystem or disk can purge processes from dram as the cache agressively uses all of real dram memory. The result is a paging frenzy that can be percieved as a system lockup with disk busy and with response times of 10-30minutes for enter/new-line to an xterm shell to shell prompt reponse. Active processes which are purged from memory remain locked on the paging queue with pages being stolen as rapidly as they are faulted back in. There are no fairness or priority controls preventing pages from being stollen at a rate preventing execution. The problem actually gets significantly worse by adding more dram to the system. Normal tasks, like initializing a database, restoring compressed backups, and other write intensive jobs effectively crash the machine, while it's locked into a paging frenzy that will not end in any reasonable time period without power cycling the machine.
Is there some way to make the kernel not use up 98% of available memory for disk caching? If more free memory was left available for applications, the problem would be less significant. My experience as an end-user running X is that the system gets, as you write, bogged down over time when running into these race conditions (large resource consuming applications takes alot longer to start after the system has gobbled up the available memory for caching). Is anybody looking into this other than me posting here?
first of all try the erratum kernel; it has some minor vm bugs fixed that could cause the wrong page to be swapped out. In addition, massive writes shouldn't evict all memory anymore; the 2.4.20 rmap VM has code to prevent that.
Ran up2date today, I'm on Kernel 2.4.20-13.9 now. It still seem to do alot of swapping, heres some output from free: total used free shared buffers cached Mem: 1289496 1265892 23604 0 98460 376552 -/+ buffers/cache: 790880 498616 Swap: 2040212 83644 1956568 After starting and stopping an application (JBoss app server) a couple of times, free looks like this: total used free shared buffers cached Mem: 1289496 1270864 18632 0 99376 377240 -/+ buffers/cache: 794248 495248 Swap: 2040212 115208 1925004 The cached value looks more or less unchanged, swapping has increased around 30MB, and will continue to rise after i start/stop the application some more. Here's free after starting OpenOffice, The gimp and evolution afterwards: total used free shared buffers cached Mem: 1289496 1277144 12352 0 101532 346948 -/+ buffers/cache: 828664 460832 Swap: 2040212 133152 1907060 It's releasing cache now, but swap still rises as cache falls. It seems to me the kernel is keeping too little memory free for application startup overhead, so a race condition occurs where the kernel cannot free memory fast enough from cache to satisfy the need of the applications.
I'm also seeing this, purely desktop system. 256mb RAM, 2 swap partitions on two separate disks. Doing something that involves disk load is enough to kill the systems responsiveness for some time. Installing an RPM is a nasty one for some reason. When this happens I can see for instance Nautilus redrawing the screen a line at a time with constant disk activity throughout. This can happen when simply switching desktops, but often happens when logging out (but not in). This has seriously dropped the interactive performance of this system, which is really annoying :( Here's the output of free about 2 minutes after the last swap frenzy. [mike@excalibur Downloads]$ free -m total used free shared buffers cached Mem: 249 243 5 0 14 92 -/+ buffers/cache: 136 112 Swap: 847 183 663 I'm using more swap than buffers! Does anybody know when this problem might be fixed? I don't run any particularly disk heavy programs, just the usual desktop apps.
rh_9 has several performace problems: o memory management at 2.4.20-13.x is not good enough http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=90868 o there is a general bug with UTF-8 http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=69900 o all X programs use Xft and the RENDER extension is not accelerated http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=89754 for me, avoid to use UTF-8 made the system lighter. For example: now slocate.cron doesn't disturb my X programs. LANG=" " is a provisional fix at: # cat /etc/sysconfig/i18n LANG= SUPPORTED="en_US.UTF-8:en_US:en:gl_ES.UTF-8:gl_ES:gl:es_ES.UTF-8:es_ES:es" SYSFONT="latarcyrheb-sun16"
I had another one of these failures last night, which is a good example of why the current VM/Disk cache design is just plain WRONG. Production machine is a 64MB PII 333MHz machine as a Linux router also providing a RedHat mirror serviced using TUX for http/ftp service plus rsyncd. Crond runs the normal scripts for log management and the like, plus mrtg to provide some graphs for router performance and load. The machine is install as a minimal RH9 install, plus named, mrtg, tux, and rsyncd. There is no X or GUI desktop system installed. The filesystem buffers and cache normally take a well over half the real memory at nearly all times. This triggers paging when the RSS total of the DNS + rsynd + tux + crond + perl exceeds about 20MB for this system In nearly all cases, the choice to page out active process's working sets is the choice to do 2 I/O's including the read for faulting it back in. This choice should NEVER be made in favor of tightly holding on to disk buffer cache or filesystem cache memory of questionable value. This choice should NEVER be made for a low priority task. This choice should NEVER be made AT ALL until the aggregate RSS approches the real memory size, since the cost to recover a file to cache is roughly the same or less in real time and disk load. In single disk systems, the extra cost to seek to the swap file area may be significantly higher. With active downloads from the server, disk latency rises significantly, causing the paging latency to rise to the point that with paging delays the completion of the mrtg/perl task exceeds 5 minutes. As a result additional crond tasks, including multiple mrtg's stack up in the run queue, increasing the agregate RSS causing more paging. This continues for another 20 minutes till we have a half dozen mrtg tasks running and the machine is devoting 70% of it's I/O load to paging without managing to complete the first mrtg perl task which triggered the melt down. Crond was shut down, and the machine ran another 5 hours without completing any of the mrtg tasks, with response times to a CR in the ssh session remaining in the several minute range, and time to complete a "ps -laxf" about 10 minutes. Finally, killing all the MRTG perl tasks took another 10 minutes before they managed to complete and the system was responsive again. They NEVER finished .... without intervention, swap would have been exceeded regardless of how much was allocated, and processes would start dieing due to allocation failures. It took quite some time for the filesystem cache to dwindle down to 6MB, even with this crippling paging I/O load. In theory, using "FREE" memory to cache the filesystem is a good thing. But somebody really screwed up here by insisting that somehow caching the filesystem in the majority of DRAM (files that are very likely to NEVER be used again in the near term) is much more important than memory for active running processes. In practice there does not need to be any "free" dram, and what SHOULD happen is that ANY page fault allocating DRAM should be taken from the disk cache and/or buffer pool down to a relatively small tunable percentage of real memory. The disk cache and filesystem cache are there to minimize I/O, not to provoke I/O in the form of extensive VM paging. By hogging DRAM, the current cache management takes ANY linux server system unpredictably unstable in production. At all times, a server resources MUST scale linearly with load, or the drop in effiency will trigger resource queue stackup with significant hystersis that is very likely to be non-recoverable as long as requests enter the queues. To manage this effectively, priority MUST be directly given to active tasks for all resources such that the tasks can complete without stacking up queues and increasing the new workload of the system. Linux violates this significantly in a number of areas where previous UNIX systems do not. All kernel resource management algorithms MUST if at ALL possible become more effiecent under load, and seldom, if ever, trigger more work than would otherwise be required as compared to a sequential batch execution engine. To do this properly, write behind disk caching should NEVER be scheduled before any read request in the queues. There are processes waiting for the reads, and NONE waiting for the writes (at least until write behinds fill memory, at which point they get triggered and complete as pairs with reads). Where at all possible, disk queue scheduling should be priority driven based on the taks priority the invokes the I/O .... right down to allocation of pre-I/O resources such as disk buffers and cache space. The filesystem designs must promote EFFECTIVE aggregation of disk I/O under load to acctually reduce the per request disk queue latencies and miximize disk thruput. Increased disk seeks under load, must be offset by increased utilization per seek and the corresponsiding rotational loss. Access to DRAM for RSS MUST be priority driven and fair share distributed. Processes with inferrior priority MUST NOT be allowed to consume memory and other resources such that high priority and otherwise interactive tasks are always on the short end of the stick and unable to effectively use their high priority status to complete quicker. There are huge performance costs for flushing DRAM caches .... Linux needs to work hard at effectively minimizing the cache footprint of the kernel, and minimizing low priority context switches to tasks that may do little more than fault out of L1/L2 cache very active higher priority tasks, if not out of real DRAM too. Lastly, the design of all cron scheduled tasks should include a serialization lock to prevent multiples from stacking up in the run queues and memory. John Bass Owner/DMS Design Performance by Design
The Red Hat Linux 9 kernel should only swap out process pages if the active list for cache pages is less than 15% of (active cache + active anon). I would appreciate it if somebody with a misbehaving VM could show me the contents of /proc/meminfo so I've got a better idea of exactly how things are going wrong.
Hi Rik, catching your /proc/meminfo file might be a bit problematic as it can take 20-30 minutes just to log into a VM trashing machine, and probably the same period to capture the file to a disk that already has a queue service time in the seconds. I think my last post was pretty clear about the 64mb PII-333mhz set and the work load that trashed it. The assertion that paging occurs only when active cache + active anon is down to 15% of mem can be verfied and explored other ways .... consider the vmstat data on the same 64MB PII-333mhz machine under normal use with a "vmstat 30" trace running: procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 12720 12224 8636 11984 0 0 0 5 134 11 0 1 99 0 0 0 12720 12224 8636 11984 0 0 0 0 141 11 0 0 100 0 0 0 12720 12224 8636 11984 0 0 0 0 173 12 0 1 99 0 0 0 12720 12224 8636 11984 1 0 1 0 150 15 0 0 99 0 0 0 12720 12224 8636 11984 1 0 1 0 130 14 0 0 100 0 0 0 12716 14456 6380 12144 1 83 25 156 158 35 20 3 77 0 0 0 12716 14456 6380 12144 0 0 0 14 159 12 0 1 99 0 0 0 12716 14456 6380 12144 3 0 3 0 189 17 0 1 99 0 0 0 12716 14456 6408 12188 4 0 6 2 262 19 0 2 98 0 0 0 12716 14456 6408 12188 0 0 0 1 176 13 0 2 98 2 0 0 12716 14428 6408 12188 1 0 1 0 169 14 0 1 99 0 0 0 12716 14052 6432 12216 0 0 0 7 185 14 0 2 98 0 0 0 12716 14052 6432 12216 0 0 0 5 192 12 0 1 98 0 0 0 12716 14052 6432 12248 3 0 4 0 160 28 0 1 99 0 0 0 12716 14052 6432 12248 1 0 1 0 180 14 0 1 99 0 0 0 12752 17400 4412 11004 16 82 32 154 192 47 21 3 77 0 0 0 12752 17400 4412 11004 1 0 1 14 139 13 0 0 100 0 0 0 12752 17400 4412 11004 0 0 0 0 159 13 0 1 99 0 0 0 12752 17400 4420 11004 1 0 1 0 171 12 0 0 100 0 0 0 12752 17368 4436 11164 3 0 8 1 162 19 0 1 99 0 0 0 12752 17344 4436 11164 1 0 1 0 140 13 0 0 100 0 0 0 12752 17344 4436 11164 0 0 0 0 130 12 0 1 99 0 0 0 12752 17340 4436 11164 0 0 0 0 148 12 0 0 100 0 0 0 12752 16752 4488 11388 0 0 8 7 129 15 0 1 99 0 0 0 12752 16752 4504 11388 0 0 0 6 181 12 0 1 99 0 0 0 12748 17132 4688 10852 1 63 9 125 203 34 21 3 76 0 0 0 12748 17132 4692 10964 1 0 5 14 152 19 0 1 99 0 0 0 12748 17132 4692 10964 0 0 0 0 170 12 0 1 99 0 0 0 12748 17132 4692 10964 0 0 0 0 129 12 0 0 100 0 0 0 12748 17132 4700 10964 0 0 0 1 132 14 0 0 100 0 0 0 12748 17132 4700 10964 0 0 0 0 138 13 0 0 100 0 0 0 12748 17132 4700 10964 0 0 0 0 133 11 0 0 100 0 0 0 12748 17132 4700 10964 1 0 1 0 141 14 0 1 99 0 0 0 12748 17132 4700 10964 0 0 0 0 129 11 0 0 100 0 0 0 12748 17132 4700 10964 0 0 0 0 127 11 0 0 100 0 0 0 12728 16648 4948 11004 0 22 7 96 155 36 21 3 76 0 0 0 12728 16648 4948 11004 0 0 0 16 132 12 0 0 100 0 0 0 12728 16648 4948 11004 0 0 0 0 162 12 0 1 99 0 0 0 12728 16648 4948 11004 0 0 0 0 136 14 0 0 100 0 0 0 12728 16648 4948 11004 0 0 0 0 138 14 0 0 99 0 0 0 12728 16648 4948 11004 0 0 0 0 152 12 0 1 99 0 0 0 12728 16648 4948 11004 0 0 0 0 140 12 0 1 99 0 0 0 12728 16648 4948 11004 0 0 0 0 140 12 0 1 99 0 0 0 12728 16648 4948 11004 0 0 0 0 135 12 0 0 100 0 0 0 12728 16648 4948 11028 0 0 1 0 167 16 0 1 99 0 0 0 12760 17416 5076 10324 0 157 15 234 223 36 20 3 76 0 0 0 12760 17416 5076 10324 1 0 1 14 268 32 0 2 98 0 0 0 12760 17248 5124 10420 0 0 3 7 328 20 0 4 96 0 0 0 12760 17248 5124 10420 0 0 0 5 213 15 0 1 98 0 0 0 12760 17248 5132 10420 0 0 0 1 183 18 0 1 99 0 0 0 12760 17248 5132 10420 0 0 0 0 167 13 0 1 99 0 0 0 12760 17248 5132 10420 0 0 0 0 151 12 0 1 99 0 0 0 12760 17248 5132 10420 0 0 0 0 147 11 0 1 99 0 0 0 12760 17248 5132 10420 0 0 0 0 139 14 0 0 100 0 0 0 12760 17248 5132 10420 0 0 0 0 140 13 0 0 99 0 0 0 12760 16948 5372 10556 0 40 8 108 148 33 21 2 77 0 0 0 12760 16948 5372 10556 0 0 0 14 134 11 0 0 100 we can clearly see the impact the mrtg has on the system every 5 minutes (every 10 samples) where it nearly always forces page outs ... since the numbers are normalized to per second figures, a 40block/sec average with 30 second quantum implies 1,200 blocks or 1.2MB was flushed to swap. The 157block/sec number implies that 4,710 blocks, or 4.7MB was flush to swap .... all the while the filesystem cache is around 10mb and the buffer cache above 5MB which for this machine is certainly well above the 15% figure. One doesn't need to look very hard to see this, or provoke it. Consider the normal state for this machie: # free total used free shared buffers cached Mem: 61412 60680 732 0 3172 30404 -/+ buffers/cache: 27104 34308 Swap: 200772 12840 187932 # vmstat 30 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 12840 804 2888 30496 6 16 13 5 13 21 7 11 13 0 0 0 12840 804 2748 30500 7 1 8 6 155 16 0 1 99 0 0 0 12840 1616 2864 29112 4 1 18 9 143 20 0 1 98 0 0 0 12840 1136 2824 29296 21 0 26 6 193 45 0 1 99 0 0 0 12840 1128 2736 29384 1 0 1 0 142 13 0 1 99 0 0 0 12840 1128 2736 29384 1 0 1 0 141 15 0 0 100 0 0 0 12840 1128 2736 29384 0 0 0 0 131 13 0 0 100 0 0 0 12840 1092 2736 29384 2 0 2 0 147 18 0 0 99 0 0 0 12840 6804 2456 17872 6 49 212 85 169 67 19 2 79 0 0 0 12840 11520 2700 19136 5 1 37 40 154 30 2 1 97 0 0 0 12840 11520 2704 19136 3 0 3 14 167 23 0 1 99 0 0 0 12840 11520 2704 19136 0 0 0 0 147 13 0 1 99 0 0 0 12840 11380 2708 19196 7 0 9 0 154 22 0 1 99 1 0 0 12648 772 3536 29064 8 38 381 293 232 118 1 3 96 1 0 0 12648 672 2040 32632 34 55 3883 3745 435 397 2 18 80 Note the transition in free .... which is an indication of the working set size that caused the impluse. The typical number I see for filesystem/buffer cache on this machine is frequently well above 25MB, and depends largely on the amount of downloads in the recient past. And as you can see, we have already started signficant page in and out traffic with the cache consuming over half of real memory. This is WRONG. Bind certainly has the largest VM allocation of all the processes, but remains trimmed to a fairly small working set: # ps -laxf F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 1 25 1318 1 25 0 35836 1896 rt_sig S ? 1:32 /usr/sbin/named -u named1 0 12356 1 15 0 1440 36 pipe_w S ? 0:00 CROND 4 512 12361 12356 15 0 8772 1564 lock_p D ? 0:03 \_ /usr/bin/perl /usr/bin/mrtg /home/mrtg/cwx.cfg --logging /va 4 512 12629 12356 15 0 5736 4 pipe_w S ? 0:00 \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg as compared with the mrtg cron task that triggers the periodic paging. Now, as I noted the longer post the problem isn't fatal until there is significant disk traffic by other applications, such as active TUX/rsyncd file serving which radically impact the paging rate. Here for example, is a snapshot of the vmstat during the meltdown the other day that happens to still be in an active window on my desktop: # vmstat 60 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 7 30 3 182764 528 1568 6384 7 18 11 3 7 20 7 12 38 0 30 0 159168 636 1728 5060 385 279 406 293 281 372 0 2 98 5 26 0 153432 532 1716 5108 368 306 401 312 283 377 0 2 98 6 24 0 159576 520 1424 5656 389 463 449 470 323 357 0 2 98 1 30 1 160256 556 1508 6052 425 413 487 425 369 403 1 2 97 5 25 2 158912 548 1416 5296 458 372 503 379 302 373 0 2 98 7 33 2 165512 520 1520 6116 400 482 472 495 341 384 1 2 97 0 30 0 164280 520 1492 6164 437 461 503 485 324 429 3 2 95 5 24 0 164352 652 1660 6092 343 361 389 381 300 349 3 2 95 26 4 0 165420 768 1504 5560 363 380 402 393 329 351 1 2 96 0 29 0 160300 876 1508 4600 377 290 396 298 297 366 0 1 98 8 20 0 166380 640 1544 4676 371 413 390 418 326 326 0 2 98 22 8 0 163624 524 1344 4652 394 324 420 330 304 365 1 1 98 0 29 0 168948 584 1364 4640 396 415 419 418 309 330 0 2 98 3 27 0 160072 552 1356 5268 368 317 405 323 292 351 1 2 98 23 6 1 153780 576 1588 6124 357 335 404 353 303 349 1 1 98 The 15% figure for cache here of total real memory ... and after subtracting the memory consumed by the kernel for other reasons and core processes which activate frequently ... bind, cron, .... etc, the 15% figure is a much higher REAL percentage of usable process memory. Here the machine has stacked up a little over a half dozen mrtg/perl/sendmail tasks, plus has multiple active tux/rsync clients driving a base I/O load which actively impacts the paging rate and the filesystem caching is actively contributing to the paging rate. As said in the first post, I frequently see archival operations and rpm updates drive the cache percentage high and trigger substantial paging .... critical meltdowns in the past have been invoked not by mrtg on this machine, but by network rpm updates, in particular rpm --rebuilddb. As this machine is a mirror server, it frequently sees large sustained file accesses during mirror update and RedHat net installs. The server has a twin, with 512MB of DRAM, which while harder to provoke into vm trashing, does do so at times, but manages to typically recover on it's own. That machine, also a mirror server, frequently has over 300MB in filesystem cache, and starts paging just as easily.
Ok - staging a work load to demonstrate active pagine to disk with high cache values .... I simply tar'ed /var/ftp/pub/mirrors to /dev/null to create filesystem I/O:
# tar cf /dev/null /var/ftp/pub/mirrors # vmstat 30& # while cat /proc/meminfo > do > sleep 10 done wait a few minutes for cron to start mrtg and we get total: used: free: shared: buffers: cached: Mem: 62885888 62119936 765952 0 12951552 24379392 Swap: 205590528 13516800 192073728 MemTotal: 61412 kB MemFree: 748 kB MemShared: 0 kB Buffers: 12648 kB Cached: 16592 kB SwapCached: 7216 kB Active: 30248 kB ActiveAnon: 9712 kB ActiveCache: 20536 kB Inact_dirty: 2740 kB Inact_laundry: 3348 kB Inact_clean: 692 kB Inact_target: 7404 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61412 kB LowFree: 748 kB SwapTotal: 200772 kB SwapFree: 187572 kB 1 0 0 13200 764 12920 16276 2 7 384 141 526 356 7 15 77 total: used: free: shared: buffers: cached: Mem: 62885888 62095360 790528 0 13824000 22847488 Swap: 205590528 13516800 192073728 MemTotal: 61412 kB MemFree: 772 kB MemShared: 0 kB Buffers: 13500 kB Cached: 15124 kB SwapCached: 7188 kB Active: 30488 kB ActiveAnon: 10036 kB ActiveCache: 20452 kB Inact_dirty: 2836 kB Inact_laundry: 2352 kB Inact_clean: 1052 kB Inact_target: 7344 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61412 kB LowFree: 772 kB SwapTotal: 200772 kB SwapFree: 187572 kB total: used: free: shared: buffers: cached: Mem: 62885888 56987648 5898240 0 8478720 14479360 Swap: 205590528 13975552 191614976 MemTotal: 61412 kB MemFree: 5760 kB MemShared: 0 kB Buffers: 8280 kB Cached: 7340 kB SwapCached: 6800 kB Active: 26084 kB ActiveAnon: 16936 kB ActiveCache: 9148 kB Inact_dirty: 3560 kB Inact_laundry: 3356 kB Inact_clean: 620 kB Inact_target: 6724 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61412 kB LowFree: 5760 kB SwapTotal: 200772 kB SwapFree: 187124 kB 0 4 2 13452 4608 9080 11516 14 97 388 191 477 191 24 12 64 total: used: free: shared: buffers: cached: Mem: 62885888 52834304 10051584 0 9420800 19005440 Swap: 205590528 13639680 191950848 MemTotal: 61412 kB MemFree: 9816 kB MemShared: 0 kB Buffers: 9200 kB Cached: 11644 kB SwapCached: 6916 kB Active: 22704 kB ActiveAnon: 8976 kB ActiveCache: 13728 kB Inact_dirty: 3688 kB Inact_laundry: 2788 kB Inact_clean: 496 kB Inact_target: 5932 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61412 kB LowFree: 9816 kB SwapTotal: 200772 kB SwapFree: 187452 kB total: used: free: shared: buffers: cached: Mem: 62885888 56217600 6668288 0 10932224 20115456 Swap: 205590528 13512704 192077824 MemTotal: 61412 kB MemFree: 6512 kB MemShared: 0 kB Buffers: 10676 kB Cached: 12520 kB SwapCached: 7124 kB Active: 24840 kB ActiveAnon: 8756 kB ActiveCache: 16084 kB Inact_dirty: 2920 kB Inact_laundry: 3300 kB Inact_clean: 492 kB Inact_target: 6308 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61412 kB LowFree: 6512 kB SwapTotal: 200772 kB SwapFree: 187576 kB total: used: free: shared: buffers: cached: Mem: 62885888 61509632 1376256 0 13250560 20168704 Swap: 205590528 13496320 192094208 MemTotal: 61412 kB MemFree: 1344 kB MemShared: 0 kB Buffers: 12940 kB Cached: 12584 kB SwapCached: 7112 kB Active: 26312 kB ActiveAnon: 8532 kB ActiveCache: 17780 kB Inact_dirty: 2164 kB Inact_laundry: 4260 kB Inact_clean: 492 kB Inact_target: 6644 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61412 kB LowFree: 1344 kB SwapTotal: 200772 kB SwapFree: 187592 kB 0 0 0 13180 848 13932 13024 16 26 279 106 500 349 7 15 78 total: used: free: shared: buffers: cached: Mem: 62885888 61857792 1028096 0 14266368 20537344 Swap: 205590528 13496320 192094208 MemTotal: 61412 kB MemFree: 1004 kB MemShared: 0 kB Buffers: 13932 kB Cached: 13024 kB SwapCached: 7032 kB Active: 28016 kB ActiveAnon: 8836 kB ActiveCache: 19180 kB Inact_dirty: 2192 kB Inact_laundry: 3536 kB Inact_clean: 836 kB Inact_target: 6916 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61412 kB LowFree: 1004 kB SwapTotal: 200772 kB SwapFree: 187592 kB hope that helps
Here's output from my /proc/meminfo. There's some free mem in the dump (180MB) due to me just closing an application, but this is after working for at couple of hours where the Cache never went beneath app. 300MB - there has been plenty swapping activity. total: used: free: shared: buffers: cached: Mem: 1320443904 1136025600 184418304 0 34238464 498208768 Swap: 2089177088 220901376 1868275712 MemTotal: 1289496 kB MemFree: 180096 kB MemShared: 0 kB Buffers: 33436 kB Cached: 317408 kB SwapCached: 169124 kB Active: 941812 kB ActiveAnon: 655812 kB ActiveCache: 286000 kB Inact_dirty: 2316 kB Inact_laundry: 78156 kB Inact_clean: 13832 kB Inact_target: 207220 kB HighTotal: 393200 kB HighFree: 44488 kB LowTotal: 896296 kB LowFree: 135608 kB SwapTotal: 2040212 kB SwapFree: 1824488 kB
Here is the 64MB router/mirror server doing an rsync mirror update with 36MB tied up in buffers and cache and the system is paging heavily with 10-20 second command response times. # vmstat 30 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 19 3 99428 644 5136 10856 5 14 12 9 12 22 6 9 3 3 22 0 100320 644 5096 9492 229 102 302 1020 1396 1093 4 15 81 5 20 2 101736 532 4272 11048 240 238 354 917 1066 829 4 11 85 ]# cat /proc/mem* total: used: free: shared: buffers: cached: Mem: 62885888 62193664 692224 0 4874240 29519872 Swap: 205590528 102973440 102617088 MemTotal: 61412 kB MemFree: 676 kB MemShared: 0 kB Buffers: 4760 kB Cached: 11208 kB SwapCached: 17620 kB Active: 31456 kB ActiveAnon: 23832 kB ActiveCache: 7624 kB Inact_dirty: 3632 kB Inact_laundry: 3008 kB Inact_clean: 724 kB Inact_target: 7764 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61412 kB LowFree: 676 kB SwapTotal: 200772 kB SwapFree: 100212 kB
Couple more notes on the previous post .... the rsync mirror update triggered another mrtg stackup from agressive paging due to excessive cache/buffer use. It will be interesting to see if this one recovers, or dies from congestive paging failure. In any case, having the vast majority of memory tied up in buffers and cache while paging to death is just plain WRONG. vmstat 30 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 12 0 89712 668 3204 10332 5 14 12 9 13 22 6 9 3 4 10 0 90120 644 3032 10708 332 124 394 426 810 648 1 6 92 2 13 1 89992 692 3000 11140 298 107 418 539 863 777 6 7 87 4 9 0 84552 664 3056 10636 349 109 391 493 818 704 3 7 90 [root@cwx mirrors]# cat /proc/mem* total: used: free: shared: buffers: cached: Mem: 62885888 62238720 647168 0 3371008 32702464 Swap: 205590528 80367616 125222912 MemTotal: 61412 kB MemFree: 632 kB MemShared: 0 kB Buffers: 3292 kB Cached: 9188 kB SwapCached: 22748 kB Active: 33456 kB ActiveAnon: 25788 kB ActiveCache: 7668 kB Inact_dirty: 3000 kB Inact_laundry: 2264 kB Inact_clean: 564 kB Inact_target: 7856 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61412 kB LowFree: 632 kB SwapTotal: 200772 kB SwapFree: 122288 kB [root@cwx mirrors]# ps -laxf F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 1 0 15 0 1372 452 schedu S ? 0:08 init 1 0 2 1 15 0 0 0 contex SW ? 6:09 [keventd] 1 0 3 1 15 0 0 0 schedu SW ? 0:00 [kapmd] 1 0 4 1 34 19 0 0 ksofti SWN ? 0:02 [ksoftirqd_CPU0] 1 0 9 1 15 0 0 0 bdflus SW ? 0:09 [bdflush] 1 0 5 1 15 0 0 0 schedu SW ? 3:10 [kswapd] 1 0 6 1 15 0 0 0 schedu SW ? 0:00 [kscand/DMA] 1 0 7 1 15 0 0 0 schedu SW ? 0:00 [kscand/Normal] 1 0 8 1 15 0 0 0 schedu SW ? 0:00 [kscand/HighMem] 1 0 10 1 15 0 0 0 schedu SW ? 0:00 [kupdated] 1 0 11 1 25 0 0 0 md_thr SW ? 0:00 [mdrecoveryd] 1 0 15 1 15 0 0 0 end SW ? 1:55 [kjournald] 1 0 73 1 25 0 0 0 end SW ? 0:00 [khubd] 1 0 647 1 15 0 0 0 end SW ? 0:00 [kjournald] 1 0 648 1 15 0 0 0 end SW ? 0:01 [kjournald] 1 0 2361 1 15 0 1452 284 schedu S ? 1:30 syslogd -m 0 5 0 2365 1 15 0 1380 140 do_sys S ? 0:20 klogd -x 5 32 2383 1 17 0 1644 232 schedu S ? 0:00 portmap 5 29 2402 1 25 0 1616 320 schedu S ? 0:00 rpc.statd 5 0 2439 1 24 0 1368 176 schedu S ? 0:00 /usr/sbin/apmd -p 10 -w 5 -W -P /etc/sysconfig/apm-scripts/apmsc 5 0 2525 1 16 0 3516 168 schedu S ? 0:13 /usr/sbin/sshd 5 0 13317 2525 16 0 6760 0 schedu SW ? 0:00 \_ /usr/sbin/sshd 5 510 13319 13317 15 0 6800 0 schedu SW ? 0:00 | \_ /usr/sbin/sshd 0 510 13320 13319 18 0 4316 0 schedu SW pts/0 0:00 | \_ -bash 1 0 30392 2525 15 0 6896 4 schedu S ? 0:10 \_ /usr/sbin/sshd 4 0 30394 30392 15 0 4400 0 wait4 SW pts/1 0:02 | \_ -bash 0 0 19323 30394 21 0 4100 0 wait4 SW pts/1 0:00 | \_ su - 4 0 19324 19323 15 0 4360 0 wait4 SW pts/1 0:00 | \_ -bash 0 0 27601 19324 15 0 4124 0 wait4 SW pts/1 0:00 | \_ sh fast 0 0 27602 27601 23 0 4172 0 wait4 SW pts/1 0:00 | \_ sh xx g 4 0 27606 27602 15 0 4572 0 schedu SW pts/1 0:05 | \_ rsync -v -a --delete --stats --bwlim 5 0 27689 27606 15 0 4572 864 lock_p D pts/1 0:52 | \_ rsync -v -a --delete --stats --b 1 0 17887 2525 15 0 6900 296 schedu S ? 1:08 \_ /usr/sbin/sshd 4 0 17906 17887 15 0 4364 604 wait4 S pts/2 0:04 | \_ -bash 4 0 28108 17906 15 0 3224 1276 - R pts/2 0:00 | \_ ps -laxf 5 0 20129 2525 20 0 6764 0 schedu SW ? 0:00 \_ /usr/sbin/sshd 5 513 20131 20129 15 0 6840 0 schedu SW ? 0:00 \_ /usr/sbin/sshd 0 513 20132 20131 16 0 4332 0 schedu SW pts/3 0:00 \_ -bash 5 0 2539 1 15 0 2064 228 schedu S ? 0:00 xinetd -stayalive -pidfile /var/run/xinetd.pid 5 38 2555 1 15 0 2400 2396 schedu SL ? 0:22 ntpd -U ntp 5 0 2578 1 15 0 5956 516 schedu S ? 0:34 sendmail: rejecting connections on daemon MTA: load average: 17 1 51 2587 1 15 0 5744 108 pause S ? 0:01 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue 1 0 2597 1 15 0 1420 12 schedu S ? 0:00 gpm -t ps/2 -m /dev/mouse 5 0 2928 1 15 0 1640 240 pause S ? 0:00 [TUX date] 1 0 2929 1 15 0 0 0 schedu SW ? 0:04 [TUX logger] 1 0 2930 1 25 0 1648 104 wait4 S ? 0:00 [TUX manager] 5 99 2931 2930 15 0 1648 160 end S ? 4:24 \_ [TUX worker 0] 1 99 2932 2931 15 0 1648 164 end S ? 1:05 \_ [async IO 0/1] 1 99 2933 2931 15 0 1648 164 end S ? 0:20 \_ [async IO 0/2] 1 99 2934 2931 15 0 1648 164 end S ? 0:07 \_ [async IO 0/3] 1 99 2935 2931 15 0 1648 164 end S ? 0:01 \_ [async IO 0/4] 1 99 2936 2931 15 0 1648 164 end S ? 0:00 \_ [async IO 0/5] 1 99 2937 2931 15 0 1648 164 end S ? 0:00 \_ [async IO 0/6] 1 99 2938 2931 15 0 1648 164 end S ? 0:00 \_ [async IO 0/7] 1 99 2939 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/8] 1 99 2940 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/9] 1 99 2941 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/10] 1 99 2942 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/11] 1 99 2943 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/12] 1 99 2944 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/13] 1 99 2945 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/14] 1 99 2946 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/15] 1 99 2947 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/16] 1 99 2948 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/17] 1 99 2949 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/18] 1 99 2950 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/19] 1 99 2951 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/20] 1 99 2952 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/21] 1 99 2953 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/22] 1 99 2954 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/23] 1 99 2955 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/24] 1 99 2956 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/25] 1 99 2957 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/26] 1 99 2958 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/27] 1 99 2959 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/28] 1 99 2960 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/29] 1 99 2961 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/30] 1 99 2962 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/31] 1 99 2963 2931 25 0 1648 164 end S ? 0:00 \_ [async IO 0/32] 4 0 2968 1 21 0 1360 124 schedu S tty3 0:00 /sbin/mingetty tty3 4 0 2969 1 21 0 1360 0 schedu SW tty4 0:00 /sbin/mingetty tty4 4 0 2970 1 21 0 1360 0 schedu SW tty5 0:00 /sbin/mingetty tty5 4 0 2971 1 21 0 1360 0 schedu SW tty6 0:00 /sbin/mingetty tty6 1 0 2973 1 15 0 0 0 end SW ? 0:11 [kjournald] 1 0 27219 1 15 0 0 0 end SW ? 0:00 [kjournald] 1 0 27262 1 25 0 3592 0 schedu SW ? 0:00 rpc.rquotad 5 0 27266 1 15 0 0 0 schedu SW ? 0:00 [nfsd] 1 0 27267 1 15 0 0 0 schedu SW ? 0:00 [nfsd] 1 0 27268 1 15 0 0 0 schedu SW ? 0:00 [nfsd] 1 0 27274 1 25 0 0 0 schedu SW ? 0:00 [lockd] 1 0 27275 1 25 0 0 0 end SW ? 0:00 [rpciod] 1 0 27269 1 15 0 0 0 schedu SW ? 0:00 [nfsd] 1 0 27270 1 15 0 0 0 schedu SW ? 0:00 [nfsd] 1 0 27271 1 15 0 0 0 schedu SW ? 0:00 [nfsd] 1 0 27272 1 15 0 0 0 schedu SW ? 0:00 [nfsd] 1 0 27273 1 15 0 0 0 schedu SW ? 0:00 [nfsd] 1 0 27281 1 25 0 1644 0 schedu SW ? 0:00 rpc.mountd 1 25 1318 1 25 0 37400 1332 rt_sig S ? 13:54 /usr/sbin/named -u named 1 0 12674 1 15 0 1428 40 schedu S ? 0:01 crond 1 0 27886 12674 15 0 1440 0 pipe_w SW ? 0:00 \_ CROND 4 512 27891 27886 15 0 8792 1088 - R ? 0:02 | \_ /usr/bin/perl /usr/bin/mrtg /home/mrtg/cwx.cfg --logging 4 512 27966 27886 15 0 5732 0 pipe_w SW ? 0:00 | \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg 1 0 27914 12674 15 0 1440 0 pipe_w SW ? 0:00 \_ CROND 4 512 27917 27914 15 0 8792 1052 lock_p D ? 0:02 | \_ /usr/bin/perl /usr/bin/mrtg /home/mrtg/cwx.cfg --logging 4 512 27987 27914 15 0 5736 0 pipe_w SW ? 0:00 | \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg 1 0 27942 12674 15 0 1440 0 pipe_w SW ? 0:00 \_ CROND 4 512 27946 27942 15 0 8792 1100 lock_p D ? 0:02 | \_ /usr/bin/perl /usr/bin/mrtg /home/mrtg/cwx.cfg --logging 4 512 28010 27942 16 0 5728 0 pipe_w SW ? 0:00 | \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg 1 0 28018 12674 19 0 1436 0 pipe_w SW ? 0:00 \_ CROND 4 0 28022 28018 15 0 8524 920 lock_p D ? 0:02 | \_ /usr/bin/perl /usr/bin/mrtg /etc/mrtg/mrtg.cfg 1 0 28020 12674 15 0 1440 0 pipe_w SW ? 0:00 \_ CROND 4 512 28024 28020 15 0 8576 916 lock_p D ? 0:02 | \_ /usr/bin/perl /usr/bin/mrtg /home/mrtg/users.cfg --loggi 4 512 28056 28020 15 0 5732 0 pipe_w SW ? 0:00 | \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg 1 0 28021 12674 15 0 1440 0 pipe_w SW ? 0:00 \_ CROND 4 512 28025 28021 15 0 8796 2100 pipe_w S ? 0:02 | \_ /usr/bin/perl /usr/bin/mrtg /home/mrtg/cwx.cfg --logging 4 512 28054 28021 15 0 5732 0 pipe_w SW ? 0:00 | \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg 1 0 28048 12674 18 0 1436 0 pipe_w SW ? 0:00 \_ CROND 4 0 28051 28048 15 0 8532 1100 lock_p D ? 0:01 | \_ /usr/bin/perl /usr/bin/mrtg /etc/mrtg/mrtg.cfg 1 0 28049 12674 15 0 1440 0 pipe_w SW ? 0:00 \_ CROND 4 512 28052 28049 15 0 8580 1136 lock_p D ? 0:02 | \_ /usr/bin/perl /usr/bin/mrtg /home/mrtg/users.cfg --loggi 4 512 28061 28049 15 0 5736 0 pipe_w SW ? 0:00 | \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg 1 0 28050 12674 15 0 1440 0 pipe_w SW ? 0:00 \_ CROND 4 512 28053 28050 15 0 8748 2044 lock_p D ? 0:02 | \_ /usr/bin/perl /usr/bin/mrtg /home/mrtg/cwx.cfg --logging 4 512 28068 28050 19 0 5736 0 pipe_w SW ? 0:00 | \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg 1 0 28080 12674 15 0 1436 360 wait4 S ? 0:00 \_ CROND 4 512 28106 28080 15 0 5740 2364 end D ? 0:00 | \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg 1 0 28081 12674 15 0 1440 336 pipe_w S ? 0:00 \_ CROND 4 512 28086 28081 15 0 8772 6880 schedu S ? 0:01 \_ /usr/bin/perl /usr/bin/mrtg /home/mrtg/cwx.cfg --logging 4 512 28107 28081 15 0 5732 2320 pipe_w S ? 0:00 \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem mrtg 4 0 18814 1 20 0 1356 0 schedu SW tty1 0:00 /sbin/mingetty tty1 4 0 18816 1 21 0 1356 0 schedu SW tty2 0:00 /sbin/mingetty tty2
>> The problem when all memory is used, is that Linux seems to prefer swapping >> over releasing memory from the cache. The system (X especially) appears >> sluggish when this happens. I started seeing this also after kernel-smp-2.4.18-27.7.x -> kernel-smp-2.4.20-18.7 update (RH73) After some probing I found this difference: chrismcc]$ uname -a ; cat /proc/sys/vm/bdflush Linux eeyore 2.4.18-27.7.xsmp #1 SMP Fri Mar 14 05:52:30 EST 2003 i686 unknown 30 500 0 0 2560 15360 60 20 0 chrismcc]$ uname -a ; cat /proc/sys/vm/bdflush Linux piglet 2.4.20-18.7smp #1 SMP Thu May 29 07:49:23 EDT 2003 i686 unknown 30 500 0 0 500 3000 60 20 0 As a test I did: /sbin/sysctl -w vm.bdflush="30 500 0 0 2560 15360 60 20 0" [chrismcc@kanga chrismcc]$ uname -a ; cat /proc/sys/vm/bdflush Linux kanga 2.4.20-18.7smp #1 SMP Thu May 29 07:49:23 EDT 2003 i686 unknown 30 500 0 0 2560 15360 60 20 0 And... tada, all was well again Could this be the cause of the above problems? Or different?
This behaviour is nearly making Red Hat 9 unusable as a platform for a squid proxy. I am seeing Squid's constant disk read and writes causing buffer and cache sizes to grow slowly over time and cause large parts of squid to be swapped out. Squid is installed on a server with 1gb of ram. The max VSZ size I have seen so far is around 550mb of RAM, just around half of the systems ram. With half the ram of the system supposidly "available" I don't consider it unreasonable to expect that the active application not have half its pages swapped out and not having to access swap constantly to get at those pages back. The following is output from "vmstat 1 2000": procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 79768 9348 341788 476640 0 1 6 23 7 13 0 1 24 0 0 0 79776 9320 341560 476880 8 8 8 8 628 244 0 3 97 0 0 0 79776 9208 341560 477008 0 0 16 0 575 245 2 3 95 0 0 0 79784 9072 341352 477344 8 8 8 8 575 170 1 2 97 0 0 0 79784 8988 341352 477416 12 0 12 0 438 158 1 1 98 0 0 0 79904 8988 341344 477432 0 132 0 380 410 169 0 2 98 0 0 0 79904 8972 341344 477512 8 0 80 0 405 189 0 1 99 0 0 0 79904 8884 341360 477588 8 0 8 0 850 299 0 6 94 0 0 0 79904 8776 341488 477664 40 0 52 0 972 542 2 10 88 0 0 0 79936 8636 341600 477736 56 72 56 80 1297 574 3 9 88 0 0 1 79936 8672 341820 477564 24 0 24 912 949 480 3 6 91 0 0 0 79944 8668 341836 477580 0 8 12 944 1034 469 1 3 96 0 0 0 79944 8672 341836 477580 0 0 0 0 316 94 1 3 96 0 0 0 79960 8676 341672 477748 0 128 0 204 343 109 0 1 99 0 0 0 79960 8692 341680 477736 4 0 28 0 741 333 1 3 95 0 0 0 79960 8708 341532 477880 4 0 12 28 630 278 1 4 95 0 0 0 79960 8708 341624 477784 4 0 4 444 291 152 0 1 99 0 0 0 79960 8708 341624 477784 0 0 0 0 357 155 0 2 98 0 0 0 79960 8708 341628 477780 0 0 8 0 542 242 1 1 98 0 0 0 79960 8708 341628 477768 12 0 16 16 1254 438 2 8 89 The following is /proc/meminfo: total: used: free: shared: buffers: cached: Mem: 1054982144 1045995520 8986624 0 349908992 562728960 Swap: 534601728 82034688 452567040 MemTotal: 1030256 kB MemFree: 8776 kB MemShared: 0 kB Buffers: 341708 kB Cached: 477592 kB SwapCached: 71948 kB Active: 757492 kB ActiveAnon: 108844 kB ActiveCache: 648648 kB Inact_dirty: 84 kB Inact_laundry: 151652 kB Inact_clean: 21916 kB Inact_target: 186228 kB HighTotal: 131008 kB HighFree: 1076 kB LowTotal: 899248 kB LowFree: 7700 kB SwapTotal: 522072 kB SwapFree: 441960 kB The output of "ps -eo pid,user,args,vsz,rss | grep squid" is: 31695 root /usr/local/squid 5588 552 31697 squid (squid) 522940 108052 31699 squid (unlinkd) 1344 8 I have tried many different values for /proc/sys/vm/bdflush and /proc/sys/vm/kswapd. I can only seem to slow it down by making bdflush run much more often but the "leak" is still there.
I forgot to mention the kernel version in use: 2.4.20-18.9smp
I will be backporting the latest -rmap updates to this kernel
I have performed some tests with : vm.bdflush = 30 500 0 0 2560 15360 60 20 0 as proposed. The system appears to swap less aggresively than the default settings of: vm.bdflush = 30 500 0 0 500 3000 60 But, seen from the below tests (with the proposed bdflush setting) where I've been launching af bunch of applications to deplete my 1.2GB of memory, a situation occurs where both Swap and Cache are rising and the system becomes non-responsive / sluggish. The test snapshots were performed using "free -s1", I've cut'n pasted to make things easer to read: test mem snapshot 1: Free Cached swap used 16968 539068 58008 16064 539444 58052 16064 539632 58088 16064 539712 58088 16296 540084 58132 Test mem snapshot 2 (a couple of minutes after snapshot 1): Free Cached swap used 17712 547380 58544 11548 549976 58588 12916 551820 58632 12448 552660 58668 10848 554228 58712 10096 554516 58772 11352 549580 58832 11352 549848 58892 meminfo (a little after snapshot 2): total: used: free: shared: buffers: cached: Mem: 1320443904 1303855104 16588800 0 104771584 628047872 Swap: 2089177088 61177856 2027999232 MemTotal: 1289496 kB MemFree: 16200 kB MemShared: 0 kB Buffers: 102316 kB Cached: 553584 kB SwapCached: 59744 kB Active: 960188 kB ActiveAnon: 479156 kB ActiveCache: 481032 kB Inact_dirty: 48 kB Inact_laundry: 184728 kB Inact_clean: 25988 kB Inact_target: 234188 kB HighTotal: 393200 kB HighFree: 1024 kB LowTotal: 896296 kB LowFree: 15176 kB SwapTotal: 2040212 kB SwapFree: 1980468 kB So, a little better with the proposed settings, but the race condition i still evident (and noteworthy). Btw, I'm using RH9 kernel 2.4.20-18.9
Did the latest errata kernel (kernel-2.4.20-19.7) address any issues from this bug?
in theory , yes. --cut-- * Sat Jul 12 2003 Rik van Riel <riel> - upgrade to latest -rmap to fix #89226, #90668, etc. --end-- we will seeh
2.4.20-19 does appear to have some improvement. I've noticed on my workstation machines that less swap is now in use. For my squid server initially it looked ok. As cache and buffer were rising and squid grew free would dip down to around 8000K then some cache would be freed and free would be back up to 12000K. This lasted for about 20 minutes and then it started eating into swap again. While the rate of increasing swap usage seems noticably slower the kernel still seems very eager to swap. And now it appears to be a little more "bursty" about it. As in it tends to swap out in 1MB chunks early on and then do continous swap ins until the next chunk it writes out (sorry I forgot to grab a vmstat capture of this happening but will grab one if you think its needed). [blocke@komodo blocke]$ uname -a Linux komodo.newpaltz.edu 2.4.20-19.9smp #1 SMP Tue Jul 15 17:04:18 EDT 2003 i686 i686 i386 GNU/Linux [blocke@komodo blocke]$ free total used free shared buffers cached Mem: 1030248 1021516 8732 0 274520 472864 -/+ buffers/cache: 274132 756116 Swap: 522072 90516 431556 [blocke@komodo blocke]$ vmstat procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 90516 8600 274544 472972 1 4 16 108 213 152 0 1 98 [blocke@komodo blocke]$ ps -eo pid,user,args,vsz | grep squid 1700 root /usr/local/squid 5592 1702 squid (squid) 614780 1704 squid (unlinkd) 1344 2818 blocke grep squid 3576 This may very well be me misunderstanding the amount of memory that squid needs but I'm still believing the problem is the kernel doesn't like giving up cache and buffer when it probably should. (User error or Kernel problem?) So in summary: I've seen some improvement on desktop/workstation workloads and a very minor improvement on the squid case.
I am running a production Cyrus IMAP server on an SMP Redhat 2.4.20-19.7.x kernel, and have been experiencing these same problems. The system has 6GB RAM and everything is fine until all free memory becomes used for filesystem cache. Once this happens, the VM appears to prefer swapping to reclaiming filesystem cache pages. Kswapd starts consuming a large amount of CPU time, and the load average jumps dramatically (1min. loadavg of 30-40 on a system that usually has a loadavg of 3-4 during its busiest times). Naturally, everything becomes extremely sluggish. Here is a sample output from "vmstat 5" when this situation is occurring: r b w swpd free buff cache si so bi bo in cs us sy id 1 3 2 73012 11712 222168 4948160 0 1 81 10 53 96 7 27 66 1 2 0 73028 10508 223760 4948220 0 3 897 1880 1963 2471 19 63 18 3 15 5 73072 10552 225284 4946696 0 10 362 2612 2211 2243 22 65 13 17 0 1 73160 12672 225900 4943812 1 22 590 2908 2822 3013 23 49 28 26 0 3 73912 13644 222472 4946896 0 179 626 9333 7817 3856 33 65 2 10 3 2 74008 12036 223616 4947184 6 22 735 1862 3113 3403 36 64 0 18 0 3 74272 11024 215976 4955500 3 82 790 2757 2180 2591 41 58 0 22 8 3 74240 11168 216184 4955956 0 8 473 1843 2300 2576 26 73 1 6 4 2 74252 10808 214828 4955836 5 24 808 2976 2875 3261 41 59 0 12 3 2 74324 12608 207488 4961412 1 55 366 2170 2095 2488 22 77 1 22 8 6 74324 10832 208636 4961360 1 2 422 1191 2001 2169 27 72 0 11 12 4 74828 10692 208568 4961264 0 116 579 6286 5889 3216 36 62 2 14 4 3 74956 11192 208416 4960296 6 27 532 1864 3033 2125 24 76 0 16 10 3 75268 10712 210336 4958820 0 62 350 4056 4143 2207 25 74 1 21 7 4 75556 10616 212176 4956108 0 71 1020 5248 5562 3337 24 75 1 21 19 5 75652 10960 215536 4954084 6 26 623 5127 3362 3212 33 67 0 20 8 6 75712 12744 215940 4951696 10 33 695 4195 2677 2840 26 74 1 21 7 2 75720 11300 217988 4950940 3 10 689 4242 3293 3740 38 62 0 13 13 3 75760 10932 221180 4946608 0 8 647 3536 2593 2893 34 66 0 19 13 3 75768 10876 223072 4944824 6 5 287 2181 2158 2360 21 79 1 14 8 5 75768 11156 225752 4941700 6 7 968 4850 3730 4495 31 68 0 I have reverted to running the 2.4.18-18.7.x SMP kernel and am not experiencing these problems. Here's some output from "vmstat 5" when the system was even busier than above (this output will only show 4GB RAM, as I compiled this kernel before I had the full 6GB, and therefore didn't enable PAE): procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 10740 383308 2628568 0 0 578 2728 3315 4361 31 6 63 4 0 0 0 10804 383572 2628176 0 0 155 2076 2947 3718 15 5 80 4 1 3 0 10696 384068 2627920 0 0 420 2189 3978 4993 36 5 58 1 1 0 0 19460 384556 2621132 0 0 587 2898 2837 3500 26 7 67 0 0 0 0 16940 384972 2624160 0 0 463 2055 3411 4259 24 7 68 2 0 0 0 15560 385440 2626144 0 0 387 2703 3186 3885 22 9 69 1 0 0 0 11380 385152 2627020 0 0 521 4151 3859 5448 26 8 66 1 2 0 0 10988 385604 2625608 0 0 568 2163 2984 3888 34 4 62 1 1 0 0 10648 386224 2625080 0 0 1379 2239 3085 4090 18 6 76 0 0 0 0 10692 386480 2626148 0 0 369 1105 2667 3458 26 7 67 3 2 0 0 10880 387136 2622072 0 0 316 3516 3594 4498 26 7 67 2 2 1 0 10680 387684 2621396 0 0 317 7511 3633 4692 31 7 62 1 0 0 0 10744 385524 2623248 0 0 1843 2921 4476 5306 33 7 60 0 0 0 0 10712 386028 2621188 0 0 2853 2749 4020 4248 25 7 68 0 0 0 0 11156 386684 2619760 0 0 539 3435 3287 4593 35 6 59 0 0 0 0 13300 387380 2616384 0 0 519 3978 4265 4917 30 7 63 5 0 0 0 12176 383456 2617320 0 0 198 2043 2536 3310 30 5 65 1 2 0 0 11328 383976 2618960 0 0 1198 2037 3024 4148 22 5 73 0 0 0 0 10716 384456 2614516 0 0 484 2512 2836 4154 27 5 68 1 0 1 0 10704 384808 2614580 0 0 237 2068 3106 4081 41 5 54 2 0 1 0 10636 385180 2613144 0 0 335 2064 2806 3819 24 5 70 It's difficult for me to reboot between these kernels, as this is a production system; but if there's any other data I could capture, that would assist in analysis of this problem, I'll try to do so.
For me: 2.4.20-19.7smp helped a little, but it still started swapping , just not as hard. that was on a MySQL DB server and several web servers, and a devel server, all RH7.3 good news: devel server reinstalled with RHEL3 beta , problem gone :)
We run 2.4.20-19.9 without swap and we basically see the same issue. When memory gets to around the 60% used state (by used, I mean, total minus free minus cache), processes go into what looks like a spinlock craze trying to get memory. This is despite 800MB of RAM (on a 2GB machine) being listed as cache (which to me means "available"). If swap is enabled, it just goes into swap hell. Our application's activity on the system is similar to squid. It's almost as if the kernel *must* keep about 40% of memory available for cache.
Been working for a couple of hours... Found myself spending more and more time waiting while switching between applications. This is what the different memory stats are right now (running with the default vm parameters for RHL9, latest kernel update): free: total used free shared buffers cached Mem: 1289496 1280104 9392 0 24396 779656 -/+ buffers/cache: 476052 813444 Swap: 2040212 285496 1754716 vmstat: procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 2 0 0 285700 9324 24404 779656 3 30 46 171 186 1206 6 8 86 /proc/meminfo: total: used: free: shared: buffers: cached: Mem: 1320443904 1310916608 9527296 0 24989696 987979776 Swap: 2089177088 292765696 1796411392 MemTotal: 1289496 kB MemFree: 9304 kB MemShared: 0 kB Buffers: 24404 kB Cached: 779656 kB SwapCached: 185168 kB Active: 968668 kB ActiveAnon: 350752 kB ActiveCache: 617916 kB Inact_dirty: 5200 kB Inact_laundry: 187108 kB Inact_clean: 28196 kB Inact_target: 237832 kB HighTotal: 393200 kB HighFree: 1276 kB LowTotal: 896296 kB LowFree: 8028 kB SwapTotal: 2040212 kB SwapFree: 1754308 kB
Any news regarding this bug?
Two things: 1) could you try "echo 1 5" > /proc/sys/vm/pagecache ... to make sure the kernel really evicts most of the page cache before swapping 2) Davej, could you add the inode reclaim fixes into the 2.4.20-* kernels ?
I have this same problem on about 5 production servers. I have been wrestling with it for the last couple of months and just now finally found this bug thread. I'm on this version now: Linux chi.actarg.com 2.4.20-20.9 #1 Mon Aug 18 11:45:58 EDT 2003 i686 i686 i386 And still am having problems. I did a: grep -r zzyzx /remote_nfs_volume/* and watched the cache with "free". If left to run, the nfs caching will grow continually, swapping out about every process on the machine. The apps become very sluggish and the cache does not seem to release easily. Unfortunately, I upgraded the whole network to RH9 before understanding there was a problem so now I'm crippled across the network. I tried echo 1 5 > /proc/sys/vm/pagetable_cache (I'm assuming that's what is meant in #28) but the system still seems to prefer swapping out processes as opposed to releasing cache. Is there a fix for this in the works for this?
> Davej, could you add the inode reclaim fixes into the 2.4.20-* kernels ? Not without spending a significant amount of time untangling the various vm related patches in that tree. Its based on an older rmap version, with various updates (some of which may or may not be in later rmaps). At a guess its at least a day or so work. I don't have time to do this anytime soon, so don't hold your breath for it..
I'm holding my breath for something :) Is there something I can do as a workaround? For example, is there a way to limit the amount of memory the kernel uses for caching? That way, I could keep the memory more available for processes.
Created attachment 96045 [details] none Is there a fix for this yet - is it an issue in AS3. We see this activity on our servers to the point that they become unusable. Last messages on the console show kswapd as the top process
I'd appreciate it if you could post a screen of top(1) and a screen of vmstat 1 during a trouble period, so we can debug what's happening with the RHEL3 kernel, as well as the exact version number of the kernel you are using.
> is it an issue in AS3 I think there should be a '?' in there I am migrating to RHEL3 (from 7.3) and am NOT seeing this anymore on 2.4.21-4.ELsmp > We see this activity on our servers to the point that they become unusable On RH7.3 you might try this: /sbin/swapoff -a (works for me)
Created attachment 96201 [details] Tarball of bug-reproducing example code
The block device cache is causing kswapd thrashing, usually bringing the system to a halt. This problem has been reproduced on kernels as recent as 2.4.21-4EL. In our application we deal with large (multi-GB) files on multi-CPU 4GB platforms (mostly 2.4.7-10). While handling these files, the block device cache allocates all remaining available memory (3.5G) up to the 4G physical limit. Once the block device cache has pegged the physical memory limit, it doesn't seem to manage it's allocation of that memory well enough to prevent unnecessary page-swapping. Ultimately, thrashing takes over and the SYSTEM COMES TO A HALT. After the application closes all files and exits, the cache maintains its allocation of this memory until either: 1) the file is removed, or 2) somebody requests more memory. In the former case, used memory (top, /proc/meminfo) drops instantly to the amount used by all processes (sum of ps use). In the latter, memory use remains pegged and swapping typically remains a problem. There doesn't appear to be a timeout on the cache's allocation. THIS IS BROKEN. This problem is most noticable when the (cached) files causing the problem are on a local disk. Below is an example of a pseudo-idle system (only running 'du') which is affected by the trashing problem. Both CPUs are 99% system, kswapd is 99.9%, load average exceeds 4 and growing, and virtually all memory is consumed, although only 717,140K is reported to be used by "all" processes (using a sum of 'ps -aux' memory use). 5:31pm up 53 days, 11:28, 19 users, load average: 4.64, 3.14, 2.14 160 processes: 157 sleeping, 2 running, 0 zombie, 1 stopped CPU0 states: 0.1% user, 99.0% system, 0.0% nice, 0.2% idle CPU1 states: 0.1% user, 99.2% system, 0.0% nice, 0.0% idle Mem: 3928460K av, 3828808K used, 99652K free, 0K shrd, 26148K buff Swap: 4194224K av, 696384K used, 3497840K free 2715008K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 5 root 17 0 0 0 0 RW 99.9 0.0 218:52 kswapd I have seen situations where the load average exceeds 12.0 (!), and others on a 4-CPU 64-bit 6GB machine (running 2.4.21[-4.EL]) where all four CPUs are at 100% system, and page-swapping. THIS PROBLEM IS READILY REPRODUCIBLE. I have a test program (fst) which can reproduce the problem; with an additional memory reclaimation program (reclaim). A tarball of these has been attached. fst can be used to generate large files (with seek behavior typical of our application, as seeking seems to aggrevate the problem). When using fst (on a 4GB system), specify 'num_blks' to be 2,000,000 to 4,000,000, with mode = 1 (seek-updating enabled): fst 3000000 fst.out 1 This will create a file with 3,000,000 blocks of random size between 1-2048 bytes. Midway through creating fst.out, the block device cache should have allocated all of memory. If thrashing doesn't immediately occur you can run multiple fst's to aggravate the problem. reclaim can be used to illustrate that, with fst still running (and pegged), it is possible to manually reclaim/free the memory used by the block device cache, thereby eliminating the issues with kswapd, bdflush, kupdated, etc. But given that fst's still running, memory usage creeps back up, as expected. This seems to be a fairly fundamental and substantial problem. Over time rogue memory use by the block device cache simply creeps up and up toward the physical limit. And it becomes a probem more readily. Can anyone provide a means to mitigate or eliminate this problem? We've toyed with altering parameters to bdflush and the like, with no succe
Chris, thank you for your test program. I'll be visiting family over the next week, but once I return I'll run it and I'll try to improve the VM's behaviour when faced with your test program.
I am also seeing the same problem after I just ran up2date. Although, up2date is only upgrading me to the following (others of you are higher maybe because you're running EE, else do I need to do manual?): Linux localhost.localdomain 2.4.20-8 #1 Thu Mar 13 17:18:24 EST 2003 i686 athlon i386 GNU/Linux I am running RH9 Workstation with 768Mb RAM. Very frustrating because my system is so slow now when I top off the RAM.
Could this be related? http://marc.theaimsgroup.com/?l=linux-kernel&m=107368165419559&w=2
This problem of has been shown to be eliminated in (at least) RedHat's 2.4.20-24.7 or later (available as an RPM from updates.redhat.com); and in (at least) 2.4.23 from kernel.org.
I've upgraded to Fedora and been running with this for a while. I haven't noticed the problem for some time now. I'm solely using Fedora as a desktop environment (as I did RH9) for my development tasks, which is where I first stumpled upon the issue. My kernel is 2.4.22-1.2149.nptl So, for me it's either solved or reduced beyond the point of notice.
I have been able to reproduce this on a fresh install of Red Hat 9 using the latest Red Hat release of the 2.4.20-28.9smp kernel. If I blast NFS reads/writes to a single NFS mount point, I can reproduce this in under 3 minutes on a Dell 1750 with 4GB RAM and a 2GB swap partition. I am getting ready to try this with the stock 2.4.24 kernel and Andrea's 2.4.23aa1 patch.
I have been able to reproduce this issue with the 2.4.24smp kernel on Red Hat 9 on a Dell 450 workstation, with different behavior than I experienced on the 2.4.20-xx kernels. To sum it up, 2.4.24smp behaves much better relative to this caching issue. I was not able to get this behavior on the 2.6.2smp kernel, using an identically configured Dell 450. I have been able to bring the 2.4.20-xx kernels to their knees in less than 15 minutes, using the same scenarios below. The results listed below are based on 24 hours of testing and the system is still running well. The swap space consumption does still grow over time, however not at the previous rates. The amount of swap space consumed is dramatically lower than the amount consumed in the past 2.4.20-xx kernels using these same tests. I loaded X and 3 bash sessions, plus the normal run of the mill daemons running which are mostly idle. I hacked together a little C code to generate a large file (100GB) over NFS, in 512 byte increments sequentially vs. using dd. This caused the system to start consuming swap space, however it does take much longer to reach this state using 2.4.24 than the 2.4.20-xx kernels. Here is a clip of top from the first scenario: 11:36:57 up 22:51, 6 users, load average: 0.45, 0.53, 0.51 93 processes: 92 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.2% user 6.2% system 0.0% nice 0.0% iowait 93.0% idle CPU1 states: 0.0% user 1.1% system 0.0% nice 0.0% iowait 98.3% idle CPU2 states: 5.3% user 25.0% system 0.0% nice 0.0% iowait 69.0% idle CPU3 states: 0.0% user 0.1% system 0.0% nice 0.0% iowait 99.4% idle Mem: 2069312k av, 2019632k used, 49680k free, 0k shrd, 49292k buff 108740k active, 1781024k inactive Swap: 2096440k av, 1104k used, 2095336k free 1771100k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 6766 jim 17 0 576 576 504 S 30.9 0.0 5:18 2 cacheSmasher 1052 root 9 0 0 0 0 SW 2.9 0.0 2:56 0 rpciod 7 root 9 0 0 0 0 SW 1.1 0.0 0:28 1 kswapd 3540 root 9 0 1136 1136 876 S 0.3 0.0 3:51 0 top 3022 root 9 -1 140M 12M 4180 S < 0.1 0.6 2:33 1 X 6933 root 10 0 1128 1128 876 R 0.1 0.0 0:02 0 top 1 root 9 0 464 464 416 S 0.0 0.0 0:04 2 init 2 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 keventd 3 root 19 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd_CPU0 4 root 18 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd_CPU1 5 root 19 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd_CPU2 6 root 18 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd_CPU3 8 root 9 0 0 0 0 SW 0.0 0.0 0:00 2 bdflush 9 root 9 0 0 0 0 SW 0.0 0.0 0:01 2 Next I loaded Mozilla and the swap space increased by 700k, which actually is not so bad. However I do not have enough apps and daemons loaded to consume even half of the 2GB of RAM in the system. 12:54:35 up 1 day, 9 min, 6 users, load average: 0.49, 0.75, 0.80 99 processes: 97 sleeping, 2 running, 0 zombie, 0 stopped CPU0 states: 3.1% user 14.4% system 0.0% nice 0.0% iowait 81.3% idle CPU1 states: 3.3% user 16.3% system 0.0% nice 0.0% iowait 79.2% idle CPU2 states: 0.0% user 4.1% system 0.0% nice 0.0% iowait 95.3% idle CPU3 states: 0.2% user 0.2% system 0.0% nice 0.0% iowait 99.1% idle Mem: 2069312k av, 2019268k used, 50044k free, 0k shrd, 37384k buff 113240k active, 1827660k inactive Swap: 2096440k av, 1832k used, 2094608k free 1790920k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 6766 jim 16 0 576 576 504 R 31.8 0.0 31:09 0 cacheSmasher 1052 root 9 0 0 0 0 SW 2.7 0.0 6:16 3 rpciod 3022 root 12 -1 147M 18M 4600 S < 1.7 0.9 3:20 3 X 6939 jim 9 0 52020 50M 14984 S 1.1 2.5 0:24 1 mozilla-bin 3130 root 9 0 21712 21M 19288 S 0.9 1.0 2:24 0 kdeinit 3110 root 9 0 16888 16M 15600 S 0.5 0.8 5:30 0 kdeinit 3171 root 12 0 19600 19M 17652 S 0.5 0.9 1:20 0 kdeinit Third scenario I loaded some apps which log locally and consume between 20-90MBs of RAM, then slowly grow over time. The swap space moved only up to around 3200k, in the pervious kernel it would have spiked very fast. Even though this is running much better and consuming very little swap space, I would not expect anything to be in swap with this much memory available and so little actual memory being consumed by apps. 13:33:42 up 1 day, 48 min, 15 users, load average: 2.17, 1.90, 1.42 193 processes: 190 sleeping, 3 running, 0 zombie, 0 stopped CPU0 states: 18.4% user 18.1% system 13.0% nice 0.0% iowait 63.4% idle CPU1 states: 17.1% user 15.4% system 13.0% nice 0.0% iowait 66.3% idle CPU2 states: 18.2% user 8.3% system 14.4% nice 0.0% iowait 72.5% idle CPU3 states: 11.1% user 9.4% system 8.0% nice 0.0% iowait 78.4% idle Mem: 2069312k av, 2018996k used, 50316k free, 0k shrd, 37768k buff 114676k active, 1853104k inactive Swap: 2096440k av, 3240k used, 2093200k free 1458848k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 7662 jim 19 9 83552 16M 10344 R N 50.7 0.8 6:16 2 iomon 6766 root 13 0 576 576 504 R 45.6 0.0 46:45 2 cacheSmasher 1052 root 9 0 0 0 0 SW 4.5 0.0 8:20 3 rpciod 7543 jim 9 0 69444 67M 1812 S 3.1 3.3 1:09 3 logandgrow 3540 root 9 0 1308 1308 960 S 2.5 0.0 4:57 0 top 6933 root 11 0 1312 1312 960 R 2.5 0.0 1:08 1 top 7535 jim 9 0 12920 12M 1028 S 2.5 0.6 1:24 2 logandgrow 7545 jim 9 0 86640 84M 1812 S 1.5 4.1 1:16 0 logandgrow 7570 jim 9 0 78692 76M 1812 S 1.5 3.8 0:33 3 logandgrow Now I will try to throttle the system over the holiday weekend, to see how stable it is with high usage over a long period of time. If I produce results which are contrary to these results, I will post them next week.
Additional note: The swap space used just reached 348,452k, so I terminated the heavy NFS I/O. Unlike 2.4.20-xx, cache now is being freed for my apps as they grow though the swap space is still slowly growing.
I can confirm that the swap space does free up slowly over time using the 2.4.24smp kernel, as does the cache space. The system is currently maintaining 40-50MBs of free memory, where in the past it average between 4-10MB free. To sum it up again, the caching seems much better at this point.
I tried updated kernel from redhat 2.4.20-30.9smp for my redhat 9, but it eats all ram for caching :( So, the probles exists and with updated kernel. Or I am wrong? So could anybody say, how to fix this problem, or when redhat fix this problem ? I use 5 redhat 9 systems as servers..
The kernel(s) "eat all ram for caching" by design. The issue described here has been that in older (than 2.4.20-20-ish RedHat) kernels actually have difficultly managing low-mem situation such that kswapd et al get a lot of time (even though it doesn't actually swap pages in the end). Later kernels still give all free memory to the block device cache (why shouldn't it?), don't have weird swapping issues, and the BDC gives memory back when needed.
I would recommend rolling your own 2.4.24 kernel (kernel.org) or newer, the caching in it works fine. I haven't seen any indication of a back patch being planned for the 2.4.20 series kernels.
So why redhat dont build > 2.4.20 kernel rpm for redhat 9? Could somebody explain me this?:)
I have come across this same problem but even if I set the following in vm.pagecache to 2 10 20, I still get more than 20% of memory used as cache - Here is /proc/meminfo - [grma@shane 59] ~ > cat /proc/meminfo total: used: free: shared: buffers: cached:Mem: 525836288 445038592 80797696 0 31404032 187092992Swap: 2146787328 30904320 2115883008 MemTotal: 513512 kB MemFree: 78904 kB MemShared: 0 kB Buffers: 30668 kB Cached: 181156 kB SwapCached: 1552 kB Active: 347540 kB ActiveAnon: 204324 kB ActiveCache: 143216 kB Inact_dirty: 32452 kB Inact_laundry: 27108 kB Inact_clean: 5700 kB Inact_target: 82560 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 513512 kB LowFree: 78904 kB SwapTotal: 2096472 kB SwapFree: 2066292 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB My system seems to grind to a halt due to their being no memory available to the applications because it is all being used for cache. Any ideas what I can do to prevent this?
Can someone please explain why Redhat seem to be doing nothing about a MAJOR issue with their RHEL 3 flagship product. I have been doing some research into this and it is definitely a problem that people are coming across. It appears that there is no work around so where is the updated kernel - this problem has been around for a long time now. Where is the support that we pay £000's for??
OK, apparently it is still an issue in RHEL3. Larry, this bug may have some useful info for you...
I think, all redhat 9 users are waiting for fix of this bug too (not only RHEL). If the latest 2.4.x kernels are without this problem, so Redhat team should build new kernel rpm.
I certainly still see this behavior on RHEL3.0 with 2.4.21-9.0.3.ELsmp. Same data as everybody else has been posting for over a year now on this thread, so I won't bother. For me, it's mysql4.0.x that triggers the swapping, but RH doesn't support mysql4, so my trouble ticket got rejected when I complained about it. But now, after months of searching around, I see this thread, and realize that it's the kernel, not mysql, that has the trouble. Redhat, this is not a problem that only a few people see. Please, can you recommend a workaround until RHEL3 has a kernel with better memory management?
Hi Loren, ok, Redhat won't hear this but you have 2 possible choices: 1. Use an -aa VM enabled kernel (e.g. mainline 2.4) 2. Use Kernel 2.6.6* where (1) isn't a recommended workaround at all :) Anyway, I tried to fix this bug in the past days w/o much success yet but I won't give up. Any hints, comments or suggestions really appreciated. ciao, Marc
I turned off swap on one of my machines and it's running fine so far, without the occasional delay's I'd usually see when si/so would kick in. Today's slashdot carries a story on this issue: http://developers.slashdot.org/article.pl?sid=04/04/30/ 1238250&mode=thread&tid=106&tid=185. I can do this because I have 4GB of RAM and I know I am not going to use more than 3.2GB of it. From what I gather in user comments, 2.6.6 helps by offering a "swappines" param that you can set to zero (or really low) to encourage use of cache versus paging to disk. But for 2.4.X, many folks report success with "swapoff -a".
Hi Loren, well, "swapoff -a" isn't a solution, it's a workaround at all and in question whether a good one or not. Either we should make /proc/sys/vm/pagecache to behave correctly or introduce something like /proc/sys/vm/swappiness which works correctly ;) and *imho* the default of "1 15 100" of pagecache is wrong at all. I had good experience with "1 10 10" or even lower values. Even a _real_ working drop behind (on/off via sysctl) would make some sense. ciao, Marc
can someone explain wich kernel we have to use on RH9 to fix this problem ?
RHL9 is EOL, but it should be possible to run the Taroon (RHEL3) kernel with it. The .src.rpms for that kernel are available from ftp.redhat.com.
and the binary from ftp://ftp.redhat.de/pub/SAP/RHEL3/certified/kernel-2.4.21-9.0.1.EL
... and that kernel has the same problems as mentioned above. Anyway, for all you experiencing the above problems, try setting pagecache to 1 10 10 (echo 1 10 10 >/proc/sys/vm/pagecache) and it will work at least better than before. ciao, Marc
there is no /proc/sys/vm/pagecache file. Did the filename changed ? [root@synstd2 vm]# ll total 0 -rw-r--r-- 1 root root 0 mai 11 11:19 bdflush -rw-r--r-- 1 root root 0 mai 11 11:19 kswapd -rw-r--r-- 1 root root 0 mai 11 11:19 max_map_count -rw-r--r-- 1 root root 0 mai 11 11:19 max-readahead -rw-r--r-- 1 root root 0 mai 11 11:19 min-readahead -rw-r--r-- 1 root root 0 mai 11 11:19 overcommit_memory -rw-r--r-- 1 root root 0 mai 11 11:19 page-cluster -rw-r--r-- 1 root root 0 mai 11 11:19 pagetable_cache kernel : 2.4.20-31.9
Hi Arns, well, either that kernel does not have RMAP (unlikely ;) or the tunable is not yet added. It was added in rmap 15c. You can check whether you have rmap or not by "ls -lsa mm/rmap.c" in your kernel source directory. Maybe there's an update for your redhat to get a newer kernel?! Dunno. ciao, Marc
Arns is right; I'm running Red Hat's kernel (kernel-2.4.20-31.9.i686.rpm, via up2date just before the end-of-life). I installed their source rpm, too, and /usr/src/linux-2.4/mm/rmap.c exists. /proc/sys/vm/pagecache does not. Additionally, this bug is marked as an Athlon bug. I get the same results on an Intel PIII.
This bug is now pretty much worthless. It has become a discussion forum with a mismash of several different issues. Some points: - Red Hat 9 is _DEAD_. No one with the power to fix your "issues" cares about it anymore. - RHEL 3, while based off of RH 9, is not the same and there are patches in the RHEL 3 kernel tree that were never in RH 9 (from what I see) - The VM (rmap patches, etc) has changed over time so earlier issues mentioned in this bug have been fixed for some people and behaviours may have changed for others. If you are having problems with the RH 9 VM the fix is simple... Stop using RH 9. Fedora Core 1's kernel "fixed" all the issues I had with squid (mentioned above). If you are using RHEL 3 and you are seeing VM issues then file a seperate bug report with detailed information. Chances are it is a completely different bug that has similar symptoms based on your workload! I'd strongly suggest closing this bug as it is pure noise. </rant>
bruce, I got 6 RH9 stations dispatched around the world and i'm about to dispatch between 150 and 200 Linux stations in the next months...so i will follow your advice :"...the fix is simple... Stop using RH 9.." . We gonna switch to *Something else than redhat* . No more noise. Arns.
Bruce, the problem is better, but certainly not "FIXED" in FC1 which I have been running on since it came out. The system still agressively tends to purge active process memory as page outs everytime there is a burst of filesystem I/O. This is fundamentally wrong from two views: 1) Large semi-idle processes like X, kdeinit, and other interactive processes get their memory stolen, and have to page back in under degraded I/O conditions to respond to key strokes with high latency. 2) It requires two I/O's to page, and a single I/O to recover filesystem cache data ... paging out once there is filesystem I/O activity is a pure mistake, as it creates additional disk I/O rather than saving it. While the fixes improve this to some degree, the problem is certainly NOT fixed ... notice the burst of I/O invoking page outs below on a relatively idle FC1 system with significant memory: [jbass@fastbox jbass]$ vmstat 5 procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy wa id 0 0 58780 24764 124636 55048 0 0 4 6 13 8 3 2 0 29 0 0 58780 24752 124636 55048 0 0 0 8 101 225 2 1 0 97 0 0 58780 24744 124640 55048 0 0 0 2 129 296 2 1 0 97 0 0 58780 24740 124644 55048 0 0 0 2 101 224 1 1 0 98 0 0 58780 24736 124648 55048 0 0 0 9 108 241 2 1 0 98 0 0 58780 24724 124656 55052 0 0 1 16 163 435 5 1 0 93 0 0 58780 24728 124656 55052 0 0 0 2 108 328 3 1 0 96 3 0 58780 24456 124672 55200 0 0 32 2 107 251 2 0 0 98 1 0 58780 23972 125088 55200 0 0 79 14 377 1013 67 17 0 16 2 0 58780 23452 125520 55200 0 0 79 9 311 581 51 22 0 27 1 0 58780 22104 126556 55228 0 0 202 23 530 1015 55 6 0 40 2 0 58780 21908 126712 55228 0 0 23 4 148 352 54 9 0 37 2 0 58780 21840 126744 55228 6 0 9 4 150 353 47 43 0 10 2 0 58780 21576 126764 55228 0 0 1 4 110 243 37 59 0 4 2 0 58780 16612 126776 55228 0 0 0 4 118 269 54 35 0 10 2 0 58780 13864 126796 55228 0 0 0 10 102 252 28 70 0 2 2 0 58780 10216 126816 55228 0 0 2 8 107 232 31 68 0 2 2 0 58780 4256 125248 55164 0 0 0 4 103 270 45 50 0 5 2 0 58780 4232 124288 55164 0 0 0 2 203 502 13 85 0 2 1 0 58780 4192 121352 55164 6 0 9 10 115 403 21 73 0 6 0 1 58800 9504 119180 47524 0 10 214 31 139 446 49 11 0 40 1 0 58800 4816 119196 46672 0 0 254 198 168 410 50 7 0 44 1 0 58896 9344 119740 46632 0 0 142 146 201 469 58 16 0 26 2 0 58896 7904 120160 46632 0 0 80 107 158 248 61 3 0 36 1 1 58896 4636 119848 46620 2 0 301 70 216 553 56 10 0 35 0 1 58932 10524 119444 46592 0 22 460 174 236 706 25 12 0 63 1 1 58932 8288 121640 46592 0 0 436 95 212 720 18 6 0 76 0 1 58932 6176 123692 46592 0 0 407 98 250 789 17 8 0 75 0 1 58932 4664 125196 46592 0 0 297 309 220 578 16 5 0 79 2 0 58932 3872 126344 46284 0 31 384 262 217 618 17 5 0 78 2 1 59040 10376 127944 46404 0 0 438 199 259 759 21 7 0 72 1 0 59040 8668 129652 46404 0 0 338 135 189 575 32 6 0 62 1 0 59040 7364 129876 46768 0 0 110 370 253 716 64 7 0 29 1 0 59144 22512 129152 46516 0 4 102 305 225 503 48 10 0 42 1 0 59144 21952 129672 46516 0 0 91 594 162 376 52 5 0 43 1 0 59144 21444 130116 46516 0 0 82 393 138 323 52 7 0 41 [jbass@fastbox jbass]$ free total used free shared buffers cached Mem: 384472 375608 8864 0 105608 46036 -/+ buffers/cache: 223964 160508 Swap: 923696 60096 863600
There is "The Fedora Legacy Project" http://www.fedoralegacy.org/ with bugzilla for issues like this. I've opened bugzilla issue there: http://bugzilla.fedora.us/show_bug.cgi?id=1797 Probably it'll be the place to continue bug discussion. I'd ask those who had successful results (namely Jim Laverty, Christopher McCrory, Chris Petersen, Erik Reuter, and Marc-Christian Petersen) with 2.4.24 and 2.6.6 kernels to put their comments in "Fedora Legacy" bug that would help package proper kernels for RH9.
I just switched jobs, so I will add my comments in the next week or so. I'm in the process of updating my e-mail address everywhere. I have Fedora Core 2 w/2.6.7 running here, so I will post results based on that also.
Hi Jim, 2.6.* results aren't interesting and not related to this bug report. Anyway, I've found out how to fix 2.4-rmap silly behaviour in swapping everything out like hell. It's basically a 3 line change. I'll cook up a patch for latest RHEL3 kernel with a /proc value to turn that feature on/off. ciao, Marc
Hi Marc-Christian, Would you be so kind to explain things for me as RH9 (2.4.20-31.9) prisoner? Would your atch be useful for that older kernel as well? I've opened the bug at FL bugzilla: https://bugzilla.fedora.us/show_bug.cgi?id=1797 Thanks, m.
Marc, I cross posted between the RH 9 and FC1 instances of this bug (issue). A patch sounds good and very useful for the masses, nice work. Jim
Am I correct in thinking that someone might have a patch for this problem? If so, could it be posted ASAP as I have machines that I am going to have to rebuild as SUSE machines unless they get fixed quickly.
Created attachment 103053 [details] Fix braindead swapping
Created attachment 103054 [details] Fix braindead swapping
ARGS, I thought I already did it but I was wrong :-( Sorry. Well, I don't care if you use Redhat or SuSE (I use Debian ;) but here we go. I've attached some patches (01_vm-anon-lru.patch is the one which fixes braindead swapping) but there is alot more: Updated VM documentation (every knob is documentated in Documentation/sysctl/vm.txt), VM tweaks in /proc/sys/vm, bonus for desktop users to get a non-sluggish desktop behaviour: O(1) "desktop" boot parameter which changes max-timeslice, min-timeslice and child-penalty (also changeable at runtime via /proc/sys/kernel/sched*. Also, vm.pagecache now defaults to 1 5 10 (1 15 100 is silly). These patches have to be applied in numbering order against a 2.4.21-15-0.3.EL kernel (maybe they'll apply to something different also, dunno). These patches fixes all of the problems reported here for _ME_ and _my_ customers. That's all I cared about. If it fixes your problems and others as well, I'm glad :-) P.S.: Yes, the vm knobs are taken from 2.4-AA. ciao, Marc
Created attachment 103055 [details] 02 - vm.vm_cache_scan_ratio
Created attachment 103056 [details] 03 - vm.vm_passes
Created attachment 103057 [details] 04 - vm.vm_gfp_debug
Created attachment 103058 [details] 05 - vm.vm_vfs_scan_ratio
Created attachment 103059 [details] 06 - Remove old and obsolete VM documentation
Created attachment 103061 [details] 07 - Update VM docu to Documentation/sysctl/vm.txt
Created attachment 103062 [details] 08 - just reorder 1 variable in mm/vmscan.c
Created attachment 103063 [details] 09 - vm.pagecache - Change '1 15 100' to '1 5 10'
Created attachment 103064 [details] 10 - O(1) scheduler: Introduce sysctl knobs for max-timeslice, min-timeslice and child-penalty (Part 1)
Created attachment 103065 [details] O(1) scheduler: Introduce 'desktop' boot parameter (lowered max-timeslice) (Part 2)
Okay, 11 patches are up. That's all. Without vm.vm_anon_lru, this machine, which is up now: root@christian:[/] # w 11:52:34 up 15 days, 23:43, 18 users, load average: 0.26, 0.18, 0.17 used to go in swap after 1-2 days using almost all of swap available. Now, take a look yourself :p (NOTE: 2.4-WOLK /proc/meminfo output, not 2.4-REDHAT, not 2.6*) total: used: free: shared: buffers: cached: Mem: 527556608 516202496 11354112 0 118779904 183275520 Swap: 139788288 16384 139771904 MemTotal: 515192 kB MemFree: 11088 kB MemUsed: 504104 kB Buffers: 115996 kB Cached: 178964 kB SwapCached: 16 kB Active: 236772 kB ActiveAnon: 1820 kB ActiveCache: 234952 kB Inactive: 58480 kB Inact_dirty: 43660 kB Inact_laundry: 7736 kB Inact_clean: 7084 kB Inact_target: 59048 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 515192 kB LowFree: 11088 kB SwapTotal: 136512 kB SwapFree: 136496 kB SwapUsed: 16 kB VmallocTotal: 516028 kB VmallocUsed: 22192 kB VmallocChunk: 493836 kB Have fun. We had it =) ciao, Marc
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/