From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.3) Gecko/20010801 Description of problem: machine: Compaq Proliant DL360 w/4GB mem, dual 36GB SCSI drives OS: RedHat 7.1 + errata updates, kernel-enterprise-2.4.7-2.i686.rpm from 'Roswell 2' Under heavy I/O (Apache and a custom C++ module which do lots of mmap and munmap calls over large data sets - 7GB total), the machine slows to a crawl. The problem persists even after live traffic to the machine ceases. A top listing shows both cpu's at 100% system. Any commands (ps, uname, whatever) take minutes to return results. The same setup on RH 6.2 with 2.4.3-ac3 works fine. Please let me know what information may be useful to debugging this problem (no oops yet), and other kernels to try; I'm looking at 2.4.8-ac9 right now. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Load large data sets into memory 2. Query against those sets 3. Watch it suffocate after a few minutes Additional info:
more on how to reproduce: probably the best way to reproduce the environment would be to write a C++ class that has a method on it that goes and maps a file, touches all the memory in that file, and then returns. add another method that takes a number and returns that same number. then run that code through swig and write a mod_perl interface to call the first method then call the second method in a loop (10 times should be good), then storable::freeze an array of results and print it to stdout. this ought to emulate the kinds of things we do there. you might actually have that method take a number and return a string (literal) instead of a number, just to exercise swig a little more. Note that the file must be large; maybe 50% greater than physical RAM.
I am seeing this situation as well. I see that the following processes take a lot of cpu time: kreclaimd bdflush kupdated
We (Red Hat) should try to fix this before next release.
With 4GB of memory, you might compare with the SMP kernel; Compaq machines with 4GB tend to use the enterprise kernel because they have memory holes that create addresses over the 4GB mark that are therefore unaddressable without the enterprise kernel, but there's enough overhead from the PAE support (3-level page tables) that you might find it faster to loose the 128MB of memory (I think that's the normal hole size) and use the smp kernel instead. Could you please "cat /proc/<pid>/maps | wc -l" for the processes doing the mmap/munmap and post the results?
A vanilla 2.4.8-ac9 (w/o PAE enabled) seems to handle this load effectively. We can provide a 10-second output of 'strace -tt'; would this help? The output would be substantial, however.
I think I've solved half the mystory, and improved the other half a bit ;) Are you using a aic7xxx scsi card for the disks ?
The DL360 has the Compaq 'Integrated Smart Array Controller' which uses the 'cpqarray' driver.
I have tested the latest 2.4.7-2.9enterprise kernel after fixing my scsi errors. Works better, however the box still seems to get slow under high disk loads. Running an iozone in one console will make spawning of new processes extremely lagged. the 'w' command will hang for about 2-3 minutes. Logins as well. Already running processes(top, vmstat in my case) will continue to run fine though. Box has 8GB of RAM.
With thw 2.4.7-2.12enterprise kernel, the pauses are less random. It seems that while the iozone process is writing, reads are delayed. A good example is the 'w' command. I ran this prior to starting iozone. After iozone was started, 'w' still ran well. However, if I tried executing 'who' or 'last', there was a 30-60 second delay. After it did run, these commands executed normally from then on. This will happen with any command that has not been executed prior to the write phase.
This read/write thing involves tradeoff between latency and throughput. In this case we favor throughput; favoring latency can cause thrashing and not help latency as much as you might expect. Of course, if something is in cache it doesn't hit the I/O queue, and so the latency issues doesn't show up. I think the "crippled" part is fixed by now, although we will continue to work on improvements.
Note, you can tune the maximum time reads wait for writes with the /sbin/elvtune tool. Making reads have a lower latency will improve interactive performance at the cost of raw throughput (and thus benchmark numbers). Also, if this is a Perc or similar device; such a device can easily have up to 100 Megabyte of IO "in flight" in the controller and then it takes a while to complete new read requests.
jbusch: does the 2.4.7-6 kernel (available via up2date and probably rawhide) fix your problem ?
We tried the new kernel. It works fine until the filesystem cache fills up all the memory. Here is a before and after 'readprofile'. We have Oracle running while ftp'ing in files to the box. The before profile is while the files ftping in are filling up the FS cache. The after is once this cache fills and the machine starts behaving badly. Here is forwarded test results from my Oracle DBA on the box. When he talks about Oracle connection tests, that is just connecting as many Oracle clients as possible to the server to eat RAM. Is this enough information for you? FORWARDED MESSAGE I mounted /fstest(a large filesystem through LVM, lvm compiled into kernel so it could get profiled. Tried with and without LVM with no noticable difference in the below results) and ran the test. While the system is eating up high mem(when the system was caching the files getting ftped in), the profile looked like this. 291730 default_idle 4558.2812 1948 blk_get_queue 24.3500 2341 pci_get_interrupt_pin 18.2891 3887 isapnp_set_port 13.4965 3280 file_read_actor 12.8125 4149 bounce_end_io_read 11.2745 910 vgacon_build_attr 5.6875 928 si_swapinfo 4.8333 251 generic_unplug_device 3.9219 91 __free_pages 2.8438 125 system_call 2.2321 396 __wake_up 2.0625 56 deactivate_page 1.7500 103 fget 1.6094 66 init_buffer_head 1.0312 696 vgacon_do_font_op 1.0116 297 end_buffer_io_async 0.9770 121 vgacon_invert_region 0.9453 1737 __make_request 0.9279 90 __run_task_queue 0.8036 But once we ate all of highmem for FS cache, the profile was like this. (NOTE: I reset the profiler here to capture the true activity under memory shortage) 113753 default_idle 1777.3906 2429 zone_free_shortage 37.9531 73036 do_page_launder 32.3741 200 pci_get_interrupt_pin 1.5625 239 si_swapinfo 1.2448 55 blk_get_queue 0.6875 215 create_bounce 0.5599 107 __wake_up 0.5573 26 fget 0.4062 17 bdfind 0.3542 10 __free_pages 0.3125 17 system_call 0.3036 99 ext2_find_entry 0.2812 4 get_fast_time 0.2500 330 schedule 0.2083 51 isapnp_set_port 0.1771 51 prune_icache 0.1678 192204 total 0.1675 21 handle_IRQ_event 0.1458 16 cpu_idle 0.1429 It is a very simple test; have Oracle running, then FTP in some files to /fstest (I didn't even try the Oracle connection tests this time).
I have seen no response at all to Arjan's suggestion to use elvtune. In my experience, it's very useful and worth trying. /sbin/elvtune -r 4096 -w 8192 /dev/sdd (or whatever device you are using) will halve the read and write latency numbers allowed. Please try it and report results.
Sorry bout that. Thought the elvtune was just for interactive vs/ speed balancing. I made the change you suggested to ALL the disks on the system(all the disks in the LVM). We also got rid of Oracle on this run. We started downloading files to the box. It stayed at a nice level 6MB/sec until the buffer cache filled. The buffer cache is at currently 7.8G out of 8G. Then, the ftp rates start plummeting, to a current 1MB/sec and dropping(we're ftping into a 12 way stripe, all scsi disks, should easily be able to accomidate 6MB/sec. 4 disks on a 2.2 kernel handle the same load). The system can't catch up after this until we kill the ftp process. Load average is 5. kswap, kupdated, and kreclaimd are noticably high in CPU util, along with some random processes like ntp????? The tar the ftp server is running is also chewing some cpu as well) readprofile gives this(reset after the FS cache filled up) 376618 default_idle 5884.6562 4565 zone_free_shortage 71.3281 134155 do_page_launder 59.4659 368 __generic_copy_to_user 5.7500 583 si_swapinfo 3.0365 34 system_call 0.6071 54 __generic_copy_from_user 0.5625 90 __wake_up 0.4688 179 create_bounce 0.4661 518426 total 0.4087 24 fget 0.3750 82 csum_partial 0.3534 28 blk_get_queue 0.3500 28 handle_IRQ_event 0.1944 260 schedule 0.1641 10 sock_wfree 0.1562 19 skb_release_data 0.1484 22 __find_get_page 0.1250 17 kfree 0.1181 13 cpu_idle 0.1161
The problem vanishes in 2.4.10pre4 with elvtune -r 4096 -w 8192 for all the disks. The FTP rates stay constant once freemem is at 5MB of 8GB, and we even fired up oracle and made 300 instantanious connections(not possible previously).
Too bad that 2.4.10pre4 falls over under any kind of real load....
Possibly related: kernel 2.4.7-2.19enterprise (reiserfs quota patches added) kupdated appears to lock up ('D' wait in ps listing) under heavy I/O. I'm running exim delivering to a reiserfs partition that is exported under NFS to two other machines. This is not deployed yet so everything's usually quiet. On the NFS server I sent an email to an alias that expands to 100 users. I ran a delivery process and both the delivery process and kupdated went into 'D' wait. I waited several minutes. I tried running 'lilo' to switch back to the old kernel. That went into 'D' wait also. I tried shutdown -r now. That went into 'D' wait. Finally I gave up and hit the power button. Should I try 2.4.7-6 (or -10, or whatever Rawhide's on now)?
Created attachment 34185 [details] The oops recorded before the crash
Created attachment 34186 [details] Output from sysrq+P
Created attachment 34187 [details] The whole collection of out from the 2.4.9-0.18 crash
Arjan gave us 2.4.9-0.18 to test to determine if this has been fixed. It takes about 3-4 days, but it does crash. Before the crash, it oopsed with the oops recorded in preOops After the crash, I gathered sysrq outputs in the following files. sysrqP.1, sysrqT.1.gz sysrqP.2, sysrqT.2.gz The .2 files were a second round I took since it seemed to genereate different information. Then upon taking the sysrqM snapshot, the box became useless ever to sysrq actions. The partion of sysrqM I got is in sysrqM After taking the sysrqM snap, I repeatedly received the messages contained in sysrqCrash until a reboot. I tried to attach each individual file, but bugzilla kept rejecting me. They should be all included in a tar.gz file attached.
Created attachment 34188 [details] vmstat, uptime, and /proc/meminfo prior to the oops
The other difference between the server that crashes and the one that runs wonderfully is the swap size. The good box runs on 2GB of swap, while the one that is crashing has 9GB. Don't know if that could even be a factor. Will try to disable swap on the crashing box to see if it makes a difference.
My most recent bug reports from today have been placed into a different bug, number 54700, since this bug no longer relates to the original bug.
This bug is fixed and is no longer a problem for us.
jbusch: does the 2.4.9-13 errata kernel fix this for you?