Red Hat Bugzilla – Bug 52340
2.4.7-2enterprise kernel crippled under heavy I/O
Last modified: 2007-04-18 12:36:13 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.3)
Description of problem:
machine: Compaq Proliant DL360 w/4GB mem, dual 36GB SCSI drives
RedHat 7.1 + errata updates, kernel-enterprise-2.4.7-2.i686.rpm from
Under heavy I/O (Apache and a custom C++ module which do lots of mmap and
munmap calls over large data sets - 7GB total), the machine slows to a
crawl. The problem persists even after live traffic to the machine ceases.
A top listing shows both cpu's at 100% system. Any commands (ps, uname,
whatever) take minutes to return results.
The same setup on RH 6.2 with 2.4.3-ac3 works fine. Please let me know
what information may be useful to debugging this problem (no oops yet), and
other kernels to try; I'm looking at 2.4.8-ac9 right now.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Load large data sets into memory
2. Query against those sets
3. Watch it suffocate after a few minutes
more on how to reproduce:
probably the best way to reproduce the environment would be to write a
C++ class that has a method on it that goes and maps a file, touches
all the memory in that file, and then returns. add another method
that takes a number and returns that same number. then run that code
through swig and write a mod_perl interface to call the first method
then call the second method in a loop (10 times should be good), then
storable::freeze an array of results and print it to stdout.
this ought to emulate the kinds of things we do there. you might
actually have that method take a number and return a string (literal)
instead of a number, just to exercise swig a little more.
Note that the file must be large; maybe 50% greater than physical RAM.
I am seeing this situation as well. I see that the following processes take a
lot of cpu time:
We (Red Hat) should try to fix this before next release.
With 4GB of memory, you might compare with the SMP kernel; Compaq machines
with 4GB tend to use the enterprise kernel because they have memory holes
that create addresses over the 4GB mark that are therefore unaddressable
without the enterprise kernel, but there's enough overhead from the PAE
support (3-level page tables) that you might find it faster to loose the
128MB of memory (I think that's the normal hole size) and use the smp
Could you please "cat /proc/<pid>/maps | wc -l" for the processes doing
the mmap/munmap and post the results?
A vanilla 2.4.8-ac9 (w/o PAE enabled) seems to handle this load effectively.
We can provide a 10-second output of 'strace -tt'; would this help? The output
would be substantial, however.
I think I've solved half the mystory, and improved the other half a bit ;)
Are you using a aic7xxx scsi card for the disks ?
The DL360 has the Compaq 'Integrated Smart Array Controller' which uses the
I have tested the latest 2.4.7-2.9enterprise kernel after fixing my scsi
errors. Works better, however the box still seems to get slow under high disk
loads. Running an iozone in one console will make spawning of new processes
extremely lagged. the 'w' command will hang for about 2-3 minutes. Logins as
well. Already running processes(top, vmstat in my case) will continue to run
Box has 8GB of RAM.
With thw 2.4.7-2.12enterprise kernel, the pauses are less random. It seems that
while the iozone process is writing, reads are delayed. A good example is
the 'w' command. I ran this prior to starting iozone. After iozone was
started, 'w' still ran well. However, if I tried executing 'who' or 'last',
there was a 30-60 second delay. After it did run, these commands executed
normally from then on. This will happen with any command that has not been
executed prior to the write phase.
This read/write thing involves tradeoff between latency and throughput.
In this case we favor throughput; favoring latency can cause thrashing
and not help latency as much as you might expect.
Of course, if something is in cache it doesn't hit the I/O queue, and
so the latency issues doesn't show up.
I think the "crippled" part is fixed by now, although we will continue
to work on improvements.
Note, you can tune the maximum time reads wait for writes with the
/sbin/elvtune tool. Making reads have a lower latency will improve interactive
performance at the cost of raw throughput (and thus benchmark numbers).
Also, if this is a Perc or similar device; such a device can easily have up to
100 Megabyte of IO "in flight" in the controller and then it takes a while to
complete new read requests.
firstname.lastname@example.org: does the 2.4.7-6 kernel (available via up2date and probably
rawhide) fix your problem ?
We tried the new kernel. It works fine until the filesystem cache fills up all
the memory. Here is a before and after 'readprofile'. We have Oracle running
while ftp'ing in files to the box. The before profile is while the files ftping
in are filling up the FS cache. The after is once this cache fills and the
machine starts behaving badly.
Here is forwarded test results from my Oracle DBA on the box. When he talks
about Oracle connection tests, that is just connecting as many Oracle clients
as possible to the server to eat RAM. Is this enough information for you?
I mounted /fstest(a large filesystem through LVM, lvm compiled into kernel so
it could get profiled. Tried with and without LVM with no noticable difference
in the below results) and ran the test.
While the system is eating up high mem(when the system was caching the files
getting ftped in), the profile looked like this.
291730 default_idle 4558.2812
1948 blk_get_queue 24.3500
2341 pci_get_interrupt_pin 18.2891
3887 isapnp_set_port 13.4965
3280 file_read_actor 12.8125
4149 bounce_end_io_read 11.2745
910 vgacon_build_attr 5.6875
928 si_swapinfo 4.8333
251 generic_unplug_device 3.9219
91 __free_pages 2.8438
125 system_call 2.2321
396 __wake_up 2.0625
56 deactivate_page 1.7500
103 fget 1.6094
66 init_buffer_head 1.0312
696 vgacon_do_font_op 1.0116
297 end_buffer_io_async 0.9770
121 vgacon_invert_region 0.9453
1737 __make_request 0.9279
90 __run_task_queue 0.8036
But once we ate all of highmem for FS cache, the profile was like this. (NOTE:
I reset the profiler here to capture the true activity under memory shortage)
113753 default_idle 1777.3906
2429 zone_free_shortage 37.9531
73036 do_page_launder 32.3741
200 pci_get_interrupt_pin 1.5625
239 si_swapinfo 1.2448
55 blk_get_queue 0.6875
215 create_bounce 0.5599
107 __wake_up 0.5573
26 fget 0.4062
17 bdfind 0.3542
10 __free_pages 0.3125
17 system_call 0.3036
99 ext2_find_entry 0.2812
4 get_fast_time 0.2500
330 schedule 0.2083
51 isapnp_set_port 0.1771
51 prune_icache 0.1678
192204 total 0.1675
21 handle_IRQ_event 0.1458
16 cpu_idle 0.1429
It is a very simple test; have Oracle running, then FTP in some files to /fstest
(I didn't even try the Oracle connection tests this time).
I have seen no response at all to Arjan's suggestion to use elvtune.
In my experience, it's very useful and worth trying.
/sbin/elvtune -r 4096 -w 8192 /dev/sdd
(or whatever device you are using) will halve the read and write
latency numbers allowed. Please try it and report results.
Sorry bout that. Thought the elvtune was just for interactive vs/ speed
balancing. I made the change you suggested to ALL the disks on the system(all
the disks in the LVM). We also got rid of Oracle on this run.
We started downloading files to the box. It stayed at a nice level 6MB/sec
until the buffer cache filled. The buffer cache is at currently 7.8G out of 8G.
Then, the ftp rates start plummeting, to a current 1MB/sec and dropping(we're
ftping into a 12 way stripe, all scsi disks, should easily be able to
accomidate 6MB/sec. 4 disks on a 2.2 kernel handle the same load). The system
can't catch up after this until we kill the ftp process.
Load average is 5. kswap, kupdated, and kreclaimd are noticably high in CPU
util, along with some random processes like ntp????? The tar the ftp server is
running is also chewing some cpu as well)
readprofile gives this(reset after the FS cache filled up)
376618 default_idle 5884.6562
4565 zone_free_shortage 71.3281
134155 do_page_launder 59.4659
368 __generic_copy_to_user 5.7500
583 si_swapinfo 3.0365
34 system_call 0.6071
54 __generic_copy_from_user 0.5625
90 __wake_up 0.4688
179 create_bounce 0.4661
518426 total 0.4087
24 fget 0.3750
82 csum_partial 0.3534
28 blk_get_queue 0.3500
28 handle_IRQ_event 0.1944
260 schedule 0.1641
10 sock_wfree 0.1562
19 skb_release_data 0.1484
22 __find_get_page 0.1250
17 kfree 0.1181
13 cpu_idle 0.1161
The problem vanishes in 2.4.10pre4 with elvtune -r 4096 -w 8192 for all the
disks. The FTP rates stay constant once freemem is at 5MB of 8GB, and we even
fired up oracle and made 300 instantanious connections(not possible previously).
Too bad that 2.4.10pre4 falls over under any kind of real load....
kernel 2.4.7-2.19enterprise (reiserfs quota patches added)
kupdated appears to lock up ('D' wait in ps listing) under heavy I/O.
I'm running exim delivering to a reiserfs partition that is exported under NFS to two other machines.
This is not deployed yet so everything's usually quiet.
On the NFS server I sent an email to an alias that expands to 100 users.
I ran a delivery process and both the delivery process and kupdated went into 'D' wait. I waited several minutes. I tried running 'lilo' to switch
back to the old kernel. That went into 'D' wait also. I tried shutdown -r now. That went into 'D' wait. Finally I gave up and hit the power button.
Should I try 2.4.7-6 (or -10, or whatever Rawhide's on now)?
Created attachment 34185 [details]
The oops recorded before the crash
Created attachment 34186 [details]
Output from sysrq+P
Created attachment 34187 [details]
The whole collection of out from the 2.4.9-0.18 crash
Arjan gave us 2.4.9-0.18 to test to determine if this has been fixed. It takes
about 3-4 days, but it does crash. Before the crash, it oopsed with the oops
After the crash, I gathered sysrq outputs in the following files.
The .2 files were a second round I took since it seemed to genereate different
Then upon taking the sysrqM snapshot, the box became useless ever to sysrq
actions. The partion of sysrqM I got is in
After taking the sysrqM snap, I repeatedly received the messages contained in
until a reboot.
I tried to attach each individual file, but bugzilla kept rejecting me. They
should be all included in a tar.gz file attached.
Created attachment 34188 [details]
vmstat, uptime, and /proc/meminfo prior to the oops
The other difference between the server that crashes and the one that runs
wonderfully is the swap size. The good box runs on 2GB of swap, while the one
that is crashing has 9GB. Don't know if that could even be a factor. Will try
to disable swap on the crashing box to see if it makes a difference.
My most recent bug reports from today have been placed into a different bug,
number 54700, since this bug no longer relates to the original bug.
This bug is fixed and is no longer a problem for us.
jbusch: does the 2.4.9-13 errata kernel fix this for you?