Bug 52340 - 2.4.7-2enterprise kernel crippled under heavy I/O
Summary: 2.4.7-2enterprise kernel crippled under heavy I/O
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Public Beta
Classification: Retired
Component: kernel
Version: roswell
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brock Organ
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-08-22 22:23 UTC by Jeff Busch
Modified: 2007-04-18 16:36 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2001-12-06 17:11:26 UTC
Embargoed:


Attachments (Terms of Use)
The oops recorded before the crash (1.07 KB, text/plain)
2001-10-16 11:00 UTC, Need Real Name
no flags Details
Output from sysrq+P (652 bytes, text/plain)
2001-10-16 11:01 UTC, Need Real Name
no flags Details
The whole collection of out from the 2.4.9-0.18 crash (72.75 KB, application/tar)
2001-10-16 11:12 UTC, Need Real Name
no flags Details
vmstat, uptime, and /proc/meminfo prior to the oops (4.94 KB, text/plain)
2001-10-16 11:22 UTC, Need Real Name
no flags Details

Description Jeff Busch 2001-08-22 22:23:47 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.3)
Gecko/20010801

Description of problem:
machine:  Compaq Proliant DL360 w/4GB mem, dual 36GB SCSI drives
OS:
    RedHat 7.1 + errata updates, kernel-enterprise-2.4.7-2.i686.rpm from
'Roswell 2'

Under heavy I/O (Apache and a custom C++ module which do lots of mmap and
munmap calls over large data sets - 7GB total), the machine slows to a
crawl.  The problem persists even after live traffic to the machine ceases.
 A top listing shows both cpu's at 100% system.  Any commands (ps, uname,
whatever) take minutes to return results.  

The same setup on RH 6.2 with 2.4.3-ac3 works fine.  Please let me know
what information may be useful to debugging this problem (no oops yet), and
other kernels to try; I'm looking at 2.4.8-ac9 right now.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.  Load large data sets into memory
2.  Query against those sets
3.  Watch it suffocate after a few minutes
	

Additional info:

Comment 1 Jeff Busch 2001-08-23 04:53:54 UTC
more on how to reproduce:

probably the best way to reproduce the environment would be to write a
C++ class that has a method on it that goes and maps a file, touches
all the memory in that file, and then returns.  add another method
that takes a number and returns that same number.  then run that code
through swig and write a mod_perl interface to call the first method
then call the second method in a loop (10 times should be good), then
storable::freeze an array of results and print it to stdout.

this ought to emulate the kinds of things we do there.  you might
actually have that method take a number and return a string (literal)
instead of a number, just to exercise swig a little more.

Note that the file must be large; maybe 50% greater than physical RAM.

Comment 2 Michael E Brown 2001-08-23 16:24:29 UTC
I am seeing this situation as well. I see that the following processes take a
lot of cpu time:
 kreclaimd
 bdflush
 kupdated


Comment 3 Glen Foster 2001-08-24 15:54:20 UTC
We (Red Hat) should try to fix this before next release.

Comment 4 Michael K. Johnson 2001-08-26 14:48:00 UTC
With 4GB of memory, you might compare with the SMP kernel; Compaq machines
with 4GB tend to use the enterprise kernel because they have memory holes
that create addresses over the 4GB mark that are therefore unaddressable
without the enterprise kernel, but there's enough overhead from the PAE
support (3-level page tables) that you might find it faster to loose the
128MB of memory (I think that's the normal hole size) and use the smp
kernel instead.

Could you please "cat /proc/<pid>/maps | wc -l" for the processes doing
the mmap/munmap and post the results?

Comment 5 Jeff Busch 2001-08-27 04:16:32 UTC
A vanilla 2.4.8-ac9 (w/o PAE enabled) seems to handle this load effectively.

We can provide a 10-second output of 'strace -tt'; would this help?  The output
would be substantial, however.

Comment 6 Arjan van de Ven 2001-08-27 08:15:40 UTC
I think I've solved half the mystory, and improved the other half a bit ;)
Are you using a aic7xxx scsi card for the disks ?

Comment 7 Jeff Busch 2001-08-27 14:27:39 UTC
The DL360 has the Compaq 'Integrated Smart Array Controller' which uses the
'cpqarray' driver.

Comment 8 Need Real Name 2001-08-27 16:06:27 UTC
I have tested the latest 2.4.7-2.9enterprise kernel after fixing my scsi 
errors. Works better, however the box still seems to get slow under high disk 
loads. Running an iozone in one console will make spawning of new processes 
extremely lagged. the 'w' command will hang for about 2-3 minutes. Logins as 
well. Already running processes(top, vmstat in my case) will continue to run 
fine though. 

Box has 8GB of RAM.

Comment 9 Need Real Name 2001-08-27 18:20:39 UTC
With thw 2.4.7-2.12enterprise kernel, the pauses are less random. It seems that 
while the iozone process is writing, reads are delayed. A good example is 
the 'w' command. I ran this prior to starting iozone. After iozone was 
started, 'w' still ran well. However, if I tried executing 'who' or 'last', 
there was a 30-60 second delay. After it did run, these commands executed 
normally from then on. This will happen with any command that has not been 
executed prior to the write phase.

Comment 10 Michael K. Johnson 2001-08-30 20:49:40 UTC
This read/write thing involves tradeoff between latency and throughput.
In this case we favor throughput; favoring latency can cause thrashing
and not help latency as much as you might expect.

Of course, if something is in cache it doesn't hit the I/O queue, and
so the latency issues doesn't show up.

I think the "crippled" part is fixed by now, although we will continue
to work on improvements.

Comment 11 Arjan van de Ven 2001-08-30 20:53:43 UTC
Note, you can tune the maximum time reads wait for writes with the
/sbin/elvtune tool. Making reads have a lower latency will improve interactive
performance at the cost of raw throughput (and thus benchmark numbers).

Also, if this is a Perc or similar device; such a device can easily have up to
100 Megabyte of IO "in flight" in the controller and then it takes a while to
complete new read requests.

Comment 12 Arjan van de Ven 2001-09-03 17:44:46 UTC
jbusch: does the 2.4.7-6 kernel (available via up2date and probably
rawhide) fix your problem ?

Comment 13 Need Real Name 2001-09-03 19:03:28 UTC
We tried the new kernel. It works fine until the filesystem cache fills up all 
the memory. Here is a before and after 'readprofile'. We have Oracle running 
while ftp'ing in files to the box. The before profile is while the files ftping 
in are filling up the FS cache. The after is once this cache fills and the 
machine starts behaving badly.

Here is forwarded test results from my Oracle DBA on the box. When he talks 
about Oracle connection tests, that is just connecting as many Oracle clients 
as possible to the server to eat RAM. Is this enough information for you?

FORWARDED MESSAGE

I mounted /fstest(a large filesystem through LVM, lvm compiled into kernel so 
it could get profiled. Tried with and without LVM with no noticable difference 
in the below results) and ran the test.

While the system is eating up high mem(when the system was caching the files 
getting ftped in), the profile looked like this. 
 
291730 default_idle                             4558.2812
  1948 blk_get_queue                             24.3500
  2341 pci_get_interrupt_pin                     18.2891
  3887 isapnp_set_port                           13.4965
  3280 file_read_actor                           12.8125
  4149 bounce_end_io_read                        11.2745
   910 vgacon_build_attr                          5.6875
   928 si_swapinfo                                4.8333
   251 generic_unplug_device                      3.9219
    91 __free_pages                               2.8438
   125 system_call                                2.2321
   396 __wake_up                                  2.0625
    56 deactivate_page                            1.7500
   103 fget                                       1.6094
    66 init_buffer_head                           1.0312
   696 vgacon_do_font_op                          1.0116
   297 end_buffer_io_async                        0.9770
   121 vgacon_invert_region                       0.9453
  1737 __make_request                             0.9279
    90 __run_task_queue                           0.8036
 
But once we ate all of highmem for FS cache, the profile was like this.  (NOTE: 
I reset the profiler here to capture the true activity under memory shortage)
 
113753 default_idle                             1777.3906
  2429 zone_free_shortage                        37.9531
 73036 do_page_launder                           32.3741
   200 pci_get_interrupt_pin                      1.5625
   239 si_swapinfo                                1.2448
    55 blk_get_queue                              0.6875
   215 create_bounce                              0.5599
   107 __wake_up                                  0.5573
    26 fget                                       0.4062
    17 bdfind                                     0.3542
    10 __free_pages                               0.3125
    17 system_call                                0.3036
    99 ext2_find_entry                0.2812
     4 get_fast_time                              0.2500
   330 schedule                                   0.2083
    51 isapnp_set_port                            0.1771
    51 prune_icache                               0.1678
192204 total                                      0.1675
    21 handle_IRQ_event                           0.1458
    16 cpu_idle                                   0.1429
 
It is a very simple test; have Oracle running, then FTP in some files to /fstest
(I didn't even try the Oracle connection tests this time).

Comment 14 Michael K. Johnson 2001-09-03 22:35:19 UTC
I have seen no response at all to Arjan's suggestion to use elvtune.
In my experience, it's very useful and worth trying.
/sbin/elvtune -r 4096 -w 8192 /dev/sdd
(or whatever device you are using) will halve the read and write
latency numbers allowed.  Please try it and report results.

Comment 15 Need Real Name 2001-09-04 05:04:43 UTC
Sorry bout that. Thought the elvtune was just for interactive vs/ speed 
balancing. I made the change you suggested to ALL the disks on the system(all 
the disks in the LVM). We also got rid of Oracle on this run.

We started downloading files to the box. It stayed at a nice level 6MB/sec 
until the buffer cache filled. The buffer cache is at currently 7.8G out of 8G. 
Then, the ftp rates start plummeting, to a current 1MB/sec and dropping(we're 
ftping into a 12 way stripe, all scsi disks, should easily be able to 
accomidate 6MB/sec. 4 disks on a 2.2 kernel handle the same load). The system 
can't catch up after this until we kill the ftp process.

Load average is 5. kswap, kupdated, and kreclaimd are noticably high in CPU 
util, along with some random processes like ntp????? The tar the ftp server is 
running is also chewing some cpu as well)

readprofile gives this(reset after the FS cache filled up)

376618 default_idle                             5884.6562
  4565 zone_free_shortage                        71.3281
134155 do_page_launder                           59.4659
   368 __generic_copy_to_user                     5.7500
   583 si_swapinfo                                3.0365
    34 system_call                                0.6071
    54 __generic_copy_from_user                   0.5625
    90 __wake_up                                  0.4688
   179 create_bounce                              0.4661
518426 total                                      0.4087
    24 fget                                       0.3750
    82 csum_partial                               0.3534
    28 blk_get_queue                              0.3500
    28 handle_IRQ_event                           0.1944
   260 schedule                                   0.1641
    10 sock_wfree                                 0.1562
    19 skb_release_data                           0.1484
    22 __find_get_page                            0.1250
    17 kfree                                      0.1181
    13 cpu_idle                                   0.1161

Comment 16 Need Real Name 2001-09-06 14:36:39 UTC
The problem vanishes in 2.4.10pre4 with elvtune -r 4096 -w 8192 for all the 
disks. The FTP rates stay constant once freemem is at 5MB of 8GB, and we even 
fired up oracle and made 300 instantanious connections(not possible previously).



Comment 17 Arjan van de Ven 2001-09-06 14:39:27 UTC
Too bad that 2.4.10pre4 falls over under any kind of real load....


Comment 18 John Dalbec 2001-09-26 15:51:38 UTC
Possibly related: 
kernel 2.4.7-2.19enterprise (reiserfs quota patches added) 
kupdated appears to lock up ('D' wait in ps listing) under heavy I/O.
I'm running exim delivering to a reiserfs partition that is exported under NFS to two other machines.
This is not deployed yet so everything's usually quiet.
On the NFS server I sent an email to an alias that expands to 100 users.
I ran a delivery process and both the delivery process and kupdated went into 'D' wait.  I waited several minutes.  I tried running 'lilo' to switch 
back to the old kernel.  That went into 'D' wait also.  I tried shutdown -r now.  That went into 'D' wait.  Finally I gave up and hit the power button.
Should I try 2.4.7-6 (or -10, or whatever Rawhide's on now)?


Comment 19 Need Real Name 2001-10-16 11:00:08 UTC
Created attachment 34185 [details]
The oops recorded before the crash

Comment 20 Need Real Name 2001-10-16 11:01:56 UTC
Created attachment 34186 [details]
Output from sysrq+P

Comment 21 Need Real Name 2001-10-16 11:12:20 UTC
Created attachment 34187 [details]
The whole collection of out from the 2.4.9-0.18 crash

Comment 22 Need Real Name 2001-10-16 11:13:10 UTC
Arjan gave us 2.4.9-0.18 to test to determine if this has been fixed. It takes 
about 3-4 days, but it does crash. Before the crash, it oopsed with the oops 
recorded in 
preOops
After the crash, I gathered sysrq outputs in the following files.
sysrqP.1, sysrqT.1.gz
sysrqP.2, sysrqT.2.gz
The .2 files were a second round I took since it seemed to genereate different 
information.

Then upon taking the sysrqM snapshot, the box became useless ever to sysrq 
actions. The partion of sysrqM I got is in
sysrqM
After taking the sysrqM snap, I repeatedly received the messages contained in 
sysrqCrash
until a reboot.

I tried to attach each individual file, but bugzilla kept rejecting me. They 
should be all included in a tar.gz file attached.


Comment 23 Need Real Name 2001-10-16 11:22:07 UTC
Created attachment 34188 [details]
vmstat, uptime, and /proc/meminfo prior to the oops

Comment 24 Need Real Name 2001-10-16 15:52:38 UTC
The other difference between the server that crashes and the one that runs 
wonderfully is the swap size. The good box runs on 2GB of swap, while the one 
that is crashing has 9GB. Don't know if that could even be a factor. Will try 
to disable swap on the crashing box to see if it makes a difference.

Comment 25 Need Real Name 2001-10-16 16:43:10 UTC
My most recent bug reports from today have been placed into a different bug, 
number 54700, since this bug no longer relates to the original bug.

Comment 26 Need Real Name 2001-10-25 15:27:34 UTC
This bug is fixed and is no longer a problem for us.

Comment 27 Michael K. Johnson 2001-12-06 17:11:21 UTC
jbusch: does the 2.4.9-13 errata kernel fix this for you?


Note You need to log in before you can comment on or make changes to this bug.