Bug 121434 - Extremely high iowait with 3Ware array and moderate disk activity
Extremely high iowait with 3Ware array and moderate disk activity
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
high Severity high
: ---
: ---
Assigned To: Tom Coughlan
http://www.webhostingtalk.com/archive...
RHEL3U7NAK
:
: 130357 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-04-21 11:28 EDT by Aleksander Adamowski
Modified: 2007-11-30 17:07 EST (History)
60 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-02 16:55:19 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
dmesg file from the affected machine (15.84 KB, text/plain)
2004-04-21 11:30 EDT, Aleksander Adamowski
no flags Details
Detailed description of an array 3Ware 8506-4 (15.17 KB, text/html)
2004-06-09 10:57 EDT, Aleksander Adamowski
no flags Details
Detailed description of an array at a 3Ware 8506-8 controller (20.68 KB, text/html)
2004-06-09 11:15 EDT, Aleksander Adamowski
no flags Details
scsi dump on idle system (31.94 KB, text/plain)
2004-06-14 09:21 EDT, Aleksander Adamowski
no flags Details
scsi dump on system with high iowait values (78.09 KB, text/plain)
2004-06-14 09:27 EDT, Aleksander Adamowski
no flags Details
output from "(date; ps axfm; date)" (10.34 KB, text/plain)
2004-06-15 07:21 EDT, Aleksander Adamowski
no flags Details
scsi dump initiated at 2004-06-15 12:37:14 (62.38 KB, text/plain)
2004-06-15 07:30 EDT, Aleksander Adamowski
no flags Details
tiobench results, first instance (3.43 KB, text/plain)
2004-06-15 07:31 EDT, Aleksander Adamowski
no flags Details
tiobench results, second instance (3.42 KB, text/plain)
2004-06-15 07:34 EDT, Aleksander Adamowski
no flags Details
tiobench results, third instance (64 threads) (2.12 KB, text/plain)
2004-06-15 07:34 EDT, Aleksander Adamowski
no flags Details
SCSI dump made today under non-artificial high iowait condition (37.04 KB, text/plain)
2004-06-17 06:19 EDT, Aleksander Adamowski
no flags Details
Magic SysRq dump (alt-sysrq-t) (97.85 KB, text/plain)
2004-06-22 05:20 EDT, Aleksander Adamowski
no flags Details
readprofile kernel profiling data during high iowaits state (16.42 KB, text/plain)
2004-07-16 10:58 EDT, Aleksander Adamowski
no flags Details
readprofile kernel profiling data during normal state (131 bytes, text/plain)
2004-07-16 11:00 EDT, Aleksander Adamowski
no flags Details
readprofile kernel profiling data during high iowaits state, after resetting profiling data (5.62 KB, text/plain)
2004-07-16 11:42 EDT, Aleksander Adamowski
no flags Details
Patch to prevent wakeup_kswapd() from blocking when it shouldn't. (898 bytes, patch)
2004-09-01 12:04 EDT, Tom Coughlan
no flags Details | Diff
This patch was all wroog (removed) (4.43 KB, patch)
2004-09-06 01:18 EDT, Pasi Pirhonen
no flags Details | Diff
reduce swapping during excessive pagecache use (1.98 KB, patch)
2004-09-24 18:41 EDT, Tom Coughlan
no flags Details | Diff
top output (1.04 KB, text/plain)
2004-10-21 11:38 EDT, HR
no flags Details
output logs (62.53 KB, text/plain)
2005-03-07 02:42 EST, Pasi Sjöholm
no flags Details
3Ware RAID1 versus RAID5 (155.83 KB, application/octet-stream)
2005-03-11 11:40 EST, Joseph Salisbury
no flags Details
RAW data for RAID1 versus RAID5 comparison (8.38 KB, text/plain)
2005-03-11 11:43 EST, Joseph Salisbury
no flags Details
3Ware RAID versus Software RAID (157.12 KB, application/pdf)
2005-03-15 10:26 EST, Joseph Salisbury
no flags Details
rstatd capture showing interrupt/context stalling (11.60 KB, image/gif)
2005-04-08 20:55 EDT, Ian Davis
no flags Details
Activating write cache seems to help a lot (826.49 KB, application/pdf)
2005-11-22 09:16 EST, David Denis
no flags Details

  None (edit)
Description Red Hat Bugzilla 2004-04-21 11:28:54 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040115

Description of problem:
The machine is a dual PIV Xeon 2.4 GHz server with a 3Ware 8506-8
Seriat ATA controller which holds a RAID5 array of 8 120 GB Seagate disks.

During relatively moderate disk activity (a TAR backup over SSH, some
Tivoli sotrage manager backups) the system starts spending almost all
CPU time in iowait. The system is extremely unresponsive, and I
sometimes wait for SSH login for almost a minute, and over 10 seconds
to do simple filesystem operations, like "ls" or copying a single
small 1kb file.

There was a thread on WebHostingTalk where people claim this occurs
only on Redhat Enterprise 3:

http://www.webhostingtalk.com/archive/thread/229306-2.html
http://www.webhostingtalk.com/archive/thread/243144-1.html

My experience would support this observation, as I have no such
problem on a Fedora Core 1 box with 3Ware 8506-4 controller, which is
under an order of magnitude higher I/O load. 

Example output from "top":
 17:28:07  up 7 days,  2:55,  2 users,  load average: 6.57, 6.40, 5.62
56 processes: 55 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    0.0%    0.0%    0.2%   0.0%     0.0%   99.5%    0.0%
           cpu00    0.0%    0.0%    0.0%   0.0%     0.0%  100.0%    0.0%
           cpu01    0.1%    0.0%    0.5%   0.1%     0.0%   99.0%    0.0%
Mem:  2061612k av, 2039624k used,   21988k free,       0k shrd, 
677324k buff
                    775128k actv, 1129528k in_d,   28576k in_c
Swap: 2088408k av,       0k used, 2088408k free                
1252528k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
  612 root      24   0 27696  27M  4784 S     0.1  1.3 127:37   0 dsmserv
16509 root      15   0  1164 1164   900 R     0.1  0.0   0:00   1 top
    1 root      15   0   492  492   440 S     0.0  0.0   1:48   1 init
    2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0
migration/0
    3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1
migration/1
    4 root      15   0     0    0     0 SW    0.0  0.0   1:37   1 keventd
    5 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0
ksoftirqd/0
    6 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1
ksoftirqd/1
    9 root      15   0     0    0     0 DW    0.0  0.0   1:08   0 bdflush
    7 root      15   0     0    0     0 SW    0.0  0.0  27:45   0 kswapd
    8 root      15   0     0    0     0 SW    0.0  0.0   0:04   1 kscand
   10 root      15   0     0    0     0 DW    0.0  0.0   3:58   1 kupdated
   11 root      25   0     0    0     0 SW    0.0  0.0   0:00   0
mdrecoveryd
   17 root      25   0     0    0     0 SW    0.0  0.0   0:00   1
scsi_eh_0
   20 root      15   0     0    0     0 SW    0.0  0.0   0:08   0
kjournald
  122 root      15   0     0    0     0 SW    0.0  0.0   0:00   1
kjournald
  123 root      15   0     0    0     0 SW    0.0  0.0   0:00   1
kjournald
  124 root      15   0     0    0     0 DW    0.0  0.0  14:06   0
kjournald
  125 root      15   0     0    0     0 SW    0.0  0.0   0:00   0
kjournald
  126 root      15   0     0    0     0 SW    0.0  0.0   0:02   1
kjournald
  127 root      15   0     0    0     0 DW    0.0  0.0   0:33   0
kjournald
  439 root      15   0   616  616   512 S     0.0  0.0   0:00   0 syslogd
  443 root      24   0   452  452   400 S     0.0  0.0   0:00   1 klogd
  453 root      15   0   444  444   384 S     0.0  0.0   1:08   0
irqbalance
  502 root      15   0  1512 1512  1268 S     0.0  0.0   0:00   1 sshd
  516 root      25   0   824  824   708 S     0.0  0.0   0:00   1 xinetd
  531 ntp       15   0  2572 2572  2204 S     0.0  0.1   0:16   1 ntpd
  584 root      15   0  1520 1520  1204 S     0.0  0.0   0:12   1 master


Version-Release number of selected component (if applicable):
kernel-2.4.21-9.0.1.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. Launch a moderately I/O intensive task, like doing a
bzip2-compressed TAR backup of another machine tunneled over SSH
(that's only a couple hundred kb/s)


Actual Results:  Over time, CPU's spend more and more time in iowait
until they spend 99,9% of their time in iowait.


Expected Results:  iowait should't be so high and system should be
responsive under such load. Most much lighter workstation machines
easily handle such I/O loads.

Additional info:
Comment 1 Red Hat Bugzilla 2004-04-21 11:30:41 EDT
Created attachment 99604 [details]
dmesg file from the affected machine
Comment 2 Red Hat Bugzilla 2004-04-21 12:06:39 EDT
When the task that generates I/O traffic finishes, the kernel remains
in the invalid state, spending almost all CPU time in iowait. I'm just
observing this - the machine has been idle for over half an hour, but
the iowaits for both CPUs are at 98-99%.

When I run the "sync" command to flush dirty buffers, it seems to
never finish.
Comment 3 Red Hat Bugzilla 2004-04-21 12:10:06 EDT
Upgradinw the 3Ware controller's kernel module (3w-xxxx) to the latest
version (v1.02.00.037) and the controller's firmware doesn't help a bit.
Also, a similar controller with exactle the same firmware and driver
works fine on Fedora Core 1.
Comment 5 Red Hat Bugzilla 2004-04-21 14:33:52 EDT
It could be an IO elevator problem.  Tom, could you please help
Aleksander try some elvtune settings to narrow down the problem?
Comment 6 Red Hat Bugzilla 2004-04-21 19:15:06 EDT
Adding Doug to cc: list on the chance that this is related to the
SCSI affine queue patch, which was disabled in RHEL3 U2.  -ernie
Comment 7 Red Hat Bugzilla 2004-04-22 06:58:47 EDT
# elvtune /dev/sda

/dev/sda elevator ID            1
        read_latency:           512
        write_latency:          16384
        max_bomb_segments:      4


I'll try aggresively lowering read latency and moderately olowering
write latency.

BTW, on the Fedora 1 box (where the problem hasn't been observed) I've
played with many different elevator settings and never seen high iowaits.
Comment 8 Red Hat Bugzilla 2004-04-22 07:02:58 EDT
After some hours the iowaits drop to 0% (until I launch another task
which generates I/O traffic).
Comment 9 Red Hat Bugzilla 2004-04-22 08:21:33 EDT
I've tuned down the elevators and the bug doesn't seem to exhibit
itself anymore:

# elvtune /dev/sda 

/dev/sda elevator ID            1
        read_latency:           32
        write_latency:          8192
        max_bomb_segments:      4
Comment 10 Red Hat Bugzilla 2004-04-22 08:40:58 EDT
Tom, Stephen, would it be worth changing the default elevator settings
a bit for U3 so the worst latency problems are fixed ?
Comment 12 Red Hat Bugzilla 2004-04-28 10:53:27 EDT
I have the same problems on 2 servers with this controller. Installing
 kernel-smp-unsupported-2.4.21-4.EL.i686 and using /sbin/elvtune -r
512 -w 16384 -b 4 /dev/sda decreases iowait from 90% to avg 40%. So it
looks better with this settings.
Comment 13 Red Hat Bugzilla 2004-05-01 18:33:20 EDT
It's giving me high iowaits again when a backup is in progress;

My settings:
-r 32 -w 8192

Tuning down to -r 32 -w 4096 or even -r 32 -w 2048 doesn't help much
(at least not immediately - I doing this right at this moment).
Comment 14 Red Hat Bugzilla 2004-05-01 18:34:52 EDT
I'm having incredible latencies, although I've reduced elvtune
settings to -r 32 -w 2048. I have been waiting for my login shell to
start, then for execution of "ls" in my homedir for over a minute!
Comment 15 Red Hat Bugzilla 2004-05-01 19:04:42 EDT
The 3ware driver does queuing internally. Tom, it might be useful to
reduce the amount of queuing the 3ware driver does, since its TCQ
depth really seems unreasonably deep...
Comment 16 Red Hat Bugzilla 2004-05-01 19:13:31 EDT
At -r 32 -w 2048 iowaits gradually drop to about 50%, which still
doesn't seem normal. Latencies are visibly high, I'm waiting for
simple filesystem operations several seconds.

There's only one I/O intensive job, and it's a backup over the network
(so the disk I/O isn't that high, since backup goes over encrypted SSL
connection on 100 megabit ethernet, giving about 600 kb/second of data
transfer).
Comment 17 Red Hat Bugzilla 2004-05-05 18:03:26 EDT
We have experienced the same problem with the 3ware 8505-8 Serial ATA 
card and RHEL3.

Watching the system performance with top when transferring any large 
files (500Mb+) or running Oracle 10g causes iowait to hover above 95%.

Changing the elevator to /sbin/elvtune -r 32 -w 8192 -b 4 /dev/sda
has taken the iowait averages from 95%-99% to the high 80%'s.

I discussed this issue (at length) with RedHat, Oracle, and 3ware.  
The general consensus is that there it is a problem with the RedHat 
EL kernel (aka 2.4.21-9.0.1.ELsmp).  Other versions may also 
demonstrate the problem but this is the one that we used for this 
test.

We installed 2.4.26 from www.kernel.org and the problem vanished.

Example top with 2.4.26 smp kernel while Oracle 10g is importing a 
10gb file.

 14:51:44  up  2:24,  4 users,  load average: 2.05, 1.39, 1.67
122 processes: 119 sleeping, 3 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    
idle
           total   55.3%    0.0%    6.5%   0.0%     0.0%    0.0%   
38.1%
           cpu00   51.6%    0.0%    9.8%   0.0%     0.0%    0.0%   
38.5%
           cpu01   59.0%    0.0%    3.2%   0.0%     0.0%    0.0%   
37.7%
Mem:   903940k av,  897448k used,    6492k free,       0k shrd,    
1356k buff
       462568k active,             385516k inactive
Swap: 8185108k av,  150428k used, 8034680k free                  
809988k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU 
COMMAND
 4344 oracle    17   0  6716 6716  4444 R    29.9  0.7   1:59   1 imp
 4346 oracle    15   0 94308  90M 89976 R    24.9 10.3   9:30   1 
oracle
11007 oracle    15   0  1184 1184   888 R     2.4  0.1   0:00   0 top
 2827 oracle     9   0 42652  18M 17872 D     1.6  2.0   1:00   1 
oracle
 2825 oracle     9   0 41888  38M 38852 D     0.4  4.3   0:51   1 
oracle
    1 root       8   0   472  444   424 S     0.0  0.0   0:09   0 init
    2 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 
keventd
    3 root      19  19     0    0     0 SWN   0.0  0.0   0:00   0 
ksoftirqd_CPU
    4 root      18  19     0    0     0 SWN   0.0  0.0   0:00   1 
ksoftirqd_CPU
    5 root       9   0     0    0     0 SW    0.0  0.0   0:52   0 
kswapd
    6 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 
bdflush
    7 root       9   0     0    0     0 SW    0.0  0.0   0:04   1 
kupdated
    9 root       9   0     0    0     0 SW    0.0  0.0   0:00   1 
scsi_eh_0
   10 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 
khubd
   15 root       9   0     0    0     0 SW    0.0  0.0   0:09   1 
kjournald
  537 root       9   0     0    0     0 SW    0.0  0.0   0:00   1 
kjournald
Comment 18 Red Hat Bugzilla 2004-05-06 16:39:47 EDT
I have not been able to recreate this problem with either
2.4.21-9.0.1.ELsmp or 2.4.21-15.ELsmp.  Although the 3Ware controller
I'm using is parallel ide with 4 disks.  I tried running multiple cp's
of 4 gig files from drive to drive.  I could get wait I/O to go up to
70% but nothing higher..  and the system stayed responsive.  I also
tried taring files off and on to the system over ssh with no IO wait
at all. (maybe 10% once in a while)
Comment 19 Red Hat Bugzilla 2004-05-06 19:43:15 EDT
Hmmm....a few comments.  First, I sometimes forget that the fix we did
for bz #104633 isn't in the 9.0.?.EL kernels.  However, the 12.EL and
later kernel in the RHEL3 Beta channel has the fix and it very well
may solve this problem (although elv tuning might make things better
on the 12.EL and later kernels also).

As to the readings you might get with a stock 2.4.26 kernel: iowait
patches aren't in the upstream kernel.org 2.4 kernel, so it will
always read 0 if I remember correctly.  I/O wait time used to show up
as idle time and would definitely leave you thinking totally
differently about the performance.  The only way to truly compare
performance between the two is to time the amount of total time taken
to perform the same operation and then checked elapsed time vs. user &
sys time between the two environments.

If the people reporting problems in this thread can confirm whether or
not the 12.EL kernel in the Beta channel solves your problems I would
appreciate it.
Comment 20 Red Hat Bugzilla 2004-05-07 15:21:13 EDT
I am having very similar problems with a 3ware 8506-4LP with 4 SATA 
WD1200 drives on a dual 3GHz Xeon system configured in Raid5 array.

I had (at Jesse Keating's (PogoLinux) suggestion) tried the noAffine2 
test Kernel which Doug had posted (which I presume is similar to the 
beta 12.EL).  This did (as for others) improve the write.c benchmark 
included in the original bug (#104633).   It did not however 
significantly improve my latency problems.

I then tried setting the 3ware driver (via its web interface) to 
favor background tasks.  This did help significantly but the 
responsiveness was worse than a similar machine equiped with a single 
IDE drive which achieved 2.5x faster write performance and comparable 
read performance.

I tried the beta 14.ELsmp kernel which gave similar results 
(presumeably as expected) to the noAffine2 test kernel (from a 
latency perspective).

I then tried a self configured (though not correctly as I don't see 
all my RAM) generic 2.4.26 kernel (from kernel.org source).

It appears the iowait statistics in the 2.4.26 kernel are always zero 
as Doug states.  Hence the CPU shows a much greater idle time.  
However the generic 2.4.26 kernel DRAMATICALLY improves the latency 
problems. 

With the 2.4.26 kernel I returned the 3ware configuration to its 
centerpoint between background / IO performance.  Performance is 
still very good.  

Conclusion something else other than the affine problem is going on 
in the Enterprise WS-smp 2.4.21-14.ELsmp kernel with this HW 
configuration.

Another interesting observation while running the bonnie++ 
benchmark.  The profound drop in responsivness with the 2.4.21 
kernels appears predominantly during the writing and rewriting phases 
and not the reading.  The benchmark results are similar for write 
with all the kernels.  Incidentally, there similar 10% improvement in 
read rate for the noaffine2, 12.EL and 2.4.26 over the production 
2.4.21 kernel.

Another observation the 2.4.26 kernel is using an older 3ware driver 
than the 14.ELsmp.

Looking forward to a solution in the Enterprise kernel.
Comment 21 Red Hat Bugzilla 2004-05-10 11:57:22 EDT
Tried Kernel 2.4.22-1.2188.nptlsmp (Fedora 1 core) again at Jesse 
Keatings suggestion. This kernel also had greatly improved 
responsiveness over the production WS Enterprise kernel.

My apologies in my previous posting I stated the 2.4.26 was using an 
older 3ware driver the opposite is true.  In fact the all the 
versions that work have a driver several steps along than that used 
in both the production and beta WS Enterprise release.

2.4.22-1.2188.nptlsmp v 1.02.00.036
2.4.26 = 1.02.00.037
2.4.21-9.03 + 2.42.21 - EL14 = 1.02.00.033

There are  notes in rev36 of this driver related to sleeping problems 
with the driver.  I am recompoling a production kernel with the 037 
driver and will post results

regards

Keith Roberts
Comment 22 Red Hat Bugzilla 2004-05-10 18:41:58 EDT
Tried the 2.4.21-9.03 Kerner with rev 37 driver.  Didn't fix the 
problem.

This is beyond my debugging skills.

regards

Keith Roberts
Comment 23 Red Hat Bugzilla 2004-05-13 03:11:37 EDT
We are also seeing this issue of very high iowait times with our dual
3Ware 7506-8 system.  The slightest bit of writing will send the
IOWait times through the roof.  The elvtune-ing above has no noticable
affect.  We haven't tried the beta kernel yet as this is a production
machine.  Are there any signs of that kernel being particularly unstable?
Comment 24 Red Hat Bugzilla 2004-05-13 19:40:53 EDT
I am running the Fedora 1 Kernel (2.4.22-1.2188.nptlsmpto) get around 
the problem (as mentioned above).  I have not seen any particular 
problems with this Kernel but I would not say I am heavily stressing 
it.  The Beta (EL14) Kernel did not fix the problem for me.

Note as stated the drop in iowait is partially fictious as the iowait 
statistics appear to be zero in this kernel.  However the latency was 
greatly improved.
Comment 25 Red Hat Bugzilla 2004-05-14 12:47:07 EDT
Not really a surprise but I tried the new production core 2.4.21-15 
(smp) with no improvement.  

I will Sticking with the Fedora 1 core for now which is working very 
well.
Comment 26 Red Hat Bugzilla 2004-05-17 15:12:40 EDT
We have installed 2.4.21-15.ELsmp on 3 different systems and are
continuing to see outrageously high iowait times (or if you prefer,
extremely slow read and write performance)

Our primary application is Oracle 10g but we are seeing problems when
using rsync to move large files (aka database backup files) from one
server to another.

We have RHEL 3.0 EL installed on:
  Dual Xeon 2.0ghz with hyperthread and 3ware 8506 SATA RAID
  Dual Xeon 2.0ghz with hyperthread, 3gb RAM, IDE (no RAID)
  P4 3.0ghz with hyperthread, 2gb RAM, SATA (no RAID)
  P4 3.0ghz with hyperthread, 2gb RAM, IDE (no RAID)
  P4 3.0ghz with hyperthread, 3gb RAM, SATA w/Intel RAID
  P4 3.0ghz with hyperthread, 3gb RAM, SATA w/Intel RAID
  P4 3.0ghz with hyperthread, 3gb RAM, 3ware 7506 IDE RAID

One P4 motherboard was from Intel, the other from ASUS.

All of these systems run MUCH SLOWER than the Redhat 7.2 w/Oracle 8i
that they were intended te replace.

We have tested all of these with and without all of the updates that
were made available from RedHat last week.

Since this problem became apparent on a Dual Xeon system that had been
running flawlessly under RedHat 7.2 and seen the problem on two other
systems with a few different HD configurations, I am inclined to
believe that it is a problem with the RHEL 3.0 kernel.

We are in the process of installing Fedore Core1 and SUSE 9.1 and will
post the results.

LaVar
Comment 27 Red Hat Bugzilla 2004-05-17 17:17:41 EDT
Every comment except the last one mentions the 3ware adapter, so we
have been working on the assumption that the problem is related to the
3ware driver's interaction with the kernel.  Has anyone who reported a
problem on 3ware also seen the problem on other storage configurations? 

LaVar (and others), if you have seen this on non-3ware configurations,
please post the dmesg output that shows those devices being configured.

Most of the comments above also indicate that the problem is system
unresponsiveness during I/O, not necessarily that the I/O throughput
is bad.  In fact, one comment (#20) says that the bonnie++ benchmark
results are similar for write with all the kernels.  

Is this true for everyone?  I'm trying to sort if we have two problem
reports here or one: a specific problem with system unresponsiveness
when 3ware I/O is happenning, or a more general I/O throughput problem
on storage in general.

We have been trying to reproduce the 3ware/kernel problem, but so far
we have not succeeded. We obtained a SATA controller from 3ware, in
addition to the IDE one that Bill mentioned in #18. We are continuing
to investigate, and will try testing some larger configurations.



Comment 28 Red Hat Bugzilla 2004-05-17 18:59:21 EDT
Our bonnie++ numbers for our RAID5 array aren't blazing, but they're
not terrible either.  During this test iowait was pegged and all other
processes were starved. (Poor mysql!)  


I just ran bonnie++ tests on two of our systems, both running RHEL3,
one with a 3ware RAID, the other straight IDE.  Both systems are using
LVM.  On the 3ware system, IOWait was pretty much pegged at 100%.. or
97% or so throughout the whole test.  Load hovered around 8.  On the
IDE system, it bounced freely from 0 to 100 and everything in between.
 Load hovered around 4.

3ware RAID system:


Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
 /sec %CP
x4.develooper.co 4G 17016  51 20257  14 11106   5 20232  51 35321  10
 63.7   0
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 /sec %CP
                 16  1378  58 +++++ +++ +++++ +++  2156  89 +++++ +++
  585  10
x4.dev,4G,17016,51,20257,14,11106,5,20232,51,35321,10,63.7,0,16,1378,58,+++++,+++,+++++,+++,2156,89,+++++,+++,585,10


IDE System:
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
 /sec %CP
x3.develooper.co 4G 27045  90 36402  39 18741  12 28103  77 49397  23
133.6   0
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 /sec %CP
                 16  2446  98 +++++ +++ +++++ +++  2570  99 +++++ +++
 6347  97
x3.dev,4G,27045,90,36402,39,18741,12,28103,77,49397,23,133.6,0,16,2446,98,+++++,+++,+++++,+++,2570,99,+++++,+++,6347,97



Comment 29 Red Hat Bugzilla 2004-05-18 06:29:07 EDT
Aleksander,

        What is the wattage of power supply used?
        what file system are you using?

Thanks
-Padmanaban
Comment 30 Red Hat Bugzilla 2004-05-18 09:43:13 EDT
All filesystems are EXT3.

The 650Watt power supply is provided by the Hudson-3 Server Chassis,
which has 2 hot-swappable power supplies.
Comment 31 Red Hat Bugzilla 2004-05-18 17:16:38 EDT
I believe I've finally recreated this in the lab. 

My config is 4 SATA drives in a raid 5 and OS Installed on top of LVM.
I'm running bonnie++ and trying to login at the console can take so
long that it times out. 

Is everyone who is seeing this running RAID 5 and LVM?  Or just one or
 other?

I'll try swapping in the Fedora kernel and see if the problem goes
away like everyone here has reported.
Comment 32 Red Hat Bugzilla 2004-05-19 04:03:14 EDT
The system where I've seen this is running on plain hardware RAID5. No
LVM involved. Hope it helps.
Comment 33 Red Hat Bugzilla 2004-05-20 15:15:07 EDT
Tom, On the SATA 8000 series controller Both -9 and -15 kernels show
horrible latency on interactive commands when the drive is configured
as raid 5.  The fc1 2.4.22-1.2115 kernel runs fine with no noticable
latency in doing interactive tasks.

Still can't reproduce on the 7000 series controller which uses
parallel ide hard drives.  Just to be clear this is in an older Dual
500 PIII.

Comment 34 Red Hat Bugzilla 2004-05-21 08:47:45 EDT
My config is 4 SATA HW Raid 5 but NO LVM
Comment 35 Red Hat Bugzilla 2004-05-21 09:13:09 EDT
If someone doesn't mind installing and compiling a kernel from a
bitkeeper source repo, then they could help in the debugging of this I
think.  I have a kernel repo at
bk://linux-scsi.bkbits.net/rhel3-scsi-test that is our complete
2.4.21-15.EL kernel plus a series of changes I've already made.  One
of the changes that I made is that 'echo "scsi dump 2" >
/proc/scsi/scsi' now works and will produce a *huge* dump that
includes the mid layer scsi host state, the scsi device state, the
complete list of outstanding and free scsi commands for each device,
and a list of the requests in the block layer queue for each device. 
That dump would allow us to see if there is something wrong in merging
of requests or something similar to that in this case.  If you are
already familiar with bitkeeper, then the process is pretty simple:

bk pull bk://linux-scsi.bkbits.net/rhel3-scsi-test /usr/src/scsi-test
cd /usr/src/scsi-test
cp configs/kernel-2.4.21-<arch and option>.config .config
make oldconfig
make dep
make modules modules_install install

reboot into test kernel and get the dump when the system is
experiencing extremely high iowait and post the dump results here.  If
you have any problems with the kernel, please let me know about that
as well (for instance, if my fix for the dumping code doesn't work, it
used to oops and I think I have that fixed now, but it hasn't been
tested under extrememly heavy I/O load).  There is also one other
patch already present in that repo that specifically addresses a
performance issue related to typical oracle type workloads by
increasing the merging of adjacent requests.  That could help in this
case by reducing the total number of requests sent to the 3Ware
controller in order to perform the same amount of work.

One other useful bit of data would be to find out if people are having
this problem with any specific chunk size in the RAID5 arrays.  For
example, it might be that this only happens when the RAID5 array is
created with a large chunk/stripe size and our requests are routinely
smaller than the size of a single chunk or something like that. 
That's one of the reasons I would like to see a scsi dump.  It will
show what size of command we are sending and comparing that to the
chunk size of the array might be enlightening.
Comment 36 Red Hat Bugzilla 2004-05-21 10:57:14 EDT
Following up on Rik's suggestion, I tried reducing the queue depth of
the 3Ware driver:

can_queue from 254 to 30
command_per_lun from 254 to 4

An initial test did not show any dramatic improvement. It will take
more time to tell if there was any change at all. Instead of doing
this now, we will try Doug's debug kernel (with the stock 3ware
driver) to see if that points out a more specific culprit.
Comment 37 Red Hat Bugzilla 2004-05-21 12:05:01 EDT
As requested I compile and ran the test_scsi kernel

This kernel does exhibit the problem

I ran the :
echo "scsi dump 2" > /proc/scsi/scsi

Assuming that the /proc/scsi/scsi is where I should see the log file 
all I get is

Attached devices: 
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: 3ware    Model: Logical Disk 0   Rev: 1.0 
  Type:   Direct-Access                    ANSI SCSI revision: 
ffffffff

Am I looking in the right place ?
Comment 38 Red Hat Bugzilla 2004-05-21 12:12:11 EDT
Look in /var/log/messages.
Comment 39 Red Hat Bugzilla 2004-05-21 12:29:05 EDT
Tom is correct.  They should be in /var/log/messages or the output of
dmesg.  This is because the dump is done as kernel log messages.  If
you want to actually see them when you run the command, then do dmesg
-n 8 before running the command and it spam your terminal to death ;-)
Comment 40 Red Hat Bugzilla 2004-05-24 05:46:12 EDT
Wouldn't it put the system into an infinite loop if the log file is on
the same device that scsi commands are dumped for?

If so, one would have to temporarily forward those messages to another
syslog host.

Before I do this on my system, I'd like to know: what facility and
priority those messages will be??
Comment 41 Red Hat Bugzilla 2004-05-24 08:20:47 EDT
No.  The dump is done as kernel printk()s.  They show up in your
syslog messages.  And it dumps the commands while holding a lock on
the device it's dumping messages about, so it sees a snapshot in time,
not a running command list.

facility: kernel, level: info.

By the way, I do have one dump from one of these systems now.  The
dump looked fairly sane, but I do think that the 254 or so commands
they give per logical drive is insane.
Comment 42 Red Hat Bugzilla 2004-05-24 16:27:46 EDT
BTW, I'm seeing the same problem on a stock Dell 2450 using the 
internal RAID controller (PERC 3/Si).  Low levels of disk activity 
are causing very high iowait.  Here's some base information, if this 
helps:

uname -a 
Linux sca05 2.4.21-9.0.3.ELsmp #1 SMP Tue Apr 20 19:49:13 EDT 2004 
i686 i686 i386 GNU/Linux

lsmod
Module                  Size  Used by    Tainted: P
ide-tape               52464   0  (autoclean)
ide-cd                 34016   0  (autoclean)
cdrom                  32576   0  (autoclean) [ide-cd]
sg                     37228   0  (autoclean)
lp                      9124   0  (autoclean)
parport                38816   0  (autoclean) [lp]
autofs                 13620   0  (autoclean) (unused)
nfs                    95344   5  (autoclean)
lockd                  58992   1  (autoclean) [nfs]
sunrpc                 88444   1  (autoclean) [nfs lockd]
e100                   58468   1
floppy                 57488   0  (autoclean)
microcode               5056   0  (autoclean)
loop                   12696   0  (autoclean)
ext3                   89960   7
jbd                    55060   7  [ext3]
lvm-mod                64800  15
aacraid                34116   2
sd_mod                 13456   4
scsi_mod              111784   3  [sg aacraid sd_mod]

Comment 43 Red Hat Bugzilla 2004-05-27 03:57:01 EDT
I've been contacted by 3Ware's (wellm, actually AMCC's) Technical
Support Engineer, and he has supplied me with findings made by another
anonoymous customer who actually did some kernel debugging in RHES 3.0
(and 3Ware's linux driver engineer agrees with his findings).

He wrote:

"I think I found the
"culprit" on the RedHat ES 3.0 perfomance problem and I would
like to know the 3ware opinion "before" contacting RedHat.

The "main" problem is, as my first intuition, in this patch

 # several small tweaks not worth their own patch
 Patch10020: linux-2.4.18-smallpatches.patch

I did my analysis on the last public available RedHat
kernel, i.e. "kernel-2.4.21-9.0.3.EL.src.rpm", from
ftp://updates.redhat.com/enterprise/3ES/en/os/SRPMS.

After unpacking the source RPM it us possible to see
in "/usr/src/redhat/SOURCES/linux-2.4.18-smallpatches.patch",
at lines 517-518 this nice change on the base kernel 2.4.21:

    517  #define DISABLE_CLUSTERING 0
    518 -#define ENABLE_CLUSTERING 1
    519 +#define ENABLE_CLUSTERING 0

that has a "very nice" results on the file

   "/usr/src/linux-2.4/drivers/scsi/hosts.h"

     35 #define SG_NONE 0
     36 #define SG_ALL 0xff
     37
     38 #define DISABLE_CLUSTERING 0
     39 #define ENABLE_CLUSTERING 0

As you can see the C defines "DISABLE_CLUSTERING" and "ENABLE_CLUSTERING"
are now both "zero". In the original source code they are (as one
can expect from the names):

     35 #define SG_NONE 0
     36 #define SG_ALL 0xff
     37
     38 #define DISABLE_CLUSTERING 0
     39 #define ENABLE_CLUSTERING 1

"zero" and "one", that, if I interpret correctly, means respectively
"disable scatter gathering" and "enable scatter gathering".

Now in the 3ware driver source code (as in release 1.02.00.037),
in the 3w-xxxx.h file, we have this reference to "ENABLE_CLUSTERING":

    563         unchecked_isa_dma : 0,                          \
    564         use_clustering : ENABLE_CLUSTERING,             \
    565         use_new_eh_code : 1,                            \

    596         unchecked_isa_dma : 0,                          \
    597         use_clustering : ENABLE_CLUSTERING,             \
    598         use_new_eh_code : 1,                            \

As you can picture, the result is that, in the Scsi_Host_Template
initialization, the ENABLE_CLUSTERING costant is "zero" instead
that "one", as "probably" was in the origianl 3ware programmer
intention.

I did a very simple test with the redhat kernel (untouched)
changing the 3w-xxxx.h source code in this way:

    563         unchecked_isa_dma : 0,                          \
    564         use_clustering : 1                ,             \
    565         use_new_eh_code : 1,                            \

    596         unchecked_isa_dma : 0,                          \
    597         use_clustering : 1                ,             \
    598         use_new_eh_code : 1,                            \

i.e. I explicitely used "one" instead that the offended
"ENABLE_CLUSTERING" redhat macro.

Perfomances increase immediately from 50Mb/sec to 90Mb/sec.

Unfortunately there is another RedHat patch:

   Patch7030: linux-2.4.21-scsi-affine-queue.patch

that seems to influece 3ware driver perfomances.

The patch is rather complex and seems to change the logic
of the producer/consumer relation for the generic scsi driver.

If I understand correctly, from the source code comments, the
patch try to optimize the SCSI generic driver for SMP systems.

It is possible (I did some test and I would like again
to ear your opinion) to get rid of this problem in two way.

The trivial one is to delete the patch: it seems to be used
by one and only one driver: qla2200 (QLogic ISP2x00) in the
file /usr/src/linux-2.4/drivers/addon/qla2200/qla2x00.c.

I followed this path and perfomances increase to 150Mb/sec.

Then I tried another way: I left the redhat kernel untouched
again and I modified the min/max-readahead parameters, picturing
that increasing the values would influence the new producer/consumer
driver logic.

Bingo! The new values:

      echo 8192 > /proc/sys/vm/max-readahead
      echo 2048 > /proc/sys/vm/min-readahead

increased perfomaces again to 154-158Mb/sec. It even seems
to be present a light perfomances gain.

My conclusion are:

  a) there is a bug in the redhat "Patch10020" for ES 3.0;
     we should contact RedHat to understand why the macro
     was changed from one to zero;
     in the mean time we can use a "patched" 3ware include file;

  b) we may "increase" the min/max-readahead parameters from
     512/128 to 8192/2048 or we may contact RedHat to understand
     what is going on in the "Patch7030".


My questions to 3ware are:

  a)  is there any error in my analysis?
  b)  do you share my conclusions?
  c)  do you have other suggestions?"
Comment 44 Red Hat Bugzilla 2004-05-27 13:02:09 EDT
Aleksander:  Thank you for posting this.  This may in fact be the problem.

So, first, the latest RHEL3 U2 kernel source already has patch 7030
removed.  So, I've added a change to my working tree to re-enable
clustering in general.  I would be interested if someone could run my
current working kernel tree, which has the fixed scsi dump code, and
with the clustering support enabled and the machine under heavy load,
get me a dump of all the outstanding scsi commands.  I would like to
analize whether or not the clustering code is making a significant
change in the makeup of the scsi commands going out to the 3ware driver.

Of course, with patch 7030 already removed, the changes to the min/max
readahead values shouldn't be needed, but if someone wants to run a
few tests with my current working kernel tree and some different
values for this, I would welcome the input.  However, keep in mind
that even though higher readahead values will increase streaming
performance, it's a trade off between streaming performance and
interactive performance.  So, basically, you don't want to make the
readahead numbers so high that the readahead operations start to
interfere with other tasks being performed at the same time.  So, some
additional useful data besides just the straight performance numbers
would be the results of running time /etc/cron.daily/slocate.cron
which would be performing a large number of uncached disk accesses all
over the disk at the same time.  The readahead might make streaming
performance better, but it's also likely to reduce the performance of
the slocate cron job, and finding a good balance of reasonable
streaming performance and reasonable slocate performance would be ideal.
Comment 45 Red Hat Bugzilla 2004-05-28 15:39:18 EDT
I pulled the latest version of the scsi_test kernel.  This did not 
have the error mentioned above changed.  I then edited the hosts.h 
file to :

#define DISABLE_CLUSTERING 0
#define ENABLE_CLUSTERING 1

I then re-issued (having built this kernel earlier)
make modules modules_install install

I don't know if this is sufficient to rebuild all the modules 
affected.  I had not issued the make dep assuming (quite probably 
incorrectly) that if I did not change the config this would be 
unnecessary.  I am concerned as nothing changed perhaps I failed to 
rebuild the Kernel (though I saw several modules change .0 date).

Anyway assuming the change was implemented the results was no 
significant improvement in the responsiveness, once I booted from the 
Kernel.  This is a production machine so I have used up my chance for 
reboots this week.  Hence I couldn't try rebuilding the Kernel from a 
clean start.

When the system was in very big trouble it was interesting to see a 
much abbreviated log.

May 28 15:15:09 chips1 kernel: Dump of scsi host parameters:
May 28 15:15:09 chips1 kernel: (scsi0) Failed 0 Busy 254 Active254
May 28 15:15:09 chips1 kernel: (scsi0) Blocked 0 (Timer Active 0) 
Self Blocked 0
May 28 15:15:09 chips1 kernel: Dump of scsi device and command 
parameters:
May 28 15:15:09 chips1 kernel: (scsi0:0:0:0) Busy 254 Active 254 
OnLine 1 Blocked 0 (Timer Active 0)
May 28 15:15:09 chips1 kernel: (cnt) ( kdev sect nsect cnsect stat 
use_sg) (retries allowed flags) (timo/cmd timo int_timo) (cmd[0] sense
[2] result)
May 28 15:15:09 chips1 kernel: (  0) ( 08:0e 31966040  256    8    1 
32) (0 5 0x00) (6000    0    0) 0x2a 0x00 0x00000000
May 28 15:15:09 chips1 kernel: (  1) ( 08:0e 31942224  256    8    1 
32) (0 5 0x00) (6000    0    0) 0x2a 0x00 0x00000000

One thing that is very apparent is that this is not an instant 
problem.  It really feels that a very large queue of operations has 
to build up and then suddenly performance becomes dramatically 
worse.  It is as though there is a limit on the queue size and rather 
than slow up the offending process(es) everybody requesting disk 
becomes locked out.
Comment 46 Red Hat Bugzilla 2004-05-28 16:28:59 EDT
Thanks for trying that Keith.  I'm getting similar results here.  Just
changing the ENABLE_CLUSTERING define back to 1 isn't making any
significant difference.  I'm investigating a few alternative
possibilities here to see if I can work something up that would solve
the problem in a different manner.
Comment 47 Red Hat Bugzilla 2004-06-08 05:01:51 EDT
The problem definitely is in RHEL kernel.

Recently I've switched a kernel from Fedora Core 1's do RHEL's on a
similar machine (3Ware-based hw RAID5 array) because it was crashing
with kernel panics all the time (bug 123332).

With RHEL's 2.4.21-15.ELsmp kernel the panics have apparently stopped,
however instantly the problem with high iowait values arised. Storage
subsystem performance dropped so significantly that I had to disable
the virus scanning engine on that server (it's a mail server) and do
several other last resort fs tweaks (disabling atime on all
filesystems; tweaking bdflush, elvtune, readahead parameters; screen
the system from inbound virii and spam traffic using a primary SMTP MX
that does the filtering and runs Fedora Core 1) to get the performance
back to almost acceptable level - the mail system responds to clients
just before they time out most of the time...

Unfortunately I cannot build a test kernel as I have no RHEL system
that would be able to perform a build - those 2 systems that I have do
suffer from high iowait problems and running a kernel build would
likely kill them for several hours.
Comment 48 Red Hat Bugzilla 2004-06-09 10:57:36 EDT
Created attachment 100990 [details]
Detailed description of an array 3Ware 8506-4 

This is the "Details" page from the 3Ware 3DM array management page.

As visible, chunk size is 64 K.
Comment 49 Red Hat Bugzilla 2004-06-09 10:59:50 EDT
Doug, I've pulled your test kernel with Bitkeeper by running "bk clone
bk://linux-scsi.bkbits.net/rhel3-scsi-test
/usr/local/src/linux-rhel-test" and when I build it using
kernel-2.4.21-i686-smp.config, I get this compilation error:

gcc -D__KERNEL__ -I/usr/local/src/linux-rhel-test/include -Wall
-Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing
-fno-common  -Wno-unused -fomit-frame-pointer -pipe -freorder-blocks
-mpreferred-stack-boundary=2 -march=i686 -DMODULE -DMODVERSIONS
-include /usr/local/src/linux-rhel-test/include/linux/modversions.h 
-nostdinc -iwithprefix include -DKBUILD_BASENAME=i2c_ali1535  -c -o
i2c-ali1535.o i2c-ali1535.c
i2c-ali1535.c:675:6: missing terminating " character
i2c-ali1535.c:676:89: missing terminating " character
i2c-ali1535.c:691:1: unterminated argument list invoking macro
"MODULE_AUTHOR"
i2c-ali1535.c:674: error: syntax error at end of input
make[2]: *** [i2c-ali1535.o] Error 1
make[2]: Leaving directory `/usr/local/src/linux-rhel-test/drivers/i2c'
make[1]: *** [_modsubdir_i2c] Error 2
make[1]: Leaving directory `/usr/local/src/linux-rhel-test/drivers'
make: *** [_mod_drivers] Error 2
Comment 50 Red Hat Bugzilla 2004-06-09 11:15:34 EDT
Created attachment 100991 [details]
Detailed description of an array at a 3Ware 8506-8 controller

64 K chunks, too. Both arrays exhibit the same iowait problem under RHEL
kernel.
Comment 51 Red Hat Bugzilla 2004-06-09 11:42:38 EDT
Aleksander: regarding the compilation error, I suspect you have gcc
3.3 or later installed on the system you are compiling on?  The error
in question is due to the fact that the file uses a multi-line string
constant, which gcc-3.2 warns about but works with anyway, where as
the latest gcc refuses to compile it.  I suspect that there are a
*number* of places in the kernel source that are going to be unhappy
with the latest gcc.  We also have another bug report that I think is
the same thing, 124450, and they are working on getting me some
detailed information related to the problem, so if switching around
gcc compilers and stuff is a hassle, we should be able to work it out
without you having to go through the trouble.  If that changes and I
need you to try a kernel (or if we think we have the issue fixed and
just need test confirmation), then I would think we can compile a set
of test RPMs at that point and make them available.
Comment 52 Red Hat Bugzilla 2004-06-14 04:01:49 EDT
I were building the kernel on Fedora Core 1.

Now I've built one on RHEL. SCSI dumps work fine (I can see that this
is a one time dump that is triggered by 'echo "scsi dump 2" >
/proc/scsi/scsi', not a continuous one, right?).

Now I'm trying to reproduce the high iowait condition and then I'll
post a dump from normal system state and high iowait state for comparison.
Comment 53 Red Hat Bugzilla 2004-06-14 09:21:03 EDT
Created attachment 101102 [details]
scsi dump on idle system

This dump has been made when the system is almost completely idle and iowait
values are around 0.0%.
Comment 54 Red Hat Bugzilla 2004-06-14 09:27:54 EDT
Created attachment 101103 [details]
scsi dump on system with high iowait values

This dump however may not be representative of the problem covered in this bug,
as I've put the system into high iowaits very artificially, by running 2 backup
sessions from remote machines and bonnie++ benchmark at the same time.

The system didn't exhibit one behaviour that's characteristic of the discussed
problem: iowait values dropped instantly after I've stopped the bonnie++ task,
while they usually remain high for some time when the problem occurs
spontaneously.

When the system goes into high iowaits state spontaneously, I'll generate a
scsi dump and attach it here. The dump may show very different data then. I
cannot reliably put the system into that state by manual intervention, so
you'll have to wait if current dump doesn't show anything suspicious.
Comment 55 Red Hat Bugzilla 2004-06-14 14:32:31 EDT
Aleksander:  This is very helpful information.  Especially the
difference between the aritificial and typical IOwait problem going
away.  That leads me to think that the actual problem here is quite
possibly similar to the apache "thundering herd" wakeup on connect
problem that they solved some time back.  Whenever the request queue
gets filled up, any additional read or write requests are placed on a
wait queue.  That wait queue is woke up whenever a request is freed. 
Obviously, if you have a large number of processes waiting on requests
to be freed, that wake up will end up scheduling all those processes
to run, but only 1 will actually get the free request struct.  So you
spend a lot of CPU time waking processes up just to put them back on
the wait queue.  The worse the problem is (aka, the more processes you
have on the wait queue), the longer it will take to clear up.  A good
check for this would be to run ps axfm on a machine under a real
iowait load condition and see just how many processes are stuck in a D
state.

A second problem here is that if only a few request structs are free,
and the process that is the first to get there needs more than there
are free, then even though it gets some requests onto the queue, it
still has more to go and goes back on the wakeup list.  The next time
a wake up happens, this process may not be the one that gets the free
request structs, so we can essentially end up giving a few request
structs here and there to process, but not enough to let any process
actually finish and get itself off the wait queue for a significant
period of time.  Again, this would greatly exacerbate the problem
since what we need under high disk load like this is to be able to
actually get some of those processes done and off the wait queue
completely.

Anyway, that's the theory I've got after reading your updates.  It's a
hard one to test, but an artificial load that might demonstrate it
better would be a disk exerciser that uses lots of threads for reading
and writing like tiobench.  Several simultaneous tiobench runs with
lots of threads each should duplicate the problem and also show a
large lag in clearing the problem out if I'm right about what's
causing the slowdown.
Comment 57 Red Hat Bugzilla 2004-06-15 07:14:30 EDT
OK, I've run 3 tiobench instances, two of them with default settings
(8 threads), one with more threads (64 threads).

I'm starting to attach results.
Comment 58 Red Hat Bugzilla 2004-06-15 07:21:57 EDT
Created attachment 101137 [details]
output from "(date; ps axfm; date)"

Indeed, tiobench and other processes are mostly in the "D" state.
Comment 59 Red Hat Bugzilla 2004-06-15 07:30:18 EDT
Created attachment 101138 [details]
scsi dump initiated at 2004-06-15 12:37:14

Note that timestamps in the log are offset and dump starts at 12:39:10.

This is due to the previous dump (executed when iowaits were at 99%, but the
system was still quite responsive) was still unfinished going through syslog
when I've initiated the second one.

I hope that this is OK - AFAIU dumped data is a result of point-in-time
snapshot, right?
Comment 60 Red Hat Bugzilla 2004-06-15 07:31:29 EDT
Created attachment 101139 [details]
tiobench results, first instance
Comment 61 Red Hat Bugzilla 2004-06-15 07:34:19 EDT
Created attachment 101140 [details]
tiobench results, second instance
Comment 62 Red Hat Bugzilla 2004-06-15 07:34:53 EDT
Created attachment 101141 [details]
tiobench results, third instance (64 threads)
Comment 63 Red Hat Bugzilla 2004-06-15 08:05:50 EDT
Yes, dumps are a snapshot in time, so hitting it a second time before
the first one finished is perfectly fine (unless syslog just chokes on
so much data ;-)

I'll review what you've posted (Thanks!) and see what I can come up
with in terms of any possible fixes.
Comment 65 Red Hat Bugzilla 2004-06-16 13:53:41 EDT
I'm not seeing a whole lot of processes in the D state.  Not enough to
create a thundering herd problem anyway.  I've asked Stephen Tweedie
to take a quick look at this and see if he thinks this might be an
ext3 bottleneck of some sort relating to journal writes, etc.

If someone has a non-production system that they can test with, then
they could try mounting the filesystem as ext2 instead of ext3 and see
if that makes a difference on the problem.  I don't really know much
about tuning ext3 filesystem journal sizes, but either increasing or
decreasing the journal size on problem filesystems might help as well.
Comment 66 Red Hat Bugzilla 2004-06-17 06:14:17 EDT
This might be related: I can also see the following message in kernel
error logs on each boot-up:

"kernel: PCI: Unable to handle 64-bit address space for"

There's nothing after the "for"...
Comment 67 Red Hat Bugzilla 2004-06-17 06:19:19 EDT
Created attachment 101213 [details]
SCSI dump made today under non-artificial high iowait condition

I've hit the high iowait problem today when running "du -x --max-depth=1 . |
sort -n" in the mountpoint of the whole filesystem (the biggest one on that
machine).

It took quite a while to complete the "du" run and iowaits were at 98%, system
responsiveness was sloppy.
Comment 68 Red Hat Bugzilla 2004-06-17 06:25:54 EDT
I'm jumping on this thread hoping my tests may be of help.

I'm the author of the post to 3ware reported in comment #43.

This post was sent to 3ware hoping they may help to understand why
an 8506 controller with a 4x250Gb Maxtor MaxLineII raid5 array had
a very poor perfomance under RH EL 3.0: 11Mb/sec. The problem 
"actually" was in the ENABLE_CLUSTERING patch. Resetting it to the
right value (one) the perfomances jumped back to 150Mb/sec.

After that I found the problem in the affine patch, a patch I was
delighted to see deleted in the U2 Kernel release.

Unfortunately the story is not over. I can confirm, from our tests,
that our system periodically "hang" for short periods, when is under
an I/O bounded process. 

This happen with ext2, ext3, JFS, ReiserFS, XFS ... it even happen
when we create new filesystems. It is very easy to see the problem
arise when a "dd" is started on a raw device.

This short periodical "lock" of the system arise if and only if a 
a process is "writing" into the devices. They do not show at all
if the process simply "read" from the device.

At first I was convinced this was a RH ES problem, but, after dozen
of tests, I think I found this behaviour, more or less, under
other kernels: straight 2.4.26, 2.6.5 and even in Fedora CORE1/2
kernels.

I'm starting to think the problem is or inside the 8506 3ware
device driver or inside the 8506 firmware.

I cannot prove this and I still hope is a kernel issue that can
be corrected.

Hope I was of some help, Regards, G. Vitillaro.
Comment 69 Red Hat Bugzilla 2004-06-17 10:51:36 EDT
The problem is difficult and it seems that there are in fact different
problems that when combined together give the symptoms we've observed.

Let's try to prepare a list of potential causes (not that they exclude
each other, but they are separate):

=== Hardware

Poor performance of 3Ware controller in HW RAID5 setups can be
explaied by poor performance of its CPU

That's why some people setup _Linux's software RAID5_ on 3Ware
controllers instead of doing RAID5 in the hardware, because they get
much better performance if their machine CPU is reasonably fast.

Look here:
http://ask.slashdot.org/article.pl?sid=04/06/16/1658250&mode=thread&tid=137&tid=198#9445640

=== Software

There are several possible software causes in the Linux kernel as I
understand:

1) The CLUSTERING patch that does this to the hosts.h header:

#define DISABLE_CLUSTERING 0
#define ENABLE_CLUSTERING 0

has been tested and does not influence this issue (high iowaits) as
Keith Roberts reported in comment #45, but backing it off has resulted
in significant throughput increases for some testers (Giuseppe
Vitillaro, comment #68).

2) Removal of the infamous Patch7030:
linux-2.4.21-scsi-affine-queue.patch has offered some performance
improvements on 3Ware controllers for some testers (Keith Roberts,
comment #20; Giuseppe Vitillaro, comment #68), but didn't affect the
high iowait issue much too.

3) Some yet unidentified issue with RHEL kernel as compared with
Fedora kernels causing the "high iowait and low system responsiveness,
high I/O latency" issue that continues to exist.

A system that had been running Fedora kernels
(vmlinuz-2.4.22-1.2188.nptlsmp, 2.4.22-1.2190.nptlsmp) didn't have the
problem. Installing RHEL kernel (vmlinuz-2.4.21-15.ELsmp) on that
system immediately brought high iowait problems.
Comment 70 Red Hat Bugzilla 2004-06-17 11:18:32 EDT
Take care about "visibile" iowait.

I haven't a log of all of my tests, but I'm pretty sure
that in many kernel version I tested (this includes Fedora),
the iowait is not "visible", but the "hanging" problem is still
there, at least on our machine.

The problem presents itself in this way:

1) start a "dd if=/dev/zero of=/dev/sda[n] bs=1M" on
   a raw unused partition of your /dev/sda array;

2) "cd" into /usr/lib and start an "ls -l"

the "ls -l" command waits for dozen of seconds, even more
in some occasion the first time you scan a "large" uncached
directory.

I know this a very rough way to test a system, but it is the
only fast way we found to check if the problem is still there.

We are in the way to test a SW RAID5 on this "non production"
machine. If the perfomances will be in the same range of
"native HW RAID" and the "hannging" problem will disappear,
we will have a good indication that the problem arise into
the RAID5 3ware firmware. Isn't?

Regards, G. Vitillaro.
Comment 71 Red Hat Bugzilla 2004-06-17 16:04:59 EDT
Doug, regarding your earlier comment, there's really nothing in the
logs so far here that would give me enough information to either blame
or eliminate ext3 as a factor here.  But the effect of batching of
journal writes is really more likely to show up as a latency effect
under severe load, not an IO bandwidth effect.

One thing that might help would be an "alt-sysrq-t" dump of process
state during the bad performance, as that will show exactly which
processes are waiting where.  But if there's underlying bad IO
performance at the driver level, that is still quite likely to show up
in the sysrq-t log as lots of processes stuck in the filesystem.

Really, trying one of the existing reproducers on an identical
configuration except with ext3 mounted as ext2 instead is the best way
to eliminate that from the problem.
Comment 73 Red Hat Bugzilla 2004-06-18 07:58:01 EDT
BTW I've noticed that the problem is especially visible when doing
recursive permission operations on trees with large number of small files.

E.g. doing "chgrp -R groupname /some/big/directory" will progress
unusually slowly, especially if there are some other moderately I/O
intensive processes running on the system.

So commands making small reads seem to be affected the most.
Comment 74 Red Hat Bugzilla 2004-06-18 07:58:41 EDT
Ehm, I've meant "small reads and writes".
Comment 75 Red Hat Bugzilla 2004-06-18 08:02:15 EDT
Stephen, does doing such a magic SysRq dump affect the state of the
system? Some magic SysRq commands leave the system in barely usable
state (like emergency remount R/O), I'd like to be sure that this one
will not affect the operation of the server.
Comment 76 Red Hat Bugzilla 2004-06-18 10:07:54 EDT
aleksander, alt-sysrq-t only emits its output to the kernel log, and
has no other impact.  The only side-effect you might notice is the
time the kernel takes to dump the information --- if you have serial
console set up, for example, then you might see a short stall as the
kernel dumps all the output over a slow connection. 
Comment 77 Red Hat Bugzilla 2004-06-18 12:43:22 EDT
"doing "chgrp -R groupname /some/big/directory" will progress
unusually slowly, especially if there are some other moderately I/O
intensive processes running on the system."

That's an unfortunate consequence of basic IO scheduling.  If your
scheduler is "fair" with respect to the IOs it knows about, then
there's a basic problem --- writes are generated asynchronously by
applications (for most filesystem modifications, the app doesn't need
to wait until the IO hits disk --- only fsync/O_SYNC forces that.) 
But for reads, the application needs to wait for the IO to complete
before the data is available.

So for reads, you end up with the application submitting a single
read, waiting for it, submitting another, etc --- there's one IO in
the queue at a time.  (Readahead helps to some extent by making those
IOs as large as usefully possible.)  But for writes, an application
can generate huge numbers of IOs at once.

If you mix the two types of load, then yes, the reads progress slowly,
because each single small new read gets queued behind all the other
writes in the system.

The 2.6 kernel has an "anticipatory scheduler" which keeps the queue
artificially idle after a read is satisfied, to allow the reading
process to submit another one in short order and get a bit more of the
queue to itself at once.   It's not really feasible to back-port that
to 2.4.

Comment 78 Red Hat Bugzilla 2004-06-22 05:20:03 EDT
Created attachment 101325 [details]
Magic SysRq dump (alt-sysrq-t)
Comment 79 Red Hat Bugzilla 2004-06-22 05:28:16 EDT
During that dump, the processes that were I/O intensive and caused
high iowaits were: ssh (the client, it was tar-archiving a remote
machine, redirecting the compressed archive to a local file), ls (it
was hanging at the time of the dump), find (it was searching for
.tar.bz2 files on all filesystems).
Comment 80 Red Hat Bugzilla 2004-06-22 09:57:14 EDT
That sysreq-t output looks very much like journal wait issues. 
Amongst other things, syslogd is writing the log files in sync mode
(calling fsync after every log message) which appears to then be
forcing journal flushes that can delay things.  But, in general, this
really looks like an ext3 journal flush causing long latency type
problem.  Stephen?
Comment 83 Red Hat Bugzilla 2004-06-22 17:52:38 EDT
In the sysrq-t trace here, we've got one task doing a "checkpoint". 
That's when the journal is full, so we're basically doing the "sync"
to flush out any metadata that's attached to the old transactions in
the journal prior to deleting those transactions.

Going to the first comment: "I have no such problem on a Fedora Core 1
box with 3Ware 8506-4 controller, which is under an order of magnitude
higher I/O load."

Now, that code should be basically identical in FC1 and RHEL3.  I will
go have a check to see if there are any differences we've missed, though.

But bear this in mind --- CPUs are faster than disks.  If you have an
IO-intensive workload, then the processes doing IO are *necessarily*
going to spend the bulk of their time in D state.  And if you're doing
a lot of journal writes, then yes, slow IO can often be expected to
manifest itself as lots of processes waiting on the journal --- merely
because that's where the writes are physically scheduled.

So seeing processes blocked in the journal does not imply _cause_
here, especially since the exact same journal code is apparently
working fine on FC1.  The journal is definitely waiting on IO in the
specific sysrq-t snapshot here, but the jury is still out as to
whether that's a cause or an effect.

Comment 86 Red Hat Bugzilla 2004-06-23 05:16:15 EDT
If the code really is identical in RHEL and FC, then the problem is
most probably somewhere else.

I've tried (see comment #47) switching to RHEL kernel on a machine
running FC 1 distribution.

Switching from Fedora to RHEL kernel (built on a RHEL machine from
source pulled from Doug's test BK repository at
bk://linux-scsi.bkbits.net/rhel3-scsi-test) introduces the high iowait
and high latencies problem immediately.

This results in _visible_ performance degradation (abnormally high
latencies of disk operations), and I am aware that the iowait on
Fedora kernel is always shown as having 0% as reported by 'top'.
Comment 88 Red Hat Bugzilla 2004-06-24 16:11:45 EDT
I am having very similar problems, but on a desktop workstation (Dell
Precision 360), not a server.  Basically any intensive I/O, like
taring a large file or running find on the entire disk, will make the
system almost completely lockup for 10's of seconds or more.  I once
tried to run sysreport and had to abort it after a half hour because
my computer was basically unresponsive to interactive use.  I have
also seen high iowait percentages and many processes in the D state,
but I just think they are a symptom, not the problem, they just happen
to be processes that are running at that time, usually things like
kswapd, find, tar, even X (which is why interactive response is so bad)!

I have read through most of this report, and the only similarities I
identified were, using ext3 journaling, SMP kernel (hyperthreading PIV
CPU), and a SATA controller card.  I tried booting the non-smp kernel
and even booted with all filesystems mounted as ext2, but still saw
the same problem.  Could this be a hardware or kernel driver problem?
 I have an Intel Corp. 82801EB disk controller, but my one and only
IDE hard drive is connected to the Ultra ATA 100 interface, not the
SATA.  Or is this something internal to the kernel and I/O scheduling?
Comment 89 Red Hat Bugzilla 2004-07-06 10:39:32 EDT
Is there any more info I could provide to help fixing this bug?
Comment 90 Red Hat Bugzilla 2004-07-06 11:01:12 EDT
I hope my comment will not be misleading to
solve this situation.

As I noted in comment #70, we switched our machine
from HW RAID5 to Linux SW RAID5 using RH ES 3.0 U2.

We are using four Maxtor MaxLineII 250Gb attached to a 3ware
Escalade 8506-8.

We used bonnie++ (bonnie++ -n0 -r512 -s20480 -f -b -u0 on a 512Mb 
memory configuration) to evaluate perfomances of an ext3 filesystem:

Version  1.03       ------Sequential Output------ --Sequential Input- 
=
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
=
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP 
 =
/sec %CP
ulisse          20G           65419  66 46467  41           159405  
49 =
142.8   1
ulisse,20G,,,65419,66,46467,41,,,159405,49,142.8,1,,,,,,,,,,,,,

As you may note the bandwidth seems preserved:

63Mb/sec write
45Mb/sec rewrite
155Mb/sec read

The disks are identical with a base controller/single-disk bandwidth
in the 35-40Mb/sec range. It is almost the same result we obtained
using the HW RAID5 and in line with the expected perfomances
(SW RAID seems definitely better as write/rewrite bandwidth).

It seems to me that, as may be expected, this is obtained using
CPU cycles. The CPU use seems to double going from HW RAID5 to
SW RAID5.

But ... the periodical locking problem seems to be gone:
the machine (SMP biprocessor Intel Pentium Xeon 2.80Ghz)
now run smoothly.

I know this cannot be a conclusion, but if the behaviour may
be duplicated and analyzed, maybe there is hope to identify
the origin of the problem.

Regards, G. Vitillaro.


Comment 91 Red Hat Bugzilla 2004-07-06 13:34:43 EDT
Dear Giuseppe 

Could you please post your bonnie++ results for quad drive HW raid 
configuration.  Your software RAID numbers are far (3x) superior to 
my HW results posted above though you mention they are comparable to 
your HW results.  Could you also confirm with ext3 whether you were 
running defaults or had a higher peformance journal mode set. 

I notice you used a 8506-8 compared to my 8506-4.  My understanding 
was that for given number of drives it should be comparable.

regards Keith Roberts
Comment 92 Red Hat Bugzilla 2004-07-06 14:00:22 EDT
Sure, but i have a copy just of the "best" test I obtained
with HW RAID and RH ES.

The hw configuration is the same, beside going with HW RAID
on the same disks with the 8506.

You are right: the best results with HW RAID (from my logs)
is with JFS. Ext3 is rather disappointing for write and rewrite
(read is in the same range).

The Kernel was RedHat ES 3.0 2.4.21-9.0.3 (U1 I believe),
driver was 3ware 1.02.00.037, vm.max-readahead = 8192, vm.
min-readahead = 2048, with the driver recompiled with
ENABLE_CLUSTERING patched to 1. The higher readahed values
was needed because the 7030 "scsi-affine" patch was still
in our kernel at that time (Fri May  7 10:38:19 2004).

The bonnie++ command and the machine memory was the same (512Mb):
"bonnie++ -n 0 -r 512 -s 20480 -f -b" and the results, on a JFS 4096
bytes blocked file systems, was:

Version  1.03       ------Sequential Output------ --Sequential Input- 
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP 
 /sec %CP
nimitz          20G           52443  17 44000  15           161393  
31 113.9   0
nimitz,20G,,,52443,17,44000,15,,,161393,31,113.9,0,,,,,,,,,,,,,

As you can see the read perfomances are sligthly better, the 
write/rewrite perfomances are worst and the CPU usage is more
than one half, compared with SW RAID.

In both cases the "read" bandwidth is in the expected range, i.e.
near 150-160Mb/sec, for a quad disk configuration.

We were rather happy of this results "before" to note the
"locking" problem. Then, after the discussion on this thread
we "switched" to SW RAID and we are going to live in SW configuration 
(the machine is on production now), until the all thing is cleared.

Hope this help, Giuseppe.
Comment 93 Red Hat Bugzilla 2004-07-06 14:04:12 EDT
Just another note: we tried "once" ext2 (i.e. ext3 without
journal) and we didn't noted any perfomance move: but I'm
going with my memory. I havent loged these tests. Sorry.

Regards, G. Vitillaro.
Comment 94 Red Hat Bugzilla 2004-07-07 04:26:41 EDT
Giuseppe, I think this is a bit offtopic in this bug. You're talking
about preformance of various filesystem types being compared, and
about performance of HW RAID5 vs. SW RAID5.

This bug is about a different thing: using the *same filesystem*
(ext3), on the *same machine and storage subsystem*, Fedora kernel
performs well, while RHEL kernel gives terrible latencies visible on
the client side and high iowait times visible on the server side
during those periods.

As to the comparison of RAID performance, hardware vs. software, it's
a well known thing especially with 3Ware controllers.

Some people deliberately setup software RAID5 on 3Ware controllers and
treat them as "dumb" controllers:
http://ask.slashdot.org/article.pl?sid=04/06/16/1658250&mode=thread&tid=137&tid=198#9445640
Comment 95 Red Hat Bugzilla 2004-07-08 20:15:09 EDT
I am having the same issues on a live production server.  It is an
Intel mainboard with a 3Ware 7006-2 RAID Controller (IDE/Parallel). 
We are running this on RHEL3 2.4.21-4.ELsmp With Dual XEON Processors.
We are using 2 Maxtor 120gb IDE drives.

dmesg:
SCSI subsystem driver Revision: 1.00
3ware Storage Controller device driver for Linux v1.02.00.033.
scsi0 : Found a 3ware Storage Controller at 0x7000, IRQ: 54, P-chip: 1.3
scsi0 : 3ware Storage Controller
Starting timer : 0 0
3w-xxxx: scsi0: AEN: WARNING: Unclean shutdown detected: Unit #0.
blk: queue c2ecc218, I/O limit 4095Mb (mask 0xffffffff)
  Vendor: 3ware     Model: Logical Disk 0    Rev: 1.0
  Type:   Direct-Access                      ANSI SCSI revision: 00
Starting timer : 0 0
blk: queue c2ecc018, I/O limit 4095Mb (mask 0xffffffff)
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
SCSI device sda: 240119680 512-byte hdwr sectors (122941 MB)

Please let me know when a resolution is expected or what quick fix
could be used in the mean time.
Comment 96 Red Hat Bugzilla 2004-07-09 04:36:03 EDT
Bob, if you can, please try using Fedora Core 1's kernel for a while.
You can download kernel-smp from Fedora here:
http://download.fedora.redhat.com/pub/fedora/linux/core/1/i386/os/Fedora/RPMS/

Don't change any other options and post your results here (does Fedora
kernel really work better).
Comment 97 Red Hat Bugzilla 2004-07-09 15:49:01 EDT
I have installed that kernel and initially it seemed to be ok.  I
opened two tar sessions of the entire hard drive to really test it. 
iowait remains at 0% and my load average went up to 6.8.  My question
is this, tho:  If my iowait is at 0% on all 4 processors and all 4 of
those processors are at least 90% idle, how is it possible that my
load average is climbing?  I will continue to run this Fedora kernel
on our server and see what happens under normal load.  

Also, I have a ticket open with Red Hat Support.  I opened it before I
found this bug.  Today I sent them all of the information related to
my server that they requested.  Anyone who has access to those tickets
can look at ticket 346246.  This information was given to them before
I changed the kernel.  I was running the .15 EL3 kernel at the time
and had high I/O activity to reproduce the load.
Comment 98 Red Hat Bugzilla 2004-07-09 16:52:38 EDT
Bob: Aside from your ticket, we now have a number of tickets all
echoing the same basic sentiment in terms of a problem.  I'm currently
working on it on a test machine internally.  Currently, my prime
suspect is that the problem is a combination of things.  1)  The RHEL
kernels will flush a larger number of dirty pages in a single pass of
kflushd than upstream kernels will in order to improve our ability
flush out swap pages under heavy memory pressure and 2) the I/O
elevator doesn't take this into account and can allow a huge number of
writes to get put in front of reads on the request queue under a
couple different scenarios resulting in the disk request queue
becoming clogged with writes ahead of reads and generating extremely
high latencies on the read requests, which in turn causes programs to
get stuck waiting for requested read data to be returned, increasing
the overall load average, decreasing responsiveness, etc.  That's the
theory anyway, we'll see how the possible code solutions work, or
don't as the case may be.

I'm currently waiting on my test machine to finish compiling a
relatively complete set of performance runs so that I have valid
baseline data against which to judge the effectiveness of any changes
I make (I thought the people on this bugzilla would appreciate knowing
just where I'm at, hence the status update format of this last bit).
Comment 99 Red Hat Bugzilla 2004-07-09 17:03:38 EDT
Thanks Doug!
Let me know if you have any patches that you would like me to try on
our server.  I only have physical access to it on Wednesdays and
Fridays but do have remote access 24x7.
Comment 100 Red Hat Bugzilla 2004-07-09 17:22:49 EDT
It seems I am still having similar issues using the Fedora kernel.  I
was working with my named.conf file and it loaded almost instantly in
pico.  When I returned to the file, it took approx 20 seconds to open
the same file in pico.  At this point my load is at 3.15 which seems
unreasonably high considering our server has a higher mail load
between 9-5 and it is now after 5.  Approx 30 mins ago my load was at .33.
Comment 101 Red Hat Bugzilla 2004-07-09 17:32:19 EDT
Can you run top for a few minutes during this high load average time
and tell me which processes come to the top of the listing and show as
being in either an R or D state?  I'm curious what's causing the load,
whether it's something like mail server processes or kernel processes
such as bdflushd, kflushd, or kjournald.
Comment 102 Red Hat Bugzilla 2004-07-09 17:38:44 EDT
Doug,

I just now went into top but my system load is only at .7 at the
moment.  Something strange was happening tho.  All 4 processors were
showing 0% idle as well as every other column (user, system, iowait,
etc).... every 5 or 10 seconds, irq, softirq, and iowait would show
approx 33% on all four processors AS WELL AS under total, which
mathematically wouldnt make any sense.  The system is running and
responding fine but this was something odd I just observed.

I know typically spamd, qmail, and other processes related to mail pop
up at the top during high system load averages however I have yet to
see a process use more than 4% cpu during this high load averages.

I am leaving the data center at this point to travel back to my home
in PA.  I will check this thread in approx 5 hours as well as check my
system load at that point.  As soon as I see the load average peak
above 4 again, I will get the information I can out of top.
Comment 103 Red Hat Bugzilla 2004-07-12 09:03:11 EDT
For some time, I've been logging output from "ps axfmu" when there
were high iowait peaks.

Have a look at the processes spending their time in the D state. The
following has been filtered using:
awk '{ if ($8 ~ /D/) { print ; } }'
, so that onlt processes in "D" state are shown (the username column
has been removed):

 PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
 198 0.0 0.0 0 0 ? DW Jun04 29:16 [kjournald]
 23944 2.5 0.9 21896 9920 ? D Jul02 146:07 \_ fam
 25431 0.0 0.1 7736 1232 ? D Jun22 0:01 \_ /usr/bin/python
/usr/local/mailman/bin/qrunner --runner=RetryRunner:0:1 -s
 3385 0.5 0.1 4384 1768 ? DN 14:59 0:01 \_ /usr/lib/courier/bin/imapd
Maildir
 3698 0.5 0.1 4120 1604 ? DN 15:00 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 19863 0.0 0.1 5552 1876 ? DN 10:32 0:06 | \_
/usr/lib/courier/bin/imapd Maildir
 24551 0.0 0.1 3692 1056 ? DN 14:23 0:00 \_
/usr/lib/courier/bin/couriertls -server -tcpd
/usr/lib/courier/libexec/courier/imaplogin /usr
 27507 0.0 0.0 3696 952 ? DN 14:32 0:00 \_
/usr/lib/courier/bin/couriertls -server -tcpd
/usr/lib/courier/libexec/courier/imaplogin /usr
 27508 0.0 0.1 3868 1100 ? DN 14:32 0:00 | \_
/usr/lib/courier/bin/imapd Maildir
 498 0.0 0.1 4256 1412 ? DN 14:50 0:00 | \_ /usr/lib/courier/bin/imapd
Maildir
 3687 2.4 1.1 13936 11328 ? DN 15:00 0:03 | \_
/usr/lib/courier/bin/imapd Maildir
 4210 99.9 0.1 3104 1364 ? DN 15:02 0:02 | \_ submit esmtp dns;
SOKRATES (softdnserr [::ffff:192.168.254.79]) AUTH: LOGIN askwarska, TL
 4202 0.0 0.1 3096 1360 ? DN 15:02 0:00 \_ submit esmtp dns;
[10.0.10.5] (softdnserr [::ffff:192.168.254.79])
 18250 0.1 1.1 29380 11800 ? DN Jul05 1:48 \_ /usr/sbin/httpd
 23543 0.1 1.0 25820 11048 ? DN 14:19 0:03 \_ /usr/sbin/httpd
 4220 0.0 0.3 21176 3716 ? DN 15:02 0:00 \_ /usr/sbin/httpd
 2947 0.9 1.7 49312 17860 ? D 14:58 0:02 \_ /usr/bin/spamd -d -c -a -m5 -H
 2948 0.9 1.7 49312 18256 ? D 14:58 0:02 \_ /usr/bin/spamd -d -c -a -m5 -H
 2949 0.9 1.7 49312 17876 ? D 14:58 0:02 \_ /usr/bin/spamd -d -c -a -m5 -H
 3026 0.2 1.5 41228 16068 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 3132 0.2 1.6 41116 17468 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 3507 0.3 1.2 41404 12604 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 3703 0.4 1.4 41116 14992 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 3717 0.1 1.3 40984 14088 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 3718 0.5 1.4 41380 14876 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 3762 0.3 1.5 41116 15624 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H

Another one:

 PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
 25064 1.6 0.2 6420 2448 ? DN 15:27 0:04 \_ /usr/lib/courier/bin/imapd
Maildir
 25428 1.9 0.3 6424 3100 ? DN 15:28 0:04 \_ /usr/lib/courier/bin/imapd
Maildir
 25730 2.0 0.3 6424 3440 ? DN 15:29 0:03 \_ /usr/lib/courier/bin/imapd
Maildir
 26216 0.3 0.1 4112 1524 ? DN 15:32 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 26220 0.6 0.1 3844 1160 ? DN 15:32 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 26291 0.0 0.0 3768 612 ? DN 15:32 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 26276 0.0 0.0 3772 616 ? DN 15:32 0:00 | \_
/usr/lib/courier/bin/imapd Maildir
 26277 0.0 0.0 3756 612 ? DN 15:32 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 6655 0.0 0.6 240020 6416 ? D Jun21 0:15 \_ /usr/sbin/slapd -f
/etc/openldap/slapd_bdb.conf -u ldap -h ldap://0.0.0.0:389/ ldaps://0.
 25681 0.3 1.8 40396 19064 ? D 15:29 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 25690 0.3 1.8 40528 18784 ? D 15:29 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 25927 0.3 1.3 40816 13796 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 25942 0.3 1.3 40684 13812 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 25955 0.4 1.3 40816 13804 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 25978 0.3 1.3 40948 14268 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 25984 0.4 1.3 41076 14028 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H

Another one:

 PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
 195 0.0 0.0 0 0 ? DW Jun04 4:20 [kjournald]
 198 0.0 0.0 0 0 ? DW Jun04 22:57 [kjournald]
 25483 0.0 0.1 3856 1276 ? DN 09:57 0:01 \_ /usr/lib/courier/bin/imapd
Maildir
 11438 0.4 0.2 5324 2740 ? DN 09:09 0:24 | \_
/usr/lib/courier/bin/imapd Maildir
 21598 48.6 0.4 19372 4700 ? DN 09:45 22:48 | \_
/usr/lib/courier/bin/imapd Maildir
 3955 32.0 0.0 3780 616 ? DN 10:31 0:17 | \_
/usr/lib/courier/bin/imapd Maildir


Yet another one:

 PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
 198 0.0 0.0 0 0 ? DW Jun04 20:56 [kjournald]
 3091 0.6 0.2 5172 2584 ? D 11:24 0:00 | \_ /usr/bin/python -S
/usr/local/mailman/cron/gate_news
 14102 0.9 0.2 19808 2216 ? D Jun08 171:53 \_ /usr/bin/python
/usr/local/mailman/bin/qrunner --runner=ArchRunner:0:1 -s
 14136 0.0 0.1 8064 1212 ? D Jun08 3:19 \_ /usr/bin/python
/usr/local/mailman/bin/qrunner --runner=CommandRunner:0:1 -s
 3008 0.9 0.1 3492 1776 ? DN 11:24 0:00 | \_ submit esmtp dns;
krzysztof ([::ffff:195.94.219.146]) AUTH: LOGIN pedryc, TLS: SSLv2,128b
 1877 0.4 0.1 4916 1276 ? DN 11:20 0:01 \_ /usr/lib/courier/bin/imapd
Maildir
 2688 2.9 0.1 4120 1464 ? DN 11:23 0:03 \_ /usr/lib/courier/bin/imapd
Maildir
 2961 1.3 0.1 4116 1624 ? DN 11:24 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 3006 0.3 0.0 3764 624 ? DN 11:24 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 31962 0.6 0.4 7604 4464 ? DN 09:21 0:46 | \_
/usr/lib/courier/bin/imapd Maildir
 9747 0.2 0.5 8632 5892 ? DN 09:59 0:12 | \_
/usr/lib/courier/bin/imapd Maildir
 10029 0.1 0.1 4124 1524 ? DN 10:00 0:05 | \_
/usr/lib/courier/bin/imapd Maildir
 10172 0.1 0.2 5576 2892 ? DN 10:00 0:06 | \_
/usr/lib/courier/bin/imapd Maildir
 10180 0.1 0.1 3876 1272 ? DN 10:00 0:05 | \_
/usr/lib/courier/bin/imapd Maildir
 28608 0.1 0.0 1596 400 ? DN Jun16 7:32 syslogd -m 0
 2945 0.1 0.9 40664 9592 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 2954 0.2 0.9 40664 9528 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 2957 0.1 0.8 40664 8816 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 3085 0.7 0.9 40796 10168 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 3086 0.7 1.0 40796 10444 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H

An yet another one:

 PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
 198 0.0 0.0 0 0 ? DW Jun04 24:05 [kjournald]
 14290 0.0 0.1 3184 1056 ? DN Jun22 0:11 \_
/usr/lib/courier/libexec/courier/courierd
 26291 0.0 0.0 1560 428 ? DN 13:03 0:00 | \_ ./courierlocal
 27326 0.0 0.0 3768 624 ? DN 13:07 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 29624 0.5 0.1 4312 1756 ? DN 11:21 0:34 | \_
/usr/lib/courier/bin/imapd Maildir
 3108 0.0 1.1 33668 11728 ? DN Jun22 1:34 \_ /usr/sbin/httpd
 14981 0.1 0.0 1596 580 ? DN Jun22 1:32 syslogd -m 0
 26326 0.3 1.9 40432 19764 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 26336 0.3 1.8 40280 19112 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 26339 0.3 1.8 40168 19364 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 26343 0.3 1.8 40168 19296 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 26347 0.3 1.8 40168 19236 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 27323 0.1 0.4 39772 4136 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 27324 0.0 0.4 39772 4172 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 27325 0.1 0.4 39772 4192 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 27333 0.0 0.4 39772 4236 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H
 27337 0.1 0.4 39772 4256 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H


As can be seen, kjournald in state D usually accompanies the high
iowait condition, but not always.
Comment 104 Red Hat Bugzilla 2004-07-12 10:48:20 EDT
I have been noticing on my server that the processes listed at the top
of the top command generally arent using ANY resources at all.  I
honestly havent looked for the D state but I will keep an eye open
from here on out.
Comment 105 Red Hat Bugzilla 2004-07-12 11:02:26 EDT
Im not a newb, but nowhere near a developer..... but if kjournald is
accompanion the high loads, is it possible this scenario only happens
under ext3?  Is anyone having this condition under ext2?

Just a thought.
Comment 106 Red Hat Bugzilla 2004-07-12 11:08:23 EDT
In comment #100 you sad:

"It seems I am still having similar issues using the Fedora kernel.  
I
was working with my named.conf file and it loaded almost instantly in
pico.  When I returned to the file, it took approx 20 seconds to open
the same file in pico.  At this point my load is at 3.15 which seems
unreasonably high considering our server has a higher mail load
between 9-5 and it is now after 5.  Approx 30 mins ago my load was at 
.33."

We had exactly the same problem, under Fedora Core1/2 and the
last 2.6.5 plain kernel, for "both" ext2 and ext3 (and other
filesystems too), as I already reported in comment #68.

G. Vitillaro.
Comment 107 Red Hat Bugzilla 2004-07-12 15:26:53 EDT
Not only are we having those same issues but everything seems to act
differently under every time we load top.  I have yet to see any more
than 0% total iowait but this isnt possible as we are only using
approx 2% of cpu usage and our load averages are nearing or going over
4.0 .... I am assuming the fedora kernel just doesnt report iowait? 
My load average seems to be slightly lower under the fedora kernel but
it is acting weird.  Does fedora not report iowait?
Comment 108 Red Hat Bugzilla 2004-07-13 10:59:30 EDT
I believe FC1 does not report iowait.  FC2 does, as it is based on 2.6
and iowait is upstream in 2.6 already.  RHEL3 also has iowait backported.
Comment 109 Red Hat Bugzilla 2004-07-14 18:57:04 EDT
I just received this response from Red Hat Support:

Dear Sir,

It seems that when you experience an issue with your server, that's
the time 
your're running a lot of applications or a peak on the load. You can
check 'free 
-m' during this time, and you'll notice that you only left with few
RAM and 
swap. This is the reason why even in fedora you have the same issue. 

You can add another RAM on your system, and see if this will inmprove the 
performance of your system. 
 
Regards,
Leah

I can not accept this as a solution but is there any truth to this at
all?  The server we are running now replaced another server with less
power and about half the RAM (and no 3ware card).  We are under no
more load than our old server yet our load average is increasing to
insane amounts.
Comment 110 Red Hat Bugzilla 2004-07-14 19:24:26 EDT
This is not a RAM issue.

We first saw this problem when we upgraded from a system running under
7.2 Redhat to RHEL 3.0.  Same hardware, same ram, same applications,
same database, etc.  Performance was so bad that I responded to this
thread after getting the run-around from Redhat support.  Redhat told
us that the 3ware cards were not supported on RHEL 3.0 and I should
purchase a RAID array from a company that was on the compatible list,
such as, Dell or Compaq.  FYI: Oracle support was not any better.

I have tested several systems and have seen these same excessively
slow performance issues.  All of the systems that we have used are
SMP, either dual Xeon or P4 with hyper-threading, using 3ware (IDE and
SATA) or onboard SATA, have 3 or 4 Gb RAM, and use Oracle as the
primary application.

Although the Fedore Core 1 kernel took us from completely unacceptable
performance to tolerable performance, our solution was to abandon
Redhat in favor of other Linix suppliers.

The primary application that we run is the Oracle standard edition
database.  Once I discovered what needed to be done to get Oracle to
install on other distributions, our problem was solved.

I started using Red Hat several years ago and I have been very
disappointed on the lack of support and the incredible delay between
reporting a verifiable problem until resolution.  For instance, this
thread has been open for 11 weeks and it is not yet solved.  BTW, we
considered this a production halting problem.

If Redhat ever fixes this issue, I will probably use the licenses that
we purchased.  However, I would prefer a refund at this point.

LaVar
Comment 111 Red Hat Bugzilla 2004-07-14 19:44:46 EDT
I am happy to see that others feel the same way that I do.  I was fine
with this situation as long as I felt that it was being worked on.  As
soon as Red Hat Support gave me that last posted my views on Red Hat
has gone from "ok" to "horrible" ... If you are honest with me and
tell me that you are working on a problem, then I can deal with
that..... when you lie to me and tell me its something else,
especially when that solution requires me to spend needless money on
memory that isnt even going to help the problem, then its a new story
all together.  We have already begun to look at other solutions. 
Hopefully a solution to this bug is found before we migrate to someone
else.
Comment 112 Red Hat Bugzilla 2004-07-14 19:51:15 EDT
As I went back to check on the compatibility issue I was a bit
perplexed.  At first we had a Promise Technologies card in our server
but had difficulties getting it to work with Red Hat.  As a result we
consulted Red Hat's HCL and from there we found that this 3Ware card
was our best option.  At this point, those 3Ware cards no longer show
on their HCL.  Is it true that Red Hat is pulling support for a
product that it once said worked or did I misread something?
Comment 113 Red Hat Bugzilla 2004-07-14 22:39:16 EDT
No need to get upset at a busy or tired support person ;)

Engineering is looking into this problem and we are trying to get it
fixed.
Comment 114 Red Hat Bugzilla 2004-07-15 23:42:38 EDT
AFAIK 3Ware should still appear on the HCL 'though you'll need to use
the "complete list" tab as they're not officially certified.
Comment 115 Red Hat Bugzilla 2004-07-16 09:06:28 EDT
Another listing of processes in the D state, notice 3 kjournald
instances (this one from today):

 PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
 192 0.0 0.0 0 0 ? DW 11:17 0:05 [kjournald]
 193 0.0 0.0 0 0 ? DW 11:17 0:00 [kjournald]
 195 0.0 0.0 0 0 ? DW 11:17 0:04 [kjournald]
 2973 0.4 0.0 1592 580 ? D 11:17 1:04 syslogd -m 0
 3175 0.0 0.1 4680 1804 ? D 11:17 0:05 \_
/usr/lib/courier/libexec/courier/courierd
 28665 0.7 0.1 5448 1600 ? D 14:54 0:04 \_ /usr/lib/courier/bin/imapd
Maildir
 29140 0.8 0.2 5444 2460 ? D 14:56 0:04 \_ /usr/lib/courier/bin/imapd
Maildir
 17308 0.0 0.2 5660 2096 ? D 12:03 0:03 | \_
/usr/lib/courier/bin/imapd Maildir
 6061 0.0 0.0 3980 856 ? D 13:34 0:00 | \_ /usr/lib/courier/bin/imapd
Maildir
 21268 0.0 0.2 5732 2748 ? D 14:30 0:01 | \_
/usr/lib/courier/bin/imapd Maildir
 21269 0.1 0.1 4636 1620 ? D 14:30 0:02 | \_
/usr/lib/courier/bin/imapd Maildir
 4089 8.5 1.5 20832 15600 ? D 11:19 19:22 \_ fam
 29645 0.2 3.5 62620 36340 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 29646 0.2 3.5 62620 36476 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 29647 0.1 3.4 62620 35516 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 29703 0.2 3.2 62620 33184 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 29705 0.2 3.2 62620 33420 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 30371 0.2 1.7 63032 18496 ? D 15:02 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 30374 0.1 1.7 63032 17704 ? D 15:02 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 30375 0.2 1.7 63032 17972 ? D 15:02 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 30479 0.2 1.8 63032 18624 ? D 15:03 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 30507 0.4 1.6 62480 17320 ? D 15:03 0:00 \_ /usr/bin/spamd -d -c -a
-m5 -H
 3513 0.1 0.7 26424 7876 ? D 11:17 0:16 \_ /usr/sbin/httpd
 3515 0.1 0.7 26672 7944 ? D 11:17 0:18 \_ /usr/sbin/httpd
 3517 0.1 0.6 26436 6820 ? D 11:17 0:16 \_ /usr/sbin/httpd
 3522 0.1 0.7 26344 7692 ? D 11:17 0:15 \_ /usr/sbin/httpd
 12859 0.1 0.8 26852 8648 ? D 11:43 0:14 \_ /usr/sbin/httpd
 12900 0.1 0.4 27256 4916 ? D 11:43 0:14 \_ /usr/sbin/httpd
 12922 0.1 0.7 26504 8132 ? D 11:43 0:13 \_ /usr/sbin/httpd
 13748 0.1 0.5 26804 5968 ? D 11:47 0:12 \_ /usr/sbin/httpd
 13940 0.1 0.6 26208 7152 ? D 11:48 0:12 \_ /usr/sbin/httpd
 32199 0.0 0.7 26380 8196 ? D 13:08 0:06 \_ /usr/sbin/httpd
 31017 0.0 0.0 1584 588 ? D 15:05 0:00 \_ crond
 31018 0.0 0.0 1584 588 ? D 15:05 0:00 \_ crond
 31019 0.0 0.0 1584 588 ? D 15:05 0:00 \_ crond
 3611 0.0 0.1 7692 1144 ? D 11:17 0:00 \_ /usr/bin/python
/usr/local/mailman/bin/qrunner --runner=RetryRunner:0:1 -s
 13736 0.0 0.0 3960 984 ? D 14:10 0:00 /usr/lib/courier/bin/imapd Maildir
Comment 116 Red Hat Bugzilla 2004-07-16 09:08:02 EDT
And another one, notice lack of kjournald instances, and a fam instance:

 PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
 23944 3.4 1.8 24176 18620 ? D Jul02 646:49 \_ fam
 10494 0.0 0.1 4360 1664 ? DN 13:10 0:09 \_ /usr/lib/courier/bin/imapd
Maildir
 24579 1.3 0.2 5532 2636 ? DN 15:59 0:05 \_ /usr/lib/courier/bin/imapd
Maildir
 25375 0.8 0.2 5544 2596 ? DN 16:02 0:01 \_ /usr/lib/courier/bin/imapd
Maildir
 25590 0.7 0.1 4120 1480 ? DN 16:03 0:01 \_ /usr/lib/courier/bin/imapd
Maildir
 25648 0.0 0.1 3832 1108 ? DN 16:03 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 25905 0.8 0.1 4128 1636 ? DN 16:05 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 26131 0.0 0.1 3856 1160 ? DN 16:06 0:00 \_ /usr/lib/courier/bin/imapd
Maildir
 11120 0.0 0.1 3984 1320 ? DN 15:09 0:01 | \_
/usr/lib/courier/bin/imapd Maildir
 15232 0.0 0.1 4468 1768 ? DN 15:25 0:00 | \_
/usr/lib/courier/bin/imapd Maildir
 17768 0.1 0.1 4384 1676 ? DN 15:35 0:03 | \_
/usr/lib/courier/bin/imapd Maildir
 17769 0.2 0.1 3924 1344 ? DN 15:35 0:04 | \_
/usr/lib/courier/bin/imapd Maildir
 19324 0.1 0.1 4932 1796 ? DN 15:40 0:02 | \_
/usr/lib/courier/bin/imapd Maildir
 25918 0.4 0.2 4700 2092 ? DN 16:05 0:00 | \_
/usr/lib/courier/bin/imapd Maildir
 26073 0.7 0.1 4364 1732 ? DN 16:05 0:00 | \_
/usr/lib/courier/bin/imapd Maildir
 19503 0.1 1.1 30368 11508 ? DN 09:46 0:39 \_ /usr/sbin/httpd
 21794 0.1 1.0 27192 10324 ? DN 13:52 0:13 \_ /usr/sbin/httpd
 31399 0.1 1.0 26804 10856 ? DN 14:23 0:08 \_ /usr/sbin/httpd
 31415 0.1 0.9 26804 9440 ? DN 14:23 0:09 \_ /usr/sbin/httpd
 31426 0.1 0.9 26992 9916 ? DN 14:23 0:10 \_ /usr/sbin/httpd
 23730 0.2 0.7 63784 8112 ? D 15:55 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H
 23731 0.2 0.7 63784 8164 ? D 15:55 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H
 23755 0.2 0.7 62728 7988 ? D 15:55 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H
 23812 0.1 0.7 62464 8008 ? D 15:56 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H
 23826 0.2 0.7 62596 8020 ? D 15:56 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H
 28492 0.0 0.1 4300 1708 ? DN 14:14 0:05 /usr/lib/courier/bin/imapd
Maildir
 29943 0.0 0.1 3856 1264 ? DN 14:17 0:00 /usr/lib/courier/bin/imapd
Maildir
 29944 0.0 0.1 3852 1108 ? DN 14:17 0:00 /usr/lib/courier/bin/imapd
Maildir
 29945 0.0 0.1 3852 1232 ? DN 14:17 0:00 /usr/lib/courier/bin/imapd
Maildir
 29946 0.0 0.1 3868 1212 ? DN 14:17 0:00 /usr/lib/courier/bin/imapd
Maildir
 31515 0.0 0.2 5312 2652 ? DN 14:23 0:06 /usr/lib/courier/bin/imapd
Maildir
 31516 0.0 0.4 7172 4344 ? DN 14:23 0:05 /usr/lib/courier/bin/imapd
Maildir
 31517 0.0 0.1 4080 1512 ? DN 14:23 0:06 /usr/lib/courier/bin/imapd
Maildir
 23847 0.0 0.0 4604 752 ? DN 15:56 0:00 spamc -s 524288
Comment 117 Red Hat Bugzilla 2004-07-16 09:29:40 EDT
BTW, I want to direct your attention to bug 124450 which seems closely
related.
Comment 118 Red Hat Bugzilla 2004-07-16 10:58:37 EDT
Created attachment 101966 [details]
readprofile kernel profiling data during high iowaits state

This is extracted from another RHEL system, running Doug's test kernel and
suffering from high iowaits. At the moment the readprofile was captured, the
following processes were in the D state:

USER	   PID %CPU %MEM   VSZ	RSS TTY      STAT START   TIME COMMAND
root	     9	0.0  0.0     0	  0 ?	     DW   15:26   0:00 [bdflush]
root	    10	0.1  0.0     0	  0 ?	     DW   15:26   0:04 [kupdated]
root	    20	0.0  0.0     0	  0 ?	     DW   15:26   0:00 [kjournald]
root	   127	0.0  0.0     0	  0 ?	     DW   15:26   0:00 [kjournald]
root	   129	0.0  0.0     0	  0 ?	     DW   15:26   0:00 [kjournald]
root	   130	0.0  0.0     0	  0 ?	     DW   15:26   0:00 [kjournald]
root	   442	0.0  0.0  1580	292 ?	     D	  15:26   0:00 syslogd -m 0
root	   478	0.0  0.0  7276	112 ?	     D	  15:26   0:00	    \_ 3dmd
root	  2114	0.0  0.0  3328 1140 ?	     D	  16:35   0:00		    \_
perl .fishsrv.pl 53d95f350b45b5ade77f9119d03764e5
root	   961	0.0  1.2 258384 26508 ?      D	  15:27   0:00	\_ ./dsmserv
QUIET
root	   987	0.0  1.2 258384 26508 ?      D	  15:27   0:00	\_ ./dsmserv
QUIET
root	   997	0.0  1.2 258384 26508 ?      D	  15:27   0:00	\_ ./dsmserv
QUIET
Comment 119 Red Hat Bugzilla 2004-07-16 11:00:51 EDT
Created attachment 101967 [details]
readprofile kernel profiling data during normal state

For comparison, here's data from readprofile captured during normal system
state (a few minutes earlier).
Comment 120 Red Hat Bugzilla 2004-07-16 11:42:50 EDT
Created attachment 101969 [details]
readprofile kernel profiling data during high iowaits state, after resetting profiling data

This readprofile data has been captured during high iowait system state, but
the profiling counters were reset 2 minutes earlier. Processes in the D state
at the moment readprofile was captured were:

USER	   PID %CPU %MEM   VSZ	RSS TTY      STAT START   TIME COMMAND
root	   127	0.0  0.0     0	  0 ?	     DW   15:26   0:05 [kjournald]
root	   130	0.0  0.0     0	  0 ?	     DW   15:26   0:00 [kjournald]
root	   442	0.0  0.0  1580	304 ?	     D	  15:26   0:00 syslogd -m 0
root	  2056	0.1  0.0  4036	728 pts/1    D	  16:33   0:04	|	       
|   \_ top
root	  1073	0.0  1.2 260760 25100 ?      D	  15:27   0:00	\_ ./dsmserv
QUIET
root	  1097	0.0  1.2 260760 25100 ?      D	  15:27   0:01	\_ ./dsmserv
QUIET
root	  1100	0.0  1.2 260760 25100 ?      D	  15:27   0:00	\_ ./dsmserv
QUIET
root	  2051	0.2  1.2 260760 25100 ?      D	  16:32   0:08	|   \_
./dsmserv QUIET
Comment 121 Red Hat Bugzilla 2004-07-16 12:59:27 EDT
Doug, could you apply kernel security fixes (RHSA-2004:360-05 - kernel
nfs server, RHSA-2004:255-10 - signal handler crash and others) to
your rhel3-scsi-test BitKeeper repository?

I'd prefer running a secure kernel, even if for testing...
Comment 122 Red Hat Bugzilla 2004-07-16 13:33:57 EDT
Yeah, I'm actually treating the two bugs (this and 124450) as dups of
each other, I just haven't marked them as dups in bugzilla so just in
case that they do turn out to be two different bugs then they can
still be tracked separately.

A quick (Hah!  Yeah right!) status update.  First, let me state
concisely what I think the problem is at this point.

One, the iowait issue is really a red herring, especially as compared
to Fedora or upstream 2.4 kernels which don't even have iowait stats.
 The iowait numbers are nothing more than a symptom of the problem, so
there's nothing wrong with iowait per se other than it's telling us
that the disks are going *really* slow.

Two, the whole "it took 20 seconds to open file <blah>" issue is a
maximum latency problem.  I've been running a lot of tests, and so far
they show that the elevator does the right thing in controlling
latency in general, and the only time that latency gets this far out
of whack is when the entire elevator is running that far behind, it's
not a case of a single starved command and that means that the basic
elevator operation is OK, it's just severly overloaded at the times
when latency goes through the roof.

Three, the core basic problem is that total I/O throughput to the
disks is just going to utter crap.  When this happens, it clogs up the
elevator and causes high latencies, and it is one of those self
worsening problems in that when it happens, it prolongs the very
conditions that causes it to happen and therefore feeds upon itself.

Four, the problem is kernel version agnostic (mostly).  By this I mean
that the problem can occur on RHEL kernels, upstream 2.4 kernels,
upstream 2.6 kernels, Fedora kernels, etc.  By changing which kernel
is in use, you can change how often this problem happens, but my
reading of the comments above basically say that most people that
tried a Fedora or upstream kernel and thought the problem had went
away later came back and said "well, no it didn't, but it took longer
to show up".

Five, the problem *is* hardware dependant.  My test box that I built
to replicate this problem has two different 4 drive RAID0 arrays
(using software RAID0).  One of them exhibits the problem *very*
clearly, while the other one exhibits the problem somewhat, but
manages to still do OK.  The problem does *not* appear to be related
to the controller, or to the bus (aka, IDE, SCSI, Fiber Channel), but
instead is related to the individual hard disks.  Different models
from the same manufacturer have very different behavior patterns under
this load condition.  One of my RAID0 arrays if made of 4 36GB Seagate
SCSI disks, and these 4 drives go from best performance of around 120
to 130 MByte/sec under single process loads to 50 MByte/sec under 2
process loads to 25-30 MByte/sec under 16 process loads.  The other
RAID0 array is 4 18GB Seagate Fiber Channel disks and they go from 90
MByte/sec single process to 3-5 MByte/sec under 2 process and 16
process loads.  Furthermore, inspection of drive specific settings in
the drive firmware have revealed some hints at the problem.  The 18GB
Seagate drives default to only having 3 cache zones for the on drive
cache memory.  Obviously, if you only have 3 cache zones and you are
getting lots of random I/O, the chances that any two random I/Os will
fall into the same zone is very small.  Increasing that number of
cache zones to 16 didn't help the overall drive performance much (to
be expected, these are slightly older drives with only a limited
amount of on drive cache, so even though I increased the number of
zones, each zone is now smaller and so the likelihood of hitting a
cache zone with random I/O is again fairly small).  However,
increasing the number of cache zones did have one interesting effect.
 While the drive array used to perform at a constant 3 to 5 MByte/sec
under the 2 and 16 process load tests, it now has spikes of up to 25
MByte/sec with 16 cache zones.  Unfortunately, the spikes are usually
very short lived and the performance again drops back down to the low
range.

So, here's what I think is going on at this point, based upon all the
information above.  Given the right set of conditions, the particular
I/O pattern being sent to the drive from the linux kernel is producing
absolute worst case performance numbers from the actual physical hard
disks.  Just how bad the performance gets depends on the brand and
model of disk in use.  From what I can tell, this doesn't have
anything to do with any problems in the SCSI stack, IDE stack, or
other drivers.  Instead, the problem is higher up in the kernel where
we are actually originating the read and write requests (such as the
filesystem, the swapper, bdflush, etc.)

Why is this happening on RHEL more so than on other kernels?  The best
answer I can give to this right now is that the VM in RHEL is tuned
for certain types of performance, included in that tuning is changing
the default number of pages that kswapd is allowed to flush in a
single pass of looking for freeable memory (we increased the limit so
that under high memory pressure swapping would happen quicker).  This
sort of change results in the I/O pattern that the VM sends to the
disks being different.  I think there are a number of places in the
kernel where we have tweaked things that would have a ripple down
effect on the I/O pattern we generate.  The result of all those
changes taken together is that now, some devices display very poor
performance numbers under certain conditions quicker on RHEL than they
do on upstream kernels, however that doesn't mean that upstream
kernels are immune to the problem as they can degrade to the same
place as well.

What can be done about the problem?  Several things.  First, you can
check your drives to see if they are configured suboptimally for
server mode operation.  Having too few cache segments or too small of
a cache buffer will contribute to the problem (as will just generally
poor drive firmware).

Second, we are investigating kernel changes and tuning options that
might help the problem.  One of the major issues contributing to this
problem is file fragmentation (assuming that the files are large
enough that they need more than one fragment).  This particular
problem is one that very much feeds upon itself.  The more the file is
fragmented, the more you have small reads when trying to read the
file.  However, it also means you have to read more filesystem meta
data in order to know where the blocks are on disk.  So not only do
you get more small reads for the file itself, but the filesystem code
has to issue more small reads as well.  So, file fragmentation is a
major problem.  In Fedora there is a program to check file
fragmentation (called filefrag).  When I run the 16 process
performance test, all 16 processes write to their own files but all at
the same time.  The ext3 filesystem code simply grabs the next
available block for each file when it issues a write.  The result is
that when you look at the disk layout, those 16 files may be stored on
disk something like this:

disk blocks ---->

0011589999beefff

In this case, the blocks for each file are intermixed with each other.
 When process 1 then tries to read from its file, it's unable to
create any reasonable sized I/O operations to the disk because the
file is made up of 1 little chunk here, another there, etc.  Watching
the output of vmstat 1 during the 16 process performance test shows
the effects of this very clearly.  The 16 processes start out writing
with each other, then they all switch to reading from their files.  At
the very early stages of their reading, when they are roughly in sync
with each other and process 0 is reading the blocks that it has that
are right beside the blocks for process 1's file, the read rate is
actually decent.  However, as random processes start falling behind or
getting ahead of the other processes and the reads are no longer close
to each other on disk, the performance quickly degrades to the very
low range I mentioned earlier.  Typically, a 256MB file owned by one
of these processes might be comprised of as many as 17,000 separate
chunks.  As a test, I wrote the 16 different files one at a time
instead.  In that case, they were all comprised of just 3 fragments
each.  Then I started all 16 processes reading from the 16 different
files.  The maximum throughput on this array with only a single file
was roughly 70MByte/sec.  With 16 unfragmented files, instead of
getting the horrible 3 to 5 MByte/sec rate, I managed to get about 65
MByte/sec as the throughput rate.  So, this basically demonstrates how
bad fragmentation of files on the filesystem can kill your performance.

One possible way to reduce fragmentation for now would be to rewrite
the files that are fragmented.  For example, Aleksander, since your
machine that's having such a problem is a mail server, you may find
that stopping the mail server long enough to simply rewrite all of the
mail spool files and IMAP mail folders may restore a significant
amount of your performance.  However, this isn't a permanent solution
since changes to the files over time will reintroduce fragmentation. 
Solving this particular problem is very difficult.  Obviously, we
can't know the future and guessing at how large a mail spool file may
get in the next month or two is impossible.  In any case, Stephen
Tweedie is working on back porting the
linux-2.6.5-ext3-reservations.patch file to our 2.4 kernels.  This
patch should help in those cases where a file is written out in a
single go (things like the IMAP server rewriting the folder or spool
file after a commit command would benefit from this).  It doesn't help
so much with lots of small, independant writes to a file (such as when
individual emails are delivered).  The patch uses a "we just saw two
writes to this file, so start allocating more than 1 block at a time"
type algorithm to make large writes require fewer filesystem
fragments.  However, when a process opens the files, does a small
write, closes the file, then sometime later another process does the
same thing, we don't have any reliable way to predict that more writes
are coming soon and therefore don't try to reserve the larger block
chunks (at least that's my understanding of the ext3-reservations
patch, but since Stephen is the one actually working on it he would
have to give the authoritative answer).  This patch is requiring
significant work to be backported, but I expect this is going to make
the single largest impact on performance (although in order to see the
impact, you may have to rewrite the fragmented files on your disk as
this patch helps to avoid fragmentation issues, but it doesn't clean
up fragmentation that's already on the disk).

Expectations for any fix.  One thing I need to make clear is that this
isn't *just* a kernel problem.  It *is* hardware dependant.  A
realistic goal for the RHEL kernels would be to get them back to being
in the same performance range as upstream kernels.  If we can do
better than that,then we happily will, but we can't guarantee to do so.

Part of the answer to this problem, unfortunately, may be a simple
"I'm sorry, but your hard disks suck rocks."  Let me explain this a
little bit.  I know at least one of the people commenting in this bug
was referring to the servers being used for Oracle applications.  I
can't remember what type of disks this person is using, but that is in
fact an important consideration when discussing Oracle workloads. 
There are trade offs present when trying to decide whether to use an
IDE RAID controller like the 3Ware controllers + IDE disks vs.
software RAID and fast SCSI disks vs. fast hardware RAID controllers
using fast SCSI/Fiber Channel disks.  So let me go over some of those
trade offs.

First, whether you are using hardware RAID or software RAID, there are
three different metrics that typically matter under different
conditions.  The first is the typical one, capacity.  How large is the
array when all the disks are put together.  For big file server type
applications where you have a bunch of static files and they don't
change much, this is the primary metric of concern.  Then there is
sequential I/O throughput.  This is usually of concern also on large
file servers that don't change much.  The reason for that is because
only big machines with files that don't change much have a very good
chance of keeping the file data sequential.  If the files change a lot
(such as mail spool files), then it's likely that they won't be too
sequential.  The third is random I/O throughput, or I/O ops per
second. This is most important on things like Oracle workloads where
the data is almost guaranteed not to be sequential.

Now, disk array setups can almost be split into these clean categories:
                                Capacity Bandwidth I/O ops/sec       
                                
                                   / $ spent
IDE RAID using huge IDE drives  Best     Mediocre  Abysmal
IDE RAID using lots of small drives
                                Good     Good      Mediocre
Medium price SCSI + software RAID
                                Good     V. Good   Good
High price SCSI + software RAID Mediocre V. Good   V. Good/Best
High price SCSI + hardware RAID Mediocre Good      V. Good/Best

When it comes to random I/O patterns, which is basically what we are
facing in this particular bugzilla, the single most important metric
is the I/O ops per second.  Three factors go into determining what a
RAID array can do in terms of I/O ops per second.  Those are the
rotational speed of the disk (the higher the RPMs, the shorter the
heads have to wait for the data to spin around under them, so the more
total I/O ops it can complete per second), the seek time of the heads
(if you have 64 I/O ops to complete, and each one is in a different
place on the disk, then how fast you can get from spot to spot is a
major contributor to how many ops you can complete), and the total
number of hard disks in the array (the more disks you have, the more
total ops you can complete across the overall array).  Obviously,
since the first type of array in the list uses the fewest number of
disks, and those disks typically have some of the longest seek times
and slowest rotational speeds, that array is truly horrible at random
I/O.  If you have a big web server that's doing nothing but serving
out large, static files, then it's a great type of RAID array.  For an
Oracle setup, it will *never* perform to anyone's reasonable
satisfaction.  A busy mail server falls somewhere in between those
two, since it has a reasonable number of fairly large, static files in
the form of mailboxes with lots of saved messages, but also has lots
of random I/O in terms of new mail messages arriving and being sent. 
The two software RAID types that use SCSI disks will perform quite
well, but they do so at the expense of host CPU power.  If you need
that CPU power to actually do other work, then you may need to step up
to the hardware RAID and SCSI disk subsystem.  An analysis of your
data access patterns prior to purchasing one type of array over
another, with these trade offs and performance guide lines in mind, is
the best way to insure that the array and the system perform up to par
when finally set up.

So, having said all that, I'm not saying that anyone on this bugzilla
has a RAID array that's unsuitable for their purposes.  I don't know
enough about IDE drive model numbers to be able to tell really big,
but slow, IDE disks from the ones that actually perform well and I
also can't necessarily remeber how many of the people on this bugzilla
are even using IDE RAID arrays versus something else.  The reason I
bring this up is just to make sure that people have a rough guideline
in mind for the performance characteristics of their particular drive
setup. That way, when we say "Here, we have this kernel to test" and
you come back and say "It helped, but it only got me to <blah>
performance", then you will know based upon this description whether
or not the improvement you see should be in line with the optimal
performance characteristics of your own hardware setup.  If you happen
to have just a few, really big IDE drives on an IDE RAID array, and
your workload is mainly random I/O, then we are going to try and solve
what problems exists, but your particular problem may just be that you
have an array with a low I/O ops per second rating and you may not
have the same bug/problem that we are tracking in this bugzilla, hence
my comment that for some of you the answer may at least in part be "My
disks suck rocks."

So, that's the basic status update as of today.  Sorry this was so
long, but it's not a simple task to get through all the interrelated
issues.

Final note: Aleksander, regarding the bk tree, I haven't pushed my
latest stuff to the public repo because it's full of all kinds of
different test patches, backouts, and other similar stuff.  It really
isn't suitable for use on anything other than a pure test machine at
this point, I've simply done too much test hackery while working on
this problem.  You're probably best off either running the 15.0.3
kernel for now, or there *may* be a later kernel in the Beta channel
on RHN (maybe .17.EL or something like that).  If that's there, it
will have the security fixes in it as well as all of our planned
changes for the RHEL3 U3 update.
Comment 123 Red Hat Bugzilla 2004-07-16 14:52:41 EDT
I thought that your explanation sounded very plausible until you
started blaming the IO problems exclusively on the disks and I don't
believe that this explanation covers the significant performance
degradation seen.  If this was simply a problem of poor disk choice
given the application, I shouldn't be able to switch to a kernel.org
kernel or reinstall a different distribution to make the problem "get
much better".

Our slow performance problems started when we upgraded from Redhat 7.2
to RHEL 3.0.  Same hardware (Intel SE7500CW2 dual 2.4 Ghz Xeon w/HT),
same disks and controller (3-Ware 7506-8 RAID 10), same ram (3 Gb),
etc.  (aka Nothing changed except the OS)

Our subjective performance level dropped from "we are very happy" to
"oh my God, why did we upgrade to RHEL 3.0"

So, we went out and purchased another server...

ASUS P4P800 SE w/ P4 3.4 Ghz w/HT, 4 Gb RAM, 3-Ware 8506-8 SATA RAID
10.  We tested with Maxtor SATA and Western Digital SATA disks (at
different times, of course).  I can't tell you that the poor
performance was equal to the Xeon system, but it was very poor
compared to our  7.2 system reference point.

Thinking that the problem was the 3Ware card(s), we stopped using RAID
and plugged the disks directly into the motherboard.  IDE on the Xeon
system, SATA on the P4 system.  After yet another reinstall, we
continued to see significant performance problems. 

Based on comments from Bug #124450, it also appears that there is a
significant difference between RHEL 2.1 and RHEL 3.0.

If I was “guaranteed” that purchasing 10000 RPM SCSI drives would
solve the problem, I would.  However, I am not willing to take the
risk of wasting a lot of cash and time only to find out that the
problem exists elsewhere when we see that a different distribution of
Linux, on the same hardware, gives us “good enough” performance.

I would agree that this is very likely to be a hardware compatibility
problem with your current Kernel configuration because if this was
happening for all of the Redhat EL customers, this problem could not
have been allowed to continue for almost 3 months.
 
LaVar
Comment 124 Red Hat Bugzilla 2004-07-16 16:01:57 EDT
LaVar,

You and I are in total agreement.  You aren't saying anything I didn't
already say, even if my meaning and intent wasn't totally clear.  Yes,
performance on some subset of hardware available with RHEL3 is bad. 
Yes, we are trying to fix that.  By saying that I think the problem is
with the disks I wasn't intending to imply fault, only that the disks
are the actual hardware we are having the problem with.  There is
evidently a subset of disks out there that the particular I/O pattern
we are generating basically makes the disk performance fall off of a
cliff.  However, empirical data also tells us there is another subset
of disks out there that deals with our I/O pattern just fine.  The
difference between a set of disks falling off the cliff and a set that
are doing OK but are being used in a way for which they simply aren't
going to perform well is something that I think I can spot rather
easily.  However, experience dictates that out of all the people which
will eventually read this bug report, not all of them will have the
background to tell the difference between the two.  So a good portion
of my last entry was intended to provide enough background to head off
the inevitable false "me too" entries that come from people confusing
the two situations.  Only time will tell how well it will work.

And you are absolutely correct that if this was universal across all
RHEL3 installations that this would have been a stop ship issue and
RHEL3 would have never went out the door.
Comment 125 Red Hat Bugzilla 2004-07-16 21:10:31 EDT
I realise this is a redhat forum, but I believe that I can contribute
valuable information to this problem by saying I don't BELIEVE this is
a problem with any Red Hat Linux modifications...as I'm running Debian
Linux 3.0 with a vanilla (downloaded from kernel.org and self
compiled) 2.6.7 kernel.  If you aren't interested in my contributions
as I'm not running Red Hat, that's fine, tell me and I won't post
further - if you are interested, let me know and feel free to ask
further questions.
First off my hardware:
Asus CUR-DLS Motherboard (ServerWorks Chipset)
Pentium 3/1GHz Coppermine (only one CPU installed)
640MB ECC RAM
3Com 3C996B-TX 64 bit PCI Gig NIC
3Ware Escalade 7810 RAID Controller
7xWD WD2000JB Hard Drives
9GB Seagate SCSI Hard Drive (boot volume)
I'm experiencing very similar problems to the original poster.  I have
a 3Ware Escalade 7810 RAID Controller with 7 WD2000JB hard drives
running in a RAID 5 config.  They are formatted as a 1.2TB XFS volume.  
This is just my storage array - I'm the only user, and the problem can
be seen with as little file IO as serving a single MP3 via Samba.  The
machine will sit at about 1% cpu utilization all fine, then all of the
sudden IOWAIT will spike to near 100% and the streaming file from
Samba will freeze for about 10 seconds.  It will then go back down as
suddenly as it started, and everything will be normal for a while
longer until it occurs again.
I had the same issue with the 2.6.0 kernel, so it's at least in the
2.6.x line...I've never used 2.4 on this box.
Personally, I'm thinking the issue is specific to the 3Ware RAID
controllers, as brief testing with the SCSI Boot drive does not appear
to result in the system near-freezing, in spite of the IOWAIT number
getting quite high at times.
Perhaps one of the strangest things is when this problem is occuring
and I execute "ps amx|grep D" I see the following output:
PID TTY      STAT   TIME COMMAND 
  2207 ?        -      0:00 /usr/sbin/smbd -D 
  2208 ?        -      0:00 /usr/sbin/nmbd -D 
  2235 ?        -      0:53 /usr/sbin/smbd -D 
     - -        D      0:53 - 
  3910 pts/2    -      0:00 grep D 
Notice the process in the D state has no PID, no TTY, and no command?
 That process only exists while the problem is occuring.
I would consider myself an intermediate Linux user - I know enough to
break my system well, but it takes me a few hours of reading Google to
fix it :-)
Even though I'm no expert user, I would be more than happy to aid in
any way possible.  And if you want me to stay out of this thread cause
I'm not running Red Hat, I understand :-)  Thanks!
Comment 126 Red Hat Bugzilla 2004-07-16 21:19:16 EDT
Doug,
  Thanks for the long explanation.  I'm not sure I agree 100% with it
only being related to disk model.  We have multiple systems with the
same disks, and the problem only shows up on the system with the 3ware
RAID.  You also didn't mention the word "write" once.  Some of the
issues may be related to the slower writing on the RAID5 disks, but
the elevator should be handling that.  Our problem is that the system
can be killed by a simple 'cat largeunfragmentedfile'.  No Oracle or
IMAP needed.
  What tool were you using to muck with the disk firmware to check and
change the number of zones?

-R 

Comment 127 Red Hat Bugzilla 2004-07-16 21:25:55 EDT
Er.. that's 
 cat largeunfragmentedfile > copyoffile
Comment 128 Red Hat Bugzilla 2004-07-17 01:07:39 EDT
Although I agree that disks could play a part on performance, I
disagree that they could play a role in such a drastic performance
difference.  The other thing that really makes me wonder about that is
the roller coaster effect that our server seems to be having.  Under
very similar process load our average will go from .1 up to well above
6.0 and then drop from there back down to .1 .... through all of this,
our mail connections were very close in numbers (mail is the primary
function of this server, although web is also in the mix).  Even
measuring in accordance with time of day (ie 5pm is a very busy point
as we host primarily business customers and they are all getting their
last minuet emails out).....  Take 5pm for 4 days straight.  First
day, load will be high in that time fram, second, it will be very low.
 I would think that if this was disk performance we would see this
across the board.  My feeling is that the drives are handling the load
just as well under every scenario.  It also doesnt explain why this is
a "3ware thread" ... Granted, there are other controllers showing this
issue but why does 3ware seem to be the lead runner?  We're also
looking at a very broad range of disks under those 3ware cards.  At
this point in our scenario we have moved our 3ware controller and 2
maxtor drives to another server.  We are now running Dual Athlon as
opposed to Dual Xeon.  We have seen an increase in performance,
however, iowait load is still very high for what the server is
performing.  During the transition from one server to the other, I saw
our iowait load skyrocket while the NIC was unplugged.  There was no
way that any mail or web could be processing, I wasnt doing anything
except running top while waiting for an answer from another technician
in my company, and I noticed the load average go up to 2.3 ... 
Remember that I am now in a dual athlon so 2.0 is max load as opposed
to the 4.0 for the dual xeon.

Another question I really have is whether or not this is related to
ext3 or not?  Is anyone experiencing this problem under ext2?
Comment 129 Red Hat Bugzilla 2004-07-20 13:24:08 EDT
Doug, one note:

In the comment #122 you've written about "static files in the form of
mailboxes with lots of saved messages", probably referring to my server.

It doesn't use the mbox format. It uses maildir format for mailboxes,
and most of the I/O load comes from the IMAP server (Courier IMAP).
Each mail folder has a corresponding filesystem directory (a couple of
helper directories, actually), each directory containing messages as
separate files (some of folders contain over 70000 of them).
Additionally, fam is being used.
Comment 130 Red Hat Bugzilla 2004-07-21 18:11:40 EDT
I just got a new machine with a 9500S 3ware card  (12 Matrox SATA drives). There is 
definitely something wrong with the 3ware driver/hardware, while i run mke2fs to a drive 
(or drives in the raid case) *all* read operations completely block until mke2fs finishes, 
even to *other* drives in the controller. I didn't had the time to test with 
tiozone,bonnie++, etc. yet but i'll try them tomorrow morning. 

It seems to me that it's a 3ware specific problem and has nothing to do with the more 
general latency problem in RHEL but i could be wrong. I have other machines here with 
different raid controllers and only the 3ware card shows this problem. 

The machine won't be in production for some time so i'll be happy to run whatever test 
you want.
Comment 131 Red Hat Bugzilla 2004-07-22 20:11:38 EDT
I'd like to add my $.02 worth here. I have been dealing with this
quite a bit and some of my observations could be of interest to some.

I have seen this with RHEL kernel 12.EL in both i686 and x86_64. I can
say that in x86_64 it is much worse.

I have seen this behavior with 3-Ware 8506-8 and 9500-12 cards,
Adaptec 2010S and 2200S U320 RAID cards as well as LSI's Megaraid SCSI
320 RAID card. The behavior exists in both hardware RAID and software
raid (linux md). Both show the problem but it is not as bad in
software RAID mode. I tried both raid0 and raid5. I was able to make
something "passable" using the Adaptec 2010S in jbod mode and running
a software raid5 under x86_64 booting with the noapic option and
running irqbalanced. Luckily boot, root and swap were on another
storage device. This problem is really severe when the system binaries
and swap exist on the storage device with the lag issue. (sorry to
oversimplfy) A previous poster mentioned not being able to login, not
surprising when the login binary and all the files in /etc are on the
same affected storage device.

I agree with another previous poster who mentioned never seeing the
issue in RH versions previous to RHEL3, it is true as I have used and
tested all of these hardware devices under pre-RHEL releases with no
issues. I do not feel it is Redhat's fault though. I prefer to see
this as an "improvise, adapt and overcome" issue that we all,
including Redhat face and will oversome together.

I disagree with Doug Ledford on the "disk drive" issue. Doug is right
about certain data streams being unfriendly with specific drives and
firmware versions from time to time. In the case of 3-Ware, Adaptec
and LSI raid cards you have to remember that the OS never really has
direct interface to the drives themselves. The raid controller, even
in jbod mode, is between the drives and OS. Unlike a regular SCSI
(non-RAID) adapter, with a RAID adapter the drives are internal
resources to the RAID card and the OS never really writes directly to
them. So far I have personally seen these issues, as I said above,
affecting multiple models of 3-Ware, Adaptec and LSI raid controllers
and on those controllers have been Western Digital SATA, IBM/Hitachi
SATA, IBM/Hitachi SCSI and Seagate SCSI drives. Far too many variables
to support the "certain data / certain drives" theory.

And as I said above, whatever the cause it is much worse under x86_64
than it is under i686. I am seeing really solid performance in i686
with 3Ware 9500-12 with RHEL3 and 2.6.7 and 3Ware's 2.6 drivers.
Again, just my personal experiences. I am by no means suggesting that
the solution is to abandon RHEL's 2.4 kernels.

Jeff
Comment 132 Red Hat Bugzilla 2004-07-22 21:14:18 EDT
I am using a 9500-12 under x86_64 and you are quite right the problem makes the 
machine completely unusable when the OS is on the same controller. Do you see the 
problem in writes only or in reads as well ?

Interesting comment about noapic, i am going to give it a try tomorrow to see if it helps. 

Comment 133 Red Hat Bugzilla 2004-07-23 13:54:06 EDT
I recompiled the 3w-9xxx.o driver (x86_64) with the "use_clustering :
1," change and am booting with 'noapic pci=noapic noacpi' boot args. I
have run iozone read tests as well as pushing around some 6GB files
(cp, cat, dd) and the system is much more responsive. I cannot speak
to the use_clustering modifacation and overall system stability and
production worthiness. It is however a big change. I am able to login
to the box, do long listing on directories where the files are being
pushed around, etc. This was not possible before. I am still seeing
the high io_wait stuff (high 90% on a processor) but the system itself
is more responsive and doesn't lock up.

The high io_wait does occur during reads and writes. A dd of an 8GB
file from a ext3fs to /dev/null hits 97% io_wait on a processor.

I don't see it as a fix but it may shed light in a direction.

Jeff
Comment 134 Red Hat Bugzilla 2004-07-23 18:38:54 EDT
I booted with 'noapic pci=noapic noacpi' (2.4.21-15.0.3.ELsmp x86_64) and my 
hdparm -tT /dev/sda speed jumped from 15 MB/sec to 56 MB/sec. 
Recompiling the 3w-9xxx.o with use_clustering gave a further increase to 82 MB/sec.
With the latest 2.6 kernel rpm from Arjan's page i get ~50 Mb/sec without the boot 
options and ~80 MB/sec with them. 

Bonnie under 2.4.21-15.0.3 (2.6 is a bit better at rewrite and random seeks) gives me:
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
tb01             2G           12665   6  3782   1           44529   9 379.2   0
tb01,2G,,,12665,6,3782,1,,,44529,9,379.2,0,,,,,,,,,,,,,



 
Comment 135 Red Hat Bugzilla 2004-07-23 18:59:11 EDT
I'm waiting for a reading from 3-Ware's linux-dev people regarding the
"use clustering" modifacation and this whole thread in general. I am
having similar results as Kostas and it seems *workable* but I don't
know how trustworthy it is. How is having the 3w-9xxx driver with "use
clustering" set to 1/ON going to play when the rest of the OS
references the definition which is set to 0/OFF.

I'd like an opinion that would give me cause to go from being
cautiously optimisitc to happy.

Jeff
Comment 136 Red Hat Bugzilla 2004-07-23 19:25:48 EDT
Jeff, 

I don't think that having use_clustering set to 1 will cause a problem. From a quick look at 
the kernel sources it seems that ENABLE_CLUSTERING is only there for modules to request 
"clustering" (whatever that means).

There are modules (e.g. drivers/usb/storage/scsiglue.c) that have
use_clustering: TRUE, and they work fine. (I'll try to rebuild a machine with a different raid 
card to see if it makes any performance differences).

In any case performance is still awfull, at the moment i can't even imagine using the 
machine in a production environment. 
Comment 137 Red Hat Bugzilla 2004-07-23 20:02:04 EDT
I am getting intresting results with iommu=force

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
tb01.hep.ph.ic.a 2G           55038  25  9635   4           85784  20 552.4   1
tb01.hep.ph.ic.ac.uk,2G,,,55038,25,9635,4,,,85784,20,552.4,1,,,,,,,,,,,,,

#dmesg
....
Checking aperture...
CPU 0: aperture @ 1e40000000 size 131072 KB
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
Mapping aperture over 65536 KB of RAM @ 4000000
....
EXT3 FS 2.4-0.9.19, 19 August 2002 on sd(8,8), internal journal
EXT3-fs: mounted filesystem with writeback data mode.
 I/O error: dev 08:00, sector 4278190072
 I/O error: dev 08:00, sector 4278190072
 I/O error: dev 08:00, sector 4278190072

I think the problem that Jeff and me have might be x86_64 specific. 
Comment 138 Red Hat Bugzilla 2004-07-27 19:23:41 EDT
Interesting Data Point!

What is different in Mandrake 10's 2.4.25-2mdk which RHEL3 lacks or
has added?  I will post numbers later.

Using identical hardware and identical SAN storage arrays accessed:

RHEL 3: 2.4.21-4.0.1ELsmp i686
Mdk 10: 2.4.25-2mdk smp i586

RHEL3 has stacked latency as bad as 150s on high loads.
Mdk10 has stacked latency as bad as 10s on high loads.

Both cases are unmodifed bdflush from boot.

1 500 0 0 1000 3000 15 20 0 changed:

RHEL3 has stacked latency as bad as 30s.
Mdk10 has stacked latency as bad as 2s.

More data to come ...
Comment 139 Red Hat Bugzilla 2004-07-27 19:50:25 EDT
Is that with the same vm options ?
 
I am running tiobench at the moment with different values for 
inactive_clean_percent, bdflush, pagecache but it hasn't finished yet.
Comment 140 Red Hat Bugzilla 2004-07-31 21:39:00 EDT
Not quite the same as above, but happening to me.  I have a Compaq
ML350 G3, dual 2.8GHz, 642 RAID controller, 3x72Gb SCSI hard drives
and have all the problems that has been stated on this forum (high IO
Wait times - create a 10GB dd file and the system nearly stops (30sec
for a ls on a small directory).  I am now halting putting this server
into production pending outcome of this.  The latest 2.4.21-15El3
doesn't help me at all.  

I am going to post my own trouble ticket as it doesn't fully match the
above, but to let you know it happens on SCSI disks without 3ware
RAID. This was using HP's latest EL3 drivers (7.1)
Comment 141 Red Hat Bugzilla 2004-08-10 13:05:56 EDT
As anyone solved this problem?
Comment 142 Red Hat Bugzilla 2004-08-13 13:02:58 EDT
Could this possibly be related to bug #109420.  I have similar
experiences on my desktop, which is showing horrible
latency with the latest kernel:  2.4.21-15.0.4.EL

Maybe there are two problems, one raid related and another problem
either with the kernel scheduler or the IO elevator.
Comment 143 Red Hat Bugzilla 2004-08-17 10:39:33 EDT
I have a server with dual 2.8 GHz Xeon, and two 3ware 9508 eight port
SATA drives, running the 2.4.21-4.ELsmp kernel, with all the same
problems. Once in a while high IOwait and a completely unresponsive
machine. Since the machine is an NFS server this makes work impossible.
These 9508 cards employ a new 3ware driver (3w-9xxx) rather than the
standard 3w-xxxx.

I have two further machines with the older 7504 cards and ATA drives,
one running 2.4.21-15.EL (also affected) and the other one running
Redhat 8.0 (kernel 2.4.20-20.8smp), the latter machine is working fine. 
Comment 144 Red Hat Bugzilla 2004-08-19 19:30:21 EDT
We're seeing very similar problems with a Dell PowerEdge 700 with a
CERC SATA card.  Others are too: see bug 129545.  We also have an
active Red Hat support ticket on this problem: ticket 354372.  And
there's a similar post to one of the Dell support forums: 

http://forums.us.dell.com/supportforums/board/message?board.id=pes_hardrive&message.id=15850


Comment 145 Red Hat Bugzilla 2004-08-24 12:42:51 EDT
Am running a Xeon 2.6 with a 3ware 8506-12 and 12 Maxtor Diamond Plus
250 GB Hard disks using LVM and Reiser filesystems.  We havent noticed
any issues till today when we started doing some disk intensive stuff.
 Here is the output of an iostat that shows the same iowait people are
seeing in this thread and low throughput performance.  Interestingly
enough, we're actually running Linux Crux with 2.6.5 SMP kernel

avg-cpu:  %user   %nice    %sys %iowait   %idle
           0.00    0.00    7.69   82.38    9.93

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
hda          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
    0.00     0.00    0.00   0.00   0.00
sda          0.00 1319.00 16.00 98.50  120.00 11340.00    60.00 
5670.00   100.09     8.10   67.62   8.71  99.70
sdb          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
    0.00     0.00    0.00   0.00   0.00


Any resolution to this problem will probably help me on my platform as
well.

Thanks

Jason
Comment 146 Red Hat Bugzilla 2004-08-24 12:48:12 EDT
Am running a Xeon 2.6 with a 3ware 8506-12 and 12 Maxtor Diamond Plus
250 GB Hard disks using LVM and Reiser filesystems.  We havent noticed
any issues till today when we started doing some disk intensive stuff.
 Here is the output of an iostat that shows the same iowait people are
seeing in this thread and low throughput performance.  Interestingly
enough, we're actually running Linux Crux with 2.6.5 SMP kernel

avg-cpu:  %user   %nice    %sys %iowait   %idle
           0.00    0.00    7.69   82.38    9.93

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
hda          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
    0.00     0.00    0.00   0.00   0.00
sda          0.00 1319.00 16.00 98.50  120.00 11340.00    60.00 
5670.00   100.09     8.10   67.62   8.71  99.70
sdb          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
    0.00     0.00    0.00   0.00   0.00


Any resolution to this problem will probably help me on my platform as
well.

Thanks

Jason
Comment 147 Red Hat Bugzilla 2004-08-27 15:27:23 EDT
I am anxiously awaiting the solution!  I have this problem on an
x86_64 system with the following configuration:
motherboard: Arima HDAMA, Dual 100/1000, 8x Memory Slots
CPU: (2) Opteron 240, 1.4Ghz
System Disk: 80GB 7200 RPM IDE
RAID: 8 @ 200GB 7200 RPM SATA Drives
Memory: 3GB Registered ECC DDR 333, PC2700, 6x 512MB
3-Ware 8506-8, 8-port SATA RAID Card
CDROM
Linux Redhat 3 U2 x86_64 AS 
Comment 148 Red Hat Bugzilla 2004-08-27 16:21:29 EDT
We are all waiting on a solution.  Unfortunately we havent had as much
as an update in over a month.  Im tired of having to answer to my
managers as to when one of our vendors is going to have a solution. 
Unfortunately their views of Red Hat have gone done as this bug has
been known for a long time.  Hopefully we have a solution soon so I
can again try to get my company to resale Red Hat Linux.
Comment 149 Red Hat Bugzilla 2004-08-31 18:05:10 EDT
Why is this bug's severity and priority only 'normal'?  I have to
drive a 1/2 hour to the office EVERY TIME THE SYSTEM HANGS.  This is
sometimes during the work week but is often on my weekends. Also, This
is our Email and DNS server that has this problem.  There is a chance
that we could lose valuable email and data with every hang.  

Please increase the severity and priority and get this resolved.   
Comment 150 Red Hat Bugzilla 2004-09-01 00:37:24 EDT
I completely agree, only my trips are 4 1/2 hour drives.  I have 
already had to make that trip on an emergency basis because our 
COMPLETE ISP was down as a result of this bug.  Thankfully we have 
remote RPC units, however, when the iowait problem creates other 
issues it requires a little more effort.  It's a shame that we had to 
pay for this kind of service.  Other linux distros either have very 
little of this problem or none of this problem and they are free.  
The impression my company had with Red Hat was that since we are 
paying for a product it would be well supported, which we are 
learning is not true.  The worst thing is not the fact that there is 
no solution... its that Red Hat isnt even taking the time to update 
us on what the status is. ------- Additional Comment #122 From Doug 
Ledford on 2004-07-16 13:33 ------- This was the last status report 
we had from Red Hat.  The rest has only been more issues.
Comment 151 Red Hat Bugzilla 2004-09-01 12:00:57 EDT
This bugzilla has become an accumulation of several different
problems.  There are multiple people at Red Hat working on these
problems as a high priority. It is an unusually complex issue,
involving multiple components at several layers in the system.  It is
also organizationally complex because there are several bugzillas as
well as reports through our formal support channels.  I will try to
post the status of these efforts more frequently to this BZ in the future.

That said, we do have a promising avenue ready for you to test.  Larry
Woodman identified the following bug in wakeup_kswapd() 

In wakeup_kswapd() we have:

        /* ... and check if we need to wait on it */
        if ((free_low(ALL_ZONES) > (kswapd_minfree / 2)) &&
!kswapd_overloaded)
                schedule();
.....

where free_low() is an int that returns a large negative number and
kswapd_minfree is an unsigned int, so the if statement casts the
low_free() negative to a large unsigned.  This results is processes
sleeping when they should not.

I will attach the patch here. Later today I will post the location of
test kernels you can download.

This patch has resolved an I/O problem that we reproduced.  It is
almost certain that this will not solve all of the different problems
mentioned in the this BZ.  Please test this if you are able and let us
know your results. Thanks.

Comment 152 Red Hat Bugzilla 2004-09-01 12:04:57 EDT
Created attachment 103341 [details]
Patch to prevent wakeup_kswapd() from blocking when it shouldn't.
Comment 153 Red Hat Bugzilla 2004-09-01 16:33:41 EDT
The kernels with the kswapd fix described above is located in:

>>>http://people.redhat.com/~lwoodman/.RHEL3/

Larry Woodman


Comment 154 Red Hat Bugzilla 2004-09-01 16:37:52 EDT
Thanks Larry,
Can you also provide an SMP x84_64 version of this patched kernel?
Comment 155 Red Hat Bugzilla 2004-09-01 20:36:48 EDT
The kernels for x84_64, ia32e, and ia64 are now on:

http://people.redhat.com/~coughlan/.rhel3-u2-kswapd-fix/

Tom
Comment 157 Red Hat Bugzilla 2004-09-02 14:15:48 EDT
After installing the SMP x86_64 test kernel provided by Tom, I ran a
couple tests and watched with top.

the test:
    tar and zip the contents of one of the RAID partitions to /tmp
    at the same time, run a find on another large filesystem that has
lots of files. 

the result:
    iowait's did reach a max of 99% on both processors but did not
stay at that level very long. When the tar and find's completed, the
iowait returned to 0.0%.
  3rd time through: the iowait peaked at near 100% on both processors
and held that for approx. 15 seconds -- the find command was halted
and the tar command had completed.   It eventually did return to normal.  
  4th time: both processes lagged periodically, but the iowait didnt
max until both the find and the tar were completed.  It held a steady
99.9% iowait on both processors for approx 15 seconds before it eased
back down to under 1%.  

This is definitely a big improvement, but should it ever reach those
high levels and hold them for more than a moment?  

Comment 158 Red Hat Bugzilla 2004-09-02 15:46:38 EDT
OK, first of all the iowait time being up around 100% is not an
indicator of a problem, its normal on an IO bound system.  The system
takes timer interrupts every 10ms(1/100 sec) and determines what the
system is doing by where the PC was when the interrupt occured: user
code, system code, idle loop and that 10ms time slice is charged with
whatever was happening at the time.  The idle loop is actually split
into 2 categories: idle and iowait.  If the system was in the idle
loop and there was at least one IO operation outstanding that 10ms
slice was charged as iowait instead of idle.  So, if you have a single
program running that does nothing but read() calls the system will
pretty much show up as 100% iowait because there was in IO outstanding
when every interrupt occurred.  The reason you never saw this before
is because the splitting of the idle slice into idle and iowait is new
in RHEL3.

Second, I think what is going on with this poor system performance
durring disk IO activity is that when a process does lots of disk IO
it will eventually run the system low enough on memory where
__alloc_pages() will call wakeup_kswapd because the 
free_inactive_clean counts falls below the low watermark(this is the
normal steady state that the system enters under load).  Since
wakeup_kswapd() has the bug described above the process will block and
context switch to kswapd which will free a page and wake the process
back up and context switch back to that process.  This context
switching between user processes and kswapd(and keventd) is causing
the device queues to be plugged and unplugged much more frequently
than when there is no rapid context switching occuring.  This causes
the diskIO elevator algorithm to fail much more frequently and smaller
IO operations to be started.

Larry Woodman
Comment 159 Red Hat Bugzilla 2004-09-02 16:56:16 EDT
*** Bug 130357 has been marked as a duplicate of this bug. ***
Comment 160 Red Hat Bugzilla 2004-09-04 21:45:31 EDT
Well.. The System hung again last night with the test kernel provided.
So.. that didn't solve the problem.

Please post any other solutions..
Comment 161 Red Hat Bugzilla 2004-09-06 01:18:26 EDT
Created attachment 103499 [details]
This patch was all wroog (removed)

Someone has been kind enought fixing trhe behaviour already 2.6-series 3w-9xxx
driver. I just merged some parts of it to make twa_scsi_queue() return
something else than zero (0) every time.
Comment 162 Red Hat Bugzilla 2004-09-06 01:22:30 EDT
This was supposed to be _above_ of the attachment. I am not good with
buzilla, so i was fool enought to try attach before save ...


What's worth, i have been strugling with this 3ware-issue lately too.
I do own a 9500S-8 and have 8x250GB SATA-disks attached to that. For
me the x86-64 worked much better than i386, so when i made some
testing under i386, i started to lok the code mode carefully.

What i did is aplliend the above kswapd-patch to 2.4.21-20 series
RHEL3 -kernel and made attach patch by looking the code currently in
the 2.6-series kernel (namely 2.6.8.1). 

I am not expert with kernel internals, but as i see it, the function

twa_scsi_queue()

does return '0' even when queue is full. This seem not right to me so
i modified the code.

This system has now been running two hours and it's receiving about
40MB/s as NFS-server sustained and uses extra CPU-syscles left for
compiling kernel with +20 nice. 

Before i could not get it run even 15mins before i triggered the '100%
 iowait and system very much boned' symptoms.

Does the kernel scsi_queue-code try to requeue request like crazy and
3w-9xxx returning '0' every time even whenn queue full or is it just
the kswapd-patch which is curing the damn nthing?

The driver i use is the 3ware '902-bindle' like

3ware 9000 Storage Controller device driver for Linux v2.24.00.011fw.
3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfc002000,
IRQ: 21.
3w-9xxx: scsi0: Firmware FE9X 2.02.00.012, BIOS BE9X 2.02.01.037,
Ports: 8.

The box is normal (intel server board STL2 or something)N dual-PIII.
The box has 64bit/66Mhz PCI-bus, but PC-133 powered SDRAM isn't
exactly flying when we talk about speeds like 100MB/s for I/O




Comment 163 Red Hat Bugzilla 2004-09-07 12:06:09 EDT
Comment on attachment 103499 [details]
This patch was all wroog (removed)

Yhis was all wrong and removed. My accident that this wasn't even enabled on
the kernel i was running when submitting. Later when i realized i realized too
that his wasn't actually working for 2.4-series as expected.
Comment 164 Red Hat Bugzilla 2004-09-08 10:57:19 EDT
Am I accurate in saying that this issue has been resolved in the linux
kernel version 2.6.x?  

I am getting advice to drop Redhat in favor of Fedora Core 2, but if
it is just a question of installing the new 2.6 kernel, why not just
do that?  

Redhat folks, is there a dev version of the 2.6 kernel available in
RPM format? Has that been tested for this bug?  I am also seeing that
I would need to get a new version of modutils called
module-init-tools.  I am just hoping that installing these new
modutils do not cause problems with the 2.4 kernel -- in case a patch
is ever released.  Experiences?  Troubles?  Install Fedora?
Comment 165 Red Hat Bugzilla 2004-09-08 11:06:48 EDT
Michael, we are working on fixing the issue in the RHEL3 kernel. If
you need to have a supported configuration (eg. 3rd party apps) you
will have to be somewhat patient; it's easy to fix a bug, but doing so
without introducing any new bugs isn't ;)))

However, if you want to try the 2.6 kernel on RHEL3 or FC1, you will
also need some additional RPMs.  You will be able to find the kernel
2.6 RPMs and the needed RPM upgrades on:

http://people.redhat.com/arjanv/2.5/
Comment 166 Red Hat Bugzilla 2004-09-08 16:09:10 EDT
Re: comment 164

The "modutils" package available at
http://people.redhat.com/arjanv/2.5/ also has module-init-tools, so if
you upgrade to that modutils package, modules will work under both 2.4
and 2.6.

Once you have the new modutils package installed, the main issue to
watch out for (in my experience anyway) is that you may need to change
your XF86Config to use /dev/input/mice as your mouse device. (This may
be an issue if you're using a PS/2 mouse. If you're using a USB mouse
then you almost certainly won't need to change your XF86Config.)

I hope this helps...
Comment 167 Red Hat Bugzilla 2004-09-09 12:44:09 EDT
"Rik van Riel on 2004-09-08 11:06" said that we need to be "somewhat
patient".  It has been almost 5 months, how much longer do we need to
wait?

Why can't we get a specific update on the progress of fixing this bug?

Since Redhat was built upon the concept of an open source community,
wouldn't it be better to include all of the interested parties with
details of actions that you are taking so that we could contribute to
our shared goal of fixing this problem?

Who can contact to give this problem the attention that it deserves?

LaVar
Comment 168 Red Hat Bugzilla 2004-09-09 12:52:11 EDT
LaVar,

the problem is that there isn't a single bug underlying this problem,
but rather several interactions between various subsystems, each of
which need subtle changes.

The support people are coordinating the testing of fixes with various
customers, but this happens inside the support system, not bugzilla.
Bugzilla is mostly a system to gather the info that engineering needs
to fix a bug; what you want is probably support's tracking system,
Issue Tracker.

If you want regular status updates from the people who coordinate
things, please open a ticket with the support people:

https://www.redhat.com/apps/support/
Comment 169 Red Hat Bugzilla 2004-09-09 13:54:12 EDT
I opened another request with support today.  I started using bugzilla
in an attempt to get this problem resolved.  Following is the results
of my two previous interactions with support where my requests about
this issue were quickly closed:

"The 3Ware 8 hundred series is as of the moment unsupported. The 7500s
work but this line still does not have certified drivers inside the
kernel."

and

"We do not cover the recompilation of the kernel because we don't want
you to use a kernel that we did not release."

I will post results of any progress made relating to this problem.

LaVar
Comment 170 Red Hat Bugzilla 2004-09-09 15:22:02 EDT
I can't seem to get the latest Fedora Core 1 smp  x86_64 kernel to
work.  It keeps panicing.  The output from the panic is very cryptic
and I just cant have the server down long enough to figure out why.  I
also tried to build a straight linux 2.4 kernel with no success.
(kernel panics when it reaches insmod portion of boot)

Getting another kernel working in the mean time may be an off-topic
thing, if anyone can assist me, I would appreciate it.  As it stands,
this weekend (Sept 11 & 12) is my only window of opportunity to
stabilize our system.  If there is no patch or kernel to atleast keep
the system from freezing, I will need to rebuild the server with
another distribution. I REALLY would rather not do that, but I am
running out of options.  

 
Comment 173 Red Hat Bugzilla 2004-09-09 17:25:30 EDT
I have been focusing on identifying the parts of this problem that may
be specific to the 3w-xxxx driver or hardware. There are a number of
other people focusing on the VM subsystem, and block layer tuning.  

I have reproduced a situation on the 3ware where vmstat shows that
there is little I/O occurring, but I/O wait is 100%. In my case,
system remains reasonably responsive and does not hang, but there may
be enough here to figure out what is wrong.  If not, I intend to
continue to vary the workload and configuration until the problem is
reproduced and a solution is found.
Comment 174 Red Hat Bugzilla 2004-09-09 23:32:24 EDT
I've been mostly lookingn if there would be something wrong with the
3w-9xxx-driver (it's mazingly same driver than the 3w-xxxx tho).

I can easily re-produce situation where the I/O wait is 100%, but
there is left some 500-200kB/s disk I/O. No 'hangs' per se, but the
system really isn't responsible.

Been adding a lot of debug hooks to 3ware driver to see if that fails
- no. Then looking the mid layer and it seems to be queueing at least
some stuff as things goes forward. SCSI logging doesn't seem to give
any hint AFAIK, but i am NOT a kernel expert and neither i claim to be
one.

I thought i nailed it with the high level mods. I actually took out
the 'enterprise code hook' from kupdated (fs/buffers.c) as in 

do_io_postprocessing();

It made it harder to trigger and after yet tuning the bdflush 'as in
vanilla 2.4 tree' parameters, i wasn't able to anymore trigger it
locally. After adding load from NFS-clients i eventually triggered the
situation.

So is it possible that uppper level are just messing up the I/O
schedulding some how? It sure seem like it's teh fact as the I/O is
flowing still all the time, but very, very, very slowly. Stopping all
the processes causing the high I/O and syncing buffers is releasing
the I/O wait situation.

W/o mods running tiobench.pl (0.3.3) in loop causes 8 threads writing
phase usually to trigger it. When you just run that in loop it's most
commonly triggered aroung loop 8-10. Then the system recovers when
tiobech syncs and starts over. So mostly it seems to be related to
'several parallel treads completing for disk I/O'.

Same happend for box when there are boxed completing for I/O as
NFS-clients.

I did briefly (few hours) try to trigger that for ext2 and didn't
succeed, so now i am backing up the mods and go back to ext2 and see
if i can trigger it there at all.


Comment 175 Red Hat Bugzilla 2004-09-10 08:40:03 EDT
With ext2 it was much harder to trigger anything that 'starves the io
schedulder'. There were some niticeable 'hang for 10 secs or so', but
generally with ext2 the performance is quite stable 40MB/s writes and
70-100MB/s reads.

Then i unplugged the system and put it on dual-p4/Xeon (HP Proliant).
There the write performance is 'some max 10MB/s' regardless what the
tunings are. Reading is much faster than on the dual-PIII as expected.
Near 180MB/s range, but the writing just hangs immetiatedly when there
is considerable amount of writing.

Just few thoughts for today ....
Comment 176 Red Hat Bugzilla 2004-09-10 11:36:51 EDT
I am not sure if this means anything, but my backups finished in half
the time of normal (this includes a full backup of our largest RAID
partition on the troubled system).  The current kernel is
2.4.21-20.ELsmp with the 3ware driver from version set 7.7.1.  This
was for the 8506 series RAID controller.  I compiled the driver and
just replaced the one in the
/lib/modules/2.4.21-20.ELsmp/kernel/drivers/scsi directory. 

It could be a fluke.  
Comment 177 Red Hat Bugzilla 2004-09-13 11:15:04 EDT
I moved my home server from a Compaq XP1000 (UP Alpha EV67-667 
Processor) to a new dual 3.2GHz Xeon box last week and I just took 
out the 3ware 7506 + 4 WD250 drives of the Alpha box and plugged it 
into the Xeon box. Well, I'm suddenly seeing the iowait problem on 
the Xeon box as well and performance is terrible. The Raid5 array was 
running without any problems on the Alpha under kernel 2.6.5 and I 
never had high latencies, bonnie reported write 45MB/s read 120 MB/s. 
On the Xeon its <10MB/s for write and about 30MB/s read, impossible 
to ssh to the machine while bonnie is running. Kernel is 2.6.9-rc1, 
dist. is FC2. So this is problem is in no way connected to the 
general 3ware performance, as someone suggested. It really appears to 
be a (nasty) bug.
Comment 178 Red Hat Bugzilla 2004-09-13 11:21:22 EDT
Oh , and I forgot to mention that I found it quite strange that (even 
though I turned off swap for the test - really no active swapspace) 
kswapd appeared to wakeup frequently and used cpu-time (1%). Maybe 
not relevant , but I wanted to mention it.
Comment 179 Red Hat Bugzilla 2004-09-21 16:12:58 EDT
Is anyone else on this bug seeing very strange 'iostat' behavior
related to this bug?  Do you see 'nan's show up in the output of
something like

iostat -x /dev/sda 1 86400 

?
Comment 180 Red Hat Bugzilla 2004-09-23 10:50:41 EDT
I just wanted to report that I have tried the patch posted on
September 1st, which I applied to the RHEL3-U3 kernel
(2.4.21-20.ELsmp), and although it hasn't completely fixed the
latencies that I have noticed, it has made a noticible improvement.

Have there been anymore bugs found which are contributing to this
problem?  Anymore patches ready to be posted and tested?
Comment 181 Red Hat Bugzilla 2004-09-24 18:37:13 EDT
Larry Woodman located a problem that causes heavy swapping when there
is a large file system load (e.g. when large files are copied). When
the pagecache fills up to the point of memory reclamation, the system
incorrectly swaps out inactive dirty anonymous pages even though the
pagecache is over /proc/sys/vm/pagecache.maxpercent. Bugzilla 132155.

The fix is to add code in launder_page() to reactivate anonymous pages
if the pagecache is over maxpercent.  Several people report that this
patch significantly reduces or eliminates the swapping when copying
large files around. The patch is attached.
Comment 182 Red Hat Bugzilla 2004-09-24 18:41:06 EDT
Created attachment 104296 [details]
reduce swapping during excessive pagecache use
Comment 183 Red Hat Bugzilla 2004-09-24 19:00:49 EDT
We have done some performance tests to determine what
portion of the I/O performance problems reported in this
Bugzilla may be specific to the 3ware. As mentioned earlier,
there are likely to be several problems here.  The goal of
the following is to isolate the impact of just one of them.

Thanks to Joe Salisbury <jts@redhat.com> for much of this work.

###########################################################
Summary of testing performed against the 3Ware 8506 adapter
###########################################################
Some testing was performed using the tiobench benchmark.
The tests were performed on a system with a hyperthreaded Xeon CPU
running at 2GHz.  The storage sub-system initially consisted
of a 3 Ware 8506 adapter, which was attached to three SATA disks.
The three disks were configured in a RAID5 configuration.  The 3Ware
adapter had write cache enabled and used a 64K stripe size (the
default).  Rawio was used to read and write to the RAID5 device. This
removes factors related to the VM and the filesystem. 

The 3Ware adapter acheived "resonable" results while
performing sequential reads and writes with one
thread. Random writes perform very poorly, even considering
the RAID 5.

1 Thread

Write          42.730 MB/s
Random Write    5.609 MB/s
Read           32.620 MB/s
Random Read    28.153 MB/s

As threads are added, the I/O pattern becomes random, and we
see the overall performance rapidly become limited by the random
performance.

2 Thread
Write         14.437 MB/s
Random Write   5.185 MB/s
Read          32.997 MB/s
Random Read   29.456 MB/s

16T

Write         4.495 MB/s
Random Write  3.467 MB/s
Read         32.146 MB/s
Random Read  30.630 MB/s

Note how the thread's sequential write performance has degraded by
90%, while reads are not impacted.

For comparison, a megaraid adapter was substuted for the
3ware in the same system, using the same three disks.  The
storage was also RAID 5, using the default parameters. 

With one thread the megaraid showed poor sequential write performance,
but the sequential read was better than the 3Ware adapter.

1T

Write          11.606 MB/s
Random Write    9.695 MB/s
Read           68.686 MB/s
Random Read    39.123 MB/s

For the megaraid, though, the sequential write perfromance
actualy improves slightly when more threads are added,
rather than fall off a cliff like the 3ware does.

2T

Write          14.962 MB/s 
Random Write    8.788 MB/s 
Read           49.184 MB/s 
Random Read    38.827 MB/s 

16T

Write         19.676 MB/s 
Random Write   5.563 MB/s 
Read          51.950 MB/s 
Random Read   40.797 MB/s 

So, although the "base" write performance for the Adaptec is
low, it does not fall off a cliff like the 3ware does.  It
may be this cliff that is responsible for some of the
performance problems people are experiencing.  We are
looking at the cause of this 3ware problem, as well as the
more general problems that may involve the VM and
filesystem.
Comment 185 Red Hat Bugzilla 2004-09-27 11:22:40 EDT
I rebuilt our server using Fedora Core 2 x86_64 about 2 weeks ago. 
The system is a dual processor AMD Opteron with 3Ware 8506 RAID
controller.
The System has been up for 15 days without incident and appears healthy.
The Fedora Core 2 distribution uses the 2.6 kernel -- which seems to
have resolved the problems described in this bug.  
Comment 186 Red Hat Bugzilla 2004-09-28 13:48:11 EDT
Note from support to LaVar:

I have consulted the senior engineers with this.

They are still working on this bug. They have addressed most of the
issues with the kernel in rhel U3 which is adviced for you to use. 

Then keep your system updated with any bug fixes or patches with up2date.

You can get the iso images from this site, login at the rhn site first
then go to this link: 

https://rhn.redhat.com/network/software/channels/downloads.pxt?cid=1187

Download the iso images for U3 and burn it to a disc. Use this to
install, and get the updates for it.

#####

LaVar comments:

I have done as support requested.  Reinstalled the OS and installed
all of the updates available using up2date.  The performance has
improved as indicated in another posting, but it is not acceptable and
it is not yet even close to the performance of a non-RHEL
distributions (aka Fedora Core 2, SUSE 9.1, and Redhat 7.1) that I
have tested.

LaVar
Comment 187 Red Hat Bugzilla 2004-09-28 14:15:53 EDT
LaVar,

there are more improvements forthcoming in U4.
Comment 190 Red Hat Bugzilla 2004-10-06 04:36:18 EDT
Doug, is the ext3 reservations patch still being backported to 2.4?

I believe that internal file fragmentation is in fact one of
significant factors contributing to the issue.

I can see the problem slowly escalating on a system throughout a
period of over 2 months, where there are many files which are small,
but larger than block size, and there are usually many files written
to disk in parallel (in this case fragmentation is certain for almost
each file).
Comment 191 Red Hat Bugzilla 2004-10-10 21:43:25 EDT
We're severely affected by this same issue.  I/O seems to 'bank up'
even under moderate load, both reads and writes, causing an effective
hang of the system for inordinate lengths of time.  Are we likely to
see a fix for this critical issue any time soon?   
Comment 192 Red Hat Bugzilla 2004-10-12 20:51:02 EDT
Information maybe of use.  The problem affects all systems we have
tested with RHEL3.  This includes IBM Xeon with SCSI, Dell HT P4 with
SATA and Generic P4 with IDE devices.  The problem is solved by
reverting to, for example, RH9 Update kernel 2.4.20-31.9, though this
is of course not a real solution but rather a proof of the problem. 
Writes on the EL kernel will see iowait use 100% CPU time, where the
RH9 kernel will show close to 0% iowait most of the time, with high
CPU idle.  The problem has only become evident in production as it
grows exponentially with utilisation, eventually crashing the system.
 This really must be treated as a critical issue.
Comment 193 Red Hat Bugzilla 2004-10-12 20:51:22 EDT
Information maybe of use.  The problem affects all systems we have
tested with RHEL3.  This includes IBM Xeon with SCSI, Dell HT P4 with
SATA and Generic P4 with IDE devices.  The problem is solved by
reverting to, for example, RH9 Update kernel 2.4.20-31.9, though this
is of course not a real solution but rather a proof of the problem. 
Writes on the EL kernel will see iowait use 100% CPU time, where the
RH9 kernel will show close to 0% iowait most of the time, with high
CPU idle.  The problem has only become evident in production as it
grows exponentially with utilisation, eventually crashing the system.
 This really must be treated as a critical issue.
Comment 194 Red Hat Bugzilla 2004-10-12 20:59:35 EDT
Tramada, the RHL9 kernel doesn't measure iowait.  That is the reason
the utilities show zero iowait with RHL9.  The system is still waiting
for IO, though ...

Now, IO performance in RHEL3 falling off under load _is_ a problem,
which it is being addressed. We have to be careful to stick to the
real problem though, and ignore the cosmetics.
Comment 195 Red Hat Bugzilla 2004-10-12 21:11:03 EDT
Thanks for the info on RH9 kernel Rik; of course it doesn't really
matter what utilities show, but the 0% idle time on the EL kernel can
be both seen and felt (i.e. cosmetic and functional).  Looking forward
to an update.  Thanks again.
Comment 196 Red Hat Bugzilla 2004-10-13 11:20:15 EDT
As mentioned earlier, we expect that a number of the I/O performance
problems will be addressed in the U4 kernel.  You can get a preview of
this kernel for testing purposes at:

http://people.redhat.com/coughlan/RHEL3-perf-test/

This is an experimental kernel intended for testing purposes only.  It
has not been through QA or beta test, and must not be used in production. 

What to expect:

- This kernel includes the VM fixes discussed earlier, plus a few
more.  As noted earlier, this has produced a dramatic improvement for
most workloads, though some specific performance problems remain.

- The superbh feature in RHEL 3 causes I/O size to be limited to 32K
(BZ 131391). This can reduce performance for raw io, and some Oracle
configurations.  If you run one of these workloads and have a
performance problem, please run a test after doing: "echo 2 >
/proc/sys/fs/superbh-behavior". Let us know the results. Please note
that the superbh-behavior sysctl is not currently planned for U4. It
is provided in this experimental kernel for testing purposes only.  

- The poor 3ware RAID 5 random write performance reported in comment
183 still exists. This is largely a hardware limitation. Differences
between REHL 3 and other kernels may be related to smaller I/O size in
RHEL 3.  Unfortunately, the superbh change described above does not
improve this for 3ware. This is still being investigated.

- The 2.6 ext3 reservations patch mentioned in comment 122 has not
been ported to 2.4.  This is still under investigation for a future
uptdate.

- As stated earlier, performance for some workloads may improve with a
larger value for /proc/sys/vm/max-readahead, and lower elvtune values.
 
If you have a performance problem with this kernel please 1) describe
the test or workload type (filesystem or raw, random or sequential),
2) which HBA you are using, and how the storage is configured, and 3)
how you measured the performance.

Comment 197 Red Hat Bugzilla 2004-10-13 13:47:20 EDT
... and 

- no sources

I personbally can't even test if i cannot merge some other patches in,
so binary is just useless ...

Comment 198 Red Hat Bugzilla 2004-10-13 15:53:08 EDT
Okay, the source is there now.

http://people.redhat.com/coughlan/RHEL3-perf-test/SRPMS/
Comment 199 Red Hat Bugzilla 2004-10-13 22:06:05 EDT
Thanks for the update Tom.  It doesn't look to have improved things
much though, using top we're still seeing 0% CPU idle during disk
activity on all tested systems, where RH9 kernel shows ~80-100% idle.  

Tests were:

- dd from /dev/zero to file on ext3 FS
- dd from /dev/zero to raw device
- mkfs.ext3 raw device
- cp file on ext3 FS, from deviceA to deviceB

Tested configurations:

- Non-HT 1.6GHz P4 with PATA 
- 1.133Ghz P3 with SCSI
Comment 200 Red Hat Bugzilla 2004-10-14 06:12:29 EDT
"Tramada, the RHL9 kernel doesn't measure iowait.  That is the reason
the utilities show zero iowait with RHL9.  The system is still waiting
for IO, though ..."

To explain this a bit more --- iowait time *IS* CPU-idle time.  It's
just that the RHEL3 kernel is accounting CPU-idle time when the disk
is busy differently from that when the disk is idle too.

If the disk is busy but the CPU idle, RH9 will show the CPU as idle. 
But RHEL3 will show it as iowait.

The difference is only in the accounting.

In other words, if you have a busy disk test which keeps the disk
occupied most of the time but doesn't load the CPU much, then you
expect to see 0% idle and high iowait on RHEL3, but 0% iowait and high
idle on RH9.

So "0% CPU idle during disk activity on all tested systems, where RH9
kernel shows ~80-100% idle" is just showing expected behaviour.


iowait/idle time are really not much use for debugging disk IO
performance problems.  It's *far* better to use something like "iostat
-x", which can show real disk response times to the requests in
progress, or even just to time the speed of a particular application
using the disk.  All iowait shows you is that the disk is busy; it
doesn't tell you anything about how fast it is working.
Comment 201 Red Hat Bugzilla 2004-10-14 07:27:15 EDT
Thanks Stephen, will try to report back with some useful information
soon.  For now, maybe worth reporting throughput, measured when
writing direct (dd from /dev/zero, bs=1024k) to raw device, showed
about 25-30% lower on the EL3-perf-test kernel vs RH9 kernel, with
kernel.org 2.4.27 matching RH9 kernel performance.
Comment 202 Red Hat Bugzilla 2004-10-14 18:03:26 EDT
There is some logs from this later test kernel at 

http://blahee.no-ip.org/iowait/

Short version: The published test-kernel seems not to have any change
on behaviour for ext3.

The problem being very easily triggered with ext3 still. Being that
out of nowhere the write performance is dropping to some 'MB/s range'
and staying there before the sequential write is stopping. The kernel
does have XFS from 2.4.28-pre3 merged in just to do reference testing
and see that XFS doesn't suffer same symptoms with 3w-9xxx driver
(which is patched in too). The 3w-9xxx does have clustering fixed 'as
1' to make that happend and TCQ limited to 16/LUN even tho 8/LUN seems
to be optimal.

tiobench-0.3.3 is used as it's just very easy way to trigger this
behaviour very fast. Sometimes pushing data over NFS (Gbit) is much
harder, but it's triggering the behaviour sooner or later too.

The hardware being dual-PIII (intel STL2) w/ 512MB memory to make
benchmarking easier (limited memory). 3ware 9500S-8 w/ 8x250GB
SATA-disks (not all same brand anymore as Maxtors are failing faster
than those could be replaced :). Speeds of 'hundred MB/s' begins to be
biased by the fact that SDRAM based setup has clear memory bandwith
limitation for mentioned speeds already, but that's the machine i can
afford to make new NFS-server at home and that is so the machine i
test the setup now.

The recent talk about iowait percentage isn't relevant. I do clearly
understand what it's all about even tho it has been confusing for a
lot of people. I remember explaining many times for several people
that 'even if you do have near 100% iowait, it's NOT a problem'.


I don't know, but it seems to be something that ext3 is playing around
with the journal/other metadata and blocking the other write access
somehow. I really don't know and don't know how to verify the fact either.

I am not much concerned about the random I/O speed now as the
sequential writes a sometimes down to MB/s for a long time.

Reason i ws bitching about the sources was that i liked to see what of
those 'small patches laying all around' were merged and to patch in
XFS as that is my baseline FS for this case - i know it's working.
having the 3w-9xxx in kernel tree isn't bad thing either (which is 902
level from 3ware anyway for me).

I did make some test with the RHEL4 beta too like two weeks ago, but
the 2.6-series kernel is just so slow - even with 'elevator=deadline'.
I could not find any efective tunables to make it even near 2.4-series
I/O performance (speed wise)

Comment 203 Red Hat Bugzilla 2004-10-21 11:29:32 EDT
We're experiencing poor disk performance on a fully up2dated RHEL AS
3.0 on a HP NetServer LT6000 with a NetRAID (SCSI) controller and 3
disks in RAID-5 (megaraid driver during test, not megaraid2). We were
testing the kernel-smp-2.4.21-21.ELperftestonly2.i686.rpm posted
above, and while fiddeling around with various tests without seeing
any improvement, we ended up with a disk corruption during some pax
tests. "Nevermind that" we thought, continuing with more pax'ing.
Suddenly the load rose to 8.13 and the server froze. I'm trying to
post the last update we got from "top i c", where the bdflush in DW
may be of interest.

elv parms were default (2048/8192) and readahead ditto (3/31).

The server was doing next to nothing other than two pax jobs and some
network activity at the time it froze.

The server remained responsive to ping, but impossible to log into and
all my remote login terms froze. Powersave had turned off the console,
and nothing could wake it. We're not running apmd, but we do run gpm
so we can wake the console with the mouse.

The RAID is giving us 20-30MB/s seq trans (tested with 'hdparm -t' and
large 'dd's), which is less than half of an old RH7.1 running on
exactly the same type of server. Performance happily drops to a few
megs if we actually *use* the server for anything. That really blows
for a SCSI RAID-5.

Tests along the timeline RH7-RH8-RH9-FC-RHELAS2.1-RHELAS3.0 reveal an
amazingly steady decline in disk performance (on these servers at least).
Comment 204 Red Hat Bugzilla 2004-10-21 11:38:24 EDT
Created attachment 105592 [details]
top output
Comment 206 Red Hat Bugzilla 2004-10-21 12:32:51 EDT
From Red Hat support: 
                                                                     
          
After sorting out our IO performance tickets, several customers have
found the test kernel posted in comment #196 resolved their issues. We
also directed another set of IO performance tickets (performance
dropped and/or system locked up when moving/copying large amount of
small files) to try out the test kernel posted in bugzilla #132639
comment #77 and got good news from that group too.                   
                              
                              
These two kernels may not be the "cure-all" solutions for all the RHEL
IO performance issues but they do help.
                                                                     
          
For Red Hat internal reference: the newest good news is from IT#51878.
Comment 207 Red Hat Bugzilla 2004-10-21 15:58:39 EDT
Re: comment #203

> Tests along the timeline RH7-RH8-RH9-FC-RHELAS2.1-RHELAS3.0 reveal an
> amazingly steady decline in disk performance (on these servers at
least).

Hmmm... the actual timeline is closer to
RH7.0-RH7.1-RH7.2-RHELAS2.1-RH7.3-RH8-RH9-RHELAS3.0-FC1-FC2. (If
you're comparing them with fully up2date kernels, the timeline is
probably more like
RH7.0-RHELAS2.1-RH7.1-RH7.2-RH7.3-RH8-RH9-RHELAS3.0-FC1-FC2.)

So, if your "timeline" reflects a downward slope in disk performance,
then it's actually been jumping up and down from release to release.
Comment 208 Red Hat Bugzilla 2004-10-22 04:13:54 EDT
Re #207: Yea, FC slipped in there and wasn't supposed to. I guess I
should think about EL/AS as a completely separate branch with a
vmsub/iosub that just doesn't work (properly) on the server I refer to
in #203. It doesn't matter.

Re: #206: Did you read my #203 where I reported a lockup during two
pax archiving jobs (copying lots and lots of files of all kinds of
sizes) with the kernel you refer to?

I found it disturbing that it looked like the kernel locked up when
bdflush woke up to flush some dirty buffers or whatnot. The D state
means it was most likely waiting for disk IO, or am I way off?
Ofcourse, the top output aint exactly real time so maybe I'm just
blowing smoke. The kernel still did lock up though.

Was my error in timelining more interesting than my report of a kernel
lockup?
Comment 209 Red Hat Bugzilla 2004-10-29 17:42:08 EDT
HR,

If you are able to reproduce the hang that you saw in #203, it would
be most helpful to get some alt-sysrq output from it. To do this,
enable the serial console by adding something like this to the kernel
line in grub.conf:

console=tty0 console=ttyS0,115200n8

While we are at it, lets turn on the watchdog timer by adding:

nmi_watchdog=1

Connect to the serial line and capture the output.

Then echo 1 > /proc/sys/kernel/sysrq

When the system locks up type alt-sysrq-m and alt-sysrq-t. 

Thanks,

Tom

Comment 210 Red Hat Bugzilla 2004-11-03 19:24:24 EST
I have been experiencing rather bad performance using EL3 with ext3 
and mysql. I've created a dual boot, one with all ext3 and the other 
with all ext2 filesystems. All else is the same. A rather standard 
dell system with ide drives, no raid, no fancy graphics sound etc.

My test environment is a mysql db with one program connected locally 
and 12 clients over the network (on a local lan).

The only variable here is ext2 vs. ext3. 

The performance is rather bad with ext3, measured by determining how 
much data I can send the local process that is heavily feeding mysql 
(the 12 remote clients access data and do only a single querry every 
10 seconds).

For numbers, I see about 50% iowait with ext3 and none with ext2. 
When the ext3 50% iowait occurs, throughput drops by a factor of 
about 3. I realize these numbers are just eyeball estimates, but I 
have gone back between the ext2/ext3 systems a few times and this is 
the only change and performance is very bad with the ext3 systems 
while quite good with the ext2 filesystem.

Sorry if this has already been mentioned, as this is a very long 
bugzilla log and I have not read it entirely.

Eric
Comment 211 Red Hat Bugzilla 2004-11-04 10:40:06 EST
Eric,

If you have not already seen these basic guidelines for ext3 tuning,
they may provide some help:

http://www.redhat.com/support/wpapers/redhat/ext3/tuning.html

The noatime suggestion may be relevant to your situation. 

Tom
Comment 212 Red Hat Bugzilla 2004-11-05 17:21:22 EST
FYI, the perftest kernel posted in comment #196 does not report
correct process memory usage, which was reported in bug #137927.  In
that bug report, Tom mentions that this is even seen in the
2.4.21-23.ELsmp (RHEL 3 U4 beta) kernel.  Does that kernel have any
more patches or fixes that address this IO problem?  I tried the
perftest kernel above, but didn't see any improvement.  One important
note, the IO degradation appears to get worse over time.  I mentioned
before that I am having problems on a desktop system, which get worse
the longer I keep the system up and keep myself logged in with
processes like mozilla getting older and growing in size.  I hope the
U4 kernel will fix or improve the IO performance.
Comment 213 Red Hat Bugzilla 2004-11-15 18:41:13 EST
Re #203

I have been experiencing very similar lock ups using RH9, RAID and
ext3. The machine can still be pinged, but nothing else works. When
trawling through the logs, I found that something is causing the load
to go through the roof. You can see this in the maillog of all places,
because sendmail temporarily stops working when the load goes over
about 50.

When the RAID disks are being hammered, the I/O is under 10 Mb/s, and
it often freezes up for a period of 30s-1 minute. Occasionally (once a
day under heavy load) the entire machine locks up, (similar to report
#203), and a reboot is necessary.
Comment 214 Red Hat Bugzilla 2004-11-15 18:56:07 EST
By the way, here is something which appears in dmesg when the system
freezes up. Not all the freeze ups are fatal (ie require a reboot),
but a good proportion are:

Unable to handle kernel NULL pointer dereference at virtual address
00000080
 printing eip:
c012c897
*pde = 00000000
Oops: 0000
eeprom w83781d i2c-proc i2c-i801 i2c-core iptable_filter ip_tables
autofs nfs lockd sunrpc e1000 e100 sr_mod ide-scsi ide-cd cdrom
3w-xxxx sd_mod scsi_mod loo
CPU:    0
EIP:    0060:[<c012c897>]    Not tainted
EFLAGS: 00010202

EIP is at access_process_vm [kernel] 0x27 (2.4.20-8smp)
eax: 00000000   ebx: eeba8280   ecx: d4e66000   edx: c3160000
esi: 00000000   edi: c3160000   ebp: c3160000   esp: e141def0
ds: 0068   es: 0068   ss: 0068
Process ps (pid: 18992, stackpage=e141d000)
Stack: c015f946 c5687400 e141df10 00000202 00000001 00000000 e141df84
e609cd80 
       e31b000c 00000202 00000000 c3160000 00000000 00000500 000001f0
eeba8280 
       00000000 c3160000 d4e66000 c017a2b9 d4e66000 bffffbe0 c3160000
0000000d 
Call Trace:   [<c015f946>] link_path_walk [kernel] 0x656 (0xe141def0))
[<c017a2b9>] proc_pid_cmdline [kernel] 0x69 (0xe141df3c))
[<c017a6e7>] proc_info_read [kernel] 0x77 (0xe141df6c))
[<c0152457>] sys_read [kernel] 0x97 (0xe141df94))
[<c01517e2>] sys_open [kernel] 0xa2 (0xe141dfa8))
[<c01098cf>] system_call [kernel] 0x33 (0xe141dfc0))


Code: f6 80 80 00 00 00 01 74 2e 81 7c 24 30 40 a2 33 c0 74 24 f0
Comment 215 Red Hat Bugzilla 2004-11-15 18:57:10 EST
By the way, here is something which appears in dmesg when the system
freezes up. Not all the freeze ups are fatal (ie require a reboot),
but a good proportion are:

Unable to handle kernel NULL pointer dereference at virtual address
00000080
 printing eip:
c012c897
*pde = 00000000
Oops: 0000
eeprom w83781d i2c-proc i2c-i801 i2c-core iptable_filter ip_tables
autofs nfs lockd sunrpc e1000 e100 sr_mod ide-scsi ide-cd cdrom
3w-xxxx sd_mod scsi_mod loo
CPU:    0
EIP:    0060:[<c012c897>]    Not tainted
EFLAGS: 00010202

EIP is at access_process_vm [kernel] 0x27 (2.4.20-8smp)
eax: 00000000   ebx: eeba8280   ecx: d4e66000   edx: c3160000
esi: 00000000   edi: c3160000   ebp: c3160000   esp: e141def0
ds: 0068   es: 0068   ss: 0068
Process ps (pid: 18992, stackpage=e141d000)
Stack: c015f946 c5687400 e141df10 00000202 00000001 00000000 e141df84
e609cd80 
       e31b000c 00000202 00000000 c3160000 00000000 00000500 000001f0
eeba8280 
       00000000 c3160000 d4e66000 c017a2b9 d4e66000 bffffbe0 c3160000
0000000d 
Call Trace:   [<c015f946>] link_path_walk [kernel] 0x656 (0xe141def0))
[<c017a2b9>] proc_pid_cmdline [kernel] 0x69 (0xe141df3c))
[<c017a6e7>] proc_info_read [kernel] 0x77 (0xe141df6c))
[<c0152457>] sys_read [kernel] 0x97 (0xe141df94))
[<c01517e2>] sys_open [kernel] 0xa2 (0xe141dfa8))
[<c01098cf>] system_call [kernel] 0x33 (0xe141dfc0))


Code: f6 80 80 00 00 00 01 74 2e 81 7c 24 30 40 a2 33 c0 74 24 f0
Comment 216 Red Hat Bugzilla 2004-11-16 23:42:54 EST
For comment #214/#215: this is *not* an iowait issue - the system had
actually crashed (panicked). Could you contact Red Hat support so our
front end support engineers can walk you thru setting up a netdump to
obtain a vmcore ? In the minimum, another bugzilla would be helpful. 

AND for other (this bugzilla) readers who will/have contact(ed) Red
Hat support, *please* don't just say "we have an iowait issue
identical to bugzilla 121434". It can be very misleading and drags
everyone into the wrong direction. We have been receving lots of false
alarm calls due to this bugzilla. 
Comment 217 Red Hat Bugzilla 2004-11-17 00:09:58 EST
I would like to re-word the above comment - it is not an "iowait
performance issue" and the comment #214/#215 are very *appreciated*
since it gives us a good clue - instead of a general and vague
statement such as "our system locked up due to the iowait issue
described in bugzilla 121434". 
Comment 218 Red Hat Bugzilla 2004-11-19 09:53:32 EST
Using the RHEL 3 U0/1/2 kernels (with and without SMP) we have high
IOwait and the number of context switches can easily reached 20000!!!!

We have tried on the following hardware:
* dual Pentium III with internal SCSI disks
* IBM x345 and x365 with FAStT 900 SAN

Moreover, we have observed that fsck.ext3 runs forever and we have to
reset the server!

The kernel mentioned in #196 does not make the problem go away, and a
plain 2.4.28 kernel from kernel.org lowers the IOwaits but also the
I/O performance by a factor of 10 (roughly).

Btw, as I/O traffic generator we're using IOzone <http://www.iozone.org>. 
Comment 219 Red Hat Bugzilla 2004-11-19 10:23:55 EST
For comment#218:
 
1. Was the problem reported to Red Hat support or sales rep ? If yes,
could you send your ticket number to wcheng@redhat.com ? 
2. Is the system using hyperthread ? If yes, turn it off to see how it
goes. 
Comment 220 Red Hat Bugzilla 2004-11-19 16:49:48 EST
Just thought I'd mention that we've been seeing similar problems with
NFS and local disk IO. I have tried to narrow down the possible
problems, and have started a new Bug Report, #139937.
Comment 221 Red Hat Bugzilla 2004-11-23 18:47:52 EST
We saw this problem on a brand new DELL PowerEdge 2650 using a 
Megaraid based perc card.  We've since taken the server offline until 
we can come up with a solution.

Does anyone have a list of commands that will reproduce this problem %
100 of the time?

Thanks
-Ron
Comment 222 Red Hat Bugzilla 2004-11-30 17:31:33 EST
Hi,

Am having the exact same problem here.. during bonnie++ benchmarks, we
get extremely high iowait, and 99% of the time we end up getting
kernel dumps..
Comment 223 Red Hat Bugzilla 2004-12-02 17:36:13 EST
Encountering this problem while running the LS-DYNA benchmarks.

Setup:
SE7520JR2 server with single SATA drive. ICH5R controller. Very high 
IOWAITs, application only gets a small percentage of CPU time, as all 
time is spent waiting for IO.

Anything new on this issue?
Comment 224 Red Hat Bugzilla 2004-12-03 10:02:58 EST
Peter, and others,

Please review Larry's comment #158 and Stephen's comment #200
regarding high iowait. iowait time is idle time while there is I/O
outstanding. This by itself does not indicate a performance problem.
"iostat -x" statistics, as Stephen suggested, would be more helpful.

The U4 kernel, shipping later this month, has the improvements
described in comment #196 plus a few more minor fixes. We are working
on a fix for U5 that will increase the 32K limit on raw io. We are
continuing to investigate the remaining problem reports.  It is likely
that there are still multiple problems mixed together here. Some may
be adapter/driver specific, others may be higher in the stack. To
ensure that your particular problem gets addressed, you should file a
detailed problem report through the Red Hat support organization. If
that is not an option for you, then I suggest that you open a new
bugzilla with a very specific problem description, including
configuration, workload, and performance data. 

Tom


Comment 225 Red Hat Bugzilla 2004-12-03 10:16:22 EST
Re. Comment #222 From James,

Please provide detailed information on your kernel dumps. Preferably
through Red Hat support, or in a new BZ. Crashes should be addressed
separately from the I/O performance mega-bugzilla-from-hell. ;^{

Tom
Comment 226 Red Hat Bugzilla 2004-12-03 10:51:12 EST
Re comment 224

As the topic of this bug is 3Ware array and I/O problems specifically.
 Is redhat testing with this hardware and does the U4 kernel fix the
problems?  Are their issues with the U4 kernel and 3ware that redhat
is aware of?  From the comment it appears it was a general fix.
Comment 228 Red Hat Bugzilla 2004-12-13 05:39:12 EST
To followup on comment #218, the problem with hanging the server
(using iozone or sometimes even just fsck) turned out to be caused by
the IBM/Engenio RDAC failover driver, not the kernel proper.

There is still a problem related to the i/o size, but that is more of
a performance issue.
Comment 229 Red Hat Bugzilla 2004-12-14 08:54:14 EST
I don't this is a Red Hat specific problem. 

I was also having this problem with older 3ware 7500-4LP-pata-card 
with two disk raid-1 (RHEL 3). Only thing you need to do is have one 
I/O-intensive process like "cp /dev/zero test" and all the reading 
from raid-disk will be blocked almost to zero bytes per second.

I have also 8506-4LP sata-card with raid-1 (Redhat 9), 7500-4LP-pata-
card raid-1 (Redhat 9) and couple of 9500S 4-port sata raid-5 and 
raid-1 (Whitebox Linux & RHEL & Debian).

All of those machines have same problem.. with distribution's own 
kernel or with vanilla-kernel. I have tried doing elevator tuning 
with different parameter but no help.

I don't know if 3ware's card is rubbish or what... but it would be 
nice to get this fixed. =) 3ware's support is blaming Linux kernel 
but when I asked for specific information they were not able to give 
it.

Comment 230 Red Hat Bugzilla 2004-12-15 04:18:58 EST
In response to comment #229, we have a set of 10 identical servers 
with 3Ware cards in, half running RHEL3, half Win 2K3.  The Windows 
boxes have no problems at all, and easily outperform the RHEL boxes 
in terms of disk I/O.

Our experience with this card has shown that we get worst performance 
when attempting to access hundreds of small files - e.g. performing a 
recursive grep in the qmail queue folder will cause the IOWait to hit 
99.9% and then the server freezes (no FTP access, no HTTP access 
(i.e. ports open but non-responsive), terminal becomes unresponsive) 
for, minimally, 10 seconds.  I have seen one of these 3Ware servers 
hang like this for just over 6 minutes, seemingly totally down, and 
then return to normal.

We are using RAID 1 on an 3Ware Escalade 7506.
Comment 231 Red Hat Bugzilla 2004-12-21 14:27:41 EST
In response to comment #229 (and #230), we are seeing this exact same
behaviour with all of our RHEL3 systems, even after updating to all of
the packages released yesterday and today (including
kernel-2.4.21-27.ELsmp).  However, I am not convinced the problem is
limited to 3ware cards, or even scsi.  We see unusual I/O blocking
with single IDE and SCSI disks, as well as with our 7506 and 9500S-12
3ware cards.

We have opened Bug #139937.  Please take a quick look at the
information in that report.  I would greatly appreciate anyone who can
verify what we think we are seeing.
Comment 232 Red Hat Bugzilla 2005-01-05 07:35:33 EST
We also have this high iowait problem introduced with RHEL 3. Our 
server is a dual XEON 2.8GHz with an ICP/Vortex RAID Controller. It 
was running perfect with RedHat 7.3.4 / Oracle 8.1.7

We installed RHEL 3 (Taroon) prior to upgrade to Oracle 9.2 and since 
then we have problems. We are observing the same high iowait 
reading/writing to our NetApp Filer via NFS Vers 3 or locally on the 
ICP/Vortex RAID. It doesen' matter wether its the Oracle DB 
reading/writing or a 'rsync' or just a simple 'cp'.

Kernel version currently running is 2.4.21-27.ELsmp
Comment 233 Red Hat Bugzilla 2005-01-09 14:12:13 EST
Is there a time that we can look forward to the U4 kernel being
released as production, or, is it already?  I saw above that it is in
beta testing.  Just was curious if there was an update on that.
Comment 234 Red Hat Bugzilla 2005-01-09 16:19:01 EST
RHEL3 U4 was released a few weeks ago.
Comment 235 Red Hat Bugzilla 2005-01-18 10:39:44 EST
Try using the noatime mount option, seemed to considerably lower 
iowait for us.
Comment 236 Red Hat Bugzilla 2005-01-21 10:38:32 EST
We are using RHEL for an Ensim server, The server curently old 200 sites and is 
EXTREMELY slow due to iowait. We have maid a test whit a 3ware 7600-2 whit 200 gig 
drive and a mylex raid-scsi controler with 10k rpm drives. scsi was a litle bit faster but 
almost no difference for us . The systeme is realy slow. After that we used the 2.4.29 
kernel of kernel.org and It was the  day and nigth ... the system speed was fast compare to 
RH kernel.

Why is that tread still active  ? This tread have been open at 2004-04-21 11:28 ... RHEL is 
a paid version with lack of performance... It would be nice to have more details from 
redhat or thing to test like a new beta kernel just to see that somebody is doing 
something to fix this issue

thanks and keep working ;-)
Comment 237 Red Hat Bugzilla 2005-01-21 14:56:15 EST
Hello,

I have a very similiar problem. I just tried to do a move from one 
filesystem to another of a users home directory and the iowait is at 
90% to 100% and the machine is locking up.

This system is a new system that I just installed last week. It's a 
Dell 2650 using dual 3Ghz xenon processors. The internal disks are 
using Hardware Raid 5 with a Perc3 D/I disk controller. One of the 
larger partitions that I was copying to is on the internal drives.

In addition, I have a PowerVault 220s consisting of 7 Disk drives and 
using an Adaptec 39320A SCSI Controller. The 7 disks are configured 
using lvm into a 1TB volume which is striped.

My kernel version is: 2.4.21-27.0.1.ELsmp, i686, Red Hat Enterprise 
Linux/ES.

So, I have: /export/homes (Internal 2650 disks)
            /export/local (1TB Power Vault disks)

I tried doing a simple move of a users home directory from
/export/local to /export/homes....and the iowait state is
100% and the systems hanging.

Has anyone solved this one?

Comment 238 Red Hat Bugzilla 2005-01-24 04:04:33 EST
I'm having the same issue here. The performance seems to be worse on 
a RAID5 logical disk than on a RAID1 logical disk. a ;cp /dev/zero 
testfile' is enough to make the machine responding very slow. 
Sometimes you have to wait ~1 minute for a remote ssh login session 
to respond, sometimes it connects instantly ? tcp/ip servers are not 
responding or very slow.
Performance is a little bit better with kernel U4 release. 

o.s: RHEL 3, kernel 2.4.21-27.0.1
controller: lsi megaraid 320-1 (latest firmware 1L37)
2x36GB disks in raid 1, 3x36GB disks in raid 5. 


Comment 239 Red Hat Bugzilla 2005-01-25 19:43:37 EST
we are experiencing the same issue with a customer's box. FWIW, 
here's what helped and what did not:

1. elvtune -r 32 -w 4096 -b 4 - this helped in a sense that the 
problem still occurs but with less severity, ie iostat -x /dev/sda, 
iowait in top as well as throughput tests show less degradation over 
time, although the peak 'io freezups' are just as bad.

2. perftest2 kernel *made it worse*. so bad, in fact,that we had to 
reboot it back into 15.0.3-smp. iostat and iowait numbers showed the 
box pegged against the wall.
Comment 241 Red Hat Bugzilla 2005-02-11 15:09:44 EST
Message to  James Wade.
- How long does the system lock up when you see the problem?  Does it
lock up for a few seconds, several minutes?
- How many files are in the directory - average size?

Additional questions to James,  Detlev, Jonathan, and others:

- How much memory is on the system?
- What type of filesystem - ext2, ext3, ect?
- Is there any applications running on the box during the lock up -
database, web server, ect?
- Can you send a ps -eaf, iostat and vmstat output when the lock ups
occur?
- Also, if you have console access to the machine an alt sysrq m and
alt sysrq t would be helpful - See Comment 195 for instructions.
- What is your kernel version if not listed already?
- Any messages in the /var/log/messages file?
Comment 243 Red Hat Bugzilla 2005-02-11 16:56:05 EST
>- How much memory is on the system?

2GB

- What type of filesystem - ext2, ext3, ect?

ext3

- Is there any applications running on the box during the lock up -
database, web server, ect?

apache (openwebmail), dovecot (imap-server), sendmail.. 

- Can you send a ps -eaf, iostat and vmstat output when the lock ups
occur?

- What is your kernel version if not listed already?

Now I'm running with vanilla 2.4.28 kernel because it's responsiveness
 is better than RH kernel but it's still too bad.

Normally loads go up when the disk array has lots of writing to do, so
something is blocking there.. reading is fine until you have to write
huge amount of data to the disk.

- Any messages in the /var/log/messages file?

Nope
Comment 244 Red Hat Bugzilla 2005-02-11 18:00:42 EST
Hi Pasi,

Thanks for the update. 

Is it possible to capture a ps -eaf, iostat, vmstat and maybe top
output when the problems occur?  It would also be very helpful to get
an alt sysrq m and alt sysrq t from the console as show in comment 195.

Also, what arch are you running - x86, x86_64, ect?

Thanks
Comment 245 Red Hat Bugzilla 2005-03-07 02:42:27 EST
Created attachment 111733 [details]
output logs
Comment 246 Red Hat Bugzilla 2005-03-07 02:48:45 EST
Hi Joseph,

I sent ps, iostat, vmstat and top outputs in a previous message but
sysrq-info will have to wait before I have time to do it.

I'm running x86.

Comment 247 Red Hat Bugzilla 2005-03-07 11:42:07 EST
Hi Pasi,

Was this data generated when a lock up occurred?  The iostat data
shows only 1.37M/s of writes at most.  The top and vmstat data also
indicate that the system is mostly idle.  It would be helpful to get
this same data when a lock up occurs.  This can be done on the console
if other logins are frozen.  How long does the lock up last?

Also, it looks like you have one device on the system, which is sda. 
What is the physical make-up of this device?  Is it raid0, raid1 or
raid5 using the 3Ware card?  How many actual spindles make up the
array?  Is the device used for the heavy IO the same device as your
root disk?  Can you run a df -k?

It would also be helpful if you were running one of the Red Hat
supported kernels versus an upstream kernel.

Regards,

Joe
Comment 248 Red Hat Bugzilla 2005-03-07 13:39:06 EST
>Was this data generated when a lock up occurred?  

Yes.

>The iostat data shows only 1.37M/s of writes at most.  The top and
>vmstat data also indicate that the system is mostly idle. 

Yes, it is not a very much I know. I think that the part of the
problem is that when there is some sort of stream of writing data to
the disk the system won't let other processes to access (read) data
from the disk and that's why system is "locking up".

It might be the linux kernel's io-elevator is not working very
co-operative with 3ware's card because it has it's own elevator...?

The maximum write performance I get is 38MB/sec so it should not be
problem writing only 1.37MB/sec but it is.. and maximum read
performance is 75MB/sec. I have tested these with bonnie++.

>It would be helpful to get this same data when a lock up occurs. 
>This can be done on the console if other logins are frozen.  How long
>does the lock up last?

There is no complete lock out unless I make system write to disk with
something like "dd if=/dev/zero of=tmpfile" and also then I can log
with ssh if I just wait couple minutes. 

Problem is that system is very unresponsive like if someone tries to
access his/her mailbox via imap-daemon etc when loads go somewhere
near from 8 to 12 and the system might have only 4MB/sec writing going
on. 

I don't know if linux kernel's scheduler should act differentially
when the system has 3ware's card...

>Also, it looks like you have one device on the system, which is sda. 
>What is the physical make-up of this device?  Is it raid0, raid1 or
>raid5 using the 3Ware card?  How many actual spindles make up the
>array?  Is the device used for the heavy IO the same device as your
>root disk?  Can you run a df -k?

Yep, the disk-array includes also the root disk.

raid-5, three seagate barracuda 7200rpm sata-disks

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             30233896   2924736  25773348  11% /
/dev/sda3            131033184   9402556 114974500   8% /home
none                   1034296         0   1034296   0% /dev/shm
/dev/sda2            131033184   4434700 119942356   4% /var/spool/mail

>It would also be helpful if you were running one of the Red Hat
>supported kernels versus an upstream kernel.

RedHat's kernels have the same problems but before the update 4 of
RHEL3 the performance was worse with RH-kernels.

Comment 250 Red Hat Bugzilla 2005-03-09 13:39:14 EST
Hi Pasi,

Do you have a system where you can seperate the root partition off of
the RAID5 device?

I have a system that does not exhibite the system lock ups.  It is
using the 3Ware adapter with a RAID5 volume across 3 disks.  However,
my root partition is on a dedicated disk and not on the 3Ware RAID5
volume.  I'm going to setup another system with the root partition on
RAID5 and see if I can reproduce your problem.

Thanks,

Joe
Comment 251 Red Hat Bugzilla 2005-03-09 14:49:18 EST
Hi Joseph,

currently I don't have a system where I could try seperating the root
partition but maybe in next few weeks I might have one...

Hmm, it might be some sort of a solution to have a root partition on a
dedicated disk but it still wouldn't resolve the issue that if the
system is writing (like that 1.37MB/sec) to raid-5-volume that almost
all the reading is blocked more or less.

What model of 3ware-raid-card do you have? This most problematic and
also most loaded system has 9500S-4-card. Normally at daytime (maximum
concurrent) it has almost 50 webmail users and 100 imap/pop-users. 

So if there is 300 incoming mails which are 100kB each. The system
will load itself over 15 and everything is almost stop before it gets
it's writing done. I haven't expired this with normal software-raid.

85% of time system load is balancing between 0.5 and 3 but there is no
processes which would take time or heave disk-IO. :(

BR,

Pasi
Comment 253 Red Hat Bugzilla 2005-03-11 11:40:56 EST
Created attachment 111891 [details]
3Ware RAID1 versus RAID5
Comment 254 Red Hat Bugzilla 2005-03-11 11:43:43 EST
Created attachment 111892 [details]
RAW data for RAID1 versus RAID5 comparison
Comment 255 Red Hat Bugzilla 2005-03-11 11:48:04 EST
Hi Pasi,

I ran some comparisons of RAID5 versus RAID1 using the 3Ware adapter.
 The RAID5 volume was created across three SATA spindles.  The RAID1
volume was created on two spindles, which left one spindle free for a
hot spare.

I created two attachments.  The first is a pdf(Attachment111891 [details]) ,
which contains some graphs of the data.  The second attachment(111892)
is the raw data.  

In your case, I believe that random writes is the most important data
point.  The RAID5 configuration was able to achieve 7.6Mb/s with one
thread, 1.1Mb/s with two threads and 1.1Mb/s with four threads.  

The RAID1 configuration used one less disk, but it was able to achieve
31Mb/s with 1 thread,  4.3Mb/s with two threads and 4.2Mb/s with four
threads.

In addition, note the difference in latency between raid5 and raid1. 
With four threads the latency for RAID5 is 174605 ms while the RAID1
latency is only 59045 ms, which is still high, but much better than
the RAID5 case.

If its possible, I think it would be a good experiment for you to try
RAID1 for your data.  I also think you should separate your root
partition for your data partition.  However, this would require at
least four spindles if you want the root partition RAID1 protected,
which is suggested in a production environment.

I'm also running experiments with software RAID5.  I let the 3Ware
adapter present each of the three spindles to the OS as single drives.
 It then use mdadm to create the RAID5 volume.  I will update the
bugzilla when that data is available.
Comment 256 Red Hat Bugzilla 2005-03-11 11:55:23 EST
Hi Pasi,

One more note, you may gain performance by running on RAID1 versus
RAID5.  However, you will have to sacrifice disk space in order to run
on RAID1.  So if you have two 36G drives, you will only have 36G of
storage available since the disks are mirrored.

Regards,

Joe
Comment 257 Red Hat Bugzilla 2005-03-15 10:26:55 EST
Created attachment 112021 [details]
3Ware RAID versus Software RAID
Comment 258 Red Hat Bugzilla 2005-03-15 10:31:34 EST
Hi Pasi,

Some IO performance tests were run using an mdadm software raid5
volume.  The software RAID5 configuration was able to achieve about
11MB/s versus the 1.5MB/s for the 3Ware RAID5.  The tiotest benchmark
was used for this comparison.  The software RAID5 stripe was created
across the three SATA drives which were presented to the OS as
individual drives from the 3Ware adapter.  I created a new pdf
attachment(112021), which includes this new data for 1MB IOs.  I'm
also graphing other block sizes, which show the improvement with
software raid as well.   You may want to try experiments on your
configuration  using software raid as well as relocating the root
partition to a separate device.  Let me know if you would like a
sample script to create a software raid volume and I'll send it along.

Regards,

Joe
Comment 259 Red Hat Bugzilla 2005-03-17 14:07:58 EST
Wow, thanks for compiling that Joseph. 

Those are some pretty damning numbers. I find the random writes for 3ware raid5
vs software raid5 especially disappointing considering what these cards cost. 

Any plans to sent 3ware these numbers for comment? If not I'd sure like to...
Comment 260 Red Hat Bugzilla 2005-03-19 14:32:31 EST
Why would it matter if the root partition is raid5ed?
Comment 261 Red Hat Bugzilla 2005-03-21 10:34:49 EST
Processes create entries in the /proc filesystem as well as other places.  If
there is a high write latency on this device that would cause a delay in
creating these entries.  It actually dosen't matter if the device is raid5,
raid1 or a single disk.  However, if the device has a high latency during IO,
processes will appear to have a long delay when they need to access the root
partition.   So logging in for example creates a shell process like bash.  The
login would experience a long delay until /proc can be updated. 
Comment 262 Red Hat Bugzilla 2005-03-21 12:48:57 EST
/proc is actually a virtual filesystem so operations to that location may not be
the cause of the problem since there is no physical IO.  It may be that the
actual login, ls ect commands being on the high latency device could cause the
delays.  We will continue to investigate and update the bugzilla with new
infomation.
Comment 265 Red Hat Bugzilla 2005-03-29 16:58:05 EST
Restoring issue tracker ids mysteriously lost when comment #247 was added.
Comment 266 Red Hat Bugzilla 2005-04-08 20:55:07 EDT
Created attachment 112896 [details]
rstatd capture showing interrupt/context stalling


We too have a dual x86_64 AMD machine, Tyan S2882 motherboard, w/3ware 9500S-8
controller, showing this problem.  Machine is running RHEL3 w/U4,
2.4.21-27.0.2ELsmp, up2date says all patches applied (as of Apr5).

It's a terabyte Raid 1 production file server & the performance hit is most
annoying.  We frequently generate huge hardware simulation logs here.	Throw
in a few CVS checkout's & the server becomes virtually unusable.

Frequent 5 to 10 second non-responsive periods to queries (ie: 'ls' on a client
server, or hardware simulations stalling on compute servers).

Running the evil rstatd daemon, I captured something interesting.   Immediately
=before= a stall is a 100% CPU spike.	

During a stall...
  Interrupts go down to virtually zero (maybe 1% of normal).
  Page(in) (disk-cache) statistic goes to zero.
  Page(out) (disk-cache) still showing low-level activity.
  Disk activity is present through a stall (which makes sense, since page-out
is busy).
 =Network/packet traffic goes to zero=
 =Context-switching goes to zero=
  The load statistic however increases.

After a stall,
  Everything jumps back to life.
  Load statistic starts to decrease.

It's apparantly interrupt related.  Might explain why disabling the APIC during
boot helps somewhat (by changing the dynamics).  Perhaps the 3ware driver uses
a "inoccuous" kernel resource & inadvertantly hits a spinlock?	 Maybe logging
spinlock activity could narrow things down.

Just food for thought...
   Ian Davis
Comment 267 Red Hat Bugzilla 2005-04-12 08:21:03 EDT
Hi

I don't mean to be one of those people who logs a message just saying "When the 
hell is this going to be fixed", but frankly that is the gist of this message.

It is embarrassing to sell an "enterprise" solution to a client to end up with 
this kind of abysmal performance.  I have recently discovered that it is not 
feasible to perform VACUUMing on a PostgreSQL database on RHEL3 (fully patched) 
with a 3Ware RAID card (7506) because the server hangs.  The combination of 
small reads and writes that go on when VACUUMing totally stalls the server.  
(Please try it - it's one of the easiest ways to replicate this problem that I 
know of).

I do understand that the 3ware card(s) were never on the HCL, but we were 
informed by RedHat staff in several posts on this thread that a fix is 
definitely forthcoming and hence we have several servers with them in.  Is 
RedHat going to release a solution to this problem or not?  Simply having that 
question answered would help a great deal.

I would also flag that this thread has been running for almost a complete 
year.  I do not consider this to be "enterprise" support?  A lot of good money 
is being paid for an otherwise great operating system.  I would appreciate an 
ETA on this being properly fixed (or a confession that you are not going to 
bother).

Finally, I am sure that it is frustrating for you RedHat developers to receive 
negative posts asking what the hell is going on - especially when the problem 
is as complex as this one appears to be.  Apologies for that.  However, please 
consider us poor end-users who are not receiving decent feedback and have got 
heart-attack-inducingly slow servers for several hundred pounds per year - now 
THAT is frustrating.

Regards
Comment 268 Red Hat Bugzilla 2005-04-12 14:12:42 EDT
Has anyone seen these symptoms with the RHEL4 2.6.9 kernels? We're considering 
scheduling downtime to upgrade our fileserver with the 3ware card, and the 
issues discussed here, from RHEL3 to RHEL4, but it's a pointless exercise if 
the same issue is still present. 
 
Thanks, 
Doug 
Comment 269 Red Hat Bugzilla 2005-04-13 10:55:32 EDT
*** Bug 154441 has been marked as a duplicate of this bug. ***
Comment 270 Red Hat Bugzilla 2005-04-18 07:49:09 EDT
Andrew, Doug,

As you know this BZ became a catch-all for performance issues in RHEL 3. There
are a number of different problems mixed in here. When we set out to fix these
problems we were able to reproduce several of them, and we proceeded to fix
those problems in U3 and U4. Although this BZ has been open for a long time,
many of the issues reported here have been fixed. 

It has become clear that there remains a 3ware-specific problem that is still
not fixed. The problem has persisted because, until recently, we have not been
able to reproduce it. We tried a number of configurations, compared with other
kernels, asked for more details, but were not able to reproduce the problem.
Recently Joe Sailsbury continued to look for the problem by doing some 3ware
performance tests. (See recent posts, especially Comment #243, showing poor
performance, but no hangs.) Following this, and the details provided by Pasi,
Joe was able to reproduce the 3ware hangs. We are looking at the reason for
this, and the cause of the problem now. I expect we will be able to identify the
cause in a few weeks. 

Doug, From experience so far, we need to be cautious about assuming that the
problem we have reproduced is the same as yours. With that in mind, though, Joe
reports that the problem he has with 3ware on RHEL 3 does not occur on RHEL 4.
There are some pauses on RHEL 4, but they are much more in the "normal" range. 

Yes, we are still working on this bug.

Tom 

 


Comment 271 Red Hat Bugzilla 2005-04-18 10:05:36 EDT
Tom

Thank you very much for the comprehensive feedback.  Much appreciated.

Kind regards

Comment 272 Red Hat Bugzilla 2005-04-20 10:06:18 EDT
We have two low-level Linux servers for staging purposes with kernel version 
2.4.21-smp at our company, one with RHEL3 and another with SuSE 9.0 (2.4.21-99-
smp4G). Both have the same configuration: P4 2,8Ghz with HT and one IDE disk 
(without any RAID controllers). I can report about strange behavior of them - 
only SuSE system is affected by this bug. I've succeded with preventing 
system "hang up" using tunings listed below, but I cann't totally escape from 
perfomance slowdown (while copying, after the first 300 MB at that time file 
copying slows down extremely).
Comment 273 Red Hat Bugzilla 2005-04-27 07:15:11 EDT
"Has anyone seen these symptoms with the RHEL4 2.6.9 kernels? We're considering 
scheduling downtime to upgrade our fileserver with the 3ware card, and the 
issues discussed here, from RHEL3 to RHEL4, but it's a pointless exercise if 
the same issue is still present. "

Yes. We see the same symptoms on a RHEL 4 ES machine. It doesn't look like the
problems has been solved en en RHEL 4's 2.6 kernel.  

The machine i s a Dual Xeon with a 3ware 7500-8 controller.
We still use an old firmware version, so we will try to upgrade the firmware in 
near future. 
Comment 274 Red Hat Bugzilla 2005-04-29 02:45:12 EDT
I have some RHEL4 boxes I can play with to see if I hit any of the problems 
seen here.  What is currently the most reliable way to trigger it? 
 
My boxes are: 
dual Xeon EM64T (x86_64 kernel), hyperthreading on 
4GB RAM 
1x or 2x 3ware 9500-12 
Hitachi or Seagate SATA disks, 400GB each 
root partition on a LV on an md device, NOT on 3ware 
 
Right now I see what seems like poor serial write performance (1 thread) on an 
11-disk hardware RAID array (~4TB).  I'm running a 2.6.9-5.0.3.ELsmp kernel 
modified to enable the xfs filesystem (simply changing the kernel 
CONFIG_XFS_FS).  Right now my filesystems are xfs. 
 
I should be able to change various things if it would be helpful: 
 
change to the official RHEL kernel 
change to ext3 
change to mdadm RAID 
turn off hyperthreading 
other? 
 
My slow seqential-write test is: 
 
mount info: 
/dev/sda on /mnt/raid1 type xfs (rw,usrquota,grpquota) 
/mnt/raid1/scratch1 on /scratch1 type xfs 
(rw,bind,noatime,usrquota,grpquota,osyncisdsync) 
 
command: 
sync ; time dd if=/dev/zero of=/scratch1/8G-2 bs=1M count=8192 ; time sync 
 
Running iostat -x 5 in another window shows during the bulk of this test 
something like: 
 
avg-cpu:  %user   %nice    %sys %iowait   %idle 
           0.00    0.00    0.45   49.65   49.90 
 
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util 
sda          0.00 3826.85  0.00 106.81    0.00 31602.40     0.00 15801.20   
295.86  8275.85 1123.43   9.38 100.22 
 
100.22% utilization?  Hmmm...  Yes it's typical, and seems anomalous, but is 
probably irrelevant.  The request and queue sizes, average wait, and service 
times show here are also typical, as is the written kB/s. 
 
During the final sync, the queue drains slowly, and the reported await times 
go into the 100s of seconds (apparently the early requests only get serviced 
as the queue drains).  The kB/s written sits at zero during the sync, and the 
writes per second goes up around 250 (it sits at ~100 during the dd). 
 
As I say, I'd welcome test suggestions that are likely to unearth new 
information.  On my own, I plan to make my /scratch2 (an identical array on a 
second 3ware controller) into a software raid volume.  Personally, I plan to 
use xfs, but I'm willing to text ext3 if that would help others. 
 
Comment 275 Red Hat Bugzilla 2005-04-29 03:12:01 EDT
I should add: 
 
I am not seeing interactive response problems with this simple dd/sync test, 
just low bandwidth.  ~15MB/s single-thread sequential write on an 11-spindle 
hardware RAID5 array?  Wow, that's horrible, and probably intolerable in our 
application.  That is what is motivating me to try mdadm RAID instead of 3ware 
RAID.  Any other suggestions for speeding up writes? 
 
Read performance of this 8GB file of zeros is around 100MB/s. 
 
One series of my 'iostat -x 5' outputs showed the following sequence of kB/s 
written (5-second intervals): 12620 11764 15769 15276 74726 18713 12571 12570 
12519.  In other words, a huge spike in write rate.  It's the only such spike 
I saw in an 8.5 minute dd/sync run.  The nearby records are in full: 
 
avg-cpu:  %user   %nice    %sys %iowait   %idle 
           0.05    0.00    2.25   48.22   49.47 
 
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util 
sda          0.00 3699.40  0.00 129.40    0.00 30552.00     0.00 15276.00   
236.11  8262.12 32407.72   7.73 100.02 
 
 
avg-cpu:  %user   %nice    %sys %iowait   %idle 
           0.05    0.00    6.55   27.95   65.45 
 
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util 
sda          0.00 18097.60  0.00 592.60    0.00 149452.80     0.00 74726.40   
252.20  8224.04 14372.21   1.69 100.02 
 
 
avg-cpu:  %user   %nice    %sys %iowait   %idle 
           0.00    0.00    0.65   69.68   29.66 
 
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util 
sda          0.00 4531.80  0.00 144.40    0.00 37427.20     0.00 18713.60   
259.19  8243.19 1003.57   6.93 100.02 
 
 
avg-cpu:  %user   %nice    %sys %iowait   %idle 
           0.05    0.00    0.45   64.50   35.00 
 
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util 
sda          0.00 3044.40  0.00 90.20    0.00 25142.60     0.00 12571.30   
278.74  8246.19 1498.60  11.09 100.02 
 
 
Comparisons of the relevant metrics show that during the spike: 
 
* write requests merged per second shot up (as did the other write metrics, 
including writes per second) 
* average request size stayed the same 
* average queue size stayed the same 
* average wait time shot up in the record **preceding** the spike in write 
output, and stayed relatively high during the spike -- indicating some 
long-ago-queued requests were finally serviced -- this may be most directly 
related to the cause of the spike, since it preceded it 
* service time went way down 
* cpu utilization stayed at 100% 
 
Recall from my preceding BZ entry that during the final sync, while the queue 
was slowly draining, the average wait times go way, way up, as does the # of 
writes per second -- these circumstances seem similar to the spike I saw 
during the dd. 
 
Comment 277 Red Hat Bugzilla 2005-05-11 09:37:35 EDT
The 3Ware Escalade 8506-8 only has 2MB of SRAM (static RAM).
At that small amount of SRAM with lots of RAID-5 writes, expect massive delays
as the cache "overflows" and isn't able to keep up with the incoming data writes.

The Escalade 7000/8000 series are "storage switches" with 0 wait state SRAM
_not_ "buffering controllers" with high latency, but far more DRAM.
The switch with extremely low latencies using an ASIC + SRAM, like a high-end
Layer 3 Switch.
But that also means they have a very, very small amount of "costly" SRAM,
instead of [S]DRAM (SRAM should _not_ be confused with SDRAM).

That's why RAID-0, RAID-1 and RAID-10 performance will be much better than RAID-5.
You have to tweak the kernel to stage writes so it's not waiting on the 3Ware
controller to finish XORs and writes.
The 3Ware controller will queue up a massive number of I/O operations, which is
great for RAID-0, RAID-1 and RAID-10, but not RAID-5 with its small SRAM.

Different kernels might tweak different parameters that will affect RAID-5
operations on 3Ware Escalade 7000/8000 series controllers.
Finding a good set (3Ware's site has excellent recommendations) and putting
those in your /etc/rc.d/rc.local or similar is _crucial_ for a production server
to minimize any change in defaults with any new kernel.

The 3Ware Escalade 9000 series adds DRAM buffering to the existing ASIC+SRAM design.
Although the drivers are still maturing from what I've seen.

Unless you absolutely need the maximum storage, or your primary application is
lots and lots of reads (e.g., a MySQL server for web content),
use RAID-10 on 3Ware 7000/8000 series cards.
Comment 278 Red Hat Bugzilla 2005-05-12 22:11:53 EDT
I can't benchmark it since it's a live system, but the following helped greatly
with the iowait problem under FC2:

echo 512 > /sys/block/sda/queue/nr_requests

It was suggested on LKML that nr_requests be double queue_depth. I get much more
even performance now.

I also use:
blockdev --setra 16384 /dev/sda
Comment 280 Red Hat Bugzilla 2005-05-20 07:39:00 EDT
Regarding comment #259:

We are now running with the latest firmware on the 3ware controller. It did not
help at all. 

We then changed the io elevator from "anticpatory" to "deadline". When running
bonnie we have pretty good read performance (above 90000 kB/s) on intelligent
reads, but write performance still sucks (around 20000 kB/s) compared to 3ware's
own benchmarks ( http://www.3ware.com/LinuxWP_0701.pdf ).

I the tried to increase the readahead with blockdev --setra 16384. Now the
read-performance sucked as well (around 3000 kB/s). I decreased the readahead to
8192, 4096 2048 ... etc down to 128. After each decrease I ran som bonnie-tests.
Only when the ra was 128 i got good "intelligent read"-performance.

While running bonnie, the system becomes very unresponsive.

I have also tried the suggestions in comment #264 without any change in
responsiveness.

The specs for this machine is:

Dual Xeon 2.4 GHz. 
Tyan S2723 motherboard (Intel E7501)
1024 MB RAM
3ware 7500-8 with 7 120 GB disks in hardware RAID 5 and 1 hot spare.
LVM on top of RAID5
ext3 file systems 
Upgraded from RH 9 -> RHEL 3 -> RHEL 4

We are experiencing the same behaviour (unresponsiveness) on another RHEL 4 
server. 
The specs for this one is:
Single Xeon 2.66 Ghz
2048 MB RAM
Intel SE7501CW2 motherboard
2 200 GB ATA disks in software raid 1
LVM on top of raid 1.
ext3 file systems
Fresh install of RHEL 4. (Not upgrade from RHEL 3)

The only hardware in common between the 2 machines is the E7501 chipset. From
this, it seems that there 4 possible sources for the unresponsivenes:

1: The driver for the E7501 might be broken (I don't think so)
2: LVM is broken (possible)
3: Ext3 (I don't think so)
4: The bug is somewhere in the kernel io handling (seems possible).

I can only agree that it seems that there's 2 separate problems: a performance
problem on 3ware raid controllers, and an unresponsiveness problem, not related
to 3ware raid controllers.

Kristian Sørensen
Comment 281 Red Hat Bugzilla 2005-05-30 15:30:27 EDT
Hi

I believe we have just observed the same problem in our new server:
P4 3.0 GHz Intel- hyperthreading on.
1Gb Ram
3ware 7000 series controller (SATA) with 8x250Gb Maxtor SATA drives
RHEL 3.0 (installed via ROCKS 3.3.0 Makalu)  

I have had to rebuild the raid hardware once (was believe to be a disk failure).  However, now when I try 
to mkfs.ext3 on the /dev/sda2 partition the entire system freezes at ~half-way point (I/O wait at 
199.9% during entire process).  Strange thing is that I can mkfs.ext3 on /dev/sda1 successfully- 
although again the IOwait is maximum for this process as well.  If anyone has suggestions/ideas for me 
to try and fix this I would really appreciate it.

Thanks!
Darren
Comment 282 Red Hat Bugzilla 2005-05-30 15:31:31 EDT
Hi

I believe we have just observed the same problem in our new server:
P4 3.0 GHz Intel- hyperthreading on.
1Gb Ram
3ware 7000 series controller (SATA) with 8x250Gb Maxtor SATA drives
RHEL 3.0 (installed via ROCKS 3.3.0 Makalu)  

I have had to rebuild the raid hardware once (was believe to be a disk failure).  However, now when I try 
to mkfs.ext3 on the /dev/sda2 partition the entire system freezes at ~half-way point (I/O wait at 
199.9% during entire process).  Strange thing is that I can mkfs.ext3 on /dev/sda1 successfully- 
although again the IOwait is maximum for this process as well.  If anyone has suggestions/ideas for me 
to try and fix this I would really appreciate it.

Thanks!
Darren
Comment 286 Red Hat Bugzilla 2005-07-21 14:19:13 EDT
Hello -

(In reply to comment #256)
> Andrew, Doug,
> 
> As you know this BZ became a catch-all for performance issues in RHEL 3. ...
> It has become clear that there remains a 3ware-specific problem that is still
> not fixed. ...
> Yes, we are still working on this bug.

Is there any further progress on this?  We have the same problem (FC2 2.6.6
kernel w/SNARE on a dual opteron system with the 3ware 8506 card).  A (perhaps)
useful data point is that we have the same OS running on other dual opteron
systems but with the 9000 series cards and none of them have the problem.  At
this point, I think we're going to just upgrade the 8506 card to a 9000 series
one so that we stop having problems.

Debbie

PS  Thru our vendor a 3ware engineer has been looking at this for us, but has
not yet been able to replicate the problem (sound familiar :-).
Comment 287 Red Hat Bugzilla 2005-07-26 06:25:59 EDT
You might be interested in this test of 9 SATA RAID5 adapters:

http://www.tweakers.net/reviews/557/23

3Ware has scored absolutely the lowest in server performance, while the Areca
adapters have excelled over all the others.
Comment 288 Red Hat Bugzilla 2005-07-26 10:41:05 EDT
Here are interesting results WRT tweaking kernel 2.6 on 3Ware hardwre, obtained
by Gaspar Bakos (posting here with permission from the author):

-------- Original Message --------
Subject: 	3ware + RAID5 under 2.6.*
Date: 	Mon, 25 Jul 2005 21:16:25 -0400 (EDT)
From: 	Gaspar Bakos <gbakos@cfa.harvard.edu>
Reply-To: 	gbakos@cfa.harvard.edu
To: aleksander.adamowski.redhat@altkom.pl

Dear all,

I am forwarding this report i sent to the FC and RAID lists, because at
one time you were involved in the "3ware raid under linux" issue, so
you may be interested, or have ideas.

Apologies for mass emailing.

Cheers
Gaspar

-------------

The purpose of this email is twofold:
- to share the results of the many tests I performed with
  a 3ware RAID card + RAID-5 + XFS, pushing for better file I/O,
- and to initiate some brainstorming on what parameters can be tuned for
  getting a good performance out of this hardware under 2.6.* kernels.

I started all these tests because the performance was quite poor, meaning
that the write speed was slow, the read speed was barely acceptable, and the
system load went very high (10.0) during bonnie++ tests.
My questions are marked below with "Q".

1.
There are many useful links related to the 3ware card and related anomalies.
The bugzilla page:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121434
contains some 260 comments. It is mostly 2.4 kernel and RHEL specific.

2.
A newer description of the problem can be found in the thread:
http://lkml.org/lkml/2005/4/20/110
http://openlab-debugging.web.cern.ch/openlab-debugging/raid/
by Andreas Hirstius.
There was a nasty fls() bug, which was eliminated recently, and improved
performance and stability.

3.
There are recommendations by 3ware, which can be
summarized in one line: "blockdev --setra 16384".
http://www.3ware.com/reference/techlibrary.asp
"Maximum Performance for Linux Kernel 2.6 Combined with XFS File System",
which actually leads to a PDF that has a different title:
"Benchmarking the 9000 controller with linux 2.6".


Q: Any other useful links?

Briefly, the hardware setup I use
=================================
- Tyan S2882 Thunder K8S Pro motherboard
- Dual AMD opteron CPUs
- 4Gb RAM
- 3ware 9500-8S 8 port serial ATA controller
- 8 x 300GB ST3300831AS SATA Seagate disks in hardware RAID-5
More details at the end of this email.

OS/setup
=======
- Redhat FC3, first with 2.6.9-1.667smp kernel, then with all the upgrades,
  and finally a self-compiled 2.6.12.3 x86_64 kernel
- XFS filesystem
- Raid strip size = 64k, write-cache enabled
Kernel config attached.

==========================================================================
Tuneable parameters
====================
1. Kernel itself. I tried 2.6.9-1.667smp, 2.6.11-1.14_FC3smp, and 2.6.12.3
(self-compiled)

	1.a Kernel config (NUMA system, etc.)

2. Raid setup on the card.
	- Write-cache enabled? (I use "YES")
	- Raid strip size
	- firmware, bios, etc. on the card
	- staggered spinup (I use "YES", but the drives may not support it.
	  I always "warm up" the unit before the tests, )

3. 3ware driver version
- 3w-9xxx_2.26.02.002 the older version in the kernels
- 3w-9xxx_2.26.03.015fw from the 3ware website, containing the firmware as
  well.

4. Run-time kernel parameters (my device is /dev/sde):

	4.a
		/sys/class/scsi_host/host6/
		cmd_per_lun
		can_queue

	4.b
		/sys/block/sde/queue/, e.g.
		iosched            max_sectors_kb  read_ahead_kb
		max_hw_sectors_kb  nr_requests     scheduler

	4.c
		/sys/block/sde/device/ e.g.
		queue_depth

	4.d Other params from the 2.4 kernel, if they have an alternative in
		2.6:

		/proc/sys/vm/max-readahead

	Q: Anything else?

5. blockdev --setra
	This is possibly belongs to those points mentioned under 4.)

6. For not raw IO (dd), the XFS filesystem parameters.

7. Q: Anything crucial parameter i am missing?

==========================================================================
Tests
=====
I changed the following during the tests. It is not an orthogonal set of
parameters, and I did not try everything with every combination.

- kernel
- raid strip size: 64K and 256K
- 3ware driver and firmware
- /sys/block/sde/queue/nr_requests
- blockdev --setra xxx /dev/sde
- XFS filesystem parameters

I used 5 bonnie++ commands to do not only simple IO, but also combined
filesystem performance:

MOUNT=/mnt/3w1/un0
SIZE=20480
echo "Bonnie test for IO performance"
sync; time bonnie++ -m cfhat5 -n 0   -u 0 -r 4092 -s $SIZE -f -b -d $MOUNT
echo "Testing with zero size files"
sync; time bonnie++ -m cfhat5 -n 50:0:0:50 -u 0 -r 4092 -s 0 -b -d $MOUNT
echo "Testing with tiny files"
sync; time bonnie++ -m cfhat5 -n 20:10:1:20 -u 0 -r 4092 -s 0 -b -d $MOUNT
echo "Testing with 100Kb to 1Mb files"
sync; time bonnie++ -m cfhat5 -n 10:1000000:100000:10 -u 0 -r 4092 -s 0 -b -d $MOUNT
echo "Testing with 16Mb size files"
sync; time bonnie++ -m cfhat5 -n 1:17000000:17000000:10 -u 0 -r 4092 -s 0 -b -d
$MOUNT

==========================================================================
System information during the tests
===================================
This is just to make sure the system is behaving OK, and to catch some
errors. Done only outside the recorded tests, so as not to affect the
results.

1. top, or cat /proc/loadavg
to see the load

2. iostat, iostat -x

3. vmstat

4. ps -eaf
If the system behaves strange, as if locked.

Q: Anything else recommended that can be useful to check healthy system
behaviour?

==========================================================================
Other testing tools?
====================

1.  iozone
mentioning an Excel table in the man page made me uncertain whether
to try it...

2. dd
for raw IO.

Q: What else?

==========================================================================
Conclusions in a nutshell
=========================
1. With any of the kernels below 2.6.12.3, on the ___ x86_64 ___
architecture, the performance is poor. Load becomes huge, system
unresponsive, kswapd0, kswapd1 running on top of the "top".

2. The blockdev --setra 16384 does almost nothing else than increases the
read speed from the disks by also consuming much more CPU time. The write
and re-write speed do not change considerably. It is not really a
solution, when a system is run in hw raid based on an expensive card so as
to save CPU cycles for other tasks. (Then we can use sw RAID-5 on JBOD,
which is just much faster with more CPU usage)

3. The best I got during normal operation (no kswapd anomaly and
unresponsive system) was about 80Mb/s write, 40Mb/s rewrite and 350Mb/s
read. However, this was with "blockdev --setra 4092" and 43% CPU usage.
I would rather quote a more conservative 180Mb/s at setra 256 and 20% CPU.

4. I made tests Migration from 64kb to 256kb stripe size on a 2Tb array would take
forever. The performance during this migration is really bad, indifferent
from what the IO priority is set up in the 3ware interface:
50Mb/s write, 8Mb/s rewrite (!) and 12Mb/s read.

As I had no data yet to loose, it was much faster to reboot, and delete
unit, create one with 256Kb stripe size, and initialize it.

5. The performance of the 3ware card seemed worse with the 256k strip size.
Write: 68 Rewrite: 21, read: 60Mb/s

6. Changing /sys/block/sde/queue/nr_requests from 128 to 512 does a moderate
improvement. Going to higher numbers, such as 1024 does not make it better
any more.

==========================================================================
QUESTIONS:
=========

Q: Where is useful information on how to tune the various /sys/*
   parameters.?
   What are recommended values for a 2Tb array running on 3ware card?
   What are the relation between these parameters?

   Notably: nr_requests, can_queue, command_per_lun, max-readahead, etc.

Q: Are there any benchmarks showing better (re)write performance on an eight
   disk SATA RAID-5 with similar capacity (2Tb)?

Q: (mostly to 3ware/amcc inc.) Why is the 256K strip size so inefficient
    compared to the 64k?

==========================================================================
TEST RESULTS
============

	---------------------------------------------------------------------------
	TEST2.1
	-------
	raid strip size = 64k
	blockdev --setra 256 /dev/sde
	/sys/block/sde/queue/nr_requests = 128
	mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2

	xfs_info /mnt/3w1/un0/
	meta-data=/mnt/3w1/un0           isize=1024   agcount=32, agsize=16021136 blks
	         =                       sectsz=512
	data     =                       bsize=4096   blocks=512676288, imaxpct=25
	         =                       sunit=16     swidth=112 blks, unwritten=1
	naming   =version 2              bsize=4096
	log      =internal               bsize=4096   blocks=32768, version=2
	         =                       sectsz=512   sunit=16 blks
	realtime =none                   extsz=65536  blocks=0, rtextents=0

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	            100/100   577   5 +++++ +++   914   5   763   6 +++++ +++    97   0
	real	24m32.187s
	user	0m0.365s
	sys	0m32.705s

	Testing with tiny files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max            /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	       100:10:0/100   125   2 103182 100   824   7   127   2 84106  99    82   1
	real	49m47.104s
	user	0m0.494s
	sys	1m5.833s

	Testing with 100Kb to 1Mb files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10    42   5    75   5   685  11    41   5    24   1   212   4
	real	18m29.176s
	user	0m0.240s
	sys	0m45.138s

	16Mb files:
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000     4  14     7  14   461  39     4  15     5  10   562  43

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000     3  14     7  14   522  40     4  14     6  11   493  39
	real	13m43.331s
	user	0m0.455s
	sys	1m53.656s

	-----------------------------------------------------------------------------
	TEST 2.2
	--------
	-> change inode size

	Strip size 64Kb
	blockdev --setra 256 /dev/sde
	/sys/block/sde/queue/nr_requests = 128
	mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=2k -l version=2 /dev/sde1

	meta-data=/dev/sde1              isize=2048   agcount=32, agsize=16021136 blks
	         =                       sectsz=512
	data     =                       bsize=4096   blocks=512676288, imaxpct=25
	         =                       sunit=16     swidth=112 blks, unwritten=1
	naming   =version 2              bsize=4096
	log      =internal log           bsize=4096   blocks=32768, version=2
	         =                       sectsz=512   sunit=16 blks
	realtime =none                   extsz=65536  blocks=0, rtextents=0

	Disk IO
	Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
	                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
	Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
	cfhat5          20G 57019  97 75887  16 47033  10 35907  61 192411  22 311.6   0

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	              50/50   655   6 +++++ +++   944   5   717   6 +++++ +++   112   0
	real	10m58.033s
	user	0m0.182s
	sys	0m16.954s

	Testing with tiny files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	         20:10:1/20   111   2 +++++ +++   805   7   107   2 +++++ +++   126   1
	real	9m23.056s
	user	0m0.105s
	sys	0m12.835s

	Testing with 100Kb to 1Mb files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10    44   5   221  13   504   7    43   5    22   1   164   2
	real	17m25.308s
	user	0m0.207s
	sys	0m42.914s

	==> Seq. read speed increased to 3x, seq. delete decreased

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000/10  4  14    10  20   450  34     4  14     5   9   419  34
	real	13m24.856s
	user	0m0.483s
	sys	1m53.478s

	==> Delete speed decreased. Seq. read speed somewhat increased.
	==> No significant difference compared to smaller inode size.

	-----------------------------------------------------------------------------
	TEST2.3
	--------
	Tests done while migrating from Stripe 64kB to Stripe 256kB.
	/sys/block/sde/queue/nr_requests = 128
	blockdev --setra 256 /dev/sde
	Extremely slow.

	Bonnie test for IO performance
	Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
	                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
	Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
	cfhat5          20G           53072  11  8848   1           12039   1 139.3   0

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	              50/50  289   3 +++++ +++   603   3   444   4 +++++ +++    77   0
	real	17m19.235s
	user	0m0.186s
	sys	0m17.566s

	Testing with tiny files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	         20:10:1/20   86   1 +++++ +++   564   5    86   1 +++++ +++    90   0
	real	12m16.227s
	user	0m0.099s
	sys	0m12.125s

	Testing with 100Kb to 1Mb files
	Delete files in random order...done.
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10  29   3    13   0   466   6    25   3    11   0   125   2
	real	41m4.151s
	user	0m0.255s
	sys	0m42.095s

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000/10 2   9     2   5   273  20     2   8     1   3   258  19
	real	29m20.672s
	user	0m0.469s
	sys	1m49.345s

	===> Disk IO becomes extreme slow when array is migrating strip size

	-----------------------------------------------------------------------------
	TEST 2.4
	--------
	Tests done with 256Kb RAID array size
	blockdev --setra 256 /dev/sde
	/sys/block/sde/queue/nr_requests = 128
	mkfs.xfs -f -b size=4k -d su=256k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0
/dev/sde1

	meta-data=/dev/sde1              isize=1024   agcount=32, agsize=16021184 blks
	         =                       sectsz=512
	data     =                       bsize=4096   blocks=512676288, imaxpct=25
	         =                       sunit=64     swidth=448 blks, unwritten=1
	naming   =version 2              bsize=4096
	log      =internal log           bsize=4096   blocks=32768, version=2
	         =                       sectsz=512   sunit=64 blks
	realtime =none                   extsz=65536  blocks=0, rtextents=0

	top - 11:54:04 up 11:31,  2 users,  load average: 8.52, 7.56, 5.07
	Tasks: 104 total,   1 running, 102 sleeping,   1 stopped,   0 zombie
	Cpu(s):  0.3% us,  4.0% sy,  0.0% ni,  0.7% id, 94.5% wa,  0.0% hi,  0.5% si
	Mem:   4010956k total,  3988284k used,    22672k free,        0k buffers
	Swap:  7823576k total,      224k used,  7823352k free,  3789640k cached
	  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
	30821 root      18   0  8312  916  776 D  5.3  0.0   1:21.60 bonnie++
	  175 root      15   0     0    0    0 D  1.3  0.0   0:16.35 kswapd1
	  176 root      15   0     0    0    0 S  1.0  0.0   0:18.38 kswapd0

	Bonnie test for IO performance
	Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
	                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
	Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
	cfhat5          20G           68990  14 21157   5           60837   7 250.2   0
	real	27m58.805s
	user	0m1.118s
	sys	1m58.749s

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	              50/50   255   3 +++++ +++   247   2   252   3 +++++ +++    61   0
	real	23m59.997s
	user	0m0.186s
	sys	0m26.721s

	==> Much slower than 64kb size with setra=256

	Testing with tiny files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	         20:10:1/20   110   3 +++++ +++   243   3   112   3 +++++ +++    77   1
	real	11m57.399s
	user	0m0.100s
	sys	0m17.356s

	==> Much slower than 64kb size with setra=256

	Testing with 100Kb to 1Mb files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10    36   5    77   5   232   4    40   5    35   2    92   2
	real	18m25.701s
	user	0m0.238s
	sys	0m45.724s

	==> Somewhat slower than 64kb size with setra=256

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000/10     4  15     3   6   227  18     3  14     2   4   155  13
	real	20m11.168s
	user	0m0.508s
	sys	1m55.892s

	==> Somewhat slower than 64kb size with setra=256

	==> Definitely inferior to the 64kb raid strip size

	------------------------------------------------------------------------------
	TEST2.5
	-------
	raid strip size = 256K
	Change su to 64k
	blockdev --setra 256 /dev/sde
	/sys/block/sde/queue/nr_requests = 128
	mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0
/dev/sde1

	Bonnie test for IO performance
	Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
	                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
	Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
	cfhat5          20G           72627  15 23325   5           63101   7 272.0   0
	real	25m56.324s
	user	0m1.097s
	sys	1m57.267s

	===> General IO was slightly faster with su=64k than su=256k

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	              50/50   788   7 +++++ +++   989   6   781   7 +++++ +++    93   0
	real	12m8.633s
	user	0m0.158s
	sys	0m16.578s


	===> Filesystem is much faster with su=64k

	Testing with tiny files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	         20:10:1/20   135   2 +++++ +++   818   7   133   2 +++++ +++   145   1
	real	7m51.365s
	user	0m0.091s
	sys	0m12.182s

	===> Filesystem is somewhat faster with su=64k

	Testing with 100Kb to 1Mb files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10    41   5    91   5   787  12    41   5    24   1   224   4
	real	18m6.138s
	user	0m0.243s
	sys	0m42.042s

	===> For larger files, it becomes almost indifferent if we use su=64k or su=256k

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000/10     4  14     3   6   476  34     3  11     2   5   546  40
	real	19m37.665s
	user	0m0.548s
	sys	1m49.408s

	===> For larger files, it becomes almost indifferent if we use su=64k or su=256k

	------------------------------------------------------------------------------
	TEST 2.6
	---------
	Tests done with 256Kb RAID array size
	blockdev --setra 1024 /dev/sde
	/sys/block/sde/queue/nr_requests = 128
	blockdev --setra 1024 /dev/sde
	mkfs.xfs -f -b size=4k -d su=256k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0
/dev/sde1

	meta-data=/dev/sde1              isize=1024   agcount=32, agsize=16021184 blks
	         =                       sectsz=512
	data     =                       bsize=4096   blocks=512676288, imaxpct=25
	         =                       sunit=64     swidth=448 blks, unwritten=1
	naming   =version 2              bsize=4096
	log      =internal log           bsize=4096   blocks=32768, version=2
	         =                       sectsz=512   sunit=64 blks
	realtime =none                   extsz=65536  blocks=0, rtextents=0

	Bonnie test for IO performance
	Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
	                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
	Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
	cfhat5          20G           68794  14 26139   6           118452  14 255.5   0
	real	22m2.101s
	user	0m1.268s
	sys	1m58.232s

	=> Speed increased compared to TEST 2.4 (setra 256). CPU % didn't increase.

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	              50/50   253   3 +++++ +++   247   2   251   3 +++++ +++    60   0
	real	24m14.398s
	user	0m0.178s
	sys	0m27.186s

	=> No change compared to 2.4

	Testing with tiny files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	         20:10:1/20   112   3 +++++ +++   241   3   109   3 +++++ +++    71   1
	real	12m21.663s
	user	0m0.089s
	sys	0m17.502s

	=> No change.

	Testing with 100Kb to 1Mb files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10    39   5    90   5   237   4    37   5    32   1    82   1
	real	18m47.223s
	user	0m0.260s
	sys	0m45.430s

	=> No change.

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000/10 4  13     6  12   215  16     4  14     5   9   171  13
	real	14m21.865s
	user	0m0.474s
	sys	1m49.301s

	==> Improved.

	------------------------------------------------------------------------------
	TEST 2.6
	--------
	Back to raid-strip = 64k
	/sys/block/sde/queue/nr_requests = 128
	mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0
/dev/sde1
	blockdev --setra 256 /dev/sde

	top - 10:51:03 up  8:06,  3 users,  load average: 9.69, 4.18, 1.63
	Tasks: 128 total,   1 running, 127 sleeping,   0 stopped,   0 zombie
	Cpu(s):  0.2% us,  5.0% sy,  0.0% ni,  5.2% id, 88.5% wa,  0.0% hi,  1.2% si
	Mem:   4010956k total,  3987456k used,    23500k free,       52k buffers
	Swap:  7823576k total,      224k used,  7823352k free,  3677224k cached

	System stays responsive despite the giant load.

	  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
	 5757 root      18   0  8308  916  776 D  6.3  0.0   0:35.69 bonnie++
	  176 root      15   0     0    0    0 D  1.3  0.0   0:05.27 kswapd0
	  175 root      15   0     0    0    0 S  1.0  0.0   0:05.64 kswapd1

	Bonnie test for IO performance
	Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
	                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
	Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
	cfhat5          20G           65322  14 46177  10           183637  21 293.2   0
	real	15m23.264s
	user	0m1.118s
	sys	1m58.544s

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	              50/50   701   6 +++++ +++   983   5   733   6 +++++ +++   111   0
	real	10m56.735s
	user	0m0.171s
	sys	0m15.877s

	Testing with tiny files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	         20:10:1/20   109   2 +++++ +++   824   7   108   2 +++++ +++   147   1
	real	8m58.359s
	user	0m0.107s
	sys	0m12.546s

	Testing with 100Kb to 1Mb files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10    45   5   214  13   642   9    45   5    22   1   211   3
	real	16m59.573s
	user	0m0.230s
	sys	0m42.618s

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000/10     4  13    11  20   467  32     4  13     5   9   416  30
	real	13m15.243s
	user	0m0.534s
	sys	1m47.777s

	------------------------------------------------------------------------------
	TEST 2.7
	---------
	Change setra:
	blockdev --setra 4092 /dev/sde
	raid-strip = 64k
	/sys/block/sde/queue/nr_requests = 128
	mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0
/dev/sde1

	[root@cfhat5 diskio]# iostat -x /dev/sde
	Linux 2.6.12.3-GB2 (cfhat5)     07/25/2005

	avg-cpu:  %user   %nice    %sys %iowait   %idle
	           0.29    0.04    1.00    4.88   93.80
	Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await  svctm  %util
	sde          0.04 903.28 19.74 44.03 4757.48 8632.40  2378.74  4316.20   209.94
    7.73  121.17   1.96  12.51

	Bonnie test for IO performance
	Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
	                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
	Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
	cfhat5          20G           66303  13 41254   9           345730  41 274.7   0
	real	15m21.055s
	user	0m1.114s
	sys	1m57.199s

	==> Write does not change. Rewrite decreases. Read increases.

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	              50/50   624   6 +++++ +++   904   5   727   6 +++++ +++   113   0
	real	10m59.528s
	user	0m0.189s
	sys	0m16.520s

	Testing with tiny files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	         20:10:1/20   111   2 +++++ +++   798   7   102   2 +++++ +++   143   1
	real	9m12.536s
	user	0m0.120s
	sys	0m12.467s

	Testing with 100Kb to 1Mb files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10  46   6   323  20   686  10    43   5    30   1   207   3
	real	14m42.960s
	user	0m0.262s
	sys	0m42.090s

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000/10     4  14    20  40   524  38     4  13    11  21   492  35
	real	10m42.784s
	user	0m0.453s
	sys	1m51.078s

	------------------------------------------------------------------------------
	TEST 2.8
	---------
	echo 512 > /sys/block/sde/queue/nr_requests
	raid-strip = 64k
	mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0
/dev/sde1
	blockdev --setra 4092 /dev/sde

	Bonnie test for IO performance
	Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
	                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
	Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
	cfhat5          20G           78573  16 42444   9           353894  42 284.6   0
	real	14m14.938s
	user	0m1.213s
	sys	1m55.382s

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	              50/50   623   6 +++++ +++   894   5   739   6 +++++ +++   123   0
	real	10m25.379s
	user	0m0.186s
	sys	0m16.846s

	Testing with tiny files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	         20:10:1/20   107   2 +++++ +++   835   7   100   1 +++++ +++   159   1
	real	9m7.268s
	user	0m0.104s
	sys	0m12.589s

	Testing with 100Kb to 1Mb files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10    47   6   324  19   697  10    44   5    35   2   232   4
	real	13m41.706s
	user	0m0.234s
	sys	0m42.614s

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000/10     4  14    19  38   448  32     4  13    11  21   506  36
	real	10m40.404s
	user	0m0.469s
	sys	1m51.098s

	------------------------------------------------------------------------------
	TEST 2.9
	---------
	echo 1024 > /sys/block/sde/queue/nr_requests
	raid-strip = 64k
	mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0
/dev/sde1
	blockdev --setra 4092 /dev/sde

	Bonnie test for IO performance
	Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
	                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
	Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
	cfhat5          20G           79546  16 41227   9           351637  43 285.0   0
	real	14m26.609s
	user	0m1.136s
	sys	1m57.398s

	==> No improvement

	Testing with zero size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	              50/50   616   5 +++++ +++   880   5   748   6 +++++ +++   123   0
	real	10m25.469s
	user	0m0.186s
	sys	0m16.723s

	Testing with tiny files
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	         20:10:1/20    99   2 +++++ +++   779   7   104   2 +++++ +++   165   1
	real	9m12.385s
	user	0m0.111s
	sys	0m12.947s

	Testing with 100Kb to 1Mb files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	10:1000000:100000/10  47   6   316  20   616   9    47   6    36   2   248   4
	real	13m22.360s
	user	0m0.231s
	sys	0m43.679s

	Testing with 16Mb size files
	Version  1.03       ------Sequential Create------ --------Random Create--------
	cfhat5              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
	files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
	1:17000000:17000000/10 3  13    16  31   386  27     4  13    11  22   558  40
	real	11m1.018s
	user	0m0.464s
	sys	1m49.534s



============================================================================
Hardware info
=============

[root@cfhat5 diskio]# cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor 246
stepping        : 10
cpu MHz         : 1991.008
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips        : 3915.77
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor 246
stepping        : 10
cpu MHz         : 1991.008
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips        : 3973.12
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

-----------------------------------------------------------
[root@cfhat5 diskio]# cat /sys/class/scsi_host/host6/stats
3w-9xxx Driver version: 2.26.03.015fw
Current commands posted:      0
Max commands posted:         79
Current pending commands:     0
Max pending commands:         1
Last sgl length:              2
Max sgl length:              32
Last sector count:            0
Max sector count:           256
SCSI Host Resets:             0
AEN's:                        0

--------------------------
3ware card info

Model   9500S-8
Serial #      L19403A5100293
Firmware      FE9X 2.06.00.009
Driver        2.26.03.015fw
BIOS  BE9X 2.03.01.051
Boot Loader   BL9X 2.02.00.001
Memory Installed      112 MB
# of Ports    8
# of Units    1
# of Drives   8

Write cache enabled
Auto-spin up enabled, 2 sec between spin-up
Drives, however, probably do not support spinup.

-------------------------------
Disks:
Drive Information (Controller ID 6)
Port 	Model 	Capacity 	Serial # 	Firmware 	Unit 	Status
0 	ST3300831AS 	279.46 GB 	3NF0BZYJ 	3.02 	0 	OK
1 	ST3300831AS 	279.46 GB 	3NF0AC04 	3.01 	0 	OK
2 	ST3300831AS 	279.46 GB 	3NF0A7JE 	3.01 	0 	OK
3 	ST3300831AS 	279.46 GB 	3NF0ABT1 	3.01 	0 	OK
4 	ST3300831AS 	279.46 GB 	3NF0A63J 	3.01 	0 	OK
5 	ST3300831AS 	279.46 GB 	3NF0ACC5 	3.01 	0 	OK
6 	ST3300831AS 	279.46 GB 	3NF09FLP 	3.01 	0 	OK
7 	ST3300831AS 	279.46 GB 	3NF046WY 	3.01 	0 	OK

----------------------------------
[root@cfhat5 diskio]# vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0    380 3781540      0  58004    0    0  2712  3781  243   216  0  2 91  7

[root@cfhat5 diskio]# free
             total       used       free     shared    buffers     cached
Mem:       4010956     229532    3781424          0          0      58004
-/+ buffers/cache:     171528    3839428
Swap:      7823576        380    7823196


============================================================================
Kernel config

See http://www.cfa.harvard.edu/~gbakos/diskio/

+ ------------------------------------------------------------------------ +
Dr. Gaspar A. Bakos
Hubble Fellow, Solar, Stellar and Planetary Sciences (SSP) Division
Harvard-Smithsonian Center for Astrophysics
60 Garden Street, Cambridge, MA 02138 (USA)
+ ------------------------------------------------------------------------ +
Comment 289 Red Hat Bugzilla 2005-07-26 13:05:20 EDT
The massive load under 2.6.x goes away with the following changes for me.

vm.dirty_expire_centisecs = 1000
vm.dirty_ratio = 5

Comment 290 Red Hat Bugzilla 2005-07-29 08:55:27 EDT
Hello,

About the Linux 2.4 driver, we realized that we have big problem with the 9.2
driver. The 9.1.5.2 was fine.

In fact the 9.2 firmware delivered (FE9X 2.06.00.009) with the driver is buggy
if you use more than 1 unit. (Hotspares are considered as an unit.). We got a
new firmware from 3ware support and it solves the problem.

With some tweaking and 5 x RAID 5 volume on 3 x 3ware card (capacity around 5TB)
we got with the iozone test 160MB/s write and 270MB/s Read.

If you want more detailed information contact me.
  
Comment 291 Red Hat Bugzilla 2005-09-20 03:44:30 EDT
Hi,
I am experiencing a similar problem without 3ware RAID.

I previously used EL3 Update 4 that was fine.
Now, I installed EL3 Update 5, make it an nfs server
and I am trying to check its performance.
Clients hangs easily, especially with big rsize/wsize(16KB and 32KB).
Commands like df and ls do not answer, and the clients and
the server are quiet.

Yoshihiro Tsuchiya
Comment 292 Red Hat Bugzilla 2005-09-20 07:53:51 EDT
Yoshihiro,

Please open a separate bug report for the NFS performance problem you described
in comment 277. 

Thanks,

Tom
Comment 297 Red Hat Bugzilla 2005-10-15 01:28:13 EDT
I'm suffering with a 3ware 7500-4 (3 drives in a raid5 + a hot spare) on RHEL4.
 Something I noticed is that 3ware completely rewrote the linux driver (from
1.26.00.039 to 1.26.02.001).  Has anyone tried the new driver to see if it
improves the write performance?
Comment 299 Red Hat Bugzilla 2005-11-19 16:46:43 EST
rsyncing some directory from a small P4/SCSI RAID (Mylex acceleraid 170 with 
10krpm disks) server to some big bi-amd64/SATA2 Raid (3ware 9550SX with 4 
SATA2 disks). Both are RAID-5. The load is somewhat huge for the bi-amd64, 
it's is about 0.30 on the "small" one ...

top - 22:46:15 up 56 min,  2 users,  load average: 7.57, 7.80, 7.80
Tasks:  80 total,   1 running,  79 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0% us,  0.0% sy,  0.0% ni,  0.0% id, 100.0% wa,  0.0% hi,  0.0% si
Cpu1  :  0.0% us,  0.0% sy,  0.0% ni,  0.0% id, 100.0% wa,  0.0% hi,  0.0% si
Mem:   3090480k total,  3071604k used,    18876k free,    13776k buffers
Swap:  3895752k total,     2656k used,  3893096k free,  2833656k cached
Comment 300 Red Hat Bugzilla 2005-11-22 09:16:31 EST
Created attachment 121349 [details]
Activating write cache seems to help a lot

Activating write cache seems to help a lot
Comment 301 Red Hat Bugzilla 2005-11-22 09:30:02 EST
I am seeing this problem also with RHEL 4.2 and 3Ware 7504 card with three
drives in a Hardware RAID 5 config.

Can someone from Redhat please give us an update on this call as it has been
opened for over 18 months now.
Comment 302 Red Hat Bugzilla 2005-11-29 11:05:21 EST
is it anything to do with this issue per chance ?

https://www.redhat.com/archives/linux-lvm/2004-February/msg00141.html
Comment 303 Red Hat Bugzilla 2005-12-06 13:22:03 EST
I’ve read this Bugzilla twice as I find it very similar to what we experience 
but it’s totally different as we have RH AS 4.0 and Emulex FC HBA connected to 
a DMX3000. We experience an I/O rate of 30-40 I/O’s per sec from a host with 
MySQL doing ~4KB I/O. Not very impressive for a HI end SAN storage solution 
(Other Solaris hosts perform up to 3000 I/O’s so there is no problem with the 
DMX)

Have anyone tried to align the file system or is that totally unnecessary for 
the host controllers mentioned hear?, I’m a SAN person so this might be totally 
irrelevant but as I know this is a problem in SAN environment it might cause 
the bad performance with RAID 5 and the local controllers as well. I have tried 
to get my Linux persons to read this Buzilla as we see bad performance on a 
local HP SCSI controller as well on FC HBA’s connected to SAN storage (they say 
that IT’s not a Linux problem... It’s your SAN storage...). I don’t have any 
Linux server to try this on so I post it hear and if it makes sense to align 
the I/O’s so they don’t need to cross I/O boundary’s that are set by the SCSI 
or what ever controller, it’s worth a try.  I also believe that this isn’t only 
for RAID 5 devices so aligning RAID 1 would also be a good thing to try to get 
aligned to the cache.

I would give it a try to do File System alignment for the Raid 5 volumes to see 
if it does any positive effect. If the controller Cache has 32KB cache slots an 
alignment for the partition to start at could be 128 Bytes (or 64 bytes) both 
for RAID 1 and RAID 5 Depending on how the Raid controller handles the RAID 
devices and cache.

This is an example to align that works on FC SAN.

1.	Execute “fdisk /dev/sd<x>
2.	Type “n” to create a new partition
3.	Type “p” to create a primary partition
4.	Type “1” to create partition #1
5.	Select the defaults to use the complete disk
6.	Type “x” to get into expert mode
7.	Type “b” to specify the starting block for partitions
8.	Type “1” to select partition #1
9.	Type “128” to make partition #1 to align on 64KB boundary (if 
that “boundery” exist on the controller).
10.	Type “r” to return to main menu
11.	Type “t” to change partition type
12.	Type “1” to select partition 1
13.	Type “fb” to set type to fb 
14.	Type “w” to write label and the partition information to disk


Might help might not?
Comment 304 Red Hat Bugzilla 2006-01-25 13:09:05 EST
Hi.

Is this a normal behavior for dd? I ran the following. It’s an ext3 file system
on internal SCSI device. During this time the 4 CPU’s goes up to ~99% system time. 

dd if=/dev/zero of=/opt/laban_test/fs1/file01 bs=8192 count=100000 &
dd if=/dev/zero of=/opt/laban_test/fs2/file01 bs=8192 count=100000 &
dd if=/dev/zero of=/opt/laban_test/fs3/file01 bs=8192 count=100000 &
dd if=/dev/zero of=/opt/laban_test/fs4/file01 bs=8192 count=100000 &

[root@se3108 ~]# strace -c -tt -T -p 19090
Process 19090 attached - interrupt to quit
Process 19090 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.92   14.260186         196     72870           write
  5.08    0.762567          10     72870           read
  0.00    0.000071          10         7         6 open
  0.00    0.000036          12         3           close
  0.00    0.000025          25         1           munmap
  0.00    0.000015          15         1           mmap2
  0.00    0.000010          10         1           fstat64
------ ----------- ----------- --------- --------- ----------------
100.00   15.022910                145753         6 total
[root@se3108 ~]#

Comment 305 Red Hat Bugzilla 2006-01-25 14:38:20 EST
Finally I had a change to test 3ware's 9500s under Windows 2000-server.
(Un)fortunately I also were able to reproduce same kind of symptoms with it.

3GHz Pentium 4, 2*160GB sata-disks, raid-1

1. Send 5GB file through smb-share to win2000-server.
2. win2000-server slowly becomes unresponsive to any other processes which are
trying to access the harddisk until even mouse movements become unresponsive.
3. after server has wrote the whole file everything becomes back to the way it was.

So it seems to me that 3ware's cards just suck and their marketing is lying
quite a much about their card's perfomance. This case is closed for me. =)

btw. some benchmark http://www.tweakers.net/ext/i.dsp/1110264565.png
Comment 306 Red Hat Bugzilla 2006-03-01 07:03:18 EST
Is this bug been resolved. We have enterprise 3 update 4 kernel installed on HP 
DL 360 G4 & DL 380 G4. With moderate work the iowait + load going unnecessarily 
high. And the performance also seems very jittery. When a tar backup was done 
from the disk to a USB hard drive the iowait went 99% high & load went high as 
30~50 & managed to crash the server one time. If redhat says iowait numbers in 
TOP is not a problem why did it crash the system. The OS should not crash if 
the backup fails. And its not limited to backups, we have seen a hanging IMAP 
server with stale TCP/IP connections under fairly low number of connections. 
How do we solve this problem. Will the new RH ES 4 solve these issues. It seems 
like this bug is not solved. We tried ES 4 on a moderate server to notice the 
same high iowaits & poor response for remote terminal connected users. Going to 
try it on a HP DL 380 to see if there's a difference. Any insight in to this 
may greatly help.
Comment 307 Red Hat Bugzilla 2006-03-01 07:07:50 EST
Is this bug been resolved. We have enterprise 3 update 4 kernel installed on HP 
DL 360 G4 & DL 380 G4. With moderate work the iowait + load going unnecessarily 
high. And the performance also seems very jittery. When a tar backup was done 
from the disk to a USB hard drive the iowait went 99% high & load went high as 
30~50 & managed to crash the server one time. If redhat says iowait numbers in 
TOP is not a problem why did it crash the system. The OS should not crash if 
the backup fails. And its not limited to backups, we have seen a hanging IMAP 
server with stale TCP/IP connections under fairly low number of connections. 
How do we solve this problem. Will the new RH ES 4 solve these issues. It seems 
like this bug is not solved. We tried ES 4 on a moderate server to notice the 
same high iowaits & poor response for remote terminal connected users. Going to 
try it on a HP DL 380 to see if there's a difference. Any insight in to this 
may greatly help. And there is no 3ware cards inside. Raid card is smart array 
6i/5i.
Comment 308 Red Hat Bugzilla 2006-03-01 21:10:41 EST
This BZ has become a catch-all for many unrelated performance problems. As a
result it is impossible to update the status, because there is no status that
applies to them all. The way I propose to address this is to strictly limit 
this BZ to performance problems with the 3ware driver/adapter in RHEL 3. Other
problems need a separate BZ. As you can see, some of the 3ware problems have
been addressed in RHEL 3 updates, some have been deemed to be hardware
limitations. Others may remain.

Denzel, if you are experiencing a crash, that is clearly a different problem
than this one. You should open a separate BZ. Please use a serial console to
capture the output at the time of the crash. Also provide /var/log/messages
showing the boot messages and the messages leading up to the crash. (Or just
provide a sysreport.) Depending on the situation, we may ask you to provide a
vmcore, using netdump or disk dump. 

The "hanging IMAP server with stale TCP/IP connections" sounds like a different
problem. Please open a separate BZ.

If you are having a problem with I/O performance on a 3ware adapter, please
provide more information about how it is configured and enough information to
allow us to reproduce the problem. If you are having an I/O performance problem
with another HBA, please open a separate BZ with the information requested here. 

Comment 309 Red Hat Bugzilla 2006-03-02 06:27:03 EST
I think that the I/O performance problems are related to the following comment
from the 9.3.0.2 3ware firmware release:

Write performance and read performance balancing issue
With the 3ware 9550SX controller, you might experience slow read performance
when you have lots of write operations going to the controller at the same time.
With the current firmware, the firmware is maximizing the write performance and
gives it higher priority than read performance. A future firmware update will
rectify this issue.

No idea if this applies to all the other controllers but I suspect so. This can
explain the system unresponsiveness under write load since the reads are
starved. Some kernel tuning might help a lot in this case.

Having said that I/O performance under RHEL4 is a lot better for the same
hardware so the kernel isn't entirely blameless here. 

Note You need to log in before you can comment on or make changes to this bug.