Bug 121434
Description
Red Hat Bugzilla
2004-04-21 15:28:54 UTC
Created attachment 99604 [details]
dmesg file from the affected machine
When the task that generates I/O traffic finishes, the kernel remains in the invalid state, spending almost all CPU time in iowait. I'm just observing this - the machine has been idle for over half an hour, but the iowaits for both CPUs are at 98-99%. When I run the "sync" command to flush dirty buffers, it seems to never finish. Upgradinw the 3Ware controller's kernel module (3w-xxxx) to the latest version (v1.02.00.037) and the controller's firmware doesn't help a bit. Also, a similar controller with exactle the same firmware and driver works fine on Fedora Core 1. The http://www.webhostingtalk.com seems dead, so here's a Google cache of those threads: http://www.google.pl/search?q=cache:E2G1lPkVRL0J:www.webhostingtalk.com/archive/thread/229306-2.html+3ware+iowait http://www.google.pl/search?q=cache:cGLF1qYl-e4J:www.webhostingtalk.com/archive/thread/243144-1.html+3ware+iowait It could be an IO elevator problem. Tom, could you please help Aleksander try some elvtune settings to narrow down the problem? Adding Doug to cc: list on the chance that this is related to the SCSI affine queue patch, which was disabled in RHEL3 U2. -ernie # elvtune /dev/sda /dev/sda elevator ID 1 read_latency: 512 write_latency: 16384 max_bomb_segments: 4 I'll try aggresively lowering read latency and moderately olowering write latency. BTW, on the Fedora 1 box (where the problem hasn't been observed) I've played with many different elevator settings and never seen high iowaits. After some hours the iowaits drop to 0% (until I launch another task which generates I/O traffic). I've tuned down the elevators and the bug doesn't seem to exhibit itself anymore: # elvtune /dev/sda /dev/sda elevator ID 1 read_latency: 32 write_latency: 8192 max_bomb_segments: 4 Tom, Stephen, would it be worth changing the default elevator settings a bit for U3 so the worst latency problems are fixed ? I have the same problems on 2 servers with this controller. Installing kernel-smp-unsupported-2.4.21-4.EL.i686 and using /sbin/elvtune -r 512 -w 16384 -b 4 /dev/sda decreases iowait from 90% to avg 40%. So it looks better with this settings. It's giving me high iowaits again when a backup is in progress; My settings: -r 32 -w 8192 Tuning down to -r 32 -w 4096 or even -r 32 -w 2048 doesn't help much (at least not immediately - I doing this right at this moment). I'm having incredible latencies, although I've reduced elvtune settings to -r 32 -w 2048. I have been waiting for my login shell to start, then for execution of "ls" in my homedir for over a minute! The 3ware driver does queuing internally. Tom, it might be useful to reduce the amount of queuing the 3ware driver does, since its TCQ depth really seems unreasonably deep... At -r 32 -w 2048 iowaits gradually drop to about 50%, which still doesn't seem normal. Latencies are visibly high, I'm waiting for simple filesystem operations several seconds. There's only one I/O intensive job, and it's a backup over the network (so the disk I/O isn't that high, since backup goes over encrypted SSL connection on 100 megabit ethernet, giving about 600 kb/second of data transfer). We have experienced the same problem with the 3ware 8505-8 Serial ATA card and RHEL3. Watching the system performance with top when transferring any large files (500Mb+) or running Oracle 10g causes iowait to hover above 95%. Changing the elevator to /sbin/elvtune -r 32 -w 8192 -b 4 /dev/sda has taken the iowait averages from 95%-99% to the high 80%'s. I discussed this issue (at length) with RedHat, Oracle, and 3ware. The general consensus is that there it is a problem with the RedHat EL kernel (aka 2.4.21-9.0.1.ELsmp). Other versions may also demonstrate the problem but this is the one that we used for this test. We installed 2.4.26 from www.kernel.org and the problem vanished. Example top with 2.4.26 smp kernel while Oracle 10g is importing a 10gb file. 14:51:44 up 2:24, 4 users, load average: 2.05, 1.39, 1.67 122 processes: 119 sleeping, 3 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 55.3% 0.0% 6.5% 0.0% 0.0% 0.0% 38.1% cpu00 51.6% 0.0% 9.8% 0.0% 0.0% 0.0% 38.5% cpu01 59.0% 0.0% 3.2% 0.0% 0.0% 0.0% 37.7% Mem: 903940k av, 897448k used, 6492k free, 0k shrd, 1356k buff 462568k active, 385516k inactive Swap: 8185108k av, 150428k used, 8034680k free 809988k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 4344 oracle 17 0 6716 6716 4444 R 29.9 0.7 1:59 1 imp 4346 oracle 15 0 94308 90M 89976 R 24.9 10.3 9:30 1 oracle 11007 oracle 15 0 1184 1184 888 R 2.4 0.1 0:00 0 top 2827 oracle 9 0 42652 18M 17872 D 1.6 2.0 1:00 1 oracle 2825 oracle 9 0 41888 38M 38852 D 0.4 4.3 0:51 1 oracle 1 root 8 0 472 444 424 S 0.0 0.0 0:09 0 init 2 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 keventd 3 root 19 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd_CPU 4 root 18 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd_CPU 5 root 9 0 0 0 0 SW 0.0 0.0 0:52 0 kswapd 6 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush 7 root 9 0 0 0 0 SW 0.0 0.0 0:04 1 kupdated 9 root 9 0 0 0 0 SW 0.0 0.0 0:00 1 scsi_eh_0 10 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 khubd 15 root 9 0 0 0 0 SW 0.0 0.0 0:09 1 kjournald 537 root 9 0 0 0 0 SW 0.0 0.0 0:00 1 kjournald I have not been able to recreate this problem with either 2.4.21-9.0.1.ELsmp or 2.4.21-15.ELsmp. Although the 3Ware controller I'm using is parallel ide with 4 disks. I tried running multiple cp's of 4 gig files from drive to drive. I could get wait I/O to go up to 70% but nothing higher.. and the system stayed responsive. I also tried taring files off and on to the system over ssh with no IO wait at all. (maybe 10% once in a while) Hmmm....a few comments. First, I sometimes forget that the fix we did for bz #104633 isn't in the 9.0.?.EL kernels. However, the 12.EL and later kernel in the RHEL3 Beta channel has the fix and it very well may solve this problem (although elv tuning might make things better on the 12.EL and later kernels also). As to the readings you might get with a stock 2.4.26 kernel: iowait patches aren't in the upstream kernel.org 2.4 kernel, so it will always read 0 if I remember correctly. I/O wait time used to show up as idle time and would definitely leave you thinking totally differently about the performance. The only way to truly compare performance between the two is to time the amount of total time taken to perform the same operation and then checked elapsed time vs. user & sys time between the two environments. If the people reporting problems in this thread can confirm whether or not the 12.EL kernel in the Beta channel solves your problems I would appreciate it. I am having very similar problems with a 3ware 8506-4LP with 4 SATA WD1200 drives on a dual 3GHz Xeon system configured in Raid5 array. I had (at Jesse Keating's (PogoLinux) suggestion) tried the noAffine2 test Kernel which Doug had posted (which I presume is similar to the beta 12.EL). This did (as for others) improve the write.c benchmark included in the original bug (#104633). It did not however significantly improve my latency problems. I then tried setting the 3ware driver (via its web interface) to favor background tasks. This did help significantly but the responsiveness was worse than a similar machine equiped with a single IDE drive which achieved 2.5x faster write performance and comparable read performance. I tried the beta 14.ELsmp kernel which gave similar results (presumeably as expected) to the noAffine2 test kernel (from a latency perspective). I then tried a self configured (though not correctly as I don't see all my RAM) generic 2.4.26 kernel (from kernel.org source). It appears the iowait statistics in the 2.4.26 kernel are always zero as Doug states. Hence the CPU shows a much greater idle time. However the generic 2.4.26 kernel DRAMATICALLY improves the latency problems. With the 2.4.26 kernel I returned the 3ware configuration to its centerpoint between background / IO performance. Performance is still very good. Conclusion something else other than the affine problem is going on in the Enterprise WS-smp 2.4.21-14.ELsmp kernel with this HW configuration. Another interesting observation while running the bonnie++ benchmark. The profound drop in responsivness with the 2.4.21 kernels appears predominantly during the writing and rewriting phases and not the reading. The benchmark results are similar for write with all the kernels. Incidentally, there similar 10% improvement in read rate for the noaffine2, 12.EL and 2.4.26 over the production 2.4.21 kernel. Another observation the 2.4.26 kernel is using an older 3ware driver than the 14.ELsmp. Looking forward to a solution in the Enterprise kernel. Tried Kernel 2.4.22-1.2188.nptlsmp (Fedora 1 core) again at Jesse Keatings suggestion. This kernel also had greatly improved responsiveness over the production WS Enterprise kernel. My apologies in my previous posting I stated the 2.4.26 was using an older 3ware driver the opposite is true. In fact the all the versions that work have a driver several steps along than that used in both the production and beta WS Enterprise release. 2.4.22-1.2188.nptlsmp v 1.02.00.036 2.4.26 = 1.02.00.037 2.4.21-9.03 + 2.42.21 - EL14 = 1.02.00.033 There are notes in rev36 of this driver related to sleeping problems with the driver. I am recompoling a production kernel with the 037 driver and will post results regards Keith Roberts Tried the 2.4.21-9.03 Kerner with rev 37 driver. Didn't fix the problem. This is beyond my debugging skills. regards Keith Roberts We are also seeing this issue of very high iowait times with our dual 3Ware 7506-8 system. The slightest bit of writing will send the IOWait times through the roof. The elvtune-ing above has no noticable affect. We haven't tried the beta kernel yet as this is a production machine. Are there any signs of that kernel being particularly unstable? I am running the Fedora 1 Kernel (2.4.22-1.2188.nptlsmpto) get around the problem (as mentioned above). I have not seen any particular problems with this Kernel but I would not say I am heavily stressing it. The Beta (EL14) Kernel did not fix the problem for me. Note as stated the drop in iowait is partially fictious as the iowait statistics appear to be zero in this kernel. However the latency was greatly improved. Not really a surprise but I tried the new production core 2.4.21-15 (smp) with no improvement. I will Sticking with the Fedora 1 core for now which is working very well. We have installed 2.4.21-15.ELsmp on 3 different systems and are continuing to see outrageously high iowait times (or if you prefer, extremely slow read and write performance) Our primary application is Oracle 10g but we are seeing problems when using rsync to move large files (aka database backup files) from one server to another. We have RHEL 3.0 EL installed on: Dual Xeon 2.0ghz with hyperthread and 3ware 8506 SATA RAID Dual Xeon 2.0ghz with hyperthread, 3gb RAM, IDE (no RAID) P4 3.0ghz with hyperthread, 2gb RAM, SATA (no RAID) P4 3.0ghz with hyperthread, 2gb RAM, IDE (no RAID) P4 3.0ghz with hyperthread, 3gb RAM, SATA w/Intel RAID P4 3.0ghz with hyperthread, 3gb RAM, SATA w/Intel RAID P4 3.0ghz with hyperthread, 3gb RAM, 3ware 7506 IDE RAID One P4 motherboard was from Intel, the other from ASUS. All of these systems run MUCH SLOWER than the Redhat 7.2 w/Oracle 8i that they were intended te replace. We have tested all of these with and without all of the updates that were made available from RedHat last week. Since this problem became apparent on a Dual Xeon system that had been running flawlessly under RedHat 7.2 and seen the problem on two other systems with a few different HD configurations, I am inclined to believe that it is a problem with the RHEL 3.0 kernel. We are in the process of installing Fedore Core1 and SUSE 9.1 and will post the results. LaVar Every comment except the last one mentions the 3ware adapter, so we have been working on the assumption that the problem is related to the 3ware driver's interaction with the kernel. Has anyone who reported a problem on 3ware also seen the problem on other storage configurations? LaVar (and others), if you have seen this on non-3ware configurations, please post the dmesg output that shows those devices being configured. Most of the comments above also indicate that the problem is system unresponsiveness during I/O, not necessarily that the I/O throughput is bad. In fact, one comment (#20) says that the bonnie++ benchmark results are similar for write with all the kernels. Is this true for everyone? I'm trying to sort if we have two problem reports here or one: a specific problem with system unresponsiveness when 3ware I/O is happenning, or a more general I/O throughput problem on storage in general. We have been trying to reproduce the 3ware/kernel problem, but so far we have not succeeded. We obtained a SATA controller from 3ware, in addition to the IDE one that Bill mentioned in #18. We are continuing to investigate, and will try testing some larger configurations. Our bonnie++ numbers for our RAID5 array aren't blazing, but they're not terrible either. During this test iowait was pegged and all other processes were starved. (Poor mysql!) I just ran bonnie++ tests on two of our systems, both running RHEL3, one with a 3ware RAID, the other straight IDE. Both systems are using LVM. On the 3ware system, IOWait was pretty much pegged at 100%.. or 97% or so throughout the whole test. Load hovered around 8. On the IDE system, it bounced freely from 0 to 100 and everything in between. Load hovered around 4. 3ware RAID system: Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP x4.develooper.co 4G 17016 51 20257 14 11106 5 20232 51 35321 10 63.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1378 58 +++++ +++ +++++ +++ 2156 89 +++++ +++ 585 10 x4.dev,4G,17016,51,20257,14,11106,5,20232,51,35321,10,63.7,0,16,1378,58,+++++,+++,+++++,+++,2156,89,+++++,+++,585,10 IDE System: Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP x3.develooper.co 4G 27045 90 36402 39 18741 12 28103 77 49397 23 133.6 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2446 98 +++++ +++ +++++ +++ 2570 99 +++++ +++ 6347 97 x3.dev,4G,27045,90,36402,39,18741,12,28103,77,49397,23,133.6,0,16,2446,98,+++++,+++,+++++,+++,2570,99,+++++,+++,6347,97 Aleksander, What is the wattage of power supply used? what file system are you using? Thanks -Padmanaban All filesystems are EXT3. The 650Watt power supply is provided by the Hudson-3 Server Chassis, which has 2 hot-swappable power supplies. I believe I've finally recreated this in the lab. My config is 4 SATA drives in a raid 5 and OS Installed on top of LVM. I'm running bonnie++ and trying to login at the console can take so long that it times out. Is everyone who is seeing this running RAID 5 and LVM? Or just one or other? I'll try swapping in the Fedora kernel and see if the problem goes away like everyone here has reported. The system where I've seen this is running on plain hardware RAID5. No LVM involved. Hope it helps. Tom, On the SATA 8000 series controller Both -9 and -15 kernels show horrible latency on interactive commands when the drive is configured as raid 5. The fc1 2.4.22-1.2115 kernel runs fine with no noticable latency in doing interactive tasks. Still can't reproduce on the 7000 series controller which uses parallel ide hard drives. Just to be clear this is in an older Dual 500 PIII. My config is 4 SATA HW Raid 5 but NO LVM If someone doesn't mind installing and compiling a kernel from a bitkeeper source repo, then they could help in the debugging of this I think. I have a kernel repo at bk://linux-scsi.bkbits.net/rhel3-scsi-test that is our complete 2.4.21-15.EL kernel plus a series of changes I've already made. One of the changes that I made is that 'echo "scsi dump 2" > /proc/scsi/scsi' now works and will produce a *huge* dump that includes the mid layer scsi host state, the scsi device state, the complete list of outstanding and free scsi commands for each device, and a list of the requests in the block layer queue for each device. That dump would allow us to see if there is something wrong in merging of requests or something similar to that in this case. If you are already familiar with bitkeeper, then the process is pretty simple: bk pull bk://linux-scsi.bkbits.net/rhel3-scsi-test /usr/src/scsi-test cd /usr/src/scsi-test cp configs/kernel-2.4.21-<arch and option>.config .config make oldconfig make dep make modules modules_install install reboot into test kernel and get the dump when the system is experiencing extremely high iowait and post the dump results here. If you have any problems with the kernel, please let me know about that as well (for instance, if my fix for the dumping code doesn't work, it used to oops and I think I have that fixed now, but it hasn't been tested under extrememly heavy I/O load). There is also one other patch already present in that repo that specifically addresses a performance issue related to typical oracle type workloads by increasing the merging of adjacent requests. That could help in this case by reducing the total number of requests sent to the 3Ware controller in order to perform the same amount of work. One other useful bit of data would be to find out if people are having this problem with any specific chunk size in the RAID5 arrays. For example, it might be that this only happens when the RAID5 array is created with a large chunk/stripe size and our requests are routinely smaller than the size of a single chunk or something like that. That's one of the reasons I would like to see a scsi dump. It will show what size of command we are sending and comparing that to the chunk size of the array might be enlightening. Following up on Rik's suggestion, I tried reducing the queue depth of the 3Ware driver: can_queue from 254 to 30 command_per_lun from 254 to 4 An initial test did not show any dramatic improvement. It will take more time to tell if there was any change at all. Instead of doing this now, we will try Doug's debug kernel (with the stock 3ware driver) to see if that points out a more specific culprit. As requested I compile and ran the test_scsi kernel This kernel does exhibit the problem I ran the : echo "scsi dump 2" > /proc/scsi/scsi Assuming that the /proc/scsi/scsi is where I should see the log file all I get is Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: 3ware Model: Logical Disk 0 Rev: 1.0 Type: Direct-Access ANSI SCSI revision: ffffffff Am I looking in the right place ? Look in /var/log/messages. Tom is correct. They should be in /var/log/messages or the output of dmesg. This is because the dump is done as kernel log messages. If you want to actually see them when you run the command, then do dmesg -n 8 before running the command and it spam your terminal to death ;-) Wouldn't it put the system into an infinite loop if the log file is on the same device that scsi commands are dumped for? If so, one would have to temporarily forward those messages to another syslog host. Before I do this on my system, I'd like to know: what facility and priority those messages will be?? No. The dump is done as kernel printk()s. They show up in your syslog messages. And it dumps the commands while holding a lock on the device it's dumping messages about, so it sees a snapshot in time, not a running command list. facility: kernel, level: info. By the way, I do have one dump from one of these systems now. The dump looked fairly sane, but I do think that the 254 or so commands they give per logical drive is insane. BTW, I'm seeing the same problem on a stock Dell 2450 using the internal RAID controller (PERC 3/Si). Low levels of disk activity are causing very high iowait. Here's some base information, if this helps: uname -a Linux sca05 2.4.21-9.0.3.ELsmp #1 SMP Tue Apr 20 19:49:13 EDT 2004 i686 i686 i386 GNU/Linux lsmod Module Size Used by Tainted: P ide-tape 52464 0 (autoclean) ide-cd 34016 0 (autoclean) cdrom 32576 0 (autoclean) [ide-cd] sg 37228 0 (autoclean) lp 9124 0 (autoclean) parport 38816 0 (autoclean) [lp] autofs 13620 0 (autoclean) (unused) nfs 95344 5 (autoclean) lockd 58992 1 (autoclean) [nfs] sunrpc 88444 1 (autoclean) [nfs lockd] e100 58468 1 floppy 57488 0 (autoclean) microcode 5056 0 (autoclean) loop 12696 0 (autoclean) ext3 89960 7 jbd 55060 7 [ext3] lvm-mod 64800 15 aacraid 34116 2 sd_mod 13456 4 scsi_mod 111784 3 [sg aacraid sd_mod] I've been contacted by 3Ware's (wellm, actually AMCC's) Technical Support Engineer, and he has supplied me with findings made by another anonoymous customer who actually did some kernel debugging in RHES 3.0 (and 3Ware's linux driver engineer agrees with his findings). He wrote: "I think I found the "culprit" on the RedHat ES 3.0 perfomance problem and I would like to know the 3ware opinion "before" contacting RedHat. The "main" problem is, as my first intuition, in this patch # several small tweaks not worth their own patch Patch10020: linux-2.4.18-smallpatches.patch I did my analysis on the last public available RedHat kernel, i.e. "kernel-2.4.21-9.0.3.EL.src.rpm", from ftp://updates.redhat.com/enterprise/3ES/en/os/SRPMS. After unpacking the source RPM it us possible to see in "/usr/src/redhat/SOURCES/linux-2.4.18-smallpatches.patch", at lines 517-518 this nice change on the base kernel 2.4.21: 517 #define DISABLE_CLUSTERING 0 518 -#define ENABLE_CLUSTERING 1 519 +#define ENABLE_CLUSTERING 0 that has a "very nice" results on the file "/usr/src/linux-2.4/drivers/scsi/hosts.h" 35 #define SG_NONE 0 36 #define SG_ALL 0xff 37 38 #define DISABLE_CLUSTERING 0 39 #define ENABLE_CLUSTERING 0 As you can see the C defines "DISABLE_CLUSTERING" and "ENABLE_CLUSTERING" are now both "zero". In the original source code they are (as one can expect from the names): 35 #define SG_NONE 0 36 #define SG_ALL 0xff 37 38 #define DISABLE_CLUSTERING 0 39 #define ENABLE_CLUSTERING 1 "zero" and "one", that, if I interpret correctly, means respectively "disable scatter gathering" and "enable scatter gathering". Now in the 3ware driver source code (as in release 1.02.00.037), in the 3w-xxxx.h file, we have this reference to "ENABLE_CLUSTERING": 563 unchecked_isa_dma : 0, \ 564 use_clustering : ENABLE_CLUSTERING, \ 565 use_new_eh_code : 1, \ 596 unchecked_isa_dma : 0, \ 597 use_clustering : ENABLE_CLUSTERING, \ 598 use_new_eh_code : 1, \ As you can picture, the result is that, in the Scsi_Host_Template initialization, the ENABLE_CLUSTERING costant is "zero" instead that "one", as "probably" was in the origianl 3ware programmer intention. I did a very simple test with the redhat kernel (untouched) changing the 3w-xxxx.h source code in this way: 563 unchecked_isa_dma : 0, \ 564 use_clustering : 1 , \ 565 use_new_eh_code : 1, \ 596 unchecked_isa_dma : 0, \ 597 use_clustering : 1 , \ 598 use_new_eh_code : 1, \ i.e. I explicitely used "one" instead that the offended "ENABLE_CLUSTERING" redhat macro. Perfomances increase immediately from 50Mb/sec to 90Mb/sec. Unfortunately there is another RedHat patch: Patch7030: linux-2.4.21-scsi-affine-queue.patch that seems to influece 3ware driver perfomances. The patch is rather complex and seems to change the logic of the producer/consumer relation for the generic scsi driver. If I understand correctly, from the source code comments, the patch try to optimize the SCSI generic driver for SMP systems. It is possible (I did some test and I would like again to ear your opinion) to get rid of this problem in two way. The trivial one is to delete the patch: it seems to be used by one and only one driver: qla2200 (QLogic ISP2x00) in the file /usr/src/linux-2.4/drivers/addon/qla2200/qla2x00.c. I followed this path and perfomances increase to 150Mb/sec. Then I tried another way: I left the redhat kernel untouched again and I modified the min/max-readahead parameters, picturing that increasing the values would influence the new producer/consumer driver logic. Bingo! The new values: echo 8192 > /proc/sys/vm/max-readahead echo 2048 > /proc/sys/vm/min-readahead increased perfomaces again to 154-158Mb/sec. It even seems to be present a light perfomances gain. My conclusion are: a) there is a bug in the redhat "Patch10020" for ES 3.0; we should contact RedHat to understand why the macro was changed from one to zero; in the mean time we can use a "patched" 3ware include file; b) we may "increase" the min/max-readahead parameters from 512/128 to 8192/2048 or we may contact RedHat to understand what is going on in the "Patch7030". My questions to 3ware are: a) is there any error in my analysis? b) do you share my conclusions? c) do you have other suggestions?" Aleksander: Thank you for posting this. This may in fact be the problem. So, first, the latest RHEL3 U2 kernel source already has patch 7030 removed. So, I've added a change to my working tree to re-enable clustering in general. I would be interested if someone could run my current working kernel tree, which has the fixed scsi dump code, and with the clustering support enabled and the machine under heavy load, get me a dump of all the outstanding scsi commands. I would like to analize whether or not the clustering code is making a significant change in the makeup of the scsi commands going out to the 3ware driver. Of course, with patch 7030 already removed, the changes to the min/max readahead values shouldn't be needed, but if someone wants to run a few tests with my current working kernel tree and some different values for this, I would welcome the input. However, keep in mind that even though higher readahead values will increase streaming performance, it's a trade off between streaming performance and interactive performance. So, basically, you don't want to make the readahead numbers so high that the readahead operations start to interfere with other tasks being performed at the same time. So, some additional useful data besides just the straight performance numbers would be the results of running time /etc/cron.daily/slocate.cron which would be performing a large number of uncached disk accesses all over the disk at the same time. The readahead might make streaming performance better, but it's also likely to reduce the performance of the slocate cron job, and finding a good balance of reasonable streaming performance and reasonable slocate performance would be ideal. I pulled the latest version of the scsi_test kernel. This did not have the error mentioned above changed. I then edited the hosts.h file to : #define DISABLE_CLUSTERING 0 #define ENABLE_CLUSTERING 1 I then re-issued (having built this kernel earlier) make modules modules_install install I don't know if this is sufficient to rebuild all the modules affected. I had not issued the make dep assuming (quite probably incorrectly) that if I did not change the config this would be unnecessary. I am concerned as nothing changed perhaps I failed to rebuild the Kernel (though I saw several modules change .0 date). Anyway assuming the change was implemented the results was no significant improvement in the responsiveness, once I booted from the Kernel. This is a production machine so I have used up my chance for reboots this week. Hence I couldn't try rebuilding the Kernel from a clean start. When the system was in very big trouble it was interesting to see a much abbreviated log. May 28 15:15:09 chips1 kernel: Dump of scsi host parameters: May 28 15:15:09 chips1 kernel: (scsi0) Failed 0 Busy 254 Active254 May 28 15:15:09 chips1 kernel: (scsi0) Blocked 0 (Timer Active 0) Self Blocked 0 May 28 15:15:09 chips1 kernel: Dump of scsi device and command parameters: May 28 15:15:09 chips1 kernel: (scsi0:0:0:0) Busy 254 Active 254 OnLine 1 Blocked 0 (Timer Active 0) May 28 15:15:09 chips1 kernel: (cnt) ( kdev sect nsect cnsect stat use_sg) (retries allowed flags) (timo/cmd timo int_timo) (cmd[0] sense [2] result) May 28 15:15:09 chips1 kernel: ( 0) ( 08:0e 31966040 256 8 1 32) (0 5 0x00) (6000 0 0) 0x2a 0x00 0x00000000 May 28 15:15:09 chips1 kernel: ( 1) ( 08:0e 31942224 256 8 1 32) (0 5 0x00) (6000 0 0) 0x2a 0x00 0x00000000 One thing that is very apparent is that this is not an instant problem. It really feels that a very large queue of operations has to build up and then suddenly performance becomes dramatically worse. It is as though there is a limit on the queue size and rather than slow up the offending process(es) everybody requesting disk becomes locked out. Thanks for trying that Keith. I'm getting similar results here. Just changing the ENABLE_CLUSTERING define back to 1 isn't making any significant difference. I'm investigating a few alternative possibilities here to see if I can work something up that would solve the problem in a different manner. The problem definitely is in RHEL kernel. Recently I've switched a kernel from Fedora Core 1's do RHEL's on a similar machine (3Ware-based hw RAID5 array) because it was crashing with kernel panics all the time (bug 123332). With RHEL's 2.4.21-15.ELsmp kernel the panics have apparently stopped, however instantly the problem with high iowait values arised. Storage subsystem performance dropped so significantly that I had to disable the virus scanning engine on that server (it's a mail server) and do several other last resort fs tweaks (disabling atime on all filesystems; tweaking bdflush, elvtune, readahead parameters; screen the system from inbound virii and spam traffic using a primary SMTP MX that does the filtering and runs Fedora Core 1) to get the performance back to almost acceptable level - the mail system responds to clients just before they time out most of the time... Unfortunately I cannot build a test kernel as I have no RHEL system that would be able to perform a build - those 2 systems that I have do suffer from high iowait problems and running a kernel build would likely kill them for several hours. Created attachment 100990 [details]
Detailed description of an array 3Ware 8506-4
This is the "Details" page from the 3Ware 3DM array management page.
As visible, chunk size is 64 K.
Doug, I've pulled your test kernel with Bitkeeper by running "bk clone bk://linux-scsi.bkbits.net/rhel3-scsi-test /usr/local/src/linux-rhel-test" and when I build it using kernel-2.4.21-i686-smp.config, I get this compilation error: gcc -D__KERNEL__ -I/usr/local/src/linux-rhel-test/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -Wno-unused -fomit-frame-pointer -pipe -freorder-blocks -mpreferred-stack-boundary=2 -march=i686 -DMODULE -DMODVERSIONS -include /usr/local/src/linux-rhel-test/include/linux/modversions.h -nostdinc -iwithprefix include -DKBUILD_BASENAME=i2c_ali1535 -c -o i2c-ali1535.o i2c-ali1535.c i2c-ali1535.c:675:6: missing terminating " character i2c-ali1535.c:676:89: missing terminating " character i2c-ali1535.c:691:1: unterminated argument list invoking macro "MODULE_AUTHOR" i2c-ali1535.c:674: error: syntax error at end of input make[2]: *** [i2c-ali1535.o] Error 1 make[2]: Leaving directory `/usr/local/src/linux-rhel-test/drivers/i2c' make[1]: *** [_modsubdir_i2c] Error 2 make[1]: Leaving directory `/usr/local/src/linux-rhel-test/drivers' make: *** [_mod_drivers] Error 2 Created attachment 100991 [details]
Detailed description of an array at a 3Ware 8506-8 controller
64 K chunks, too. Both arrays exhibit the same iowait problem under RHEL
kernel.
Aleksander: regarding the compilation error, I suspect you have gcc 3.3 or later installed on the system you are compiling on? The error in question is due to the fact that the file uses a multi-line string constant, which gcc-3.2 warns about but works with anyway, where as the latest gcc refuses to compile it. I suspect that there are a *number* of places in the kernel source that are going to be unhappy with the latest gcc. We also have another bug report that I think is the same thing, 124450, and they are working on getting me some detailed information related to the problem, so if switching around gcc compilers and stuff is a hassle, we should be able to work it out without you having to go through the trouble. If that changes and I need you to try a kernel (or if we think we have the issue fixed and just need test confirmation), then I would think we can compile a set of test RPMs at that point and make them available. I were building the kernel on Fedora Core 1. Now I've built one on RHEL. SCSI dumps work fine (I can see that this is a one time dump that is triggered by 'echo "scsi dump 2" > /proc/scsi/scsi', not a continuous one, right?). Now I'm trying to reproduce the high iowait condition and then I'll post a dump from normal system state and high iowait state for comparison. Created attachment 101102 [details]
scsi dump on idle system
This dump has been made when the system is almost completely idle and iowait
values are around 0.0%.
Created attachment 101103 [details]
scsi dump on system with high iowait values
This dump however may not be representative of the problem covered in this bug,
as I've put the system into high iowaits very artificially, by running 2 backup
sessions from remote machines and bonnie++ benchmark at the same time.
The system didn't exhibit one behaviour that's characteristic of the discussed
problem: iowait values dropped instantly after I've stopped the bonnie++ task,
while they usually remain high for some time when the problem occurs
spontaneously.
When the system goes into high iowaits state spontaneously, I'll generate a
scsi dump and attach it here. The dump may show very different data then. I
cannot reliably put the system into that state by manual intervention, so
you'll have to wait if current dump doesn't show anything suspicious.
Aleksander: This is very helpful information. Especially the difference between the aritificial and typical IOwait problem going away. That leads me to think that the actual problem here is quite possibly similar to the apache "thundering herd" wakeup on connect problem that they solved some time back. Whenever the request queue gets filled up, any additional read or write requests are placed on a wait queue. That wait queue is woke up whenever a request is freed. Obviously, if you have a large number of processes waiting on requests to be freed, that wake up will end up scheduling all those processes to run, but only 1 will actually get the free request struct. So you spend a lot of CPU time waking processes up just to put them back on the wait queue. The worse the problem is (aka, the more processes you have on the wait queue), the longer it will take to clear up. A good check for this would be to run ps axfm on a machine under a real iowait load condition and see just how many processes are stuck in a D state. A second problem here is that if only a few request structs are free, and the process that is the first to get there needs more than there are free, then even though it gets some requests onto the queue, it still has more to go and goes back on the wakeup list. The next time a wake up happens, this process may not be the one that gets the free request structs, so we can essentially end up giving a few request structs here and there to process, but not enough to let any process actually finish and get itself off the wait queue for a significant period of time. Again, this would greatly exacerbate the problem since what we need under high disk load like this is to be able to actually get some of those processes done and off the wait queue completely. Anyway, that's the theory I've got after reading your updates. It's a hard one to test, but an artificial load that might demonstrate it better would be a disk exerciser that uses lots of threads for reading and writing like tiobench. Several simultaneous tiobench runs with lots of threads each should duplicate the problem and also show a large lag in clearing the problem out if I'm right about what's causing the slowdown. OK, I've run 3 tiobench instances, two of them with default settings (8 threads), one with more threads (64 threads). I'm starting to attach results. Created attachment 101137 [details]
output from "(date; ps axfm; date)"
Indeed, tiobench and other processes are mostly in the "D" state.
Created attachment 101138 [details]
scsi dump initiated at 2004-06-15 12:37:14
Note that timestamps in the log are offset and dump starts at 12:39:10.
This is due to the previous dump (executed when iowaits were at 99%, but the
system was still quite responsive) was still unfinished going through syslog
when I've initiated the second one.
I hope that this is OK - AFAIU dumped data is a result of point-in-time
snapshot, right?
Created attachment 101139 [details]
tiobench results, first instance
Created attachment 101140 [details]
tiobench results, second instance
Created attachment 101141 [details]
tiobench results, third instance (64 threads)
Yes, dumps are a snapshot in time, so hitting it a second time before the first one finished is perfectly fine (unless syslog just chokes on so much data ;-) I'll review what you've posted (Thanks!) and see what I can come up with in terms of any possible fixes. I'm not seeing a whole lot of processes in the D state. Not enough to create a thundering herd problem anyway. I've asked Stephen Tweedie to take a quick look at this and see if he thinks this might be an ext3 bottleneck of some sort relating to journal writes, etc. If someone has a non-production system that they can test with, then they could try mounting the filesystem as ext2 instead of ext3 and see if that makes a difference on the problem. I don't really know much about tuning ext3 filesystem journal sizes, but either increasing or decreasing the journal size on problem filesystems might help as well. This might be related: I can also see the following message in kernel error logs on each boot-up: "kernel: PCI: Unable to handle 64-bit address space for" There's nothing after the "for"... Created attachment 101213 [details]
SCSI dump made today under non-artificial high iowait condition
I've hit the high iowait problem today when running "du -x --max-depth=1 . |
sort -n" in the mountpoint of the whole filesystem (the biggest one on that
machine).
It took quite a while to complete the "du" run and iowaits were at 98%, system
responsiveness was sloppy.
I'm jumping on this thread hoping my tests may be of help. I'm the author of the post to 3ware reported in comment #43. This post was sent to 3ware hoping they may help to understand why an 8506 controller with a 4x250Gb Maxtor MaxLineII raid5 array had a very poor perfomance under RH EL 3.0: 11Mb/sec. The problem "actually" was in the ENABLE_CLUSTERING patch. Resetting it to the right value (one) the perfomances jumped back to 150Mb/sec. After that I found the problem in the affine patch, a patch I was delighted to see deleted in the U2 Kernel release. Unfortunately the story is not over. I can confirm, from our tests, that our system periodically "hang" for short periods, when is under an I/O bounded process. This happen with ext2, ext3, JFS, ReiserFS, XFS ... it even happen when we create new filesystems. It is very easy to see the problem arise when a "dd" is started on a raw device. This short periodical "lock" of the system arise if and only if a a process is "writing" into the devices. They do not show at all if the process simply "read" from the device. At first I was convinced this was a RH ES problem, but, after dozen of tests, I think I found this behaviour, more or less, under other kernels: straight 2.4.26, 2.6.5 and even in Fedora CORE1/2 kernels. I'm starting to think the problem is or inside the 8506 3ware device driver or inside the 8506 firmware. I cannot prove this and I still hope is a kernel issue that can be corrected. Hope I was of some help, Regards, G. Vitillaro. The problem is difficult and it seems that there are in fact different problems that when combined together give the symptoms we've observed. Let's try to prepare a list of potential causes (not that they exclude each other, but they are separate): === Hardware Poor performance of 3Ware controller in HW RAID5 setups can be explaied by poor performance of its CPU That's why some people setup _Linux's software RAID5_ on 3Ware controllers instead of doing RAID5 in the hardware, because they get much better performance if their machine CPU is reasonably fast. Look here: http://ask.slashdot.org/article.pl?sid=04/06/16/1658250&mode=thread&tid=137&tid=198#9445640 === Software There are several possible software causes in the Linux kernel as I understand: 1) The CLUSTERING patch that does this to the hosts.h header: #define DISABLE_CLUSTERING 0 #define ENABLE_CLUSTERING 0 has been tested and does not influence this issue (high iowaits) as Keith Roberts reported in comment #45, but backing it off has resulted in significant throughput increases for some testers (Giuseppe Vitillaro, comment #68). 2) Removal of the infamous Patch7030: linux-2.4.21-scsi-affine-queue.patch has offered some performance improvements on 3Ware controllers for some testers (Keith Roberts, comment #20; Giuseppe Vitillaro, comment #68), but didn't affect the high iowait issue much too. 3) Some yet unidentified issue with RHEL kernel as compared with Fedora kernels causing the "high iowait and low system responsiveness, high I/O latency" issue that continues to exist. A system that had been running Fedora kernels (vmlinuz-2.4.22-1.2188.nptlsmp, 2.4.22-1.2190.nptlsmp) didn't have the problem. Installing RHEL kernel (vmlinuz-2.4.21-15.ELsmp) on that system immediately brought high iowait problems. Take care about "visibile" iowait. I haven't a log of all of my tests, but I'm pretty sure that in many kernel version I tested (this includes Fedora), the iowait is not "visible", but the "hanging" problem is still there, at least on our machine. The problem presents itself in this way: 1) start a "dd if=/dev/zero of=/dev/sda[n] bs=1M" on a raw unused partition of your /dev/sda array; 2) "cd" into /usr/lib and start an "ls -l" the "ls -l" command waits for dozen of seconds, even more in some occasion the first time you scan a "large" uncached directory. I know this a very rough way to test a system, but it is the only fast way we found to check if the problem is still there. We are in the way to test a SW RAID5 on this "non production" machine. If the perfomances will be in the same range of "native HW RAID" and the "hannging" problem will disappear, we will have a good indication that the problem arise into the RAID5 3ware firmware. Isn't? Regards, G. Vitillaro. Doug, regarding your earlier comment, there's really nothing in the logs so far here that would give me enough information to either blame or eliminate ext3 as a factor here. But the effect of batching of journal writes is really more likely to show up as a latency effect under severe load, not an IO bandwidth effect. One thing that might help would be an "alt-sysrq-t" dump of process state during the bad performance, as that will show exactly which processes are waiting where. But if there's underlying bad IO performance at the driver level, that is still quite likely to show up in the sysrq-t log as lots of processes stuck in the filesystem. Really, trying one of the existing reproducers on an identical configuration except with ext3 mounted as ext2 instead is the best way to eliminate that from the problem. BTW I've noticed that the problem is especially visible when doing recursive permission operations on trees with large number of small files. E.g. doing "chgrp -R groupname /some/big/directory" will progress unusually slowly, especially if there are some other moderately I/O intensive processes running on the system. So commands making small reads seem to be affected the most. Ehm, I've meant "small reads and writes". Stephen, does doing such a magic SysRq dump affect the state of the system? Some magic SysRq commands leave the system in barely usable state (like emergency remount R/O), I'd like to be sure that this one will not affect the operation of the server. aleksander, alt-sysrq-t only emits its output to the kernel log, and has no other impact. The only side-effect you might notice is the time the kernel takes to dump the information --- if you have serial console set up, for example, then you might see a short stall as the kernel dumps all the output over a slow connection. "doing "chgrp -R groupname /some/big/directory" will progress unusually slowly, especially if there are some other moderately I/O intensive processes running on the system." That's an unfortunate consequence of basic IO scheduling. If your scheduler is "fair" with respect to the IOs it knows about, then there's a basic problem --- writes are generated asynchronously by applications (for most filesystem modifications, the app doesn't need to wait until the IO hits disk --- only fsync/O_SYNC forces that.) But for reads, the application needs to wait for the IO to complete before the data is available. So for reads, you end up with the application submitting a single read, waiting for it, submitting another, etc --- there's one IO in the queue at a time. (Readahead helps to some extent by making those IOs as large as usefully possible.) But for writes, an application can generate huge numbers of IOs at once. If you mix the two types of load, then yes, the reads progress slowly, because each single small new read gets queued behind all the other writes in the system. The 2.6 kernel has an "anticipatory scheduler" which keeps the queue artificially idle after a read is satisfied, to allow the reading process to submit another one in short order and get a bit more of the queue to itself at once. It's not really feasible to back-port that to 2.4. Created attachment 101325 [details]
Magic SysRq dump (alt-sysrq-t)
During that dump, the processes that were I/O intensive and caused high iowaits were: ssh (the client, it was tar-archiving a remote machine, redirecting the compressed archive to a local file), ls (it was hanging at the time of the dump), find (it was searching for .tar.bz2 files on all filesystems). That sysreq-t output looks very much like journal wait issues. Amongst other things, syslogd is writing the log files in sync mode (calling fsync after every log message) which appears to then be forcing journal flushes that can delay things. But, in general, this really looks like an ext3 journal flush causing long latency type problem. Stephen? In the sysrq-t trace here, we've got one task doing a "checkpoint". That's when the journal is full, so we're basically doing the "sync" to flush out any metadata that's attached to the old transactions in the journal prior to deleting those transactions. Going to the first comment: "I have no such problem on a Fedora Core 1 box with 3Ware 8506-4 controller, which is under an order of magnitude higher I/O load." Now, that code should be basically identical in FC1 and RHEL3. I will go have a check to see if there are any differences we've missed, though. But bear this in mind --- CPUs are faster than disks. If you have an IO-intensive workload, then the processes doing IO are *necessarily* going to spend the bulk of their time in D state. And if you're doing a lot of journal writes, then yes, slow IO can often be expected to manifest itself as lots of processes waiting on the journal --- merely because that's where the writes are physically scheduled. So seeing processes blocked in the journal does not imply _cause_ here, especially since the exact same journal code is apparently working fine on FC1. The journal is definitely waiting on IO in the specific sysrq-t snapshot here, but the jury is still out as to whether that's a cause or an effect. If the code really is identical in RHEL and FC, then the problem is most probably somewhere else. I've tried (see comment #47) switching to RHEL kernel on a machine running FC 1 distribution. Switching from Fedora to RHEL kernel (built on a RHEL machine from source pulled from Doug's test BK repository at bk://linux-scsi.bkbits.net/rhel3-scsi-test) introduces the high iowait and high latencies problem immediately. This results in _visible_ performance degradation (abnormally high latencies of disk operations), and I am aware that the iowait on Fedora kernel is always shown as having 0% as reported by 'top'. I am having very similar problems, but on a desktop workstation (Dell Precision 360), not a server. Basically any intensive I/O, like taring a large file or running find on the entire disk, will make the system almost completely lockup for 10's of seconds or more. I once tried to run sysreport and had to abort it after a half hour because my computer was basically unresponsive to interactive use. I have also seen high iowait percentages and many processes in the D state, but I just think they are a symptom, not the problem, they just happen to be processes that are running at that time, usually things like kswapd, find, tar, even X (which is why interactive response is so bad)! I have read through most of this report, and the only similarities I identified were, using ext3 journaling, SMP kernel (hyperthreading PIV CPU), and a SATA controller card. I tried booting the non-smp kernel and even booted with all filesystems mounted as ext2, but still saw the same problem. Could this be a hardware or kernel driver problem? I have an Intel Corp. 82801EB disk controller, but my one and only IDE hard drive is connected to the Ultra ATA 100 interface, not the SATA. Or is this something internal to the kernel and I/O scheduling? Is there any more info I could provide to help fixing this bug? I hope my comment will not be misleading to solve this situation. As I noted in comment #70, we switched our machine from HW RAID5 to Linux SW RAID5 using RH ES 3.0 U2. We are using four Maxtor MaxLineII 250Gb attached to a 3ware Escalade 8506-8. We used bonnie++ (bonnie++ -n0 -r512 -s20480 -f -b -u0 on a 512Mb memory configuration) to evaluate perfomances of an ext3 filesystem: Version 1.03 ------Sequential Output------ --Sequential Input- = --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- = --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP = /sec %CP ulisse 20G 65419 66 46467 41 159405 49 = 142.8 1 ulisse,20G,,,65419,66,46467,41,,,159405,49,142.8,1,,,,,,,,,,,,, As you may note the bandwidth seems preserved: 63Mb/sec write 45Mb/sec rewrite 155Mb/sec read The disks are identical with a base controller/single-disk bandwidth in the 35-40Mb/sec range. It is almost the same result we obtained using the HW RAID5 and in line with the expected perfomances (SW RAID seems definitely better as write/rewrite bandwidth). It seems to me that, as may be expected, this is obtained using CPU cycles. The CPU use seems to double going from HW RAID5 to SW RAID5. But ... the periodical locking problem seems to be gone: the machine (SMP biprocessor Intel Pentium Xeon 2.80Ghz) now run smoothly. I know this cannot be a conclusion, but if the behaviour may be duplicated and analyzed, maybe there is hope to identify the origin of the problem. Regards, G. Vitillaro. Dear Giuseppe Could you please post your bonnie++ results for quad drive HW raid configuration. Your software RAID numbers are far (3x) superior to my HW results posted above though you mention they are comparable to your HW results. Could you also confirm with ext3 whether you were running defaults or had a higher peformance journal mode set. I notice you used a 8506-8 compared to my 8506-4. My understanding was that for given number of drives it should be comparable. regards Keith Roberts Sure, but i have a copy just of the "best" test I obtained with HW RAID and RH ES. The hw configuration is the same, beside going with HW RAID on the same disks with the 8506. You are right: the best results with HW RAID (from my logs) is with JFS. Ext3 is rather disappointing for write and rewrite (read is in the same range). The Kernel was RedHat ES 3.0 2.4.21-9.0.3 (U1 I believe), driver was 3ware 1.02.00.037, vm.max-readahead = 8192, vm. min-readahead = 2048, with the driver recompiled with ENABLE_CLUSTERING patched to 1. The higher readahed values was needed because the 7030 "scsi-affine" patch was still in our kernel at that time (Fri May 7 10:38:19 2004). The bonnie++ command and the machine memory was the same (512Mb): "bonnie++ -n 0 -r 512 -s 20480 -f -b" and the results, on a JFS 4096 bytes blocked file systems, was: Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nimitz 20G 52443 17 44000 15 161393 31 113.9 0 nimitz,20G,,,52443,17,44000,15,,,161393,31,113.9,0,,,,,,,,,,,,, As you can see the read perfomances are sligthly better, the write/rewrite perfomances are worst and the CPU usage is more than one half, compared with SW RAID. In both cases the "read" bandwidth is in the expected range, i.e. near 150-160Mb/sec, for a quad disk configuration. We were rather happy of this results "before" to note the "locking" problem. Then, after the discussion on this thread we "switched" to SW RAID and we are going to live in SW configuration (the machine is on production now), until the all thing is cleared. Hope this help, Giuseppe. Just another note: we tried "once" ext2 (i.e. ext3 without journal) and we didn't noted any perfomance move: but I'm going with my memory. I havent loged these tests. Sorry. Regards, G. Vitillaro. Giuseppe, I think this is a bit offtopic in this bug. You're talking about preformance of various filesystem types being compared, and about performance of HW RAID5 vs. SW RAID5. This bug is about a different thing: using the *same filesystem* (ext3), on the *same machine and storage subsystem*, Fedora kernel performs well, while RHEL kernel gives terrible latencies visible on the client side and high iowait times visible on the server side during those periods. As to the comparison of RAID performance, hardware vs. software, it's a well known thing especially with 3Ware controllers. Some people deliberately setup software RAID5 on 3Ware controllers and treat them as "dumb" controllers: http://ask.slashdot.org/article.pl?sid=04/06/16/1658250&mode=thread&tid=137&tid=198#9445640 I am having the same issues on a live production server. It is an Intel mainboard with a 3Ware 7006-2 RAID Controller (IDE/Parallel). We are running this on RHEL3 2.4.21-4.ELsmp With Dual XEON Processors. We are using 2 Maxtor 120gb IDE drives. dmesg: SCSI subsystem driver Revision: 1.00 3ware Storage Controller device driver for Linux v1.02.00.033. scsi0 : Found a 3ware Storage Controller at 0x7000, IRQ: 54, P-chip: 1.3 scsi0 : 3ware Storage Controller Starting timer : 0 0 3w-xxxx: scsi0: AEN: WARNING: Unclean shutdown detected: Unit #0. blk: queue c2ecc218, I/O limit 4095Mb (mask 0xffffffff) Vendor: 3ware Model: Logical Disk 0 Rev: 1.0 Type: Direct-Access ANSI SCSI revision: 00 Starting timer : 0 0 blk: queue c2ecc018, I/O limit 4095Mb (mask 0xffffffff) Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 SCSI device sda: 240119680 512-byte hdwr sectors (122941 MB) Please let me know when a resolution is expected or what quick fix could be used in the mean time. Bob, if you can, please try using Fedora Core 1's kernel for a while. You can download kernel-smp from Fedora here: http://download.fedora.redhat.com/pub/fedora/linux/core/1/i386/os/Fedora/RPMS/ Don't change any other options and post your results here (does Fedora kernel really work better). I have installed that kernel and initially it seemed to be ok. I opened two tar sessions of the entire hard drive to really test it. iowait remains at 0% and my load average went up to 6.8. My question is this, tho: If my iowait is at 0% on all 4 processors and all 4 of those processors are at least 90% idle, how is it possible that my load average is climbing? I will continue to run this Fedora kernel on our server and see what happens under normal load. Also, I have a ticket open with Red Hat Support. I opened it before I found this bug. Today I sent them all of the information related to my server that they requested. Anyone who has access to those tickets can look at ticket 346246. This information was given to them before I changed the kernel. I was running the .15 EL3 kernel at the time and had high I/O activity to reproduce the load. Bob: Aside from your ticket, we now have a number of tickets all echoing the same basic sentiment in terms of a problem. I'm currently working on it on a test machine internally. Currently, my prime suspect is that the problem is a combination of things. 1) The RHEL kernels will flush a larger number of dirty pages in a single pass of kflushd than upstream kernels will in order to improve our ability flush out swap pages under heavy memory pressure and 2) the I/O elevator doesn't take this into account and can allow a huge number of writes to get put in front of reads on the request queue under a couple different scenarios resulting in the disk request queue becoming clogged with writes ahead of reads and generating extremely high latencies on the read requests, which in turn causes programs to get stuck waiting for requested read data to be returned, increasing the overall load average, decreasing responsiveness, etc. That's the theory anyway, we'll see how the possible code solutions work, or don't as the case may be. I'm currently waiting on my test machine to finish compiling a relatively complete set of performance runs so that I have valid baseline data against which to judge the effectiveness of any changes I make (I thought the people on this bugzilla would appreciate knowing just where I'm at, hence the status update format of this last bit). Thanks Doug! Let me know if you have any patches that you would like me to try on our server. I only have physical access to it on Wednesdays and Fridays but do have remote access 24x7. It seems I am still having similar issues using the Fedora kernel. I was working with my named.conf file and it loaded almost instantly in pico. When I returned to the file, it took approx 20 seconds to open the same file in pico. At this point my load is at 3.15 which seems unreasonably high considering our server has a higher mail load between 9-5 and it is now after 5. Approx 30 mins ago my load was at .33. Can you run top for a few minutes during this high load average time and tell me which processes come to the top of the listing and show as being in either an R or D state? I'm curious what's causing the load, whether it's something like mail server processes or kernel processes such as bdflushd, kflushd, or kjournald. Doug, I just now went into top but my system load is only at .7 at the moment. Something strange was happening tho. All 4 processors were showing 0% idle as well as every other column (user, system, iowait, etc).... every 5 or 10 seconds, irq, softirq, and iowait would show approx 33% on all four processors AS WELL AS under total, which mathematically wouldnt make any sense. The system is running and responding fine but this was something odd I just observed. I know typically spamd, qmail, and other processes related to mail pop up at the top during high system load averages however I have yet to see a process use more than 4% cpu during this high load averages. I am leaving the data center at this point to travel back to my home in PA. I will check this thread in approx 5 hours as well as check my system load at that point. As soon as I see the load average peak above 4 again, I will get the information I can out of top. For some time, I've been logging output from "ps axfmu" when there were high iowait peaks. Have a look at the processes spending their time in the D state. The following has been filtered using: awk '{ if ($8 ~ /D/) { print ; } }' , so that onlt processes in "D" state are shown (the username column has been removed): PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 198 0.0 0.0 0 0 ? DW Jun04 29:16 [kjournald] 23944 2.5 0.9 21896 9920 ? D Jul02 146:07 \_ fam 25431 0.0 0.1 7736 1232 ? D Jun22 0:01 \_ /usr/bin/python /usr/local/mailman/bin/qrunner --runner=RetryRunner:0:1 -s 3385 0.5 0.1 4384 1768 ? DN 14:59 0:01 \_ /usr/lib/courier/bin/imapd Maildir 3698 0.5 0.1 4120 1604 ? DN 15:00 0:00 \_ /usr/lib/courier/bin/imapd Maildir 19863 0.0 0.1 5552 1876 ? DN 10:32 0:06 | \_ /usr/lib/courier/bin/imapd Maildir 24551 0.0 0.1 3692 1056 ? DN 14:23 0:00 \_ /usr/lib/courier/bin/couriertls -server -tcpd /usr/lib/courier/libexec/courier/imaplogin /usr 27507 0.0 0.0 3696 952 ? DN 14:32 0:00 \_ /usr/lib/courier/bin/couriertls -server -tcpd /usr/lib/courier/libexec/courier/imaplogin /usr 27508 0.0 0.1 3868 1100 ? DN 14:32 0:00 | \_ /usr/lib/courier/bin/imapd Maildir 498 0.0 0.1 4256 1412 ? DN 14:50 0:00 | \_ /usr/lib/courier/bin/imapd Maildir 3687 2.4 1.1 13936 11328 ? DN 15:00 0:03 | \_ /usr/lib/courier/bin/imapd Maildir 4210 99.9 0.1 3104 1364 ? DN 15:02 0:02 | \_ submit esmtp dns; SOKRATES (softdnserr [::ffff:192.168.254.79]) AUTH: LOGIN askwarska, TL 4202 0.0 0.1 3096 1360 ? DN 15:02 0:00 \_ submit esmtp dns; [10.0.10.5] (softdnserr [::ffff:192.168.254.79]) 18250 0.1 1.1 29380 11800 ? DN Jul05 1:48 \_ /usr/sbin/httpd 23543 0.1 1.0 25820 11048 ? DN 14:19 0:03 \_ /usr/sbin/httpd 4220 0.0 0.3 21176 3716 ? DN 15:02 0:00 \_ /usr/sbin/httpd 2947 0.9 1.7 49312 17860 ? D 14:58 0:02 \_ /usr/bin/spamd -d -c -a -m5 -H 2948 0.9 1.7 49312 18256 ? D 14:58 0:02 \_ /usr/bin/spamd -d -c -a -m5 -H 2949 0.9 1.7 49312 17876 ? D 14:58 0:02 \_ /usr/bin/spamd -d -c -a -m5 -H 3026 0.2 1.5 41228 16068 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 3132 0.2 1.6 41116 17468 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 3507 0.3 1.2 41404 12604 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 3703 0.4 1.4 41116 14992 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 3717 0.1 1.3 40984 14088 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 3718 0.5 1.4 41380 14876 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 3762 0.3 1.5 41116 15624 ? D 15:00 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H Another one: PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 25064 1.6 0.2 6420 2448 ? DN 15:27 0:04 \_ /usr/lib/courier/bin/imapd Maildir 25428 1.9 0.3 6424 3100 ? DN 15:28 0:04 \_ /usr/lib/courier/bin/imapd Maildir 25730 2.0 0.3 6424 3440 ? DN 15:29 0:03 \_ /usr/lib/courier/bin/imapd Maildir 26216 0.3 0.1 4112 1524 ? DN 15:32 0:00 \_ /usr/lib/courier/bin/imapd Maildir 26220 0.6 0.1 3844 1160 ? DN 15:32 0:00 \_ /usr/lib/courier/bin/imapd Maildir 26291 0.0 0.0 3768 612 ? DN 15:32 0:00 \_ /usr/lib/courier/bin/imapd Maildir 26276 0.0 0.0 3772 616 ? DN 15:32 0:00 | \_ /usr/lib/courier/bin/imapd Maildir 26277 0.0 0.0 3756 612 ? DN 15:32 0:00 \_ /usr/lib/courier/bin/imapd Maildir 6655 0.0 0.6 240020 6416 ? D Jun21 0:15 \_ /usr/sbin/slapd -f /etc/openldap/slapd_bdb.conf -u ldap -h ldap://0.0.0.0:389/ ldaps://0. 25681 0.3 1.8 40396 19064 ? D 15:29 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 25690 0.3 1.8 40528 18784 ? D 15:29 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 25927 0.3 1.3 40816 13796 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 25942 0.3 1.3 40684 13812 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 25955 0.4 1.3 40816 13804 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 25978 0.3 1.3 40948 14268 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 25984 0.4 1.3 41076 14028 ? D 15:30 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H Another one: PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 195 0.0 0.0 0 0 ? DW Jun04 4:20 [kjournald] 198 0.0 0.0 0 0 ? DW Jun04 22:57 [kjournald] 25483 0.0 0.1 3856 1276 ? DN 09:57 0:01 \_ /usr/lib/courier/bin/imapd Maildir 11438 0.4 0.2 5324 2740 ? DN 09:09 0:24 | \_ /usr/lib/courier/bin/imapd Maildir 21598 48.6 0.4 19372 4700 ? DN 09:45 22:48 | \_ /usr/lib/courier/bin/imapd Maildir 3955 32.0 0.0 3780 616 ? DN 10:31 0:17 | \_ /usr/lib/courier/bin/imapd Maildir Yet another one: PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 198 0.0 0.0 0 0 ? DW Jun04 20:56 [kjournald] 3091 0.6 0.2 5172 2584 ? D 11:24 0:00 | \_ /usr/bin/python -S /usr/local/mailman/cron/gate_news 14102 0.9 0.2 19808 2216 ? D Jun08 171:53 \_ /usr/bin/python /usr/local/mailman/bin/qrunner --runner=ArchRunner:0:1 -s 14136 0.0 0.1 8064 1212 ? D Jun08 3:19 \_ /usr/bin/python /usr/local/mailman/bin/qrunner --runner=CommandRunner:0:1 -s 3008 0.9 0.1 3492 1776 ? DN 11:24 0:00 | \_ submit esmtp dns; krzysztof ([::ffff:195.94.219.146]) AUTH: LOGIN pedryc, TLS: SSLv2,128b 1877 0.4 0.1 4916 1276 ? DN 11:20 0:01 \_ /usr/lib/courier/bin/imapd Maildir 2688 2.9 0.1 4120 1464 ? DN 11:23 0:03 \_ /usr/lib/courier/bin/imapd Maildir 2961 1.3 0.1 4116 1624 ? DN 11:24 0:00 \_ /usr/lib/courier/bin/imapd Maildir 3006 0.3 0.0 3764 624 ? DN 11:24 0:00 \_ /usr/lib/courier/bin/imapd Maildir 31962 0.6 0.4 7604 4464 ? DN 09:21 0:46 | \_ /usr/lib/courier/bin/imapd Maildir 9747 0.2 0.5 8632 5892 ? DN 09:59 0:12 | \_ /usr/lib/courier/bin/imapd Maildir 10029 0.1 0.1 4124 1524 ? DN 10:00 0:05 | \_ /usr/lib/courier/bin/imapd Maildir 10172 0.1 0.2 5576 2892 ? DN 10:00 0:06 | \_ /usr/lib/courier/bin/imapd Maildir 10180 0.1 0.1 3876 1272 ? DN 10:00 0:05 | \_ /usr/lib/courier/bin/imapd Maildir 28608 0.1 0.0 1596 400 ? DN Jun16 7:32 syslogd -m 0 2945 0.1 0.9 40664 9592 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 2954 0.2 0.9 40664 9528 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 2957 0.1 0.8 40664 8816 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 3085 0.7 0.9 40796 10168 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 3086 0.7 1.0 40796 10444 ? D 11:24 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H An yet another one: PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 198 0.0 0.0 0 0 ? DW Jun04 24:05 [kjournald] 14290 0.0 0.1 3184 1056 ? DN Jun22 0:11 \_ /usr/lib/courier/libexec/courier/courierd 26291 0.0 0.0 1560 428 ? DN 13:03 0:00 | \_ ./courierlocal 27326 0.0 0.0 3768 624 ? DN 13:07 0:00 \_ /usr/lib/courier/bin/imapd Maildir 29624 0.5 0.1 4312 1756 ? DN 11:21 0:34 | \_ /usr/lib/courier/bin/imapd Maildir 3108 0.0 1.1 33668 11728 ? DN Jun22 1:34 \_ /usr/sbin/httpd 14981 0.1 0.0 1596 580 ? DN Jun22 1:32 syslogd -m 0 26326 0.3 1.9 40432 19764 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 26336 0.3 1.8 40280 19112 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 26339 0.3 1.8 40168 19364 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 26343 0.3 1.8 40168 19296 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 26347 0.3 1.8 40168 19236 ? D 13:03 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 27323 0.1 0.4 39772 4136 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 27324 0.0 0.4 39772 4172 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 27325 0.1 0.4 39772 4192 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 27333 0.0 0.4 39772 4236 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 27337 0.1 0.4 39772 4256 ? D 13:07 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H As can be seen, kjournald in state D usually accompanies the high iowait condition, but not always. I have been noticing on my server that the processes listed at the top of the top command generally arent using ANY resources at all. I honestly havent looked for the D state but I will keep an eye open from here on out. Im not a newb, but nowhere near a developer..... but if kjournald is accompanion the high loads, is it possible this scenario only happens under ext3? Is anyone having this condition under ext2? Just a thought. In comment #100 you sad: "It seems I am still having similar issues using the Fedora kernel. I was working with my named.conf file and it loaded almost instantly in pico. When I returned to the file, it took approx 20 seconds to open the same file in pico. At this point my load is at 3.15 which seems unreasonably high considering our server has a higher mail load between 9-5 and it is now after 5. Approx 30 mins ago my load was at .33." We had exactly the same problem, under Fedora Core1/2 and the last 2.6.5 plain kernel, for "both" ext2 and ext3 (and other filesystems too), as I already reported in comment #68. G. Vitillaro. Not only are we having those same issues but everything seems to act differently under every time we load top. I have yet to see any more than 0% total iowait but this isnt possible as we are only using approx 2% of cpu usage and our load averages are nearing or going over 4.0 .... I am assuming the fedora kernel just doesnt report iowait? My load average seems to be slightly lower under the fedora kernel but it is acting weird. Does fedora not report iowait? I believe FC1 does not report iowait. FC2 does, as it is based on 2.6 and iowait is upstream in 2.6 already. RHEL3 also has iowait backported. I just received this response from Red Hat Support: Dear Sir, It seems that when you experience an issue with your server, that's the time your're running a lot of applications or a peak on the load. You can check 'free -m' during this time, and you'll notice that you only left with few RAM and swap. This is the reason why even in fedora you have the same issue. You can add another RAM on your system, and see if this will inmprove the performance of your system. Regards, Leah I can not accept this as a solution but is there any truth to this at all? The server we are running now replaced another server with less power and about half the RAM (and no 3ware card). We are under no more load than our old server yet our load average is increasing to insane amounts. This is not a RAM issue. We first saw this problem when we upgraded from a system running under 7.2 Redhat to RHEL 3.0. Same hardware, same ram, same applications, same database, etc. Performance was so bad that I responded to this thread after getting the run-around from Redhat support. Redhat told us that the 3ware cards were not supported on RHEL 3.0 and I should purchase a RAID array from a company that was on the compatible list, such as, Dell or Compaq. FYI: Oracle support was not any better. I have tested several systems and have seen these same excessively slow performance issues. All of the systems that we have used are SMP, either dual Xeon or P4 with hyper-threading, using 3ware (IDE and SATA) or onboard SATA, have 3 or 4 Gb RAM, and use Oracle as the primary application. Although the Fedore Core 1 kernel took us from completely unacceptable performance to tolerable performance, our solution was to abandon Redhat in favor of other Linix suppliers. The primary application that we run is the Oracle standard edition database. Once I discovered what needed to be done to get Oracle to install on other distributions, our problem was solved. I started using Red Hat several years ago and I have been very disappointed on the lack of support and the incredible delay between reporting a verifiable problem until resolution. For instance, this thread has been open for 11 weeks and it is not yet solved. BTW, we considered this a production halting problem. If Redhat ever fixes this issue, I will probably use the licenses that we purchased. However, I would prefer a refund at this point. LaVar I am happy to see that others feel the same way that I do. I was fine with this situation as long as I felt that it was being worked on. As soon as Red Hat Support gave me that last posted my views on Red Hat has gone from "ok" to "horrible" ... If you are honest with me and tell me that you are working on a problem, then I can deal with that..... when you lie to me and tell me its something else, especially when that solution requires me to spend needless money on memory that isnt even going to help the problem, then its a new story all together. We have already begun to look at other solutions. Hopefully a solution to this bug is found before we migrate to someone else. As I went back to check on the compatibility issue I was a bit perplexed. At first we had a Promise Technologies card in our server but had difficulties getting it to work with Red Hat. As a result we consulted Red Hat's HCL and from there we found that this 3Ware card was our best option. At this point, those 3Ware cards no longer show on their HCL. Is it true that Red Hat is pulling support for a product that it once said worked or did I misread something? No need to get upset at a busy or tired support person ;) Engineering is looking into this problem and we are trying to get it fixed. AFAIK 3Ware should still appear on the HCL 'though you'll need to use the "complete list" tab as they're not officially certified. Another listing of processes in the D state, notice 3 kjournald instances (this one from today): PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 192 0.0 0.0 0 0 ? DW 11:17 0:05 [kjournald] 193 0.0 0.0 0 0 ? DW 11:17 0:00 [kjournald] 195 0.0 0.0 0 0 ? DW 11:17 0:04 [kjournald] 2973 0.4 0.0 1592 580 ? D 11:17 1:04 syslogd -m 0 3175 0.0 0.1 4680 1804 ? D 11:17 0:05 \_ /usr/lib/courier/libexec/courier/courierd 28665 0.7 0.1 5448 1600 ? D 14:54 0:04 \_ /usr/lib/courier/bin/imapd Maildir 29140 0.8 0.2 5444 2460 ? D 14:56 0:04 \_ /usr/lib/courier/bin/imapd Maildir 17308 0.0 0.2 5660 2096 ? D 12:03 0:03 | \_ /usr/lib/courier/bin/imapd Maildir 6061 0.0 0.0 3980 856 ? D 13:34 0:00 | \_ /usr/lib/courier/bin/imapd Maildir 21268 0.0 0.2 5732 2748 ? D 14:30 0:01 | \_ /usr/lib/courier/bin/imapd Maildir 21269 0.1 0.1 4636 1620 ? D 14:30 0:02 | \_ /usr/lib/courier/bin/imapd Maildir 4089 8.5 1.5 20832 15600 ? D 11:19 19:22 \_ fam 29645 0.2 3.5 62620 36340 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 29646 0.2 3.5 62620 36476 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 29647 0.1 3.4 62620 35516 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 29703 0.2 3.2 62620 33184 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 29705 0.2 3.2 62620 33420 ? D 14:58 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 30371 0.2 1.7 63032 18496 ? D 15:02 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 30374 0.1 1.7 63032 17704 ? D 15:02 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 30375 0.2 1.7 63032 17972 ? D 15:02 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 30479 0.2 1.8 63032 18624 ? D 15:03 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 30507 0.4 1.6 62480 17320 ? D 15:03 0:00 \_ /usr/bin/spamd -d -c -a -m5 -H 3513 0.1 0.7 26424 7876 ? D 11:17 0:16 \_ /usr/sbin/httpd 3515 0.1 0.7 26672 7944 ? D 11:17 0:18 \_ /usr/sbin/httpd 3517 0.1 0.6 26436 6820 ? D 11:17 0:16 \_ /usr/sbin/httpd 3522 0.1 0.7 26344 7692 ? D 11:17 0:15 \_ /usr/sbin/httpd 12859 0.1 0.8 26852 8648 ? D 11:43 0:14 \_ /usr/sbin/httpd 12900 0.1 0.4 27256 4916 ? D 11:43 0:14 \_ /usr/sbin/httpd 12922 0.1 0.7 26504 8132 ? D 11:43 0:13 \_ /usr/sbin/httpd 13748 0.1 0.5 26804 5968 ? D 11:47 0:12 \_ /usr/sbin/httpd 13940 0.1 0.6 26208 7152 ? D 11:48 0:12 \_ /usr/sbin/httpd 32199 0.0 0.7 26380 8196 ? D 13:08 0:06 \_ /usr/sbin/httpd 31017 0.0 0.0 1584 588 ? D 15:05 0:00 \_ crond 31018 0.0 0.0 1584 588 ? D 15:05 0:00 \_ crond 31019 0.0 0.0 1584 588 ? D 15:05 0:00 \_ crond 3611 0.0 0.1 7692 1144 ? D 11:17 0:00 \_ /usr/bin/python /usr/local/mailman/bin/qrunner --runner=RetryRunner:0:1 -s 13736 0.0 0.0 3960 984 ? D 14:10 0:00 /usr/lib/courier/bin/imapd Maildir And another one, notice lack of kjournald instances, and a fam instance: PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 23944 3.4 1.8 24176 18620 ? D Jul02 646:49 \_ fam 10494 0.0 0.1 4360 1664 ? DN 13:10 0:09 \_ /usr/lib/courier/bin/imapd Maildir 24579 1.3 0.2 5532 2636 ? DN 15:59 0:05 \_ /usr/lib/courier/bin/imapd Maildir 25375 0.8 0.2 5544 2596 ? DN 16:02 0:01 \_ /usr/lib/courier/bin/imapd Maildir 25590 0.7 0.1 4120 1480 ? DN 16:03 0:01 \_ /usr/lib/courier/bin/imapd Maildir 25648 0.0 0.1 3832 1108 ? DN 16:03 0:00 \_ /usr/lib/courier/bin/imapd Maildir 25905 0.8 0.1 4128 1636 ? DN 16:05 0:00 \_ /usr/lib/courier/bin/imapd Maildir 26131 0.0 0.1 3856 1160 ? DN 16:06 0:00 \_ /usr/lib/courier/bin/imapd Maildir 11120 0.0 0.1 3984 1320 ? DN 15:09 0:01 | \_ /usr/lib/courier/bin/imapd Maildir 15232 0.0 0.1 4468 1768 ? DN 15:25 0:00 | \_ /usr/lib/courier/bin/imapd Maildir 17768 0.1 0.1 4384 1676 ? DN 15:35 0:03 | \_ /usr/lib/courier/bin/imapd Maildir 17769 0.2 0.1 3924 1344 ? DN 15:35 0:04 | \_ /usr/lib/courier/bin/imapd Maildir 19324 0.1 0.1 4932 1796 ? DN 15:40 0:02 | \_ /usr/lib/courier/bin/imapd Maildir 25918 0.4 0.2 4700 2092 ? DN 16:05 0:00 | \_ /usr/lib/courier/bin/imapd Maildir 26073 0.7 0.1 4364 1732 ? DN 16:05 0:00 | \_ /usr/lib/courier/bin/imapd Maildir 19503 0.1 1.1 30368 11508 ? DN 09:46 0:39 \_ /usr/sbin/httpd 21794 0.1 1.0 27192 10324 ? DN 13:52 0:13 \_ /usr/sbin/httpd 31399 0.1 1.0 26804 10856 ? DN 14:23 0:08 \_ /usr/sbin/httpd 31415 0.1 0.9 26804 9440 ? DN 14:23 0:09 \_ /usr/sbin/httpd 31426 0.1 0.9 26992 9916 ? DN 14:23 0:10 \_ /usr/sbin/httpd 23730 0.2 0.7 63784 8112 ? D 15:55 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H 23731 0.2 0.7 63784 8164 ? D 15:55 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H 23755 0.2 0.7 62728 7988 ? D 15:55 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H 23812 0.1 0.7 62464 8008 ? D 15:56 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H 23826 0.2 0.7 62596 8020 ? D 15:56 0:01 \_ /usr/bin/spamd -d -c -a -m5 -H 28492 0.0 0.1 4300 1708 ? DN 14:14 0:05 /usr/lib/courier/bin/imapd Maildir 29943 0.0 0.1 3856 1264 ? DN 14:17 0:00 /usr/lib/courier/bin/imapd Maildir 29944 0.0 0.1 3852 1108 ? DN 14:17 0:00 /usr/lib/courier/bin/imapd Maildir 29945 0.0 0.1 3852 1232 ? DN 14:17 0:00 /usr/lib/courier/bin/imapd Maildir 29946 0.0 0.1 3868 1212 ? DN 14:17 0:00 /usr/lib/courier/bin/imapd Maildir 31515 0.0 0.2 5312 2652 ? DN 14:23 0:06 /usr/lib/courier/bin/imapd Maildir 31516 0.0 0.4 7172 4344 ? DN 14:23 0:05 /usr/lib/courier/bin/imapd Maildir 31517 0.0 0.1 4080 1512 ? DN 14:23 0:06 /usr/lib/courier/bin/imapd Maildir 23847 0.0 0.0 4604 752 ? DN 15:56 0:00 spamc -s 524288 BTW, I want to direct your attention to bug 124450 which seems closely related. Created attachment 101966 [details]
readprofile kernel profiling data during high iowaits state
This is extracted from another RHEL system, running Doug's test kernel and
suffering from high iowaits. At the moment the readprofile was captured, the
following processes were in the D state:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 9 0.0 0.0 0 0 ? DW 15:26 0:00 [bdflush]
root 10 0.1 0.0 0 0 ? DW 15:26 0:04 [kupdated]
root 20 0.0 0.0 0 0 ? DW 15:26 0:00 [kjournald]
root 127 0.0 0.0 0 0 ? DW 15:26 0:00 [kjournald]
root 129 0.0 0.0 0 0 ? DW 15:26 0:00 [kjournald]
root 130 0.0 0.0 0 0 ? DW 15:26 0:00 [kjournald]
root 442 0.0 0.0 1580 292 ? D 15:26 0:00 syslogd -m 0
root 478 0.0 0.0 7276 112 ? D 15:26 0:00 \_ 3dmd
root 2114 0.0 0.0 3328 1140 ? D 16:35 0:00 \_
perl .fishsrv.pl 53d95f350b45b5ade77f9119d03764e5
root 961 0.0 1.2 258384 26508 ? D 15:27 0:00 \_ ./dsmserv
QUIET
root 987 0.0 1.2 258384 26508 ? D 15:27 0:00 \_ ./dsmserv
QUIET
root 997 0.0 1.2 258384 26508 ? D 15:27 0:00 \_ ./dsmserv
QUIET
Created attachment 101967 [details]
readprofile kernel profiling data during normal state
For comparison, here's data from readprofile captured during normal system
state (a few minutes earlier).
Created attachment 101969 [details]
readprofile kernel profiling data during high iowaits state, after resetting profiling data
This readprofile data has been captured during high iowait system state, but
the profiling counters were reset 2 minutes earlier. Processes in the D state
at the moment readprofile was captured were:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 127 0.0 0.0 0 0 ? DW 15:26 0:05 [kjournald]
root 130 0.0 0.0 0 0 ? DW 15:26 0:00 [kjournald]
root 442 0.0 0.0 1580 304 ? D 15:26 0:00 syslogd -m 0
root 2056 0.1 0.0 4036 728 pts/1 D 16:33 0:04 |
| \_ top
root 1073 0.0 1.2 260760 25100 ? D 15:27 0:00 \_ ./dsmserv
QUIET
root 1097 0.0 1.2 260760 25100 ? D 15:27 0:01 \_ ./dsmserv
QUIET
root 1100 0.0 1.2 260760 25100 ? D 15:27 0:00 \_ ./dsmserv
QUIET
root 2051 0.2 1.2 260760 25100 ? D 16:32 0:08 | \_
./dsmserv QUIET
Doug, could you apply kernel security fixes (RHSA-2004:360-05 - kernel nfs server, RHSA-2004:255-10 - signal handler crash and others) to your rhel3-scsi-test BitKeeper repository? I'd prefer running a secure kernel, even if for testing... Yeah, I'm actually treating the two bugs (this and 124450) as dups of each other, I just haven't marked them as dups in bugzilla so just in case that they do turn out to be two different bugs then they can still be tracked separately. A quick (Hah! Yeah right!) status update. First, let me state concisely what I think the problem is at this point. One, the iowait issue is really a red herring, especially as compared to Fedora or upstream 2.4 kernels which don't even have iowait stats. The iowait numbers are nothing more than a symptom of the problem, so there's nothing wrong with iowait per se other than it's telling us that the disks are going *really* slow. Two, the whole "it took 20 seconds to open file <blah>" issue is a maximum latency problem. I've been running a lot of tests, and so far they show that the elevator does the right thing in controlling latency in general, and the only time that latency gets this far out of whack is when the entire elevator is running that far behind, it's not a case of a single starved command and that means that the basic elevator operation is OK, it's just severly overloaded at the times when latency goes through the roof. Three, the core basic problem is that total I/O throughput to the disks is just going to utter crap. When this happens, it clogs up the elevator and causes high latencies, and it is one of those self worsening problems in that when it happens, it prolongs the very conditions that causes it to happen and therefore feeds upon itself. Four, the problem is kernel version agnostic (mostly). By this I mean that the problem can occur on RHEL kernels, upstream 2.4 kernels, upstream 2.6 kernels, Fedora kernels, etc. By changing which kernel is in use, you can change how often this problem happens, but my reading of the comments above basically say that most people that tried a Fedora or upstream kernel and thought the problem had went away later came back and said "well, no it didn't, but it took longer to show up". Five, the problem *is* hardware dependant. My test box that I built to replicate this problem has two different 4 drive RAID0 arrays (using software RAID0). One of them exhibits the problem *very* clearly, while the other one exhibits the problem somewhat, but manages to still do OK. The problem does *not* appear to be related to the controller, or to the bus (aka, IDE, SCSI, Fiber Channel), but instead is related to the individual hard disks. Different models from the same manufacturer have very different behavior patterns under this load condition. One of my RAID0 arrays if made of 4 36GB Seagate SCSI disks, and these 4 drives go from best performance of around 120 to 130 MByte/sec under single process loads to 50 MByte/sec under 2 process loads to 25-30 MByte/sec under 16 process loads. The other RAID0 array is 4 18GB Seagate Fiber Channel disks and they go from 90 MByte/sec single process to 3-5 MByte/sec under 2 process and 16 process loads. Furthermore, inspection of drive specific settings in the drive firmware have revealed some hints at the problem. The 18GB Seagate drives default to only having 3 cache zones for the on drive cache memory. Obviously, if you only have 3 cache zones and you are getting lots of random I/O, the chances that any two random I/Os will fall into the same zone is very small. Increasing that number of cache zones to 16 didn't help the overall drive performance much (to be expected, these are slightly older drives with only a limited amount of on drive cache, so even though I increased the number of zones, each zone is now smaller and so the likelihood of hitting a cache zone with random I/O is again fairly small). However, increasing the number of cache zones did have one interesting effect. While the drive array used to perform at a constant 3 to 5 MByte/sec under the 2 and 16 process load tests, it now has spikes of up to 25 MByte/sec with 16 cache zones. Unfortunately, the spikes are usually very short lived and the performance again drops back down to the low range. So, here's what I think is going on at this point, based upon all the information above. Given the right set of conditions, the particular I/O pattern being sent to the drive from the linux kernel is producing absolute worst case performance numbers from the actual physical hard disks. Just how bad the performance gets depends on the brand and model of disk in use. From what I can tell, this doesn't have anything to do with any problems in the SCSI stack, IDE stack, or other drivers. Instead, the problem is higher up in the kernel where we are actually originating the read and write requests (such as the filesystem, the swapper, bdflush, etc.) Why is this happening on RHEL more so than on other kernels? The best answer I can give to this right now is that the VM in RHEL is tuned for certain types of performance, included in that tuning is changing the default number of pages that kswapd is allowed to flush in a single pass of looking for freeable memory (we increased the limit so that under high memory pressure swapping would happen quicker). This sort of change results in the I/O pattern that the VM sends to the disks being different. I think there are a number of places in the kernel where we have tweaked things that would have a ripple down effect on the I/O pattern we generate. The result of all those changes taken together is that now, some devices display very poor performance numbers under certain conditions quicker on RHEL than they do on upstream kernels, however that doesn't mean that upstream kernels are immune to the problem as they can degrade to the same place as well. What can be done about the problem? Several things. First, you can check your drives to see if they are configured suboptimally for server mode operation. Having too few cache segments or too small of a cache buffer will contribute to the problem (as will just generally poor drive firmware). Second, we are investigating kernel changes and tuning options that might help the problem. One of the major issues contributing to this problem is file fragmentation (assuming that the files are large enough that they need more than one fragment). This particular problem is one that very much feeds upon itself. The more the file is fragmented, the more you have small reads when trying to read the file. However, it also means you have to read more filesystem meta data in order to know where the blocks are on disk. So not only do you get more small reads for the file itself, but the filesystem code has to issue more small reads as well. So, file fragmentation is a major problem. In Fedora there is a program to check file fragmentation (called filefrag). When I run the 16 process performance test, all 16 processes write to their own files but all at the same time. The ext3 filesystem code simply grabs the next available block for each file when it issues a write. The result is that when you look at the disk layout, those 16 files may be stored on disk something like this: disk blocks ----> 0011589999beefff In this case, the blocks for each file are intermixed with each other. When process 1 then tries to read from its file, it's unable to create any reasonable sized I/O operations to the disk because the file is made up of 1 little chunk here, another there, etc. Watching the output of vmstat 1 during the 16 process performance test shows the effects of this very clearly. The 16 processes start out writing with each other, then they all switch to reading from their files. At the very early stages of their reading, when they are roughly in sync with each other and process 0 is reading the blocks that it has that are right beside the blocks for process 1's file, the read rate is actually decent. However, as random processes start falling behind or getting ahead of the other processes and the reads are no longer close to each other on disk, the performance quickly degrades to the very low range I mentioned earlier. Typically, a 256MB file owned by one of these processes might be comprised of as many as 17,000 separate chunks. As a test, I wrote the 16 different files one at a time instead. In that case, they were all comprised of just 3 fragments each. Then I started all 16 processes reading from the 16 different files. The maximum throughput on this array with only a single file was roughly 70MByte/sec. With 16 unfragmented files, instead of getting the horrible 3 to 5 MByte/sec rate, I managed to get about 65 MByte/sec as the throughput rate. So, this basically demonstrates how bad fragmentation of files on the filesystem can kill your performance. One possible way to reduce fragmentation for now would be to rewrite the files that are fragmented. For example, Aleksander, since your machine that's having such a problem is a mail server, you may find that stopping the mail server long enough to simply rewrite all of the mail spool files and IMAP mail folders may restore a significant amount of your performance. However, this isn't a permanent solution since changes to the files over time will reintroduce fragmentation. Solving this particular problem is very difficult. Obviously, we can't know the future and guessing at how large a mail spool file may get in the next month or two is impossible. In any case, Stephen Tweedie is working on back porting the linux-2.6.5-ext3-reservations.patch file to our 2.4 kernels. This patch should help in those cases where a file is written out in a single go (things like the IMAP server rewriting the folder or spool file after a commit command would benefit from this). It doesn't help so much with lots of small, independant writes to a file (such as when individual emails are delivered). The patch uses a "we just saw two writes to this file, so start allocating more than 1 block at a time" type algorithm to make large writes require fewer filesystem fragments. However, when a process opens the files, does a small write, closes the file, then sometime later another process does the same thing, we don't have any reliable way to predict that more writes are coming soon and therefore don't try to reserve the larger block chunks (at least that's my understanding of the ext3-reservations patch, but since Stephen is the one actually working on it he would have to give the authoritative answer). This patch is requiring significant work to be backported, but I expect this is going to make the single largest impact on performance (although in order to see the impact, you may have to rewrite the fragmented files on your disk as this patch helps to avoid fragmentation issues, but it doesn't clean up fragmentation that's already on the disk). Expectations for any fix. One thing I need to make clear is that this isn't *just* a kernel problem. It *is* hardware dependant. A realistic goal for the RHEL kernels would be to get them back to being in the same performance range as upstream kernels. If we can do better than that,then we happily will, but we can't guarantee to do so. Part of the answer to this problem, unfortunately, may be a simple "I'm sorry, but your hard disks suck rocks." Let me explain this a little bit. I know at least one of the people commenting in this bug was referring to the servers being used for Oracle applications. I can't remember what type of disks this person is using, but that is in fact an important consideration when discussing Oracle workloads. There are trade offs present when trying to decide whether to use an IDE RAID controller like the 3Ware controllers + IDE disks vs. software RAID and fast SCSI disks vs. fast hardware RAID controllers using fast SCSI/Fiber Channel disks. So let me go over some of those trade offs. First, whether you are using hardware RAID or software RAID, there are three different metrics that typically matter under different conditions. The first is the typical one, capacity. How large is the array when all the disks are put together. For big file server type applications where you have a bunch of static files and they don't change much, this is the primary metric of concern. Then there is sequential I/O throughput. This is usually of concern also on large file servers that don't change much. The reason for that is because only big machines with files that don't change much have a very good chance of keeping the file data sequential. If the files change a lot (such as mail spool files), then it's likely that they won't be too sequential. The third is random I/O throughput, or I/O ops per second. This is most important on things like Oracle workloads where the data is almost guaranteed not to be sequential. Now, disk array setups can almost be split into these clean categories: Capacity Bandwidth I/O ops/sec / $ spent IDE RAID using huge IDE drives Best Mediocre Abysmal IDE RAID using lots of small drives Good Good Mediocre Medium price SCSI + software RAID Good V. Good Good High price SCSI + software RAID Mediocre V. Good V. Good/Best High price SCSI + hardware RAID Mediocre Good V. Good/Best When it comes to random I/O patterns, which is basically what we are facing in this particular bugzilla, the single most important metric is the I/O ops per second. Three factors go into determining what a RAID array can do in terms of I/O ops per second. Those are the rotational speed of the disk (the higher the RPMs, the shorter the heads have to wait for the data to spin around under them, so the more total I/O ops it can complete per second), the seek time of the heads (if you have 64 I/O ops to complete, and each one is in a different place on the disk, then how fast you can get from spot to spot is a major contributor to how many ops you can complete), and the total number of hard disks in the array (the more disks you have, the more total ops you can complete across the overall array). Obviously, since the first type of array in the list uses the fewest number of disks, and those disks typically have some of the longest seek times and slowest rotational speeds, that array is truly horrible at random I/O. If you have a big web server that's doing nothing but serving out large, static files, then it's a great type of RAID array. For an Oracle setup, it will *never* perform to anyone's reasonable satisfaction. A busy mail server falls somewhere in between those two, since it has a reasonable number of fairly large, static files in the form of mailboxes with lots of saved messages, but also has lots of random I/O in terms of new mail messages arriving and being sent. The two software RAID types that use SCSI disks will perform quite well, but they do so at the expense of host CPU power. If you need that CPU power to actually do other work, then you may need to step up to the hardware RAID and SCSI disk subsystem. An analysis of your data access patterns prior to purchasing one type of array over another, with these trade offs and performance guide lines in mind, is the best way to insure that the array and the system perform up to par when finally set up. So, having said all that, I'm not saying that anyone on this bugzilla has a RAID array that's unsuitable for their purposes. I don't know enough about IDE drive model numbers to be able to tell really big, but slow, IDE disks from the ones that actually perform well and I also can't necessarily remeber how many of the people on this bugzilla are even using IDE RAID arrays versus something else. The reason I bring this up is just to make sure that people have a rough guideline in mind for the performance characteristics of their particular drive setup. That way, when we say "Here, we have this kernel to test" and you come back and say "It helped, but it only got me to <blah> performance", then you will know based upon this description whether or not the improvement you see should be in line with the optimal performance characteristics of your own hardware setup. If you happen to have just a few, really big IDE drives on an IDE RAID array, and your workload is mainly random I/O, then we are going to try and solve what problems exists, but your particular problem may just be that you have an array with a low I/O ops per second rating and you may not have the same bug/problem that we are tracking in this bugzilla, hence my comment that for some of you the answer may at least in part be "My disks suck rocks." So, that's the basic status update as of today. Sorry this was so long, but it's not a simple task to get through all the interrelated issues. Final note: Aleksander, regarding the bk tree, I haven't pushed my latest stuff to the public repo because it's full of all kinds of different test patches, backouts, and other similar stuff. It really isn't suitable for use on anything other than a pure test machine at this point, I've simply done too much test hackery while working on this problem. You're probably best off either running the 15.0.3 kernel for now, or there *may* be a later kernel in the Beta channel on RHN (maybe .17.EL or something like that). If that's there, it will have the security fixes in it as well as all of our planned changes for the RHEL3 U3 update. I thought that your explanation sounded very plausible until you started blaming the IO problems exclusively on the disks and I don't believe that this explanation covers the significant performance degradation seen. If this was simply a problem of poor disk choice given the application, I shouldn't be able to switch to a kernel.org kernel or reinstall a different distribution to make the problem "get much better". Our slow performance problems started when we upgraded from Redhat 7.2 to RHEL 3.0. Same hardware (Intel SE7500CW2 dual 2.4 Ghz Xeon w/HT), same disks and controller (3-Ware 7506-8 RAID 10), same ram (3 Gb), etc. (aka Nothing changed except the OS) Our subjective performance level dropped from "we are very happy" to "oh my God, why did we upgrade to RHEL 3.0" So, we went out and purchased another server... ASUS P4P800 SE w/ P4 3.4 Ghz w/HT, 4 Gb RAM, 3-Ware 8506-8 SATA RAID 10. We tested with Maxtor SATA and Western Digital SATA disks (at different times, of course). I can't tell you that the poor performance was equal to the Xeon system, but it was very poor compared to our 7.2 system reference point. Thinking that the problem was the 3Ware card(s), we stopped using RAID and plugged the disks directly into the motherboard. IDE on the Xeon system, SATA on the P4 system. After yet another reinstall, we continued to see significant performance problems. Based on comments from Bug #124450, it also appears that there is a significant difference between RHEL 2.1 and RHEL 3.0. If I was âguaranteedâ that purchasing 10000 RPM SCSI drives would solve the problem, I would. However, I am not willing to take the risk of wasting a lot of cash and time only to find out that the problem exists elsewhere when we see that a different distribution of Linux, on the same hardware, gives us âgood enoughâ performance. I would agree that this is very likely to be a hardware compatibility problem with your current Kernel configuration because if this was happening for all of the Redhat EL customers, this problem could not have been allowed to continue for almost 3 months. LaVar LaVar, You and I are in total agreement. You aren't saying anything I didn't already say, even if my meaning and intent wasn't totally clear. Yes, performance on some subset of hardware available with RHEL3 is bad. Yes, we are trying to fix that. By saying that I think the problem is with the disks I wasn't intending to imply fault, only that the disks are the actual hardware we are having the problem with. There is evidently a subset of disks out there that the particular I/O pattern we are generating basically makes the disk performance fall off of a cliff. However, empirical data also tells us there is another subset of disks out there that deals with our I/O pattern just fine. The difference between a set of disks falling off the cliff and a set that are doing OK but are being used in a way for which they simply aren't going to perform well is something that I think I can spot rather easily. However, experience dictates that out of all the people which will eventually read this bug report, not all of them will have the background to tell the difference between the two. So a good portion of my last entry was intended to provide enough background to head off the inevitable false "me too" entries that come from people confusing the two situations. Only time will tell how well it will work. And you are absolutely correct that if this was universal across all RHEL3 installations that this would have been a stop ship issue and RHEL3 would have never went out the door. I realise this is a redhat forum, but I believe that I can contribute valuable information to this problem by saying I don't BELIEVE this is a problem with any Red Hat Linux modifications...as I'm running Debian Linux 3.0 with a vanilla (downloaded from kernel.org and self compiled) 2.6.7 kernel. If you aren't interested in my contributions as I'm not running Red Hat, that's fine, tell me and I won't post further - if you are interested, let me know and feel free to ask further questions. First off my hardware: Asus CUR-DLS Motherboard (ServerWorks Chipset) Pentium 3/1GHz Coppermine (only one CPU installed) 640MB ECC RAM 3Com 3C996B-TX 64 bit PCI Gig NIC 3Ware Escalade 7810 RAID Controller 7xWD WD2000JB Hard Drives 9GB Seagate SCSI Hard Drive (boot volume) I'm experiencing very similar problems to the original poster. I have a 3Ware Escalade 7810 RAID Controller with 7 WD2000JB hard drives running in a RAID 5 config. They are formatted as a 1.2TB XFS volume. This is just my storage array - I'm the only user, and the problem can be seen with as little file IO as serving a single MP3 via Samba. The machine will sit at about 1% cpu utilization all fine, then all of the sudden IOWAIT will spike to near 100% and the streaming file from Samba will freeze for about 10 seconds. It will then go back down as suddenly as it started, and everything will be normal for a while longer until it occurs again. I had the same issue with the 2.6.0 kernel, so it's at least in the 2.6.x line...I've never used 2.4 on this box. Personally, I'm thinking the issue is specific to the 3Ware RAID controllers, as brief testing with the SCSI Boot drive does not appear to result in the system near-freezing, in spite of the IOWAIT number getting quite high at times. Perhaps one of the strangest things is when this problem is occuring and I execute "ps amx|grep D" I see the following output: PID TTY STAT TIME COMMAND 2207 ? - 0:00 /usr/sbin/smbd -D 2208 ? - 0:00 /usr/sbin/nmbd -D 2235 ? - 0:53 /usr/sbin/smbd -D - - D 0:53 - 3910 pts/2 - 0:00 grep D Notice the process in the D state has no PID, no TTY, and no command? That process only exists while the problem is occuring. I would consider myself an intermediate Linux user - I know enough to break my system well, but it takes me a few hours of reading Google to fix it :-) Even though I'm no expert user, I would be more than happy to aid in any way possible. And if you want me to stay out of this thread cause I'm not running Red Hat, I understand :-) Thanks! Doug, Thanks for the long explanation. I'm not sure I agree 100% with it only being related to disk model. We have multiple systems with the same disks, and the problem only shows up on the system with the 3ware RAID. You also didn't mention the word "write" once. Some of the issues may be related to the slower writing on the RAID5 disks, but the elevator should be handling that. Our problem is that the system can be killed by a simple 'cat largeunfragmentedfile'. No Oracle or IMAP needed. What tool were you using to muck with the disk firmware to check and change the number of zones? -R Er.. that's cat largeunfragmentedfile > copyoffile Although I agree that disks could play a part on performance, I disagree that they could play a role in such a drastic performance difference. The other thing that really makes me wonder about that is the roller coaster effect that our server seems to be having. Under very similar process load our average will go from .1 up to well above 6.0 and then drop from there back down to .1 .... through all of this, our mail connections were very close in numbers (mail is the primary function of this server, although web is also in the mix). Even measuring in accordance with time of day (ie 5pm is a very busy point as we host primarily business customers and they are all getting their last minuet emails out)..... Take 5pm for 4 days straight. First day, load will be high in that time fram, second, it will be very low. I would think that if this was disk performance we would see this across the board. My feeling is that the drives are handling the load just as well under every scenario. It also doesnt explain why this is a "3ware thread" ... Granted, there are other controllers showing this issue but why does 3ware seem to be the lead runner? We're also looking at a very broad range of disks under those 3ware cards. At this point in our scenario we have moved our 3ware controller and 2 maxtor drives to another server. We are now running Dual Athlon as opposed to Dual Xeon. We have seen an increase in performance, however, iowait load is still very high for what the server is performing. During the transition from one server to the other, I saw our iowait load skyrocket while the NIC was unplugged. There was no way that any mail or web could be processing, I wasnt doing anything except running top while waiting for an answer from another technician in my company, and I noticed the load average go up to 2.3 ... Remember that I am now in a dual athlon so 2.0 is max load as opposed to the 4.0 for the dual xeon. Another question I really have is whether or not this is related to ext3 or not? Is anyone experiencing this problem under ext2? Doug, one note: In the comment #122 you've written about "static files in the form of mailboxes with lots of saved messages", probably referring to my server. It doesn't use the mbox format. It uses maildir format for mailboxes, and most of the I/O load comes from the IMAP server (Courier IMAP). Each mail folder has a corresponding filesystem directory (a couple of helper directories, actually), each directory containing messages as separate files (some of folders contain over 70000 of them). Additionally, fam is being used. I just got a new machine with a 9500S 3ware card (12 Matrox SATA drives). There is definitely something wrong with the 3ware driver/hardware, while i run mke2fs to a drive (or drives in the raid case) *all* read operations completely block until mke2fs finishes, even to *other* drives in the controller. I didn't had the time to test with tiozone,bonnie++, etc. yet but i'll try them tomorrow morning. It seems to me that it's a 3ware specific problem and has nothing to do with the more general latency problem in RHEL but i could be wrong. I have other machines here with different raid controllers and only the 3ware card shows this problem. The machine won't be in production for some time so i'll be happy to run whatever test you want. I'd like to add my $.02 worth here. I have been dealing with this quite a bit and some of my observations could be of interest to some. I have seen this with RHEL kernel 12.EL in both i686 and x86_64. I can say that in x86_64 it is much worse. I have seen this behavior with 3-Ware 8506-8 and 9500-12 cards, Adaptec 2010S and 2200S U320 RAID cards as well as LSI's Megaraid SCSI 320 RAID card. The behavior exists in both hardware RAID and software raid (linux md). Both show the problem but it is not as bad in software RAID mode. I tried both raid0 and raid5. I was able to make something "passable" using the Adaptec 2010S in jbod mode and running a software raid5 under x86_64 booting with the noapic option and running irqbalanced. Luckily boot, root and swap were on another storage device. This problem is really severe when the system binaries and swap exist on the storage device with the lag issue. (sorry to oversimplfy) A previous poster mentioned not being able to login, not surprising when the login binary and all the files in /etc are on the same affected storage device. I agree with another previous poster who mentioned never seeing the issue in RH versions previous to RHEL3, it is true as I have used and tested all of these hardware devices under pre-RHEL releases with no issues. I do not feel it is Redhat's fault though. I prefer to see this as an "improvise, adapt and overcome" issue that we all, including Redhat face and will oversome together. I disagree with Doug Ledford on the "disk drive" issue. Doug is right about certain data streams being unfriendly with specific drives and firmware versions from time to time. In the case of 3-Ware, Adaptec and LSI raid cards you have to remember that the OS never really has direct interface to the drives themselves. The raid controller, even in jbod mode, is between the drives and OS. Unlike a regular SCSI (non-RAID) adapter, with a RAID adapter the drives are internal resources to the RAID card and the OS never really writes directly to them. So far I have personally seen these issues, as I said above, affecting multiple models of 3-Ware, Adaptec and LSI raid controllers and on those controllers have been Western Digital SATA, IBM/Hitachi SATA, IBM/Hitachi SCSI and Seagate SCSI drives. Far too many variables to support the "certain data / certain drives" theory. And as I said above, whatever the cause it is much worse under x86_64 than it is under i686. I am seeing really solid performance in i686 with 3Ware 9500-12 with RHEL3 and 2.6.7 and 3Ware's 2.6 drivers. Again, just my personal experiences. I am by no means suggesting that the solution is to abandon RHEL's 2.4 kernels. Jeff I am using a 9500-12 under x86_64 and you are quite right the problem makes the machine completely unusable when the OS is on the same controller. Do you see the problem in writes only or in reads as well ? Interesting comment about noapic, i am going to give it a try tomorrow to see if it helps. I recompiled the 3w-9xxx.o driver (x86_64) with the "use_clustering : 1," change and am booting with 'noapic pci=noapic noacpi' boot args. I have run iozone read tests as well as pushing around some 6GB files (cp, cat, dd) and the system is much more responsive. I cannot speak to the use_clustering modifacation and overall system stability and production worthiness. It is however a big change. I am able to login to the box, do long listing on directories where the files are being pushed around, etc. This was not possible before. I am still seeing the high io_wait stuff (high 90% on a processor) but the system itself is more responsive and doesn't lock up. The high io_wait does occur during reads and writes. A dd of an 8GB file from a ext3fs to /dev/null hits 97% io_wait on a processor. I don't see it as a fix but it may shed light in a direction. Jeff I booted with 'noapic pci=noapic noacpi' (2.4.21-15.0.3.ELsmp x86_64) and my hdparm -tT /dev/sda speed jumped from 15 MB/sec to 56 MB/sec. Recompiling the 3w-9xxx.o with use_clustering gave a further increase to 82 MB/sec. With the latest 2.6 kernel rpm from Arjan's page i get ~50 Mb/sec without the boot options and ~80 MB/sec with them. Bonnie under 2.4.21-15.0.3 (2.6 is a bit better at rewrite and random seeks) gives me: Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP tb01 2G 12665 6 3782 1 44529 9 379.2 0 tb01,2G,,,12665,6,3782,1,,,44529,9,379.2,0,,,,,,,,,,,,, I'm waiting for a reading from 3-Ware's linux-dev people regarding the "use clustering" modifacation and this whole thread in general. I am having similar results as Kostas and it seems *workable* but I don't know how trustworthy it is. How is having the 3w-9xxx driver with "use clustering" set to 1/ON going to play when the rest of the OS references the definition which is set to 0/OFF. I'd like an opinion that would give me cause to go from being cautiously optimisitc to happy. Jeff Jeff, I don't think that having use_clustering set to 1 will cause a problem. From a quick look at the kernel sources it seems that ENABLE_CLUSTERING is only there for modules to request "clustering" (whatever that means). There are modules (e.g. drivers/usb/storage/scsiglue.c) that have use_clustering: TRUE, and they work fine. (I'll try to rebuild a machine with a different raid card to see if it makes any performance differences). In any case performance is still awfull, at the moment i can't even imagine using the machine in a production environment. I am getting intresting results with iommu=force Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP tb01.hep.ph.ic.a 2G 55038 25 9635 4 85784 20 552.4 1 tb01.hep.ph.ic.ac.uk,2G,,,55038,25,9635,4,,,85784,20,552.4,1,,,,,,,,,,,,, #dmesg .... Checking aperture... CPU 0: aperture @ 1e40000000 size 131072 KB Your BIOS doesn't leave a aperture memory hole Please enable the IOMMU option in the BIOS setup Mapping aperture over 65536 KB of RAM @ 4000000 .... EXT3 FS 2.4-0.9.19, 19 August 2002 on sd(8,8), internal journal EXT3-fs: mounted filesystem with writeback data mode. I/O error: dev 08:00, sector 4278190072 I/O error: dev 08:00, sector 4278190072 I/O error: dev 08:00, sector 4278190072 I think the problem that Jeff and me have might be x86_64 specific. Interesting Data Point! What is different in Mandrake 10's 2.4.25-2mdk which RHEL3 lacks or has added? I will post numbers later. Using identical hardware and identical SAN storage arrays accessed: RHEL 3: 2.4.21-4.0.1ELsmp i686 Mdk 10: 2.4.25-2mdk smp i586 RHEL3 has stacked latency as bad as 150s on high loads. Mdk10 has stacked latency as bad as 10s on high loads. Both cases are unmodifed bdflush from boot. 1 500 0 0 1000 3000 15 20 0 changed: RHEL3 has stacked latency as bad as 30s. Mdk10 has stacked latency as bad as 2s. More data to come ... Is that with the same vm options ? I am running tiobench at the moment with different values for inactive_clean_percent, bdflush, pagecache but it hasn't finished yet. Not quite the same as above, but happening to me. I have a Compaq ML350 G3, dual 2.8GHz, 642 RAID controller, 3x72Gb SCSI hard drives and have all the problems that has been stated on this forum (high IO Wait times - create a 10GB dd file and the system nearly stops (30sec for a ls on a small directory). I am now halting putting this server into production pending outcome of this. The latest 2.4.21-15El3 doesn't help me at all. I am going to post my own trouble ticket as it doesn't fully match the above, but to let you know it happens on SCSI disks without 3ware RAID. This was using HP's latest EL3 drivers (7.1) As anyone solved this problem? Could this possibly be related to bug #109420. I have similar experiences on my desktop, which is showing horrible latency with the latest kernel: 2.4.21-15.0.4.EL Maybe there are two problems, one raid related and another problem either with the kernel scheduler or the IO elevator. I have a server with dual 2.8 GHz Xeon, and two 3ware 9508 eight port SATA drives, running the 2.4.21-4.ELsmp kernel, with all the same problems. Once in a while high IOwait and a completely unresponsive machine. Since the machine is an NFS server this makes work impossible. These 9508 cards employ a new 3ware driver (3w-9xxx) rather than the standard 3w-xxxx. I have two further machines with the older 7504 cards and ATA drives, one running 2.4.21-15.EL (also affected) and the other one running Redhat 8.0 (kernel 2.4.20-20.8smp), the latter machine is working fine. We're seeing very similar problems with a Dell PowerEdge 700 with a CERC SATA card. Others are too: see bug 129545. We also have an active Red Hat support ticket on this problem: ticket 354372. And there's a similar post to one of the Dell support forums: http://forums.us.dell.com/supportforums/board/message?board.id=pes_hardrive&message.id=15850 Am running a Xeon 2.6 with a 3ware 8506-12 and 12 Maxtor Diamond Plus 250 GB Hard disks using LVM and Reiser filesystems. We havent noticed any issues till today when we started doing some disk intensive stuff. Here is the output of an iostat that shows the same iowait people are seeing in this thread and low throughput performance. Interestingly enough, we're actually running Linux Crux with 2.6.5 SMP kernel avg-cpu: %user %nice %sys %iowait %idle 0.00 0.00 7.69 82.38 9.93 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util hda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 1319.00 16.00 98.50 120.00 11340.00 60.00 5670.00 100.09 8.10 67.62 8.71 99.70 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Any resolution to this problem will probably help me on my platform as well. Thanks Jason Am running a Xeon 2.6 with a 3ware 8506-12 and 12 Maxtor Diamond Plus 250 GB Hard disks using LVM and Reiser filesystems. We havent noticed any issues till today when we started doing some disk intensive stuff. Here is the output of an iostat that shows the same iowait people are seeing in this thread and low throughput performance. Interestingly enough, we're actually running Linux Crux with 2.6.5 SMP kernel avg-cpu: %user %nice %sys %iowait %idle 0.00 0.00 7.69 82.38 9.93 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util hda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 1319.00 16.00 98.50 120.00 11340.00 60.00 5670.00 100.09 8.10 67.62 8.71 99.70 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Any resolution to this problem will probably help me on my platform as well. Thanks Jason I am anxiously awaiting the solution! I have this problem on an x86_64 system with the following configuration: motherboard: Arima HDAMA, Dual 100/1000, 8x Memory Slots CPU: (2) Opteron 240, 1.4Ghz System Disk: 80GB 7200 RPM IDE RAID: 8 @ 200GB 7200 RPM SATA Drives Memory: 3GB Registered ECC DDR 333, PC2700, 6x 512MB 3-Ware 8506-8, 8-port SATA RAID Card CDROM Linux Redhat 3 U2 x86_64 AS We are all waiting on a solution. Unfortunately we havent had as much as an update in over a month. Im tired of having to answer to my managers as to when one of our vendors is going to have a solution. Unfortunately their views of Red Hat have gone done as this bug has been known for a long time. Hopefully we have a solution soon so I can again try to get my company to resale Red Hat Linux. Why is this bug's severity and priority only 'normal'? I have to drive a 1/2 hour to the office EVERY TIME THE SYSTEM HANGS. This is sometimes during the work week but is often on my weekends. Also, This is our Email and DNS server that has this problem. There is a chance that we could lose valuable email and data with every hang. Please increase the severity and priority and get this resolved. I completely agree, only my trips are 4 1/2 hour drives. I have already had to make that trip on an emergency basis because our COMPLETE ISP was down as a result of this bug. Thankfully we have remote RPC units, however, when the iowait problem creates other issues it requires a little more effort. It's a shame that we had to pay for this kind of service. Other linux distros either have very little of this problem or none of this problem and they are free. The impression my company had with Red Hat was that since we are paying for a product it would be well supported, which we are learning is not true. The worst thing is not the fact that there is no solution... its that Red Hat isnt even taking the time to update us on what the status is. ------- Additional Comment #122 From Doug Ledford on 2004-07-16 13:33 ------- This was the last status report we had from Red Hat. The rest has only been more issues. This bugzilla has become an accumulation of several different problems. There are multiple people at Red Hat working on these problems as a high priority. It is an unusually complex issue, involving multiple components at several layers in the system. It is also organizationally complex because there are several bugzillas as well as reports through our formal support channels. I will try to post the status of these efforts more frequently to this BZ in the future. That said, we do have a promising avenue ready for you to test. Larry Woodman identified the following bug in wakeup_kswapd() In wakeup_kswapd() we have: /* ... and check if we need to wait on it */ if ((free_low(ALL_ZONES) > (kswapd_minfree / 2)) && !kswapd_overloaded) schedule(); ..... where free_low() is an int that returns a large negative number and kswapd_minfree is an unsigned int, so the if statement casts the low_free() negative to a large unsigned. This results is processes sleeping when they should not. I will attach the patch here. Later today I will post the location of test kernels you can download. This patch has resolved an I/O problem that we reproduced. It is almost certain that this will not solve all of the different problems mentioned in the this BZ. Please test this if you are able and let us know your results. Thanks. Created attachment 103341 [details]
Patch to prevent wakeup_kswapd() from blocking when it shouldn't.
The kernels with the kswapd fix described above is located in:
>>>http://people.redhat.com/~lwoodman/.RHEL3/
Larry Woodman
Thanks Larry, Can you also provide an SMP x84_64 version of this patched kernel? The kernels for x84_64, ia32e, and ia64 are now on: http://people.redhat.com/~coughlan/.rhel3-u2-kswapd-fix/ Tom After installing the SMP x86_64 test kernel provided by Tom, I ran a couple tests and watched with top. the test: tar and zip the contents of one of the RAID partitions to /tmp at the same time, run a find on another large filesystem that has lots of files. the result: iowait's did reach a max of 99% on both processors but did not stay at that level very long. When the tar and find's completed, the iowait returned to 0.0%. 3rd time through: the iowait peaked at near 100% on both processors and held that for approx. 15 seconds -- the find command was halted and the tar command had completed. It eventually did return to normal. 4th time: both processes lagged periodically, but the iowait didnt max until both the find and the tar were completed. It held a steady 99.9% iowait on both processors for approx 15 seconds before it eased back down to under 1%. This is definitely a big improvement, but should it ever reach those high levels and hold them for more than a moment? OK, first of all the iowait time being up around 100% is not an indicator of a problem, its normal on an IO bound system. The system takes timer interrupts every 10ms(1/100 sec) and determines what the system is doing by where the PC was when the interrupt occured: user code, system code, idle loop and that 10ms time slice is charged with whatever was happening at the time. The idle loop is actually split into 2 categories: idle and iowait. If the system was in the idle loop and there was at least one IO operation outstanding that 10ms slice was charged as iowait instead of idle. So, if you have a single program running that does nothing but read() calls the system will pretty much show up as 100% iowait because there was in IO outstanding when every interrupt occurred. The reason you never saw this before is because the splitting of the idle slice into idle and iowait is new in RHEL3. Second, I think what is going on with this poor system performance durring disk IO activity is that when a process does lots of disk IO it will eventually run the system low enough on memory where __alloc_pages() will call wakeup_kswapd because the free_inactive_clean counts falls below the low watermark(this is the normal steady state that the system enters under load). Since wakeup_kswapd() has the bug described above the process will block and context switch to kswapd which will free a page and wake the process back up and context switch back to that process. This context switching between user processes and kswapd(and keventd) is causing the device queues to be plugged and unplugged much more frequently than when there is no rapid context switching occuring. This causes the diskIO elevator algorithm to fail much more frequently and smaller IO operations to be started. Larry Woodman *** Bug 130357 has been marked as a duplicate of this bug. *** Well.. The System hung again last night with the test kernel provided. So.. that didn't solve the problem. Please post any other solutions.. Created attachment 103499 [details]
This patch was all wroog (removed)
Someone has been kind enought fixing trhe behaviour already 2.6-series 3w-9xxx
driver. I just merged some parts of it to make twa_scsi_queue() return
something else than zero (0) every time.
This was supposed to be _above_ of the attachment. I am not good with buzilla, so i was fool enought to try attach before save ... What's worth, i have been strugling with this 3ware-issue lately too. I do own a 9500S-8 and have 8x250GB SATA-disks attached to that. For me the x86-64 worked much better than i386, so when i made some testing under i386, i started to lok the code mode carefully. What i did is aplliend the above kswapd-patch to 2.4.21-20 series RHEL3 -kernel and made attach patch by looking the code currently in the 2.6-series kernel (namely 2.6.8.1). I am not expert with kernel internals, but as i see it, the function twa_scsi_queue() does return '0' even when queue is full. This seem not right to me so i modified the code. This system has now been running two hours and it's receiving about 40MB/s as NFS-server sustained and uses extra CPU-syscles left for compiling kernel with +20 nice. Before i could not get it run even 15mins before i triggered the '100% iowait and system very much boned' symptoms. Does the kernel scsi_queue-code try to requeue request like crazy and 3w-9xxx returning '0' every time even whenn queue full or is it just the kswapd-patch which is curing the damn nthing? The driver i use is the 3ware '902-bindle' like 3ware 9000 Storage Controller device driver for Linux v2.24.00.011fw. 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfc002000, IRQ: 21. 3w-9xxx: scsi0: Firmware FE9X 2.02.00.012, BIOS BE9X 2.02.01.037, Ports: 8. The box is normal (intel server board STL2 or something)N dual-PIII. The box has 64bit/66Mhz PCI-bus, but PC-133 powered SDRAM isn't exactly flying when we talk about speeds like 100MB/s for I/O Comment on attachment 103499 [details]
This patch was all wroog (removed)
Yhis was all wrong and removed. My accident that this wasn't even enabled on
the kernel i was running when submitting. Later when i realized i realized too
that his wasn't actually working for 2.4-series as expected.
Am I accurate in saying that this issue has been resolved in the linux kernel version 2.6.x? I am getting advice to drop Redhat in favor of Fedora Core 2, but if it is just a question of installing the new 2.6 kernel, why not just do that? Redhat folks, is there a dev version of the 2.6 kernel available in RPM format? Has that been tested for this bug? I am also seeing that I would need to get a new version of modutils called module-init-tools. I am just hoping that installing these new modutils do not cause problems with the 2.4 kernel -- in case a patch is ever released. Experiences? Troubles? Install Fedora? Michael, we are working on fixing the issue in the RHEL3 kernel. If you need to have a supported configuration (eg. 3rd party apps) you will have to be somewhat patient; it's easy to fix a bug, but doing so without introducing any new bugs isn't ;))) However, if you want to try the 2.6 kernel on RHEL3 or FC1, you will also need some additional RPMs. You will be able to find the kernel 2.6 RPMs and the needed RPM upgrades on: http://people.redhat.com/arjanv/2.5/ Re: comment 164 The "modutils" package available at http://people.redhat.com/arjanv/2.5/ also has module-init-tools, so if you upgrade to that modutils package, modules will work under both 2.4 and 2.6. Once you have the new modutils package installed, the main issue to watch out for (in my experience anyway) is that you may need to change your XF86Config to use /dev/input/mice as your mouse device. (This may be an issue if you're using a PS/2 mouse. If you're using a USB mouse then you almost certainly won't need to change your XF86Config.) I hope this helps... "Rik van Riel on 2004-09-08 11:06" said that we need to be "somewhat patient". It has been almost 5 months, how much longer do we need to wait? Why can't we get a specific update on the progress of fixing this bug? Since Redhat was built upon the concept of an open source community, wouldn't it be better to include all of the interested parties with details of actions that you are taking so that we could contribute to our shared goal of fixing this problem? Who can contact to give this problem the attention that it deserves? LaVar LaVar, the problem is that there isn't a single bug underlying this problem, but rather several interactions between various subsystems, each of which need subtle changes. The support people are coordinating the testing of fixes with various customers, but this happens inside the support system, not bugzilla. Bugzilla is mostly a system to gather the info that engineering needs to fix a bug; what you want is probably support's tracking system, Issue Tracker. If you want regular status updates from the people who coordinate things, please open a ticket with the support people: https://www.redhat.com/apps/support/ I opened another request with support today. I started using bugzilla in an attempt to get this problem resolved. Following is the results of my two previous interactions with support where my requests about this issue were quickly closed: "The 3Ware 8 hundred series is as of the moment unsupported. The 7500s work but this line still does not have certified drivers inside the kernel." and "We do not cover the recompilation of the kernel because we don't want you to use a kernel that we did not release." I will post results of any progress made relating to this problem. LaVar I can't seem to get the latest Fedora Core 1 smp x86_64 kernel to work. It keeps panicing. The output from the panic is very cryptic and I just cant have the server down long enough to figure out why. I also tried to build a straight linux 2.4 kernel with no success. (kernel panics when it reaches insmod portion of boot) Getting another kernel working in the mean time may be an off-topic thing, if anyone can assist me, I would appreciate it. As it stands, this weekend (Sept 11 & 12) is my only window of opportunity to stabilize our system. If there is no patch or kernel to atleast keep the system from freezing, I will need to rebuild the server with another distribution. I REALLY would rather not do that, but I am running out of options. I have been focusing on identifying the parts of this problem that may be specific to the 3w-xxxx driver or hardware. There are a number of other people focusing on the VM subsystem, and block layer tuning. I have reproduced a situation on the 3ware where vmstat shows that there is little I/O occurring, but I/O wait is 100%. In my case, system remains reasonably responsive and does not hang, but there may be enough here to figure out what is wrong. If not, I intend to continue to vary the workload and configuration until the problem is reproduced and a solution is found. I've been mostly lookingn if there would be something wrong with the 3w-9xxx-driver (it's mazingly same driver than the 3w-xxxx tho). I can easily re-produce situation where the I/O wait is 100%, but there is left some 500-200kB/s disk I/O. No 'hangs' per se, but the system really isn't responsible. Been adding a lot of debug hooks to 3ware driver to see if that fails - no. Then looking the mid layer and it seems to be queueing at least some stuff as things goes forward. SCSI logging doesn't seem to give any hint AFAIK, but i am NOT a kernel expert and neither i claim to be one. I thought i nailed it with the high level mods. I actually took out the 'enterprise code hook' from kupdated (fs/buffers.c) as in do_io_postprocessing(); It made it harder to trigger and after yet tuning the bdflush 'as in vanilla 2.4 tree' parameters, i wasn't able to anymore trigger it locally. After adding load from NFS-clients i eventually triggered the situation. So is it possible that uppper level are just messing up the I/O schedulding some how? It sure seem like it's teh fact as the I/O is flowing still all the time, but very, very, very slowly. Stopping all the processes causing the high I/O and syncing buffers is releasing the I/O wait situation. W/o mods running tiobench.pl (0.3.3) in loop causes 8 threads writing phase usually to trigger it. When you just run that in loop it's most commonly triggered aroung loop 8-10. Then the system recovers when tiobech syncs and starts over. So mostly it seems to be related to 'several parallel treads completing for disk I/O'. Same happend for box when there are boxed completing for I/O as NFS-clients. I did briefly (few hours) try to trigger that for ext2 and didn't succeed, so now i am backing up the mods and go back to ext2 and see if i can trigger it there at all. With ext2 it was much harder to trigger anything that 'starves the io schedulder'. There were some niticeable 'hang for 10 secs or so', but generally with ext2 the performance is quite stable 40MB/s writes and 70-100MB/s reads. Then i unplugged the system and put it on dual-p4/Xeon (HP Proliant). There the write performance is 'some max 10MB/s' regardless what the tunings are. Reading is much faster than on the dual-PIII as expected. Near 180MB/s range, but the writing just hangs immetiatedly when there is considerable amount of writing. Just few thoughts for today .... I am not sure if this means anything, but my backups finished in half the time of normal (this includes a full backup of our largest RAID partition on the troubled system). The current kernel is 2.4.21-20.ELsmp with the 3ware driver from version set 7.7.1. This was for the 8506 series RAID controller. I compiled the driver and just replaced the one in the /lib/modules/2.4.21-20.ELsmp/kernel/drivers/scsi directory. It could be a fluke. I moved my home server from a Compaq XP1000 (UP Alpha EV67-667 Processor) to a new dual 3.2GHz Xeon box last week and I just took out the 3ware 7506 + 4 WD250 drives of the Alpha box and plugged it into the Xeon box. Well, I'm suddenly seeing the iowait problem on the Xeon box as well and performance is terrible. The Raid5 array was running without any problems on the Alpha under kernel 2.6.5 and I never had high latencies, bonnie reported write 45MB/s read 120 MB/s. On the Xeon its <10MB/s for write and about 30MB/s read, impossible to ssh to the machine while bonnie is running. Kernel is 2.6.9-rc1, dist. is FC2. So this is problem is in no way connected to the general 3ware performance, as someone suggested. It really appears to be a (nasty) bug. Oh , and I forgot to mention that I found it quite strange that (even though I turned off swap for the test - really no active swapspace) kswapd appeared to wakeup frequently and used cpu-time (1%). Maybe not relevant , but I wanted to mention it. Is anyone else on this bug seeing very strange 'iostat' behavior related to this bug? Do you see 'nan's show up in the output of something like iostat -x /dev/sda 1 86400 ? I just wanted to report that I have tried the patch posted on September 1st, which I applied to the RHEL3-U3 kernel (2.4.21-20.ELsmp), and although it hasn't completely fixed the latencies that I have noticed, it has made a noticible improvement. Have there been anymore bugs found which are contributing to this problem? Anymore patches ready to be posted and tested? Larry Woodman located a problem that causes heavy swapping when there is a large file system load (e.g. when large files are copied). When the pagecache fills up to the point of memory reclamation, the system incorrectly swaps out inactive dirty anonymous pages even though the pagecache is over /proc/sys/vm/pagecache.maxpercent. Bugzilla 132155. The fix is to add code in launder_page() to reactivate anonymous pages if the pagecache is over maxpercent. Several people report that this patch significantly reduces or eliminates the swapping when copying large files around. The patch is attached. Created attachment 104296 [details]
reduce swapping during excessive pagecache use
We have done some performance tests to determine what portion of the I/O performance problems reported in this Bugzilla may be specific to the 3ware. As mentioned earlier, there are likely to be several problems here. The goal of the following is to isolate the impact of just one of them. Thanks to Joe Salisbury <jts> for much of this work. ########################################################### Summary of testing performed against the 3Ware 8506 adapter ########################################################### Some testing was performed using the tiobench benchmark. The tests were performed on a system with a hyperthreaded Xeon CPU running at 2GHz. The storage sub-system initially consisted of a 3 Ware 8506 adapter, which was attached to three SATA disks. The three disks were configured in a RAID5 configuration. The 3Ware adapter had write cache enabled and used a 64K stripe size (the default). Rawio was used to read and write to the RAID5 device. This removes factors related to the VM and the filesystem. The 3Ware adapter acheived "resonable" results while performing sequential reads and writes with one thread. Random writes perform very poorly, even considering the RAID 5. 1 Thread Write 42.730 MB/s Random Write 5.609 MB/s Read 32.620 MB/s Random Read 28.153 MB/s As threads are added, the I/O pattern becomes random, and we see the overall performance rapidly become limited by the random performance. 2 Thread Write 14.437 MB/s Random Write 5.185 MB/s Read 32.997 MB/s Random Read 29.456 MB/s 16T Write 4.495 MB/s Random Write 3.467 MB/s Read 32.146 MB/s Random Read 30.630 MB/s Note how the thread's sequential write performance has degraded by 90%, while reads are not impacted. For comparison, a megaraid adapter was substuted for the 3ware in the same system, using the same three disks. The storage was also RAID 5, using the default parameters. With one thread the megaraid showed poor sequential write performance, but the sequential read was better than the 3Ware adapter. 1T Write 11.606 MB/s Random Write 9.695 MB/s Read 68.686 MB/s Random Read 39.123 MB/s For the megaraid, though, the sequential write perfromance actualy improves slightly when more threads are added, rather than fall off a cliff like the 3ware does. 2T Write 14.962 MB/s Random Write 8.788 MB/s Read 49.184 MB/s Random Read 38.827 MB/s 16T Write 19.676 MB/s Random Write 5.563 MB/s Read 51.950 MB/s Random Read 40.797 MB/s So, although the "base" write performance for the Adaptec is low, it does not fall off a cliff like the 3ware does. It may be this cliff that is responsible for some of the performance problems people are experiencing. We are looking at the cause of this 3ware problem, as well as the more general problems that may involve the VM and filesystem. I rebuilt our server using Fedora Core 2 x86_64 about 2 weeks ago. The system is a dual processor AMD Opteron with 3Ware 8506 RAID controller. The System has been up for 15 days without incident and appears healthy. The Fedora Core 2 distribution uses the 2.6 kernel -- which seems to have resolved the problems described in this bug. Note from support to LaVar: I have consulted the senior engineers with this. They are still working on this bug. They have addressed most of the issues with the kernel in rhel U3 which is adviced for you to use. Then keep your system updated with any bug fixes or patches with up2date. You can get the iso images from this site, login at the rhn site first then go to this link: https://rhn.redhat.com/network/software/channels/downloads.pxt?cid=1187 Download the iso images for U3 and burn it to a disc. Use this to install, and get the updates for it. ##### LaVar comments: I have done as support requested. Reinstalled the OS and installed all of the updates available using up2date. The performance has improved as indicated in another posting, but it is not acceptable and it is not yet even close to the performance of a non-RHEL distributions (aka Fedora Core 2, SUSE 9.1, and Redhat 7.1) that I have tested. LaVar LaVar, there are more improvements forthcoming in U4. Doug, is the ext3 reservations patch still being backported to 2.4? I believe that internal file fragmentation is in fact one of significant factors contributing to the issue. I can see the problem slowly escalating on a system throughout a period of over 2 months, where there are many files which are small, but larger than block size, and there are usually many files written to disk in parallel (in this case fragmentation is certain for almost each file). We're severely affected by this same issue. I/O seems to 'bank up' even under moderate load, both reads and writes, causing an effective hang of the system for inordinate lengths of time. Are we likely to see a fix for this critical issue any time soon? Information maybe of use. The problem affects all systems we have tested with RHEL3. This includes IBM Xeon with SCSI, Dell HT P4 with SATA and Generic P4 with IDE devices. The problem is solved by reverting to, for example, RH9 Update kernel 2.4.20-31.9, though this is of course not a real solution but rather a proof of the problem. Writes on the EL kernel will see iowait use 100% CPU time, where the RH9 kernel will show close to 0% iowait most of the time, with high CPU idle. The problem has only become evident in production as it grows exponentially with utilisation, eventually crashing the system. This really must be treated as a critical issue. Information maybe of use. The problem affects all systems we have tested with RHEL3. This includes IBM Xeon with SCSI, Dell HT P4 with SATA and Generic P4 with IDE devices. The problem is solved by reverting to, for example, RH9 Update kernel 2.4.20-31.9, though this is of course not a real solution but rather a proof of the problem. Writes on the EL kernel will see iowait use 100% CPU time, where the RH9 kernel will show close to 0% iowait most of the time, with high CPU idle. The problem has only become evident in production as it grows exponentially with utilisation, eventually crashing the system. This really must be treated as a critical issue. Tramada, the RHL9 kernel doesn't measure iowait. That is the reason the utilities show zero iowait with RHL9. The system is still waiting for IO, though ... Now, IO performance in RHEL3 falling off under load _is_ a problem, which it is being addressed. We have to be careful to stick to the real problem though, and ignore the cosmetics. Thanks for the info on RH9 kernel Rik; of course it doesn't really matter what utilities show, but the 0% idle time on the EL kernel can be both seen and felt (i.e. cosmetic and functional). Looking forward to an update. Thanks again. As mentioned earlier, we expect that a number of the I/O performance problems will be addressed in the U4 kernel. You can get a preview of this kernel for testing purposes at: http://people.redhat.com/coughlan/RHEL3-perf-test/ This is an experimental kernel intended for testing purposes only. It has not been through QA or beta test, and must not be used in production. What to expect: - This kernel includes the VM fixes discussed earlier, plus a few more. As noted earlier, this has produced a dramatic improvement for most workloads, though some specific performance problems remain. - The superbh feature in RHEL 3 causes I/O size to be limited to 32K (BZ 131391). This can reduce performance for raw io, and some Oracle configurations. If you run one of these workloads and have a performance problem, please run a test after doing: "echo 2 > /proc/sys/fs/superbh-behavior". Let us know the results. Please note that the superbh-behavior sysctl is not currently planned for U4. It is provided in this experimental kernel for testing purposes only. - The poor 3ware RAID 5 random write performance reported in comment 183 still exists. This is largely a hardware limitation. Differences between REHL 3 and other kernels may be related to smaller I/O size in RHEL 3. Unfortunately, the superbh change described above does not improve this for 3ware. This is still being investigated. - The 2.6 ext3 reservations patch mentioned in comment 122 has not been ported to 2.4. This is still under investigation for a future uptdate. - As stated earlier, performance for some workloads may improve with a larger value for /proc/sys/vm/max-readahead, and lower elvtune values. If you have a performance problem with this kernel please 1) describe the test or workload type (filesystem or raw, random or sequential), 2) which HBA you are using, and how the storage is configured, and 3) how you measured the performance. ... and - no sources I personbally can't even test if i cannot merge some other patches in, so binary is just useless ... Okay, the source is there now. http://people.redhat.com/coughlan/RHEL3-perf-test/SRPMS/ Thanks for the update Tom. It doesn't look to have improved things much though, using top we're still seeing 0% CPU idle during disk activity on all tested systems, where RH9 kernel shows ~80-100% idle. Tests were: - dd from /dev/zero to file on ext3 FS - dd from /dev/zero to raw device - mkfs.ext3 raw device - cp file on ext3 FS, from deviceA to deviceB Tested configurations: - Non-HT 1.6GHz P4 with PATA - 1.133Ghz P3 with SCSI "Tramada, the RHL9 kernel doesn't measure iowait. That is the reason the utilities show zero iowait with RHL9. The system is still waiting for IO, though ..." To explain this a bit more --- iowait time *IS* CPU-idle time. It's just that the RHEL3 kernel is accounting CPU-idle time when the disk is busy differently from that when the disk is idle too. If the disk is busy but the CPU idle, RH9 will show the CPU as idle. But RHEL3 will show it as iowait. The difference is only in the accounting. In other words, if you have a busy disk test which keeps the disk occupied most of the time but doesn't load the CPU much, then you expect to see 0% idle and high iowait on RHEL3, but 0% iowait and high idle on RH9. So "0% CPU idle during disk activity on all tested systems, where RH9 kernel shows ~80-100% idle" is just showing expected behaviour. iowait/idle time are really not much use for debugging disk IO performance problems. It's *far* better to use something like "iostat -x", which can show real disk response times to the requests in progress, or even just to time the speed of a particular application using the disk. All iowait shows you is that the disk is busy; it doesn't tell you anything about how fast it is working. Thanks Stephen, will try to report back with some useful information soon. For now, maybe worth reporting throughput, measured when writing direct (dd from /dev/zero, bs=1024k) to raw device, showed about 25-30% lower on the EL3-perf-test kernel vs RH9 kernel, with kernel.org 2.4.27 matching RH9 kernel performance. There is some logs from this later test kernel at http://blahee.no-ip.org/iowait/ Short version: The published test-kernel seems not to have any change on behaviour for ext3. The problem being very easily triggered with ext3 still. Being that out of nowhere the write performance is dropping to some 'MB/s range' and staying there before the sequential write is stopping. The kernel does have XFS from 2.4.28-pre3 merged in just to do reference testing and see that XFS doesn't suffer same symptoms with 3w-9xxx driver (which is patched in too). The 3w-9xxx does have clustering fixed 'as 1' to make that happend and TCQ limited to 16/LUN even tho 8/LUN seems to be optimal. tiobench-0.3.3 is used as it's just very easy way to trigger this behaviour very fast. Sometimes pushing data over NFS (Gbit) is much harder, but it's triggering the behaviour sooner or later too. The hardware being dual-PIII (intel STL2) w/ 512MB memory to make benchmarking easier (limited memory). 3ware 9500S-8 w/ 8x250GB SATA-disks (not all same brand anymore as Maxtors are failing faster than those could be replaced :). Speeds of 'hundred MB/s' begins to be biased by the fact that SDRAM based setup has clear memory bandwith limitation for mentioned speeds already, but that's the machine i can afford to make new NFS-server at home and that is so the machine i test the setup now. The recent talk about iowait percentage isn't relevant. I do clearly understand what it's all about even tho it has been confusing for a lot of people. I remember explaining many times for several people that 'even if you do have near 100% iowait, it's NOT a problem'. I don't know, but it seems to be something that ext3 is playing around with the journal/other metadata and blocking the other write access somehow. I really don't know and don't know how to verify the fact either. I am not much concerned about the random I/O speed now as the sequential writes a sometimes down to MB/s for a long time. Reason i ws bitching about the sources was that i liked to see what of those 'small patches laying all around' were merged and to patch in XFS as that is my baseline FS for this case - i know it's working. having the 3w-9xxx in kernel tree isn't bad thing either (which is 902 level from 3ware anyway for me). I did make some test with the RHEL4 beta too like two weeks ago, but the 2.6-series kernel is just so slow - even with 'elevator=deadline'. I could not find any efective tunables to make it even near 2.4-series I/O performance (speed wise) We're experiencing poor disk performance on a fully up2dated RHEL AS 3.0 on a HP NetServer LT6000 with a NetRAID (SCSI) controller and 3 disks in RAID-5 (megaraid driver during test, not megaraid2). We were testing the kernel-smp-2.4.21-21.ELperftestonly2.i686.rpm posted above, and while fiddeling around with various tests without seeing any improvement, we ended up with a disk corruption during some pax tests. "Nevermind that" we thought, continuing with more pax'ing. Suddenly the load rose to 8.13 and the server froze. I'm trying to post the last update we got from "top i c", where the bdflush in DW may be of interest. elv parms were default (2048/8192) and readahead ditto (3/31). The server was doing next to nothing other than two pax jobs and some network activity at the time it froze. The server remained responsive to ping, but impossible to log into and all my remote login terms froze. Powersave had turned off the console, and nothing could wake it. We're not running apmd, but we do run gpm so we can wake the console with the mouse. The RAID is giving us 20-30MB/s seq trans (tested with 'hdparm -t' and large 'dd's), which is less than half of an old RH7.1 running on exactly the same type of server. Performance happily drops to a few megs if we actually *use* the server for anything. That really blows for a SCSI RAID-5. Tests along the timeline RH7-RH8-RH9-FC-RHELAS2.1-RHELAS3.0 reveal an amazingly steady decline in disk performance (on these servers at least). Created attachment 105592 [details]
top output
From Red Hat support: After sorting out our IO performance tickets, several customers have found the test kernel posted in comment #196 resolved their issues. We also directed another set of IO performance tickets (performance dropped and/or system locked up when moving/copying large amount of small files) to try out the test kernel posted in bugzilla #132639 comment #77 and got good news from that group too. These two kernels may not be the "cure-all" solutions for all the RHEL IO performance issues but they do help. For Red Hat internal reference: the newest good news is from IT#51878. Re: comment #203 > Tests along the timeline RH7-RH8-RH9-FC-RHELAS2.1-RHELAS3.0 reveal an > amazingly steady decline in disk performance (on these servers at least). Hmmm... the actual timeline is closer to RH7.0-RH7.1-RH7.2-RHELAS2.1-RH7.3-RH8-RH9-RHELAS3.0-FC1-FC2. (If you're comparing them with fully up2date kernels, the timeline is probably more like RH7.0-RHELAS2.1-RH7.1-RH7.2-RH7.3-RH8-RH9-RHELAS3.0-FC1-FC2.) So, if your "timeline" reflects a downward slope in disk performance, then it's actually been jumping up and down from release to release. Re #207: Yea, FC slipped in there and wasn't supposed to. I guess I should think about EL/AS as a completely separate branch with a vmsub/iosub that just doesn't work (properly) on the server I refer to in #203. It doesn't matter. Re: #206: Did you read my #203 where I reported a lockup during two pax archiving jobs (copying lots and lots of files of all kinds of sizes) with the kernel you refer to? I found it disturbing that it looked like the kernel locked up when bdflush woke up to flush some dirty buffers or whatnot. The D state means it was most likely waiting for disk IO, or am I way off? Ofcourse, the top output aint exactly real time so maybe I'm just blowing smoke. The kernel still did lock up though. Was my error in timelining more interesting than my report of a kernel lockup? HR, If you are able to reproduce the hang that you saw in #203, it would be most helpful to get some alt-sysrq output from it. To do this, enable the serial console by adding something like this to the kernel line in grub.conf: console=tty0 console=ttyS0,115200n8 While we are at it, lets turn on the watchdog timer by adding: nmi_watchdog=1 Connect to the serial line and capture the output. Then echo 1 > /proc/sys/kernel/sysrq When the system locks up type alt-sysrq-m and alt-sysrq-t. Thanks, Tom I have been experiencing rather bad performance using EL3 with ext3 and mysql. I've created a dual boot, one with all ext3 and the other with all ext2 filesystems. All else is the same. A rather standard dell system with ide drives, no raid, no fancy graphics sound etc. My test environment is a mysql db with one program connected locally and 12 clients over the network (on a local lan). The only variable here is ext2 vs. ext3. The performance is rather bad with ext3, measured by determining how much data I can send the local process that is heavily feeding mysql (the 12 remote clients access data and do only a single querry every 10 seconds). For numbers, I see about 50% iowait with ext3 and none with ext2. When the ext3 50% iowait occurs, throughput drops by a factor of about 3. I realize these numbers are just eyeball estimates, but I have gone back between the ext2/ext3 systems a few times and this is the only change and performance is very bad with the ext3 systems while quite good with the ext2 filesystem. Sorry if this has already been mentioned, as this is a very long bugzilla log and I have not read it entirely. Eric Eric, If you have not already seen these basic guidelines for ext3 tuning, they may provide some help: http://www.redhat.com/support/wpapers/redhat/ext3/tuning.html The noatime suggestion may be relevant to your situation. Tom FYI, the perftest kernel posted in comment #196 does not report correct process memory usage, which was reported in bug #137927. In that bug report, Tom mentions that this is even seen in the 2.4.21-23.ELsmp (RHEL 3 U4 beta) kernel. Does that kernel have any more patches or fixes that address this IO problem? I tried the perftest kernel above, but didn't see any improvement. One important note, the IO degradation appears to get worse over time. I mentioned before that I am having problems on a desktop system, which get worse the longer I keep the system up and keep myself logged in with processes like mozilla getting older and growing in size. I hope the U4 kernel will fix or improve the IO performance. Re #203 I have been experiencing very similar lock ups using RH9, RAID and ext3. The machine can still be pinged, but nothing else works. When trawling through the logs, I found that something is causing the load to go through the roof. You can see this in the maillog of all places, because sendmail temporarily stops working when the load goes over about 50. When the RAID disks are being hammered, the I/O is under 10 Mb/s, and it often freezes up for a period of 30s-1 minute. Occasionally (once a day under heavy load) the entire machine locks up, (similar to report #203), and a reboot is necessary. By the way, here is something which appears in dmesg when the system freezes up. Not all the freeze ups are fatal (ie require a reboot), but a good proportion are: Unable to handle kernel NULL pointer dereference at virtual address 00000080 printing eip: c012c897 *pde = 00000000 Oops: 0000 eeprom w83781d i2c-proc i2c-i801 i2c-core iptable_filter ip_tables autofs nfs lockd sunrpc e1000 e100 sr_mod ide-scsi ide-cd cdrom 3w-xxxx sd_mod scsi_mod loo CPU: 0 EIP: 0060:[<c012c897>] Not tainted EFLAGS: 00010202 EIP is at access_process_vm [kernel] 0x27 (2.4.20-8smp) eax: 00000000 ebx: eeba8280 ecx: d4e66000 edx: c3160000 esi: 00000000 edi: c3160000 ebp: c3160000 esp: e141def0 ds: 0068 es: 0068 ss: 0068 Process ps (pid: 18992, stackpage=e141d000) Stack: c015f946 c5687400 e141df10 00000202 00000001 00000000 e141df84 e609cd80 e31b000c 00000202 00000000 c3160000 00000000 00000500 000001f0 eeba8280 00000000 c3160000 d4e66000 c017a2b9 d4e66000 bffffbe0 c3160000 0000000d Call Trace: [<c015f946>] link_path_walk [kernel] 0x656 (0xe141def0)) [<c017a2b9>] proc_pid_cmdline [kernel] 0x69 (0xe141df3c)) [<c017a6e7>] proc_info_read [kernel] 0x77 (0xe141df6c)) [<c0152457>] sys_read [kernel] 0x97 (0xe141df94)) [<c01517e2>] sys_open [kernel] 0xa2 (0xe141dfa8)) [<c01098cf>] system_call [kernel] 0x33 (0xe141dfc0)) Code: f6 80 80 00 00 00 01 74 2e 81 7c 24 30 40 a2 33 c0 74 24 f0 By the way, here is something which appears in dmesg when the system freezes up. Not all the freeze ups are fatal (ie require a reboot), but a good proportion are: Unable to handle kernel NULL pointer dereference at virtual address 00000080 printing eip: c012c897 *pde = 00000000 Oops: 0000 eeprom w83781d i2c-proc i2c-i801 i2c-core iptable_filter ip_tables autofs nfs lockd sunrpc e1000 e100 sr_mod ide-scsi ide-cd cdrom 3w-xxxx sd_mod scsi_mod loo CPU: 0 EIP: 0060:[<c012c897>] Not tainted EFLAGS: 00010202 EIP is at access_process_vm [kernel] 0x27 (2.4.20-8smp) eax: 00000000 ebx: eeba8280 ecx: d4e66000 edx: c3160000 esi: 00000000 edi: c3160000 ebp: c3160000 esp: e141def0 ds: 0068 es: 0068 ss: 0068 Process ps (pid: 18992, stackpage=e141d000) Stack: c015f946 c5687400 e141df10 00000202 00000001 00000000 e141df84 e609cd80 e31b000c 00000202 00000000 c3160000 00000000 00000500 000001f0 eeba8280 00000000 c3160000 d4e66000 c017a2b9 d4e66000 bffffbe0 c3160000 0000000d Call Trace: [<c015f946>] link_path_walk [kernel] 0x656 (0xe141def0)) [<c017a2b9>] proc_pid_cmdline [kernel] 0x69 (0xe141df3c)) [<c017a6e7>] proc_info_read [kernel] 0x77 (0xe141df6c)) [<c0152457>] sys_read [kernel] 0x97 (0xe141df94)) [<c01517e2>] sys_open [kernel] 0xa2 (0xe141dfa8)) [<c01098cf>] system_call [kernel] 0x33 (0xe141dfc0)) Code: f6 80 80 00 00 00 01 74 2e 81 7c 24 30 40 a2 33 c0 74 24 f0 For comment #214/#215: this is *not* an iowait issue - the system had actually crashed (panicked). Could you contact Red Hat support so our front end support engineers can walk you thru setting up a netdump to obtain a vmcore ? In the minimum, another bugzilla would be helpful. AND for other (this bugzilla) readers who will/have contact(ed) Red Hat support, *please* don't just say "we have an iowait issue identical to bugzilla 121434". It can be very misleading and drags everyone into the wrong direction. We have been receving lots of false alarm calls due to this bugzilla. I would like to re-word the above comment - it is not an "iowait performance issue" and the comment #214/#215 are very *appreciated* since it gives us a good clue - instead of a general and vague statement such as "our system locked up due to the iowait issue described in bugzilla 121434". Using the RHEL 3 U0/1/2 kernels (with and without SMP) we have high IOwait and the number of context switches can easily reached 20000!!!! We have tried on the following hardware: * dual Pentium III with internal SCSI disks * IBM x345 and x365 with FAStT 900 SAN Moreover, we have observed that fsck.ext3 runs forever and we have to reset the server! The kernel mentioned in #196 does not make the problem go away, and a plain 2.4.28 kernel from kernel.org lowers the IOwaits but also the I/O performance by a factor of 10 (roughly). Btw, as I/O traffic generator we're using IOzone <http://www.iozone.org>. For comment#218: 1. Was the problem reported to Red Hat support or sales rep ? If yes, could you send your ticket number to wcheng ? 2. Is the system using hyperthread ? If yes, turn it off to see how it goes. Just thought I'd mention that we've been seeing similar problems with NFS and local disk IO. I have tried to narrow down the possible problems, and have started a new Bug Report, #139937. We saw this problem on a brand new DELL PowerEdge 2650 using a Megaraid based perc card. We've since taken the server offline until we can come up with a solution. Does anyone have a list of commands that will reproduce this problem % 100 of the time? Thanks -Ron Hi, Am having the exact same problem here.. during bonnie++ benchmarks, we get extremely high iowait, and 99% of the time we end up getting kernel dumps.. Encountering this problem while running the LS-DYNA benchmarks. Setup: SE7520JR2 server with single SATA drive. ICH5R controller. Very high IOWAITs, application only gets a small percentage of CPU time, as all time is spent waiting for IO. Anything new on this issue? Peter, and others, Please review Larry's comment #158 and Stephen's comment #200 regarding high iowait. iowait time is idle time while there is I/O outstanding. This by itself does not indicate a performance problem. "iostat -x" statistics, as Stephen suggested, would be more helpful. The U4 kernel, shipping later this month, has the improvements described in comment #196 plus a few more minor fixes. We are working on a fix for U5 that will increase the 32K limit on raw io. We are continuing to investigate the remaining problem reports. It is likely that there are still multiple problems mixed together here. Some may be adapter/driver specific, others may be higher in the stack. To ensure that your particular problem gets addressed, you should file a detailed problem report through the Red Hat support organization. If that is not an option for you, then I suggest that you open a new bugzilla with a very specific problem description, including configuration, workload, and performance data. Tom Re. Comment #222 From James, Please provide detailed information on your kernel dumps. Preferably through Red Hat support, or in a new BZ. Crashes should be addressed separately from the I/O performance mega-bugzilla-from-hell. ;^{ Tom Re comment 224 As the topic of this bug is 3Ware array and I/O problems specifically. Is redhat testing with this hardware and does the U4 kernel fix the problems? Are their issues with the U4 kernel and 3ware that redhat is aware of? From the comment it appears it was a general fix. To followup on comment #218, the problem with hanging the server (using iozone or sometimes even just fsck) turned out to be caused by the IBM/Engenio RDAC failover driver, not the kernel proper. There is still a problem related to the i/o size, but that is more of a performance issue. I don't this is a Red Hat specific problem. I was also having this problem with older 3ware 7500-4LP-pata-card with two disk raid-1 (RHEL 3). Only thing you need to do is have one I/O-intensive process like "cp /dev/zero test" and all the reading from raid-disk will be blocked almost to zero bytes per second. I have also 8506-4LP sata-card with raid-1 (Redhat 9), 7500-4LP-pata- card raid-1 (Redhat 9) and couple of 9500S 4-port sata raid-5 and raid-1 (Whitebox Linux & RHEL & Debian). All of those machines have same problem.. with distribution's own kernel or with vanilla-kernel. I have tried doing elevator tuning with different parameter but no help. I don't know if 3ware's card is rubbish or what... but it would be nice to get this fixed. =) 3ware's support is blaming Linux kernel but when I asked for specific information they were not able to give it. In response to comment #229, we have a set of 10 identical servers with 3Ware cards in, half running RHEL3, half Win 2K3. The Windows boxes have no problems at all, and easily outperform the RHEL boxes in terms of disk I/O. Our experience with this card has shown that we get worst performance when attempting to access hundreds of small files - e.g. performing a recursive grep in the qmail queue folder will cause the IOWait to hit 99.9% and then the server freezes (no FTP access, no HTTP access (i.e. ports open but non-responsive), terminal becomes unresponsive) for, minimally, 10 seconds. I have seen one of these 3Ware servers hang like this for just over 6 minutes, seemingly totally down, and then return to normal. We are using RAID 1 on an 3Ware Escalade 7506. In response to comment #229 (and #230), we are seeing this exact same behaviour with all of our RHEL3 systems, even after updating to all of the packages released yesterday and today (including kernel-2.4.21-27.ELsmp). However, I am not convinced the problem is limited to 3ware cards, or even scsi. We see unusual I/O blocking with single IDE and SCSI disks, as well as with our 7506 and 9500S-12 3ware cards. We have opened Bug #139937. Please take a quick look at the information in that report. I would greatly appreciate anyone who can verify what we think we are seeing. We also have this high iowait problem introduced with RHEL 3. Our server is a dual XEON 2.8GHz with an ICP/Vortex RAID Controller. It was running perfect with RedHat 7.3.4 / Oracle 8.1.7 We installed RHEL 3 (Taroon) prior to upgrade to Oracle 9.2 and since then we have problems. We are observing the same high iowait reading/writing to our NetApp Filer via NFS Vers 3 or locally on the ICP/Vortex RAID. It doesen' matter wether its the Oracle DB reading/writing or a 'rsync' or just a simple 'cp'. Kernel version currently running is 2.4.21-27.ELsmp Is there a time that we can look forward to the U4 kernel being released as production, or, is it already? I saw above that it is in beta testing. Just was curious if there was an update on that. RHEL3 U4 was released a few weeks ago. Try using the noatime mount option, seemed to considerably lower iowait for us. We are using RHEL for an Ensim server, The server curently old 200 sites and is EXTREMELY slow due to iowait. We have maid a test whit a 3ware 7600-2 whit 200 gig drive and a mylex raid-scsi controler with 10k rpm drives. scsi was a litle bit faster but almost no difference for us . The systeme is realy slow. After that we used the 2.4.29 kernel of kernel.org and It was the day and nigth ... the system speed was fast compare to RH kernel. Why is that tread still active ? This tread have been open at 2004-04-21 11:28 ... RHEL is a paid version with lack of performance... It would be nice to have more details from redhat or thing to test like a new beta kernel just to see that somebody is doing something to fix this issue thanks and keep working ;-) Hello, I have a very similiar problem. I just tried to do a move from one filesystem to another of a users home directory and the iowait is at 90% to 100% and the machine is locking up. This system is a new system that I just installed last week. It's a Dell 2650 using dual 3Ghz xenon processors. The internal disks are using Hardware Raid 5 with a Perc3 D/I disk controller. One of the larger partitions that I was copying to is on the internal drives. In addition, I have a PowerVault 220s consisting of 7 Disk drives and using an Adaptec 39320A SCSI Controller. The 7 disks are configured using lvm into a 1TB volume which is striped. My kernel version is: 2.4.21-27.0.1.ELsmp, i686, Red Hat Enterprise Linux/ES. So, I have: /export/homes (Internal 2650 disks) /export/local (1TB Power Vault disks) I tried doing a simple move of a users home directory from /export/local to /export/homes....and the iowait state is 100% and the systems hanging. Has anyone solved this one? I'm having the same issue here. The performance seems to be worse on a RAID5 logical disk than on a RAID1 logical disk. a ;cp /dev/zero testfile' is enough to make the machine responding very slow. Sometimes you have to wait ~1 minute for a remote ssh login session to respond, sometimes it connects instantly ? tcp/ip servers are not responding or very slow. Performance is a little bit better with kernel U4 release. o.s: RHEL 3, kernel 2.4.21-27.0.1 controller: lsi megaraid 320-1 (latest firmware 1L37) 2x36GB disks in raid 1, 3x36GB disks in raid 5. we are experiencing the same issue with a customer's box. FWIW, here's what helped and what did not: 1. elvtune -r 32 -w 4096 -b 4 - this helped in a sense that the problem still occurs but with less severity, ie iostat -x /dev/sda, iowait in top as well as throughput tests show less degradation over time, although the peak 'io freezups' are just as bad. 2. perftest2 kernel *made it worse*. so bad, in fact,that we had to reboot it back into 15.0.3-smp. iostat and iowait numbers showed the box pegged against the wall. Message to James Wade. - How long does the system lock up when you see the problem? Does it lock up for a few seconds, several minutes? - How many files are in the directory - average size? Additional questions to James, Detlev, Jonathan, and others: - How much memory is on the system? - What type of filesystem - ext2, ext3, ect? - Is there any applications running on the box during the lock up - database, web server, ect? - Can you send a ps -eaf, iostat and vmstat output when the lock ups occur? - Also, if you have console access to the machine an alt sysrq m and alt sysrq t would be helpful - See Comment 195 for instructions. - What is your kernel version if not listed already? - Any messages in the /var/log/messages file? >- How much memory is on the system?
2GB
- What type of filesystem - ext2, ext3, ect?
ext3
- Is there any applications running on the box during the lock up -
database, web server, ect?
apache (openwebmail), dovecot (imap-server), sendmail..
- Can you send a ps -eaf, iostat and vmstat output when the lock ups
occur?
- What is your kernel version if not listed already?
Now I'm running with vanilla 2.4.28 kernel because it's responsiveness
is better than RH kernel but it's still too bad.
Normally loads go up when the disk array has lots of writing to do, so
something is blocking there.. reading is fine until you have to write
huge amount of data to the disk.
- Any messages in the /var/log/messages file?
Nope
Hi Pasi, Thanks for the update. Is it possible to capture a ps -eaf, iostat, vmstat and maybe top output when the problems occur? It would also be very helpful to get an alt sysrq m and alt sysrq t from the console as show in comment 195. Also, what arch are you running - x86, x86_64, ect? Thanks Created attachment 111733 [details]
output logs
Hi Joseph, I sent ps, iostat, vmstat and top outputs in a previous message but sysrq-info will have to wait before I have time to do it. I'm running x86. Hi Pasi, Was this data generated when a lock up occurred? The iostat data shows only 1.37M/s of writes at most. The top and vmstat data also indicate that the system is mostly idle. It would be helpful to get this same data when a lock up occurs. This can be done on the console if other logins are frozen. How long does the lock up last? Also, it looks like you have one device on the system, which is sda. What is the physical make-up of this device? Is it raid0, raid1 or raid5 using the 3Ware card? How many actual spindles make up the array? Is the device used for the heavy IO the same device as your root disk? Can you run a df -k? It would also be helpful if you were running one of the Red Hat supported kernels versus an upstream kernel. Regards, Joe >Was this data generated when a lock up occurred? Yes. >The iostat data shows only 1.37M/s of writes at most. The top and >vmstat data also indicate that the system is mostly idle. Yes, it is not a very much I know. I think that the part of the problem is that when there is some sort of stream of writing data to the disk the system won't let other processes to access (read) data from the disk and that's why system is "locking up". It might be the linux kernel's io-elevator is not working very co-operative with 3ware's card because it has it's own elevator...? The maximum write performance I get is 38MB/sec so it should not be problem writing only 1.37MB/sec but it is.. and maximum read performance is 75MB/sec. I have tested these with bonnie++. >It would be helpful to get this same data when a lock up occurs. >This can be done on the console if other logins are frozen. How long >does the lock up last? There is no complete lock out unless I make system write to disk with something like "dd if=/dev/zero of=tmpfile" and also then I can log with ssh if I just wait couple minutes. Problem is that system is very unresponsive like if someone tries to access his/her mailbox via imap-daemon etc when loads go somewhere near from 8 to 12 and the system might have only 4MB/sec writing going on. I don't know if linux kernel's scheduler should act differentially when the system has 3ware's card... >Also, it looks like you have one device on the system, which is sda. >What is the physical make-up of this device? Is it raid0, raid1 or >raid5 using the 3Ware card? How many actual spindles make up the >array? Is the device used for the heavy IO the same device as your >root disk? Can you run a df -k? Yep, the disk-array includes also the root disk. raid-5, three seagate barracuda 7200rpm sata-disks Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 30233896 2924736 25773348 11% / /dev/sda3 131033184 9402556 114974500 8% /home none 1034296 0 1034296 0% /dev/shm /dev/sda2 131033184 4434700 119942356 4% /var/spool/mail >It would also be helpful if you were running one of the Red Hat >supported kernels versus an upstream kernel. RedHat's kernels have the same problems but before the update 4 of RHEL3 the performance was worse with RH-kernels. Hi Pasi, Do you have a system where you can seperate the root partition off of the RAID5 device? I have a system that does not exhibite the system lock ups. It is using the 3Ware adapter with a RAID5 volume across 3 disks. However, my root partition is on a dedicated disk and not on the 3Ware RAID5 volume. I'm going to setup another system with the root partition on RAID5 and see if I can reproduce your problem. Thanks, Joe Hi Joseph, currently I don't have a system where I could try seperating the root partition but maybe in next few weeks I might have one... Hmm, it might be some sort of a solution to have a root partition on a dedicated disk but it still wouldn't resolve the issue that if the system is writing (like that 1.37MB/sec) to raid-5-volume that almost all the reading is blocked more or less. What model of 3ware-raid-card do you have? This most problematic and also most loaded system has 9500S-4-card. Normally at daytime (maximum concurrent) it has almost 50 webmail users and 100 imap/pop-users. So if there is 300 incoming mails which are 100kB each. The system will load itself over 15 and everything is almost stop before it gets it's writing done. I haven't expired this with normal software-raid. 85% of time system load is balancing between 0.5 and 3 but there is no processes which would take time or heave disk-IO. :( BR, Pasi Created attachment 111891 [details]
3Ware RAID1 versus RAID5
Created attachment 111892 [details]
RAW data for RAID1 versus RAID5 comparison
Hi Pasi,
I ran some comparisons of RAID5 versus RAID1 using the 3Ware adapter.
The RAID5 volume was created across three SATA spindles. The RAID1
volume was created on two spindles, which left one spindle free for a
hot spare.
I created two attachments. The first is a pdf(Attachment111891 [details]) ,
which contains some graphs of the data. The second attachment(111892)
is the raw data.
In your case, I believe that random writes is the most important data
point. The RAID5 configuration was able to achieve 7.6Mb/s with one
thread, 1.1Mb/s with two threads and 1.1Mb/s with four threads.
The RAID1 configuration used one less disk, but it was able to achieve
31Mb/s with 1 thread, 4.3Mb/s with two threads and 4.2Mb/s with four
threads.
In addition, note the difference in latency between raid5 and raid1.
With four threads the latency for RAID5 is 174605 ms while the RAID1
latency is only 59045 ms, which is still high, but much better than
the RAID5 case.
If its possible, I think it would be a good experiment for you to try
RAID1 for your data. I also think you should separate your root
partition for your data partition. However, this would require at
least four spindles if you want the root partition RAID1 protected,
which is suggested in a production environment.
I'm also running experiments with software RAID5. I let the 3Ware
adapter present each of the three spindles to the OS as single drives.
It then use mdadm to create the RAID5 volume. I will update the
bugzilla when that data is available.
Hi Pasi, One more note, you may gain performance by running on RAID1 versus RAID5. However, you will have to sacrifice disk space in order to run on RAID1. So if you have two 36G drives, you will only have 36G of storage available since the disks are mirrored. Regards, Joe Created attachment 112021 [details]
3Ware RAID versus Software RAID
Hi Pasi, Some IO performance tests were run using an mdadm software raid5 volume. The software RAID5 configuration was able to achieve about 11MB/s versus the 1.5MB/s for the 3Ware RAID5. The tiotest benchmark was used for this comparison. The software RAID5 stripe was created across the three SATA drives which were presented to the OS as individual drives from the 3Ware adapter. I created a new pdf attachment(112021), which includes this new data for 1MB IOs. I'm also graphing other block sizes, which show the improvement with software raid as well. You may want to try experiments on your configuration using software raid as well as relocating the root partition to a separate device. Let me know if you would like a sample script to create a software raid volume and I'll send it along. Regards, Joe Wow, thanks for compiling that Joseph. Those are some pretty damning numbers. I find the random writes for 3ware raid5 vs software raid5 especially disappointing considering what these cards cost. Any plans to sent 3ware these numbers for comment? If not I'd sure like to... Why would it matter if the root partition is raid5ed? Processes create entries in the /proc filesystem as well as other places. If there is a high write latency on this device that would cause a delay in creating these entries. It actually dosen't matter if the device is raid5, raid1 or a single disk. However, if the device has a high latency during IO, processes will appear to have a long delay when they need to access the root partition. So logging in for example creates a shell process like bash. The login would experience a long delay until /proc can be updated. /proc is actually a virtual filesystem so operations to that location may not be the cause of the problem since there is no physical IO. It may be that the actual login, ls ect commands being on the high latency device could cause the delays. We will continue to investigate and update the bugzilla with new infomation. Restoring issue tracker ids mysteriously lost when comment #247 was added. Created attachment 112896 [details]
rstatd capture showing interrupt/context stalling
We too have a dual x86_64 AMD machine, Tyan S2882 motherboard, w/3ware 9500S-8
controller, showing this problem. Machine is running RHEL3 w/U4,
2.4.21-27.0.2ELsmp, up2date says all patches applied (as of Apr5).
It's a terabyte Raid 1 production file server & the performance hit is most
annoying. We frequently generate huge hardware simulation logs here. Throw
in a few CVS checkout's & the server becomes virtually unusable.
Frequent 5 to 10 second non-responsive periods to queries (ie: 'ls' on a client
server, or hardware simulations stalling on compute servers).
Running the evil rstatd daemon, I captured something interesting. Immediately
=before= a stall is a 100% CPU spike.
During a stall...
Interrupts go down to virtually zero (maybe 1% of normal).
Page(in) (disk-cache) statistic goes to zero.
Page(out) (disk-cache) still showing low-level activity.
Disk activity is present through a stall (which makes sense, since page-out
is busy).
=Network/packet traffic goes to zero=
=Context-switching goes to zero=
The load statistic however increases.
After a stall,
Everything jumps back to life.
Load statistic starts to decrease.
It's apparantly interrupt related. Might explain why disabling the APIC during
boot helps somewhat (by changing the dynamics). Perhaps the 3ware driver uses
a "inoccuous" kernel resource & inadvertantly hits a spinlock? Maybe logging
spinlock activity could narrow things down.
Just food for thought...
Ian Davis
Hi I don't mean to be one of those people who logs a message just saying "When the hell is this going to be fixed", but frankly that is the gist of this message. It is embarrassing to sell an "enterprise" solution to a client to end up with this kind of abysmal performance. I have recently discovered that it is not feasible to perform VACUUMing on a PostgreSQL database on RHEL3 (fully patched) with a 3Ware RAID card (7506) because the server hangs. The combination of small reads and writes that go on when VACUUMing totally stalls the server. (Please try it - it's one of the easiest ways to replicate this problem that I know of). I do understand that the 3ware card(s) were never on the HCL, but we were informed by RedHat staff in several posts on this thread that a fix is definitely forthcoming and hence we have several servers with them in. Is RedHat going to release a solution to this problem or not? Simply having that question answered would help a great deal. I would also flag that this thread has been running for almost a complete year. I do not consider this to be "enterprise" support? A lot of good money is being paid for an otherwise great operating system. I would appreciate an ETA on this being properly fixed (or a confession that you are not going to bother). Finally, I am sure that it is frustrating for you RedHat developers to receive negative posts asking what the hell is going on - especially when the problem is as complex as this one appears to be. Apologies for that. However, please consider us poor end-users who are not receiving decent feedback and have got heart-attack-inducingly slow servers for several hundred pounds per year - now THAT is frustrating. Regards Has anyone seen these symptoms with the RHEL4 2.6.9 kernels? We're considering scheduling downtime to upgrade our fileserver with the 3ware card, and the issues discussed here, from RHEL3 to RHEL4, but it's a pointless exercise if the same issue is still present. Thanks, Doug *** Bug 154441 has been marked as a duplicate of this bug. *** Andrew, Doug, As you know this BZ became a catch-all for performance issues in RHEL 3. There are a number of different problems mixed in here. When we set out to fix these problems we were able to reproduce several of them, and we proceeded to fix those problems in U3 and U4. Although this BZ has been open for a long time, many of the issues reported here have been fixed. It has become clear that there remains a 3ware-specific problem that is still not fixed. The problem has persisted because, until recently, we have not been able to reproduce it. We tried a number of configurations, compared with other kernels, asked for more details, but were not able to reproduce the problem. Recently Joe Sailsbury continued to look for the problem by doing some 3ware performance tests. (See recent posts, especially Comment #243, showing poor performance, but no hangs.) Following this, and the details provided by Pasi, Joe was able to reproduce the 3ware hangs. We are looking at the reason for this, and the cause of the problem now. I expect we will be able to identify the cause in a few weeks. Doug, From experience so far, we need to be cautious about assuming that the problem we have reproduced is the same as yours. With that in mind, though, Joe reports that the problem he has with 3ware on RHEL 3 does not occur on RHEL 4. There are some pauses on RHEL 4, but they are much more in the "normal" range. Yes, we are still working on this bug. Tom Tom Thank you very much for the comprehensive feedback. Much appreciated. Kind regards We have two low-level Linux servers for staging purposes with kernel version 2.4.21-smp at our company, one with RHEL3 and another with SuSE 9.0 (2.4.21-99- smp4G). Both have the same configuration: P4 2,8Ghz with HT and one IDE disk (without any RAID controllers). I can report about strange behavior of them - only SuSE system is affected by this bug. I've succeded with preventing system "hang up" using tunings listed below, but I cann't totally escape from perfomance slowdown (while copying, after the first 300 MB at that time file copying slows down extremely). "Has anyone seen these symptoms with the RHEL4 2.6.9 kernels? We're considering scheduling downtime to upgrade our fileserver with the 3ware card, and the issues discussed here, from RHEL3 to RHEL4, but it's a pointless exercise if the same issue is still present. " Yes. We see the same symptoms on a RHEL 4 ES machine. It doesn't look like the problems has been solved en en RHEL 4's 2.6 kernel. The machine i s a Dual Xeon with a 3ware 7500-8 controller. We still use an old firmware version, so we will try to upgrade the firmware in near future. I have some RHEL4 boxes I can play with to see if I hit any of the problems seen here. What is currently the most reliable way to trigger it? My boxes are: dual Xeon EM64T (x86_64 kernel), hyperthreading on 4GB RAM 1x or 2x 3ware 9500-12 Hitachi or Seagate SATA disks, 400GB each root partition on a LV on an md device, NOT on 3ware Right now I see what seems like poor serial write performance (1 thread) on an 11-disk hardware RAID array (~4TB). I'm running a 2.6.9-5.0.3.ELsmp kernel modified to enable the xfs filesystem (simply changing the kernel CONFIG_XFS_FS). Right now my filesystems are xfs. I should be able to change various things if it would be helpful: change to the official RHEL kernel change to ext3 change to mdadm RAID turn off hyperthreading other? My slow seqential-write test is: mount info: /dev/sda on /mnt/raid1 type xfs (rw,usrquota,grpquota) /mnt/raid1/scratch1 on /scratch1 type xfs (rw,bind,noatime,usrquota,grpquota,osyncisdsync) command: sync ; time dd if=/dev/zero of=/scratch1/8G-2 bs=1M count=8192 ; time sync Running iostat -x 5 in another window shows during the bulk of this test something like: avg-cpu: %user %nice %sys %iowait %idle 0.00 0.00 0.45 49.65 49.90 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 3826.85 0.00 106.81 0.00 31602.40 0.00 15801.20 295.86 8275.85 1123.43 9.38 100.22 100.22% utilization? Hmmm... Yes it's typical, and seems anomalous, but is probably irrelevant. The request and queue sizes, average wait, and service times show here are also typical, as is the written kB/s. During the final sync, the queue drains slowly, and the reported await times go into the 100s of seconds (apparently the early requests only get serviced as the queue drains). The kB/s written sits at zero during the sync, and the writes per second goes up around 250 (it sits at ~100 during the dd). As I say, I'd welcome test suggestions that are likely to unearth new information. On my own, I plan to make my /scratch2 (an identical array on a second 3ware controller) into a software raid volume. Personally, I plan to use xfs, but I'm willing to text ext3 if that would help others. I should add: I am not seeing interactive response problems with this simple dd/sync test, just low bandwidth. ~15MB/s single-thread sequential write on an 11-spindle hardware RAID5 array? Wow, that's horrible, and probably intolerable in our application. That is what is motivating me to try mdadm RAID instead of 3ware RAID. Any other suggestions for speeding up writes? Read performance of this 8GB file of zeros is around 100MB/s. One series of my 'iostat -x 5' outputs showed the following sequence of kB/s written (5-second intervals): 12620 11764 15769 15276 74726 18713 12571 12570 12519. In other words, a huge spike in write rate. It's the only such spike I saw in an 8.5 minute dd/sync run. The nearby records are in full: avg-cpu: %user %nice %sys %iowait %idle 0.05 0.00 2.25 48.22 49.47 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 3699.40 0.00 129.40 0.00 30552.00 0.00 15276.00 236.11 8262.12 32407.72 7.73 100.02 avg-cpu: %user %nice %sys %iowait %idle 0.05 0.00 6.55 27.95 65.45 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 18097.60 0.00 592.60 0.00 149452.80 0.00 74726.40 252.20 8224.04 14372.21 1.69 100.02 avg-cpu: %user %nice %sys %iowait %idle 0.00 0.00 0.65 69.68 29.66 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4531.80 0.00 144.40 0.00 37427.20 0.00 18713.60 259.19 8243.19 1003.57 6.93 100.02 avg-cpu: %user %nice %sys %iowait %idle 0.05 0.00 0.45 64.50 35.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 3044.40 0.00 90.20 0.00 25142.60 0.00 12571.30 278.74 8246.19 1498.60 11.09 100.02 Comparisons of the relevant metrics show that during the spike: * write requests merged per second shot up (as did the other write metrics, including writes per second) * average request size stayed the same * average queue size stayed the same * average wait time shot up in the record **preceding** the spike in write output, and stayed relatively high during the spike -- indicating some long-ago-queued requests were finally serviced -- this may be most directly related to the cause of the spike, since it preceded it * service time went way down * cpu utilization stayed at 100% Recall from my preceding BZ entry that during the final sync, while the queue was slowly draining, the average wait times go way, way up, as does the # of writes per second -- these circumstances seem similar to the spike I saw during the dd. The 3Ware Escalade 8506-8 only has 2MB of SRAM (static RAM). At that small amount of SRAM with lots of RAID-5 writes, expect massive delays as the cache "overflows" and isn't able to keep up with the incoming data writes. The Escalade 7000/8000 series are "storage switches" with 0 wait state SRAM _not_ "buffering controllers" with high latency, but far more DRAM. The switch with extremely low latencies using an ASIC + SRAM, like a high-end Layer 3 Switch. But that also means they have a very, very small amount of "costly" SRAM, instead of [S]DRAM (SRAM should _not_ be confused with SDRAM). That's why RAID-0, RAID-1 and RAID-10 performance will be much better than RAID-5. You have to tweak the kernel to stage writes so it's not waiting on the 3Ware controller to finish XORs and writes. The 3Ware controller will queue up a massive number of I/O operations, which is great for RAID-0, RAID-1 and RAID-10, but not RAID-5 with its small SRAM. Different kernels might tweak different parameters that will affect RAID-5 operations on 3Ware Escalade 7000/8000 series controllers. Finding a good set (3Ware's site has excellent recommendations) and putting those in your /etc/rc.d/rc.local or similar is _crucial_ for a production server to minimize any change in defaults with any new kernel. The 3Ware Escalade 9000 series adds DRAM buffering to the existing ASIC+SRAM design. Although the drivers are still maturing from what I've seen. Unless you absolutely need the maximum storage, or your primary application is lots and lots of reads (e.g., a MySQL server for web content), use RAID-10 on 3Ware 7000/8000 series cards. I can't benchmark it since it's a live system, but the following helped greatly with the iowait problem under FC2: echo 512 > /sys/block/sda/queue/nr_requests It was suggested on LKML that nr_requests be double queue_depth. I get much more even performance now. I also use: blockdev --setra 16384 /dev/sda Regarding comment #259: We are now running with the latest firmware on the 3ware controller. It did not help at all. We then changed the io elevator from "anticpatory" to "deadline". When running bonnie we have pretty good read performance (above 90000 kB/s) on intelligent reads, but write performance still sucks (around 20000 kB/s) compared to 3ware's own benchmarks ( http://www.3ware.com/LinuxWP_0701.pdf ). I the tried to increase the readahead with blockdev --setra 16384. Now the read-performance sucked as well (around 3000 kB/s). I decreased the readahead to 8192, 4096 2048 ... etc down to 128. After each decrease I ran som bonnie-tests. Only when the ra was 128 i got good "intelligent read"-performance. While running bonnie, the system becomes very unresponsive. I have also tried the suggestions in comment #264 without any change in responsiveness. The specs for this machine is: Dual Xeon 2.4 GHz. Tyan S2723 motherboard (Intel E7501) 1024 MB RAM 3ware 7500-8 with 7 120 GB disks in hardware RAID 5 and 1 hot spare. LVM on top of RAID5 ext3 file systems Upgraded from RH 9 -> RHEL 3 -> RHEL 4 We are experiencing the same behaviour (unresponsiveness) on another RHEL 4 server. The specs for this one is: Single Xeon 2.66 Ghz 2048 MB RAM Intel SE7501CW2 motherboard 2 200 GB ATA disks in software raid 1 LVM on top of raid 1. ext3 file systems Fresh install of RHEL 4. (Not upgrade from RHEL 3) The only hardware in common between the 2 machines is the E7501 chipset. From this, it seems that there 4 possible sources for the unresponsivenes: 1: The driver for the E7501 might be broken (I don't think so) 2: LVM is broken (possible) 3: Ext3 (I don't think so) 4: The bug is somewhere in the kernel io handling (seems possible). I can only agree that it seems that there's 2 separate problems: a performance problem on 3ware raid controllers, and an unresponsiveness problem, not related to 3ware raid controllers. Kristian Sørensen Hi I believe we have just observed the same problem in our new server: P4 3.0 GHz Intel- hyperthreading on. 1Gb Ram 3ware 7000 series controller (SATA) with 8x250Gb Maxtor SATA drives RHEL 3.0 (installed via ROCKS 3.3.0 Makalu) I have had to rebuild the raid hardware once (was believe to be a disk failure). However, now when I try to mkfs.ext3 on the /dev/sda2 partition the entire system freezes at ~half-way point (I/O wait at 199.9% during entire process). Strange thing is that I can mkfs.ext3 on /dev/sda1 successfully- although again the IOwait is maximum for this process as well. If anyone has suggestions/ideas for me to try and fix this I would really appreciate it. Thanks! Darren Hi I believe we have just observed the same problem in our new server: P4 3.0 GHz Intel- hyperthreading on. 1Gb Ram 3ware 7000 series controller (SATA) with 8x250Gb Maxtor SATA drives RHEL 3.0 (installed via ROCKS 3.3.0 Makalu) I have had to rebuild the raid hardware once (was believe to be a disk failure). However, now when I try to mkfs.ext3 on the /dev/sda2 partition the entire system freezes at ~half-way point (I/O wait at 199.9% during entire process). Strange thing is that I can mkfs.ext3 on /dev/sda1 successfully- although again the IOwait is maximum for this process as well. If anyone has suggestions/ideas for me to try and fix this I would really appreciate it. Thanks! Darren Hello - (In reply to comment #256) > Andrew, Doug, > > As you know this BZ became a catch-all for performance issues in RHEL 3. ... > It has become clear that there remains a 3ware-specific problem that is still > not fixed. ... > Yes, we are still working on this bug. Is there any further progress on this? We have the same problem (FC2 2.6.6 kernel w/SNARE on a dual opteron system with the 3ware 8506 card). A (perhaps) useful data point is that we have the same OS running on other dual opteron systems but with the 9000 series cards and none of them have the problem. At this point, I think we're going to just upgrade the 8506 card to a 9000 series one so that we stop having problems. Debbie PS Thru our vendor a 3ware engineer has been looking at this for us, but has not yet been able to replicate the problem (sound familiar :-). You might be interested in this test of 9 SATA RAID5 adapters: http://www.tweakers.net/reviews/557/23 3Ware has scored absolutely the lowest in server performance, while the Areca adapters have excelled over all the others. Here are interesting results WRT tweaking kernel 2.6 on 3Ware hardwre, obtained by Gaspar Bakos (posting here with permission from the author): -------- Original Message -------- Subject: 3ware + RAID5 under 2.6.* Date: Mon, 25 Jul 2005 21:16:25 -0400 (EDT) From: Gaspar Bakos <gbakos.edu> Reply-To: gbakos.edu To: aleksander.adamowski.redhat Dear all, I am forwarding this report i sent to the FC and RAID lists, because at one time you were involved in the "3ware raid under linux" issue, so you may be interested, or have ideas. Apologies for mass emailing. Cheers Gaspar ------------- The purpose of this email is twofold: - to share the results of the many tests I performed with a 3ware RAID card + RAID-5 + XFS, pushing for better file I/O, - and to initiate some brainstorming on what parameters can be tuned for getting a good performance out of this hardware under 2.6.* kernels. I started all these tests because the performance was quite poor, meaning that the write speed was slow, the read speed was barely acceptable, and the system load went very high (10.0) during bonnie++ tests. My questions are marked below with "Q". 1. There are many useful links related to the 3ware card and related anomalies. The bugzilla page: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121434 contains some 260 comments. It is mostly 2.4 kernel and RHEL specific. 2. A newer description of the problem can be found in the thread: http://lkml.org/lkml/2005/4/20/110 http://openlab-debugging.web.cern.ch/openlab-debugging/raid/ by Andreas Hirstius. There was a nasty fls() bug, which was eliminated recently, and improved performance and stability. 3. There are recommendations by 3ware, which can be summarized in one line: "blockdev --setra 16384". http://www.3ware.com/reference/techlibrary.asp "Maximum Performance for Linux Kernel 2.6 Combined with XFS File System", which actually leads to a PDF that has a different title: "Benchmarking the 9000 controller with linux 2.6". Q: Any other useful links? Briefly, the hardware setup I use ================================= - Tyan S2882 Thunder K8S Pro motherboard - Dual AMD opteron CPUs - 4Gb RAM - 3ware 9500-8S 8 port serial ATA controller - 8 x 300GB ST3300831AS SATA Seagate disks in hardware RAID-5 More details at the end of this email. OS/setup ======= - Redhat FC3, first with 2.6.9-1.667smp kernel, then with all the upgrades, and finally a self-compiled 2.6.12.3 x86_64 kernel - XFS filesystem - Raid strip size = 64k, write-cache enabled Kernel config attached. ========================================================================== Tuneable parameters ==================== 1. Kernel itself. I tried 2.6.9-1.667smp, 2.6.11-1.14_FC3smp, and 2.6.12.3 (self-compiled) 1.a Kernel config (NUMA system, etc.) 2. Raid setup on the card. - Write-cache enabled? (I use "YES") - Raid strip size - firmware, bios, etc. on the card - staggered spinup (I use "YES", but the drives may not support it. I always "warm up" the unit before the tests, ) 3. 3ware driver version - 3w-9xxx_2.26.02.002 the older version in the kernels - 3w-9xxx_2.26.03.015fw from the 3ware website, containing the firmware as well. 4. Run-time kernel parameters (my device is /dev/sde): 4.a /sys/class/scsi_host/host6/ cmd_per_lun can_queue 4.b /sys/block/sde/queue/, e.g. iosched max_sectors_kb read_ahead_kb max_hw_sectors_kb nr_requests scheduler 4.c /sys/block/sde/device/ e.g. queue_depth 4.d Other params from the 2.4 kernel, if they have an alternative in 2.6: /proc/sys/vm/max-readahead Q: Anything else? 5. blockdev --setra This is possibly belongs to those points mentioned under 4.) 6. For not raw IO (dd), the XFS filesystem parameters. 7. Q: Anything crucial parameter i am missing? ========================================================================== Tests ===== I changed the following during the tests. It is not an orthogonal set of parameters, and I did not try everything with every combination. - kernel - raid strip size: 64K and 256K - 3ware driver and firmware - /sys/block/sde/queue/nr_requests - blockdev --setra xxx /dev/sde - XFS filesystem parameters I used 5 bonnie++ commands to do not only simple IO, but also combined filesystem performance: MOUNT=/mnt/3w1/un0 SIZE=20480 echo "Bonnie test for IO performance" sync; time bonnie++ -m cfhat5 -n 0 -u 0 -r 4092 -s $SIZE -f -b -d $MOUNT echo "Testing with zero size files" sync; time bonnie++ -m cfhat5 -n 50:0:0:50 -u 0 -r 4092 -s 0 -b -d $MOUNT echo "Testing with tiny files" sync; time bonnie++ -m cfhat5 -n 20:10:1:20 -u 0 -r 4092 -s 0 -b -d $MOUNT echo "Testing with 100Kb to 1Mb files" sync; time bonnie++ -m cfhat5 -n 10:1000000:100000:10 -u 0 -r 4092 -s 0 -b -d $MOUNT echo "Testing with 16Mb size files" sync; time bonnie++ -m cfhat5 -n 1:17000000:17000000:10 -u 0 -r 4092 -s 0 -b -d $MOUNT ========================================================================== System information during the tests =================================== This is just to make sure the system is behaving OK, and to catch some errors. Done only outside the recorded tests, so as not to affect the results. 1. top, or cat /proc/loadavg to see the load 2. iostat, iostat -x 3. vmstat 4. ps -eaf If the system behaves strange, as if locked. Q: Anything else recommended that can be useful to check healthy system behaviour? ========================================================================== Other testing tools? ==================== 1. iozone mentioning an Excel table in the man page made me uncertain whether to try it... 2. dd for raw IO. Q: What else? ========================================================================== Conclusions in a nutshell ========================= 1. With any of the kernels below 2.6.12.3, on the ___ x86_64 ___ architecture, the performance is poor. Load becomes huge, system unresponsive, kswapd0, kswapd1 running on top of the "top". 2. The blockdev --setra 16384 does almost nothing else than increases the read speed from the disks by also consuming much more CPU time. The write and re-write speed do not change considerably. It is not really a solution, when a system is run in hw raid based on an expensive card so as to save CPU cycles for other tasks. (Then we can use sw RAID-5 on JBOD, which is just much faster with more CPU usage) 3. The best I got during normal operation (no kswapd anomaly and unresponsive system) was about 80Mb/s write, 40Mb/s rewrite and 350Mb/s read. However, this was with "blockdev --setra 4092" and 43% CPU usage. I would rather quote a more conservative 180Mb/s at setra 256 and 20% CPU. 4. I made tests Migration from 64kb to 256kb stripe size on a 2Tb array would take forever. The performance during this migration is really bad, indifferent from what the IO priority is set up in the 3ware interface: 50Mb/s write, 8Mb/s rewrite (!) and 12Mb/s read. As I had no data yet to loose, it was much faster to reboot, and delete unit, create one with 256Kb stripe size, and initialize it. 5. The performance of the 3ware card seemed worse with the 256k strip size. Write: 68 Rewrite: 21, read: 60Mb/s 6. Changing /sys/block/sde/queue/nr_requests from 128 to 512 does a moderate improvement. Going to higher numbers, such as 1024 does not make it better any more. ========================================================================== QUESTIONS: ========= Q: Where is useful information on how to tune the various /sys/* parameters.? What are recommended values for a 2Tb array running on 3ware card? What are the relation between these parameters? Notably: nr_requests, can_queue, command_per_lun, max-readahead, etc. Q: Are there any benchmarks showing better (re)write performance on an eight disk SATA RAID-5 with similar capacity (2Tb)? Q: (mostly to 3ware/amcc inc.) Why is the 256K strip size so inefficient compared to the 64k? ========================================================================== TEST RESULTS ============ --------------------------------------------------------------------------- TEST2.1 ------- raid strip size = 64k blockdev --setra 256 /dev/sde /sys/block/sde/queue/nr_requests = 128 mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 xfs_info /mnt/3w1/un0/ meta-data=/mnt/3w1/un0 isize=1024 agcount=32, agsize=16021136 blks = sectsz=512 data = bsize=4096 blocks=512676288, imaxpct=25 = sunit=16 swidth=112 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=16 blks realtime =none extsz=65536 blocks=0, rtextents=0 Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 100/100 577 5 +++++ +++ 914 5 763 6 +++++ +++ 97 0 real 24m32.187s user 0m0.365s sys 0m32.705s Testing with tiny files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 100:10:0/100 125 2 103182 100 824 7 127 2 84106 99 82 1 real 49m47.104s user 0m0.494s sys 1m5.833s Testing with 100Kb to 1Mb files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 42 5 75 5 685 11 41 5 24 1 212 4 real 18m29.176s user 0m0.240s sys 0m45.138s 16Mb files: Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000 4 14 7 14 461 39 4 15 5 10 562 43 Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000 3 14 7 14 522 40 4 14 6 11 493 39 real 13m43.331s user 0m0.455s sys 1m53.656s ----------------------------------------------------------------------------- TEST 2.2 -------- -> change inode size Strip size 64Kb blockdev --setra 256 /dev/sde /sys/block/sde/queue/nr_requests = 128 mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=2k -l version=2 /dev/sde1 meta-data=/dev/sde1 isize=2048 agcount=32, agsize=16021136 blks = sectsz=512 data = bsize=4096 blocks=512676288, imaxpct=25 = sunit=16 swidth=112 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=16 blks realtime =none extsz=65536 blocks=0, rtextents=0 Disk IO Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cfhat5 20G 57019 97 75887 16 47033 10 35907 61 192411 22 311.6 0 Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 50/50 655 6 +++++ +++ 944 5 717 6 +++++ +++ 112 0 real 10m58.033s user 0m0.182s sys 0m16.954s Testing with tiny files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:10:1/20 111 2 +++++ +++ 805 7 107 2 +++++ +++ 126 1 real 9m23.056s user 0m0.105s sys 0m12.835s Testing with 100Kb to 1Mb files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 44 5 221 13 504 7 43 5 22 1 164 2 real 17m25.308s user 0m0.207s sys 0m42.914s ==> Seq. read speed increased to 3x, seq. delete decreased Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000/10 4 14 10 20 450 34 4 14 5 9 419 34 real 13m24.856s user 0m0.483s sys 1m53.478s ==> Delete speed decreased. Seq. read speed somewhat increased. ==> No significant difference compared to smaller inode size. ----------------------------------------------------------------------------- TEST2.3 -------- Tests done while migrating from Stripe 64kB to Stripe 256kB. /sys/block/sde/queue/nr_requests = 128 blockdev --setra 256 /dev/sde Extremely slow. Bonnie test for IO performance Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cfhat5 20G 53072 11 8848 1 12039 1 139.3 0 Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 50/50 289 3 +++++ +++ 603 3 444 4 +++++ +++ 77 0 real 17m19.235s user 0m0.186s sys 0m17.566s Testing with tiny files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:10:1/20 86 1 +++++ +++ 564 5 86 1 +++++ +++ 90 0 real 12m16.227s user 0m0.099s sys 0m12.125s Testing with 100Kb to 1Mb files Delete files in random order...done. Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 29 3 13 0 466 6 25 3 11 0 125 2 real 41m4.151s user 0m0.255s sys 0m42.095s Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000/10 2 9 2 5 273 20 2 8 1 3 258 19 real 29m20.672s user 0m0.469s sys 1m49.345s ===> Disk IO becomes extreme slow when array is migrating strip size ----------------------------------------------------------------------------- TEST 2.4 -------- Tests done with 256Kb RAID array size blockdev --setra 256 /dev/sde /sys/block/sde/queue/nr_requests = 128 mkfs.xfs -f -b size=4k -d su=256k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0 /dev/sde1 meta-data=/dev/sde1 isize=1024 agcount=32, agsize=16021184 blks = sectsz=512 data = bsize=4096 blocks=512676288, imaxpct=25 = sunit=64 swidth=448 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=64 blks realtime =none extsz=65536 blocks=0, rtextents=0 top - 11:54:04 up 11:31, 2 users, load average: 8.52, 7.56, 5.07 Tasks: 104 total, 1 running, 102 sleeping, 1 stopped, 0 zombie Cpu(s): 0.3% us, 4.0% sy, 0.0% ni, 0.7% id, 94.5% wa, 0.0% hi, 0.5% si Mem: 4010956k total, 3988284k used, 22672k free, 0k buffers Swap: 7823576k total, 224k used, 7823352k free, 3789640k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30821 root 18 0 8312 916 776 D 5.3 0.0 1:21.60 bonnie++ 175 root 15 0 0 0 0 D 1.3 0.0 0:16.35 kswapd1 176 root 15 0 0 0 0 S 1.0 0.0 0:18.38 kswapd0 Bonnie test for IO performance Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cfhat5 20G 68990 14 21157 5 60837 7 250.2 0 real 27m58.805s user 0m1.118s sys 1m58.749s Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 50/50 255 3 +++++ +++ 247 2 252 3 +++++ +++ 61 0 real 23m59.997s user 0m0.186s sys 0m26.721s ==> Much slower than 64kb size with setra=256 Testing with tiny files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:10:1/20 110 3 +++++ +++ 243 3 112 3 +++++ +++ 77 1 real 11m57.399s user 0m0.100s sys 0m17.356s ==> Much slower than 64kb size with setra=256 Testing with 100Kb to 1Mb files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 36 5 77 5 232 4 40 5 35 2 92 2 real 18m25.701s user 0m0.238s sys 0m45.724s ==> Somewhat slower than 64kb size with setra=256 Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000/10 4 15 3 6 227 18 3 14 2 4 155 13 real 20m11.168s user 0m0.508s sys 1m55.892s ==> Somewhat slower than 64kb size with setra=256 ==> Definitely inferior to the 64kb raid strip size ------------------------------------------------------------------------------ TEST2.5 ------- raid strip size = 256K Change su to 64k blockdev --setra 256 /dev/sde /sys/block/sde/queue/nr_requests = 128 mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0 /dev/sde1 Bonnie test for IO performance Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cfhat5 20G 72627 15 23325 5 63101 7 272.0 0 real 25m56.324s user 0m1.097s sys 1m57.267s ===> General IO was slightly faster with su=64k than su=256k Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 50/50 788 7 +++++ +++ 989 6 781 7 +++++ +++ 93 0 real 12m8.633s user 0m0.158s sys 0m16.578s ===> Filesystem is much faster with su=64k Testing with tiny files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:10:1/20 135 2 +++++ +++ 818 7 133 2 +++++ +++ 145 1 real 7m51.365s user 0m0.091s sys 0m12.182s ===> Filesystem is somewhat faster with su=64k Testing with 100Kb to 1Mb files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 41 5 91 5 787 12 41 5 24 1 224 4 real 18m6.138s user 0m0.243s sys 0m42.042s ===> For larger files, it becomes almost indifferent if we use su=64k or su=256k Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000/10 4 14 3 6 476 34 3 11 2 5 546 40 real 19m37.665s user 0m0.548s sys 1m49.408s ===> For larger files, it becomes almost indifferent if we use su=64k or su=256k ------------------------------------------------------------------------------ TEST 2.6 --------- Tests done with 256Kb RAID array size blockdev --setra 1024 /dev/sde /sys/block/sde/queue/nr_requests = 128 blockdev --setra 1024 /dev/sde mkfs.xfs -f -b size=4k -d su=256k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0 /dev/sde1 meta-data=/dev/sde1 isize=1024 agcount=32, agsize=16021184 blks = sectsz=512 data = bsize=4096 blocks=512676288, imaxpct=25 = sunit=64 swidth=448 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=64 blks realtime =none extsz=65536 blocks=0, rtextents=0 Bonnie test for IO performance Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cfhat5 20G 68794 14 26139 6 118452 14 255.5 0 real 22m2.101s user 0m1.268s sys 1m58.232s => Speed increased compared to TEST 2.4 (setra 256). CPU % didn't increase. Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 50/50 253 3 +++++ +++ 247 2 251 3 +++++ +++ 60 0 real 24m14.398s user 0m0.178s sys 0m27.186s => No change compared to 2.4 Testing with tiny files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:10:1/20 112 3 +++++ +++ 241 3 109 3 +++++ +++ 71 1 real 12m21.663s user 0m0.089s sys 0m17.502s => No change. Testing with 100Kb to 1Mb files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 39 5 90 5 237 4 37 5 32 1 82 1 real 18m47.223s user 0m0.260s sys 0m45.430s => No change. Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000/10 4 13 6 12 215 16 4 14 5 9 171 13 real 14m21.865s user 0m0.474s sys 1m49.301s ==> Improved. ------------------------------------------------------------------------------ TEST 2.6 -------- Back to raid-strip = 64k /sys/block/sde/queue/nr_requests = 128 mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0 /dev/sde1 blockdev --setra 256 /dev/sde top - 10:51:03 up 8:06, 3 users, load average: 9.69, 4.18, 1.63 Tasks: 128 total, 1 running, 127 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2% us, 5.0% sy, 0.0% ni, 5.2% id, 88.5% wa, 0.0% hi, 1.2% si Mem: 4010956k total, 3987456k used, 23500k free, 52k buffers Swap: 7823576k total, 224k used, 7823352k free, 3677224k cached System stays responsive despite the giant load. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5757 root 18 0 8308 916 776 D 6.3 0.0 0:35.69 bonnie++ 176 root 15 0 0 0 0 D 1.3 0.0 0:05.27 kswapd0 175 root 15 0 0 0 0 S 1.0 0.0 0:05.64 kswapd1 Bonnie test for IO performance Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cfhat5 20G 65322 14 46177 10 183637 21 293.2 0 real 15m23.264s user 0m1.118s sys 1m58.544s Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 50/50 701 6 +++++ +++ 983 5 733 6 +++++ +++ 111 0 real 10m56.735s user 0m0.171s sys 0m15.877s Testing with tiny files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:10:1/20 109 2 +++++ +++ 824 7 108 2 +++++ +++ 147 1 real 8m58.359s user 0m0.107s sys 0m12.546s Testing with 100Kb to 1Mb files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 45 5 214 13 642 9 45 5 22 1 211 3 real 16m59.573s user 0m0.230s sys 0m42.618s Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000/10 4 13 11 20 467 32 4 13 5 9 416 30 real 13m15.243s user 0m0.534s sys 1m47.777s ------------------------------------------------------------------------------ TEST 2.7 --------- Change setra: blockdev --setra 4092 /dev/sde raid-strip = 64k /sys/block/sde/queue/nr_requests = 128 mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0 /dev/sde1 [root@cfhat5 diskio]# iostat -x /dev/sde Linux 2.6.12.3-GB2 (cfhat5) 07/25/2005 avg-cpu: %user %nice %sys %iowait %idle 0.29 0.04 1.00 4.88 93.80 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sde 0.04 903.28 19.74 44.03 4757.48 8632.40 2378.74 4316.20 209.94 7.73 121.17 1.96 12.51 Bonnie test for IO performance Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cfhat5 20G 66303 13 41254 9 345730 41 274.7 0 real 15m21.055s user 0m1.114s sys 1m57.199s ==> Write does not change. Rewrite decreases. Read increases. Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 50/50 624 6 +++++ +++ 904 5 727 6 +++++ +++ 113 0 real 10m59.528s user 0m0.189s sys 0m16.520s Testing with tiny files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:10:1/20 111 2 +++++ +++ 798 7 102 2 +++++ +++ 143 1 real 9m12.536s user 0m0.120s sys 0m12.467s Testing with 100Kb to 1Mb files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 46 6 323 20 686 10 43 5 30 1 207 3 real 14m42.960s user 0m0.262s sys 0m42.090s Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000/10 4 14 20 40 524 38 4 13 11 21 492 35 real 10m42.784s user 0m0.453s sys 1m51.078s ------------------------------------------------------------------------------ TEST 2.8 --------- echo 512 > /sys/block/sde/queue/nr_requests raid-strip = 64k mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0 /dev/sde1 blockdev --setra 4092 /dev/sde Bonnie test for IO performance Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cfhat5 20G 78573 16 42444 9 353894 42 284.6 0 real 14m14.938s user 0m1.213s sys 1m55.382s Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 50/50 623 6 +++++ +++ 894 5 739 6 +++++ +++ 123 0 real 10m25.379s user 0m0.186s sys 0m16.846s Testing with tiny files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:10:1/20 107 2 +++++ +++ 835 7 100 1 +++++ +++ 159 1 real 9m7.268s user 0m0.104s sys 0m12.589s Testing with 100Kb to 1Mb files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 47 6 324 19 697 10 44 5 35 2 232 4 real 13m41.706s user 0m0.234s sys 0m42.614s Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000/10 4 14 19 38 448 32 4 13 11 21 506 36 real 10m40.404s user 0m0.469s sys 1m51.098s ------------------------------------------------------------------------------ TEST 2.9 --------- echo 1024 > /sys/block/sde/queue/nr_requests raid-strip = 64k mkfs.xfs -f -b size=4k -d su=64k,sw=7 -i size=1k -l version=2 -L cfhat5_1_un0 /dev/sde1 blockdev --setra 4092 /dev/sde Bonnie test for IO performance Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cfhat5 20G 79546 16 41227 9 351637 43 285.0 0 real 14m26.609s user 0m1.136s sys 1m57.398s ==> No improvement Testing with zero size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 50/50 616 5 +++++ +++ 880 5 748 6 +++++ +++ 123 0 real 10m25.469s user 0m0.186s sys 0m16.723s Testing with tiny files cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:10:1/20 99 2 +++++ +++ 779 7 104 2 +++++ +++ 165 1 real 9m12.385s user 0m0.111s sys 0m12.947s Testing with 100Kb to 1Mb files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 10:1000000:100000/10 47 6 316 20 616 9 47 6 36 2 248 4 real 13m22.360s user 0m0.231s sys 0m43.679s Testing with 16Mb size files Version 1.03 ------Sequential Create------ --------Random Create-------- cfhat5 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 1:17000000:17000000/10 3 13 16 31 386 27 4 13 11 22 558 40 real 11m1.018s user 0m0.464s sys 1m49.534s ============================================================================ Hardware info ============= [root@cfhat5 diskio]# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 stepping : 10 cpu MHz : 1991.008 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3915.77 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 stepping : 10 cpu MHz : 1991.008 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3973.12 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp ----------------------------------------------------------- [root@cfhat5 diskio]# cat /sys/class/scsi_host/host6/stats 3w-9xxx Driver version: 2.26.03.015fw Current commands posted: 0 Max commands posted: 79 Current pending commands: 0 Max pending commands: 1 Last sgl length: 2 Max sgl length: 32 Last sector count: 0 Max sector count: 256 SCSI Host Resets: 0 AEN's: 0 -------------------------- 3ware card info Model 9500S-8 Serial # L19403A5100293 Firmware FE9X 2.06.00.009 Driver 2.26.03.015fw BIOS BE9X 2.03.01.051 Boot Loader BL9X 2.02.00.001 Memory Installed 112 MB # of Ports 8 # of Units 1 # of Drives 8 Write cache enabled Auto-spin up enabled, 2 sec between spin-up Drives, however, probably do not support spinup. ------------------------------- Disks: Drive Information (Controller ID 6) Port Model Capacity Serial # Firmware Unit Status 0 ST3300831AS 279.46 GB 3NF0BZYJ 3.02 0 OK 1 ST3300831AS 279.46 GB 3NF0AC04 3.01 0 OK 2 ST3300831AS 279.46 GB 3NF0A7JE 3.01 0 OK 3 ST3300831AS 279.46 GB 3NF0ABT1 3.01 0 OK 4 ST3300831AS 279.46 GB 3NF0A63J 3.01 0 OK 5 ST3300831AS 279.46 GB 3NF0ACC5 3.01 0 OK 6 ST3300831AS 279.46 GB 3NF09FLP 3.01 0 OK 7 ST3300831AS 279.46 GB 3NF046WY 3.01 0 OK ---------------------------------- [root@cfhat5 diskio]# vmstat procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 380 3781540 0 58004 0 0 2712 3781 243 216 0 2 91 7 [root@cfhat5 diskio]# free total used free shared buffers cached Mem: 4010956 229532 3781424 0 0 58004 -/+ buffers/cache: 171528 3839428 Swap: 7823576 380 7823196 ============================================================================ Kernel config See http://www.cfa.harvard.edu/~gbakos/diskio/ + ------------------------------------------------------------------------ + Dr. Gaspar A. Bakos Hubble Fellow, Solar, Stellar and Planetary Sciences (SSP) Division Harvard-Smithsonian Center for Astrophysics 60 Garden Street, Cambridge, MA 02138 (USA) + ------------------------------------------------------------------------ + The massive load under 2.6.x goes away with the following changes for me. vm.dirty_expire_centisecs = 1000 vm.dirty_ratio = 5 Hello, About the Linux 2.4 driver, we realized that we have big problem with the 9.2 driver. The 9.1.5.2 was fine. In fact the 9.2 firmware delivered (FE9X 2.06.00.009) with the driver is buggy if you use more than 1 unit. (Hotspares are considered as an unit.). We got a new firmware from 3ware support and it solves the problem. With some tweaking and 5 x RAID 5 volume on 3 x 3ware card (capacity around 5TB) we got with the iozone test 160MB/s write and 270MB/s Read. If you want more detailed information contact me. Hi, I am experiencing a similar problem without 3ware RAID. I previously used EL3 Update 4 that was fine. Now, I installed EL3 Update 5, make it an nfs server and I am trying to check its performance. Clients hangs easily, especially with big rsize/wsize(16KB and 32KB). Commands like df and ls do not answer, and the clients and the server are quiet. Yoshihiro Tsuchiya Yoshihiro, Please open a separate bug report for the NFS performance problem you described in comment 277. Thanks, Tom I'm suffering with a 3ware 7500-4 (3 drives in a raid5 + a hot spare) on RHEL4. Something I noticed is that 3ware completely rewrote the linux driver (from 1.26.00.039 to 1.26.02.001). Has anyone tried the new driver to see if it improves the write performance? rsyncing some directory from a small P4/SCSI RAID (Mylex acceleraid 170 with 10krpm disks) server to some big bi-amd64/SATA2 Raid (3ware 9550SX with 4 SATA2 disks). Both are RAID-5. The load is somewhat huge for the bi-amd64, it's is about 0.30 on the "small" one ... top - 22:46:15 up 56 min, 2 users, load average: 7.57, 7.80, 7.80 Tasks: 80 total, 1 running, 79 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0% us, 0.0% sy, 0.0% ni, 0.0% id, 100.0% wa, 0.0% hi, 0.0% si Cpu1 : 0.0% us, 0.0% sy, 0.0% ni, 0.0% id, 100.0% wa, 0.0% hi, 0.0% si Mem: 3090480k total, 3071604k used, 18876k free, 13776k buffers Swap: 3895752k total, 2656k used, 3893096k free, 2833656k cached Created attachment 121349 [details]
Activating write cache seems to help a lot
Activating write cache seems to help a lot
I am seeing this problem also with RHEL 4.2 and 3Ware 7504 card with three drives in a Hardware RAID 5 config. Can someone from Redhat please give us an update on this call as it has been opened for over 18 months now. is it anything to do with this issue per chance ? https://www.redhat.com/archives/linux-lvm/2004-February/msg00141.html Iâve read this Bugzilla twice as I find it very similar to what we experience but itâs totally different as we have RH AS 4.0 and Emulex FC HBA connected to a DMX3000. We experience an I/O rate of 30-40 I/Oâs per sec from a host with MySQL doing ~4KB I/O. Not very impressive for a HI end SAN storage solution (Other Solaris hosts perform up to 3000 I/Oâs so there is no problem with the DMX) Have anyone tried to align the file system or is that totally unnecessary for the host controllers mentioned hear?, Iâm a SAN person so this might be totally irrelevant but as I know this is a problem in SAN environment it might cause the bad performance with RAID 5 and the local controllers as well. I have tried to get my Linux persons to read this Buzilla as we see bad performance on a local HP SCSI controller as well on FC HBAâs connected to SAN storage (they say that ITâs not a Linux problem... Itâs your SAN storage...). I donât have any Linux server to try this on so I post it hear and if it makes sense to align the I/Oâs so they donât need to cross I/O boundaryâs that are set by the SCSI or what ever controller, itâs worth a try. I also believe that this isnât only for RAID 5 devices so aligning RAID 1 would also be a good thing to try to get aligned to the cache. I would give it a try to do File System alignment for the Raid 5 volumes to see if it does any positive effect. If the controller Cache has 32KB cache slots an alignment for the partition to start at could be 128 Bytes (or 64 bytes) both for RAID 1 and RAID 5 Depending on how the Raid controller handles the RAID devices and cache. This is an example to align that works on FC SAN. 1. Execute âfdisk /dev/sd<x> 2. Type ânâ to create a new partition 3. Type âpâ to create a primary partition 4. Type â1â to create partition #1 5. Select the defaults to use the complete disk 6. Type âxâ to get into expert mode 7. Type âbâ to specify the starting block for partitions 8. Type â1â to select partition #1 9. Type â128â to make partition #1 to align on 64KB boundary (if that âbounderyâ exist on the controller). 10. Type ârâ to return to main menu 11. Type âtâ to change partition type 12. Type â1â to select partition 1 13. Type âfbâ to set type to fb 14. Type âwâ to write label and the partition information to disk Might help might not? Hi. Is this a normal behavior for dd? I ran the following. Itâs an ext3 file system on internal SCSI device. During this time the 4 CPUâs goes up to ~99% system time. dd if=/dev/zero of=/opt/laban_test/fs1/file01 bs=8192 count=100000 & dd if=/dev/zero of=/opt/laban_test/fs2/file01 bs=8192 count=100000 & dd if=/dev/zero of=/opt/laban_test/fs3/file01 bs=8192 count=100000 & dd if=/dev/zero of=/opt/laban_test/fs4/file01 bs=8192 count=100000 & [root@se3108 ~]# strace -c -tt -T -p 19090 Process 19090 attached - interrupt to quit Process 19090 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 94.92 14.260186 196 72870 write 5.08 0.762567 10 72870 read 0.00 0.000071 10 7 6 open 0.00 0.000036 12 3 close 0.00 0.000025 25 1 munmap 0.00 0.000015 15 1 mmap2 0.00 0.000010 10 1 fstat64 ------ ----------- ----------- --------- --------- ---------------- 100.00 15.022910 145753 6 total [root@se3108 ~]# Finally I had a change to test 3ware's 9500s under Windows 2000-server. (Un)fortunately I also were able to reproduce same kind of symptoms with it. 3GHz Pentium 4, 2*160GB sata-disks, raid-1 1. Send 5GB file through smb-share to win2000-server. 2. win2000-server slowly becomes unresponsive to any other processes which are trying to access the harddisk until even mouse movements become unresponsive. 3. after server has wrote the whole file everything becomes back to the way it was. So it seems to me that 3ware's cards just suck and their marketing is lying quite a much about their card's perfomance. This case is closed for me. =) btw. some benchmark http://www.tweakers.net/ext/i.dsp/1110264565.png Is this bug been resolved. We have enterprise 3 update 4 kernel installed on HP DL 360 G4 & DL 380 G4. With moderate work the iowait + load going unnecessarily high. And the performance also seems very jittery. When a tar backup was done from the disk to a USB hard drive the iowait went 99% high & load went high as 30~50 & managed to crash the server one time. If redhat says iowait numbers in TOP is not a problem why did it crash the system. The OS should not crash if the backup fails. And its not limited to backups, we have seen a hanging IMAP server with stale TCP/IP connections under fairly low number of connections. How do we solve this problem. Will the new RH ES 4 solve these issues. It seems like this bug is not solved. We tried ES 4 on a moderate server to notice the same high iowaits & poor response for remote terminal connected users. Going to try it on a HP DL 380 to see if there's a difference. Any insight in to this may greatly help. Is this bug been resolved. We have enterprise 3 update 4 kernel installed on HP DL 360 G4 & DL 380 G4. With moderate work the iowait + load going unnecessarily high. And the performance also seems very jittery. When a tar backup was done from the disk to a USB hard drive the iowait went 99% high & load went high as 30~50 & managed to crash the server one time. If redhat says iowait numbers in TOP is not a problem why did it crash the system. The OS should not crash if the backup fails. And its not limited to backups, we have seen a hanging IMAP server with stale TCP/IP connections under fairly low number of connections. How do we solve this problem. Will the new RH ES 4 solve these issues. It seems like this bug is not solved. We tried ES 4 on a moderate server to notice the same high iowaits & poor response for remote terminal connected users. Going to try it on a HP DL 380 to see if there's a difference. Any insight in to this may greatly help. And there is no 3ware cards inside. Raid card is smart array 6i/5i. This BZ has become a catch-all for many unrelated performance problems. As a result it is impossible to update the status, because there is no status that applies to them all. The way I propose to address this is to strictly limit this BZ to performance problems with the 3ware driver/adapter in RHEL 3. Other problems need a separate BZ. As you can see, some of the 3ware problems have been addressed in RHEL 3 updates, some have been deemed to be hardware limitations. Others may remain. Denzel, if you are experiencing a crash, that is clearly a different problem than this one. You should open a separate BZ. Please use a serial console to capture the output at the time of the crash. Also provide /var/log/messages showing the boot messages and the messages leading up to the crash. (Or just provide a sysreport.) Depending on the situation, we may ask you to provide a vmcore, using netdump or disk dump. The "hanging IMAP server with stale TCP/IP connections" sounds like a different problem. Please open a separate BZ. If you are having a problem with I/O performance on a 3ware adapter, please provide more information about how it is configured and enough information to allow us to reproduce the problem. If you are having an I/O performance problem with another HBA, please open a separate BZ with the information requested here. I think that the I/O performance problems are related to the following comment from the 9.3.0.2 3ware firmware release: Write performance and read performance balancing issue With the 3ware 9550SX controller, you might experience slow read performance when you have lots of write operations going to the controller at the same time. With the current firmware, the firmware is maximizing the write performance and gives it higher priority than read performance. A future firmware update will rectify this issue. No idea if this applies to all the other controllers but I suspect so. This can explain the system unresponsiveness under write load since the reads are starved. Some kernel tuning might help a lot in this case. Having said that I/O performance under RHEL4 is a lot better for the same hardware so the kernel isn't entirely blameless here. |