Bug 502499 - Poor write performance on some { sata controller, disk} combinations
Summary: Poor write performance on some { sata controller, disk} combinations
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-05-25 14:28 UTC by Vali Dragnuta
Modified: 2011-08-17 13:45 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-17 13:45:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
output of: hdparm -i /dev/sd[a-g] (4.96 KB, text/plain)
2009-06-09 10:08 UTC, Steven Haigh
no flags Details
Output of: lspci -v (6.17 KB, text/plain)
2009-06-09 10:08 UTC, Steven Haigh
no flags Details
Output of: dmesg for kernel 2.6.18-128.1.10.el5.centos.plus (22.93 KB, text/plain)
2009-06-10 16:12 UTC, Steven Haigh
no flags Details


Links
System ID Private Priority Status Summary Last Updated
CentOS 3640 0 None None None Never

Description Vali Dragnuta 2009-05-25 14:28:37 UTC
It seems that certain hardware combinations result in very poor write performance.
System information :
Intel server platform, S5000PAL;
An assortment of sata disks connected to it :
one 2.5" seagate ST980813AS connected directly to the mainboard
two 3.5" seagate ST3500320NS
one wd WD6400AAKS-65A7B2 
Kernels tested :
2.6.18-150.el5 (test kernel from dzickus) and
2.6.18-128.1.10.el5

Facts :
1.Only the 500G seagates have a decent write performance (50...60M/sec - quite poor for those drives, but ..acceptable)
2. The 640G western on the other hand barely is able to sustain 20M/sec writes.
3. the small 80G disk is only able to do 3.5M/sec write performance.
4. On the other hand, read performance goes to aout 90..100M/sec for both the 500G seagates and the 640G western digital, so the problem seems to be write related.
5. The 640G disk has good write performance in a different server (S3000 based), so the disk is not defective.
6. The disks were rotated in tests, so a defective sata port is also excluded.
Could it be an insufficient supported sata controller ? 



Additional info :

-The active driver is ahci. 
-If i disable ahci from bios setup to force the use of ata_piix instead, performance goes from bad to abysmal, even on the usually better performing seagates.
-I have three 640G wd disks. All perform fine on a S3000 mainboard but very low on the S5000 system.
- I/O scheduler was cfq now is deadline - no significant difference
- write tests were done by :
   1). write big file on ext3 filesystem
   2). write big file on xfs filesystem
   3). write from /dev/zero directly to the device with dd
- I attached an archive with dmesg, lspci and dmidecode.

Comment 1 Steven Haigh 2009-05-26 06:46:32 UTC
I am seeing similar issues on an all SATA RAID5 array. Array write performance will vary over a network from 35MB/sec to 200KB/sec. Read performance from the same RAID5 array is in excess of 45MB/sec.

From what I have managed to find out so far is that data going over a network to this machine via FTP or Samba will be around 20MB/sec until the %wa value in 'top' on the server goes above 40%. At this point, data transfer will slow to around 200KB/sec. It seems that the system is putting all data to cache, then the cache is being flushed to disk - possibly in a blocking manner leading to said performance issues.

Interestingly enough, removing (rm) a 4Gb ISO from the RAID5 array will take 3-4 minutes. Even more interestingly, these issues do not occur on the RAID1 array on the same machine.

The three SATA controllers on the machine use sata_promise, sata_sil and sata_via.

Any more information is available upon request.

# lspci
00:00.0 Host bridge: Intel Corporation 82865G/PE/P DRAM Controller/Host-Hub Interface (rev 02)
00:01.0 PCI bridge: Intel Corporation 82865G/PE/P PCI to AGP Controller (rev 02)
00:03.0 PCI bridge: Intel Corporation 82865G/PE/P PCI to CSA Bridge (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02)
00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #4 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV11DDR [GeForce2 MX200] (rev b2)
02:01.0 Ethernet controller: Intel Corporation 82547EI Gigabit Ethernet Controller
03:01.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
03:04.0 SCSI storage controller: Adaptec AHA-2944UW / AIC-7884U (rev 01)
03:0b.0 Mass storage controller: Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 02)

# cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md1 : active raid5 sdg1[3] sdf1[2] sde1[1] sdd1[4] sdc1[0]
      1172134144 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      
md2 : active raid1 sdb2[1] sda2[0]
      1967872 blocks [2/2] [UU]
      
md0 : active raid1 sdb1[1] sda1[0]
      76180096 blocks [2/2] [UU]
      
unused devices: <none>

Comment 2 Steven Haigh 2009-05-26 06:49:43 UTC
Oh, and I forgot to add... I've tested this under kernels 2.6.18-128.1.10.el5 and 2.6.18-128.1.6.el5.

Comment 3 Vali Dragnuta 2009-05-27 10:29:52 UTC
Same behavior using the RHEL 5.3 kernel instead of the centos kernel.
RHEL kernel version tested :  2.6.18-128.1.10.el5.x86_64

Comment 4 Steven Haigh 2009-05-29 10:41:31 UTC
Some more info that will probably help... 

The RAID5 array I have created has been running for a LONG time.

mdadm shows:
  Creation Time : Sun Sep 18 00:42:54 2005
     Raid Level : raid5
     Array Size : 1172134144 (1117.83 GiB 1200.27 GB)

The speed issues are new however to newer kernel versions. This means it is not a problem with the creation or anything else in relation to the RAID array - as these speed issues have only just been noticed after many years of this array running.

Sadly, I can't tell you what kernel version started these issues.

Comment 5 Steven Haigh 2009-05-29 10:43:32 UTC
More info: I've contacted Seagate and provided them with serial numbers of the drives affected. They have confirmed there are no firmware updates available for the drives in question.

Comment 6 Steven Haigh 2009-05-29 13:26:08 UTC
Testing write speeds with various settings:

$ dd bs=1M count=200 if=/dev/zero of=/mnt/raid/data/text.data
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.95399 seconds, 220 MB/s

$ dd bs=1M count=400 if=/dev/zero of=/mnt/raid/data/text.data
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 1.80456 seconds, 232 MB/s

$ dd bs=1M count=600 if=/dev/zero of=/mnt/raid/data/text.data
600+0 records in
600+0 records out
629145600 bytes (629 MB) copied, 3.61397 seconds, 174 MB/s

$ dd bs=1M count=800 if=/dev/zero of=/mnt/raid/data/text.data
800+0 records in
800+0 records out
838860800 bytes (839 MB) copied, 35.6739 seconds, 23.5 MB/s

$ dd bs=1M count=1000 if=/dev/zero of=/mnt/raid/data/text.data
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 59.2112 seconds, 17.7 MB/s

$ dd bs=1M count=1500 if=/dev/zero of=/mnt/raid/data/text.data
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 130.644 seconds, 12.0 MB/s

$ dd bs=1M count=2000 if=/dev/zero of=/mnt/raid/data/text.data
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 167.748 seconds, 12.5 MB/s

Comment 7 Steven Haigh 2009-06-03 08:39:35 UTC
Rolled back to kernel 2.6.18-8.el5 from CentOS released as kernel-2.6.18-8.el5.i686.rpm

# dd bs=1M count=200 if=/dev/zero of=/mnt/raid/data/text.data
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.92904 seconds, 226 MB/s

# dd bs=1M count=400 if=/dev/zero of=/mnt/raid/data/text.data
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 2.01037 seconds, 209 MB/s

# dd bs=1M count=600 if=/dev/zero of=/mnt/raid/data/text.data
600+0 records in
600+0 records out
629145600 bytes (629 MB) copied, 2.81386 seconds, 224 MB/s

# dd bs=1M count=800 if=/dev/zero of=/mnt/raid/data/text.data
800+0 records in
800+0 records out
838860800 bytes (839 MB) copied, 8.39298 seconds, 99.9 MB/s

# dd bs=1M count=1000 if=/dev/zero of=/mnt/raid/data/text.data
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.2183 seconds, 51.9 MB/s

# dd bs=1M count=1500 if=/dev/zero of=/mnt/raid/data/text.data
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 39.831 seconds, 39.5 MB/s

# dd bs=1M count=2000 if=/dev/zero of=/mnt/raid/data/text.data
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 59.1052 seconds, 35.5 MB/s

This means between kernel versions 2.6.18-8 and 2.6.18-128.1.10 there has been a change that has drastically reduced write speeds to SATA/RAID drives. No other hardware or software changes have been made between tests.

Comment 8 Steven Haigh 2009-06-03 08:59:51 UTC
Changing to kernel 2.6.18-92.1.22.el5.

# dd bs=1M count=200 if=/dev/zero of=/mnt/raid/data/text.data
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 1.03884 seconds, 202 MB/s

# dd bs=1M count=400 if=/dev/zero of=/mnt/raid/data/text.data
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 1.76567 seconds, 238 MB/s

# dd bs=1M count=600 if=/dev/zero of=/mnt/raid/data/text.data
600+0 records in
600+0 records out
629145600 bytes (629 MB) copied, 2.48733 seconds, 253 MB/s

# dd bs=1M count=800 if=/dev/zero of=/mnt/raid/data/text.data
800+0 records in
800+0 records out
838860800 bytes (839 MB) copied, 8.21672 seconds, 102 MB/s

# dd bs=1M count=1000 if=/dev/zero of=/mnt/raid/data/text.data
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 18.2674 seconds, 57.4 MB/s

# dd bs=1M count=1500 if=/dev/zero of=/mnt/raid/data/text.data
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 43.9399 seconds, 35.8 MB/s

# dd bs=1M count=2000 if=/dev/zero of=/mnt/raid/data/text.data
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 61.8338 seconds, 33.9 MB/s

This seems to indicate the issue is between 2.6.18-92.1.22 and 2.6.18-128.1.10.

Comment 9 Vali Dragnuta 2009-06-05 10:04:39 UTC
Steven :
1. What hardware are you running on ? (mainboard, ram,disks, controller...)
2. First writes are always better, as you just fill the kernel's buffers. You should use sync mode with dd or time sync just after dd and count that time in the transfer rate, too.
3. If you can, you should disassemble the array and try to test the speed individually  on each disk. In my case, a noticeable difference can be seen between various types of disks or even between sata ports, thus I do not exclude some weird controller firmware problem either.

Comment 10 Steven Haigh 2009-06-09 10:07:24 UTC
The CPU is a P4 3Ghz on an older Gigabyte mainboard. It runs 1.5Gb of DDR400 RAM in dual channel mode. I have 2 SATA HDDs hooked up to the onboard ICH5 SATA controller, and 2 Silicon Image controllers. The SI3112 has 2 x SATA connectors and is onboard the mainboard, the second is a PCI SI3114 4 port SATA controller.

I just did a complete teardown and rebuild of the RAID array on this system and creating the RAID5 array took well over 10 hours for 5 x 300Gb drives with no other processes accessing this RAID array.

Looking at the stats given to me via iostats, it seems the maximum speed I can get via the RAID5 array is ~6-9MB/sec per disk.

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
sdc              43.22         7.22         0.00         14          0
sdd              43.22         7.22         0.00         14          0
sde              23.62         0.00         7.04          0         14
sdf              42.71         7.16         0.00         14          0
sdg              43.72         7.29         0.00         14          0
md1               0.00         0.00         0.00          0          0

Interestingly, when I test md0 which is made up from the two drives connected to the ICH5 SATA controller, I get 40+MB/sec to EACH DRIVE.

I've been doing multiple tests with iostat to watch IO statistics and it seems that the max I get out of the sata_sil connected devices is 7MB/sec per device. This gives total speeds of the following while duplicating files on the RAID array (sequential READ and WRITE operations to the same array).

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.49    0.00   22.39   31.84    0.00   44.28

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
sdc              60.70         6.54         5.36         13         10
sdd              66.17         7.00         5.75         14         11
sde              53.23         6.40         4.52         12          9
sdf              75.62         7.38         5.53         14         11
sdg              74.13         7.15         5.94         14         11
md1             329.85        19.24        17.16         38         34

Sadly, I don't have any other NON Silicon Image SATA controllers to test to see if the slow speeds are caused by the sata_sil controller.

Playing around I managed to get data transfer to PEAK at ~20MB/sec - however this only lasted for a few seconds and I believe it was a caching thing - as things soon settled down back to the same old slow speeds.

I have attached the output of lspci -v and a hdparm -i on all attached HDDs.

Comment 11 Steven Haigh 2009-06-09 10:08:26 UTC
Created attachment 346999 [details]
output of: hdparm -i /dev/sd[a-g]

Comment 12 Steven Haigh 2009-06-09 10:08:59 UTC
Created attachment 347000 [details]
Output of: lspci -v

Comment 13 Yeechang Lee 2009-06-10 15:52:33 UTC
I can confirm what Steven and Vali have been reporting.

I have run a mdadm RAID 6 JFS array on 16 Western Digital RE2 500GB drives on a HighPoint RocketRAID 2240 (in JBOD, using the sata_mv kernel module) for 2.5 years, with Fedora and (for the past eight months) CentOS. When I boot with 2.6.18-128.1.10.el5.x86_64 (albeit the centosplus version), writes to the array are so slow as to more or less hang it; 'iostat -xk 1' shows %util for certain drives (at least one, sometimes two or three) in the array at 100%, with the affected drive changing at random. Reverting to 2.6.18-92.1.22.el5.x86_64 reliably eliminates the problem.

Also, could #470623 for Fedora be related?

Comment 14 Steven Haigh 2009-06-10 16:11:35 UTC
Hi Yeechang,

I'm not sure it the fedora bug could be related, as all my drives are 1.5Gbit drives. This doesn't mean that the underlying libata may be the cause of a problem. Looking at the same iostat command as you however, I notice that some of the drives in the RAID5 array are showing peaks of 80% utilisation on an average write speed of 6MB/sec. This doesn't seem right.

Comment 15 Steven Haigh 2009-06-10 16:12:41 UTC
Created attachment 347251 [details]
Output of: dmesg for kernel 2.6.18-128.1.10.el5.centos.plus

Comment 16 Steven Haigh 2009-06-10 16:41:26 UTC
This is also interesting bonnie++ output from the RAID5 array:
# bonnie++ -d /mnt/raid/data/ -u 501 -m zeus
Using uid:501, gid:501.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.94       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zeus             3G   591  88 11382   1  8497   2   561  61 90163  16 229.2   7
Latency             17729us   32078ms    2537ms   50973us     161ms     201ms
Version  1.94       ------Sequential Create------ --------Random Create--------
zeus                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   361   2 +++++ +++   314   1   341   2 +++++ +++   352   2
Latency               560ms     963us     673ms     924ms     894us     846ms
1.93c,1.94,zeus,1,1244631220,3G,,591,88,11382,1,8497,2,561,61,90163,16,229.2,7,16,,,,,361,2,+++++,+++,314,1,341,2,+++++,+++,352,2,17729us,32078ms,2537ms,50973us,161ms,201ms,560ms,963us,673ms,924ms,894us,846ms

It shows sequential block reads at 90MB/sec but writes at 11MB/sec across the RAID5 array.

Comment 17 Steven Haigh 2009-06-10 17:15:31 UTC
In further testing, I have also been able to discredit the dd tests vs kernel versions. For some reason (which escapes me... Caching? different IO profiles?), I can replicate the faster results on the latest kernel version - with a few exceptions.

The bonnie++ tests are inline with what I see in reality.

Comment 18 Doug Ledford 2009-06-10 21:04:18 UTC
To Vali:

If you can install and use the sg3_utils package to test the drive performance, that would actually isolate whether it's happening in the filesystem/block layer or in the SCSI/libata driver layer.  Using sg_dd on the scsi devices in blk_sgio=1 mode gets around the block layer and goes straight to scsi commands.

To Steven:

OK, first off, your write performance on your machine is abysmal, and that's regardless of which kernel you use.  The problem kernels make abysmal look even worse, but you are definitely getting poor performance even on a good day.  Your write speed should not be so far below your read speed, and whenever it is it's a good indication that something in your setup is bad for raid5 operation.  I'll address your raid issues in email since that's separate from the original bug reporter's problem of a kernel specific slowdown (you are seeing the kernel specific slowdown too, but you also have just overall bad performance that isn't related).

Comment 19 Vali Dragnuta 2009-06-12 13:04:39 UTC
@Doug :
Hello. 
For the moment I returned the hardware to the vendor to make their own evaluation. I hope the resolution will be "hardware problem" - it would simplify things. I'll be back next week when the resolution from the vendor is due.

Thank you.

Comment 20 Vali Dragnuta 2009-06-23 08:06:55 UTC
Ok, so I finally found what the problem was (still is, actually).
The system I test on is a 1U intel platform. Very compact, and to cool all the system down it has about 10 small coolers. The highest the rpm on these coolers the slower the transfer rate. It seems that the vibration produced by these coolers is dramatically affecting the disk performance. At the lowest fan rpm (just after startup) the transfer rate is at about 80...100 megabytes/second.
A step up in fan RPM and the transfer rate drops to 40 megabytes/second.  One more step infan RPM and transfer rate drops to 11 megabytes/second or lower.
Depending on the type of disk, vibration affects performance more or less.
For standard sata drives,like the western digitals in my first setup the vibration impact is very high,performance drops to 10..11 megabytes/sec or below.
If the disk is connected directly to the mainboard without touching the case the transfer rate seems to get back to normal - further confirming the vibration hypothesis ). Electrical interference in the backplane seems improbable, as another disk connected directly to the mainboard works fine when suspended above the case and works poor when it touches the case and by this it vibrates with the server case.
The more enterprise-ish sata drives, like the 500G seagate (NS/Baracuda ES2 ) from the first seup, or the WD RE3 i tested with yesterday, the performance drop exists, is very noticeable but it's not so dramatic (does not drop below 10 megabytes/sec).
Given the fact that fan rpm is expected to naturally rise on normal system operation in peak load hours, it's also expected the disk performance to drop at these moments. This is actually a design flaw of the intel platform, as neither the fans nor the disk trays  are not rubber mounted/insulated to reduce vibrations. And vibration is a certain thing with these very high rpm fans.For the moment the temporary workaround is to use enterprise-ish sata disks, these being a little more resilient to vibrations.

Comment 21 Maxim Egorushkin 2009-08-01 15:05:50 UTC
(In reply to comment #8)
> Changing to kernel 2.6.18-92.1.22.el5.
> 
> # dd bs=1M count=200 if=/dev/zero of=/mnt/raid/data/text.data
> 200+0 records in
> 200+0 records out
> 209715200 bytes (210 MB) copied, 1.03884 seconds, 202 MB/s

[...]

> # dd bs=1M count=1500 if=/dev/zero of=/mnt/raid/data/text.data
> 1500+0 records in
> 1500+0 records out
> 1572864000 bytes (1.6 GB) copied, 43.9399 seconds, 35.8 MB/s
> 
> # dd bs=1M count=2000 if=/dev/zero of=/mnt/raid/data/text.data
> 2000+0 records in
> 2000+0 records out
> 2097152000 bytes (2.1 GB) copied, 61.8338 seconds, 33.9 MB/s

You don't measure disk write performance here much, rather the performance of buffer cache. Your writes are asynchronous and they go into the buffer cache. Once the cache fills up to a certain amount pdflush daemon kicks in and stars writing to the disk in the background. When the cache is filled up to the second high watermark your writes become synchronous. More details here: http://www.westnet.com/~gsmith/content/linux-pdflush.htm

To measure the speed of direct writes to disk use oflag=direct dd option.

Comment 22 Steven Haigh 2010-09-06 16:39:50 UTC
This should probably be closed NOTABUG...

Comment 23 John Feeney 2011-08-17 13:45:51 UTC
Closing per comment #22.


Note You need to log in before you can comment on or make changes to this bug.