Red Hat Bugzilla – Bug 139910
Poor SCSI read performance caused by fragmentation of user requests
Last modified: 2007-11-30 17:07:05 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041001
Description of problem:
RH 7.3 seemed to max-out at about 60MB/s over FC.
RHEL 3 hits maxs-out at about 20MB/s.
The problem is confined to the SCSI layer. Using sg_dd (SCSI generic
implementation of dd), the FC and SAN max-out at about 160MB/s on
reads (FC-2). This means that, if the SCSI layer would send "read"
commands with larger "read" lengths, the performance would increase.
The SAN shows this problem nicely: the DDNS2A8500 "stats length"
command shows a graph of the "read" and "write" lengths it receives.
When performang I/O operations with large "read" and "write" lengths,
the DDN shows that the "write" requests are large, but the "read"
requests are fragmented:
S2A 8000: stats length
Command Length Statistics
Length Port 1 Port 2 Port 3
Kbytes Reads Writes Reads Writes Reads Writes Reads
> 0 2359622 1687DD9 1A07EF3 E0351 1B50060 C555 B0AD97
> 16 C6332 44FF0D 226E5 6589 1FFF8 1CAC 6358
> 32 14BB4 26DA31 58CD 2BDA 4664 11CD 16A
> 48 17542 186DF8 8B13 1557 81BA 96D 0
> 64 7FA2 119010 5E52 1A2E 5C20 B94 2
> 80 4768 D408F 3807 C48 33B0 589 0
> 96 B2BE A4BA6 7F07 B6C 7DBE 59D 0
> 112 2F7D 818FF 3057 E14 2FC0 902 0
> 128 3710D 2E40F9 3467F 84337 35BCB 852B5 0
> 144 0 0 0 0 0 0 0
> 160 0 0 0 0 0 0 0
> 176 0 0 0 0 0 0 0
> 192 0 0 0 0 0 0 0
> 208 0 0 0 0 0 0 0
> 224 0 0 0 0 0 0 0
> 240 0 0 0 0 0 0 0
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Time "dd" commands with large block sizes (bs) both reading and
writing to SCSI FC devices.
2.Time "sg_dd" commands of the same block sizes to the same SCSI FC
3. Note the performance difference.
Actual Results: FC-2 read/write pwerformance using sg_dd with large
block sizes will get about 160MB/s (given a good SAN like a DDNS2A8500).
Using "dd" with the same large block sizes over the same device gets
much worse performance.
Expected Results: I need 160MB/s "read" and "write". I'd settle for
a 10% performance degredation in "reads".
Larry Woodman and a few other people have been working this problem
more than me. It is not necessarily a SCSI issue. Your correct that
sg_dd will skip anything that could fragment the requests, but when
using dd the fragmentation is likely a result of the I/O elevator and
VM subsystems more than it is the SCSI subsystem. Larry, if you
disagree, then simply toss this back at me.
Created attachment 107910 [details]
Graphs comparing RHEL to RH7.3 SCSI/FC performance
I'm adding some data I've collected; see the attached PDF.
The left column of three graphs shows RHEL3U2 results, the right column of
three graphs shows RH7.3 (I forget which kernel... I think 2.4.19) results. All
on the same hardware: DDN S2A8000 8-port SANs, 3GHz I/O and Compute nodes, I/O
nodes w/ QLogic FC2 cards, channel bonded GigE, and running GFS; the Compute
Nodes exporting via NFS the GFS partitions in a ratio of 8 Compute Nodes per
I/O Node. A Foundry FastIron 1500 switch is used for GigE/NFS (the Foundry
switch firmware has also been upgraded in this time frame).
The left and right graphs are comperable, even if the captions sound a bit
The top two graphs show I/O (GFS) node perspective (GigE/NFS not involved).
The recent graphs with RHEL (left top) show up to 16 I/O nodes, the RH7.3graph
(right top) only shows eight. Write performance has doubled in RHEL vs. 7.3...
from a maximum at 550MB/s to 1.1GB/s! This is pushing the limits of the DDN
S2A8000 (sgp_dd can get about 1.3GB/s writes).
The read performance, on the other hand, has decreased significantly. In RH7.3
I had been getting an aggregate performance of ~350MB/s. In RHEL, I'm getting
200MB/s. The sgp_dd read performance aggregate (across 8 FC2 ports) is
~900MB/s for the DDN S2A8000.
The middle two graphs show the Compute Node perspective. The Compute Nodes are
NFS clients for the GFS partitions served by the I/O nodes. The 16 and 32
client nodes on the X-axis (of the left RHEL graph) map Compute Nodes to I/O
nodes 1-to-1 and 2-to-1 (one NFS client per NFS/GFS server, and two NFS Clients
per NFS/GFS server). These two sets of columns are comparable to the X-axis on
the right graph at 8 and 16 Compute Nodes (as the RH7.3 system only had 8 I/O
The middle graphs only show that NFS carries through the performance seen on
the I/O nodes.
The bottom two graphs show the scalability from one to eight Compute Nodes that
are all NFS clients of one I/O node.
In the RH7.3 case (right graph) you could expect the same NFS performance for
both reads and writes, comparable to the GFS performance of the I/O node.
In RHEL, the I/O node write performance of 170MB/s slumps to a Compute node
aggregate of 100MB/s, even though the I/O nodes use channel-bonded (802.3ad)
Also in RHEL, the read performance doesn't reach 90% of the I/O nodes read
performance until all eight NFS client Compute Nodes are invoked. This
compounds the already low maximum.
All numbers were gathered using I/O zone with 512K byte blocks, and file sizes
of twice memory size.
I was going back through the data on the attached PDF and found an
The charts shown for RH7.3 were prior to channel bonding the I/O (GFS)
nodes, and prior to using 9K MTU's. Once those features were enabled,
for example, the RH7.3 numbers on the center right chart peaked at
650MB/s for writes, and over 400MB/s for reads.
So, half the read speed was lost in the move to RHEL.
I have just upgraded a fileserver from RH7.3 to RHEL3.0 and have
discovered a huge read performance hit.
I have a Dell PE2650 attached via two Qlogic QLA2310F FCAL cards to
the storage device which is a Dell/EMC FC4500. I have installed the
latest BIOS 1.42 into the FCAL cards and have used the Qlogic 7.0.3
driver instead of the Redhat supplied one. I am running kernel
I used to get about 50mb/s write and 60mb/s read speed.
I am now getting about 50mb/s write and 20mb/s read speeds.
Read speed is now only a third of what it was.
It would help to get sgp_dd performance numbers, as the difference
between sgp_dd and dd implicate the SCSI layer and not the SAN.
You'll probably drool when you see what you could be getting!
I think the main problem is that the Linux SCSI layer has been tuned
to SCSI adapters and left FC considerations in the dust. While FC is
not the culprit (which sgp_dd shows), with a SCSI adapter and a JBOD I
can get great dd read/write performance (i.e. 300MB/s)... but,
whatever the kernel is doing to boost SCSI adapters effects FC adversely.
How do I go about getting sgp_dd performance numbers?
sg_dd is like dd, while sgp_dd shows the effect of multiple threads.
You'll need "scsi generic" either built-in to your kernel, or the sg.o
I can report that I have now overcome this problem - look at bug
I came across this before - basically, the max-readahead value is set
to 31 which is more suited towards random access. For sequential
reads, set this to 256.
I have done this and my read performance is back up to circa 60mb/s.
Created attachment 112856 [details]
Problem partially resolved in RHEL3U4 and GFS 6
Same system has been upgraded to RHEL3U4 and GFS 6, and the performance numbers
are compared to GFS 5.2/RHEL3U2 in the attached PDF.
The SCSI performance problems are solved. If you compare the top two charts in
their lower left corner, you'll see the read speed is much better than it was.
This is a single threaded test, so 100MB/s is the max. In multi-threaded tests
I can get 140MB/s reading on an I/O node.
Good Job RedHat!
But, "read" scalability is still an issue in GFS. While the peak "read" speed
has improved slightly, it's not reflecting the available hardware (and now OS)
So, RH7.3 GFS still beats RHEL in "read" scalability.
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
For more information of the RHEL errata support policy, please visit:
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.