Bug 139910

Summary: Poor SCSI read performance caused by fragmentation of user requests
Product: Red Hat Enterprise Linux 3 Reporter: Chris Worley <chrisw>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WONTFIX QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: coughlan, danderso, gary.mansell, kanderso, kpreslan, petrides, riel, sct
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-19 19:13:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Graphs comparing RHEL to RH7.3 SCSI/FC performance
none
Problem partially resolved in RHEL3U4 and GFS 6 none

Description Chris Worley 2004-11-18 19:03:08 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041001
Firefox/0.10.1

Description of problem:
RH 7.3 seemed to max-out at about 60MB/s over FC.

RHEL 3 hits maxs-out at about 20MB/s.

The problem is confined to the SCSI layer.  Using sg_dd (SCSI generic
implementation of dd), the FC and SAN max-out at about 160MB/s on
reads (FC-2).  This means that, if the SCSI layer would send "read"
commands with larger "read" lengths, the performance would increase.

The SAN shows this problem nicely: the DDNS2A8500 "stats length"
command shows a graph of the "read" and "write" lengths it receives. 
When performang I/O operations with large "read" and "write" lengths,
the DDN shows that the "write" requests are large, but the "read"
requests are fragmented:

S2A 8000[1]: stats length

                         Command Length Statistics

 Length        Port 1            Port 2            Port 3           
Port 4
 Kbytes    Reads   Writes    Reads   Writes    Reads   Writes    Reads
  Writes
 >    0  2359622  1687DD9  1A07EF3    E0351  1B50060     C555   B0AD97
  27B19F
 >   16    C6332   44FF0D    226E5     6589    1FFF8     1CAC     6358
   2E863
 >   32    14BB4   26DA31     58CD     2BDA     4664     11CD      16A
    10D6
 >   48    17542   186DF8     8B13     1557     81BA      96D        0
     B66
 >   64     7FA2   119010     5E52     1A2E     5C20      B94        2
     65B
 >   80     4768    D408F     3807      C48     33B0      589        0
     640
 >   96     B2BE    A4BA6     7F07      B6C     7DBE      59D        0
     49A
 >  112     2F7D    818FF     3057      E14     2FC0      902        0
     440
 >  128    3710D   2E40F9    3467F    84337    35BCB    852B5        0
   37F00
 >  144        0        0        0        0        0        0        0
       0
 >  160        0        0        0        0        0        0        0
       0
 >  176        0        0        0        0        0        0        0
       0
 >  192        0        0        0        0        0        0        0
       0
 >  208        0        0        0        0        0        0        0
       0
 >  224        0        0        0        0        0        0        0
       0
 >  240        0        0        0        0        0        0        0
       0


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Time "dd" commands with large block sizes (bs) both reading and
writing to SCSI FC devices.
2.Time "sg_dd" commands of the same block sizes to the same SCSI FC
devices.
3. Note the performance difference.
    

Actual Results:  FC-2 read/write pwerformance using sg_dd with large
block sizes will get about 160MB/s (given a good SAN like a DDNS2A8500).

Using "dd" with the same large block sizes over the same device gets
much worse performance.

Expected Results:  I need 160MB/s "read" and "write".  I'd settle for
a 10% performance degredation in "reads".

Additional info:

Comment 1 Doug Ledford 2004-12-03 11:20:13 UTC
Larry Woodman and a few other people have been working this problem
more than me.  It is not necessarily a SCSI issue.  Your correct that
sg_dd will skip anything that could fragment the requests, but when
using dd the fragmentation is likely a result of the I/O elevator and
VM subsystems more than it is the SCSI subsystem.  Larry, if you
disagree, then simply toss this back at me.

Comment 2 Chris Worley 2004-12-05 05:35:31 UTC
Created attachment 107910 [details]
Graphs comparing RHEL to RH7.3 SCSI/FC performance

I'm adding some data I've collected; see the attached PDF.

The left column of three graphs shows RHEL3U2 results, the right column of
three graphs shows RH7.3 (I forget which kernel... I think 2.4.19) results. All
on the same hardware: DDN S2A8000 8-port SANs, 3GHz I/O and Compute nodes, I/O
nodes w/ QLogic FC2 cards, channel bonded GigE, and running GFS; the Compute
Nodes exporting via NFS the GFS partitions in a ratio of 8 Compute Nodes per
I/O Node. A Foundry FastIron 1500 switch is used for GigE/NFS (the Foundry
switch firmware has also been upgraded in this time frame).

The left and right graphs are comperable, even if the captions sound a bit
different.

The top two graphs show I/O (GFS) node perspective (GigE/NFS not involved). 
The recent graphs with RHEL (left top) show up to 16 I/O nodes, the RH7.3graph
(right top) only shows eight.  Write performance has doubled in RHEL vs. 7.3...
from a maximum at 550MB/s to 1.1GB/s!  This is pushing the limits of the DDN
S2A8000 (sgp_dd can get about 1.3GB/s writes).

The read performance, on the other hand, has decreased significantly.  In RH7.3
I had been getting an aggregate performance of ~350MB/s. In RHEL, I'm getting
200MB/s.  The sgp_dd read performance aggregate (across 8 FC2 ports) is
~900MB/s for the DDN S2A8000.

The middle two graphs show the Compute Node perspective.  The Compute Nodes are
NFS clients for the GFS partitions served by the I/O nodes.    The 16 and 32
client nodes on the X-axis (of the left RHEL graph) map Compute Nodes to I/O
nodes 1-to-1 and 2-to-1 (one NFS client per NFS/GFS server, and two NFS Clients
per NFS/GFS server).  These two sets of columns are comparable to the X-axis on
the right graph at 8 and 16 Compute Nodes (as the RH7.3 system only had 8 I/O
nodes).

The middle graphs only show that NFS carries through the performance  seen on
the I/O nodes.

The bottom two graphs show the scalability from one to eight Compute Nodes that
are all NFS clients of one I/O node.

In the RH7.3 case (right graph) you could expect the same NFS performance for
both reads and writes, comparable to the GFS performance of the I/O node.

In RHEL, the I/O node write performance of 170MB/s slumps to a Compute node
aggregate of 100MB/s, even though the I/O nodes use channel-bonded (802.3ad)
GigE.

Also in RHEL, the read performance doesn't reach 90% of the I/O nodes read
performance until all eight NFS client Compute Nodes are invoked.  This
compounds the already low maximum.

All numbers were gathered using I/O zone with 512K byte blocks, and file sizes
of twice memory size.

Comment 3 Chris Worley 2004-12-06 14:34:49 UTC
I was going back through the data on the attached PDF and found an
error...

The charts shown for RH7.3 were prior to channel bonding the I/O (GFS)
nodes, and prior to using 9K MTU's.  Once those features were enabled,
for example, the RH7.3 numbers on the center right chart peaked at
650MB/s for writes, and over 400MB/s for reads.

So, half the read speed was lost in the move to RHEL.

Comment 4 Gary Mansell 2004-12-30 15:39:32 UTC
I have just upgraded a fileserver from RH7.3 to RHEL3.0 and have
discovered a huge read performance hit.

I have a Dell PE2650 attached via two Qlogic QLA2310F FCAL cards to
the storage device which is a Dell/EMC FC4500. I have installed the
latest BIOS 1.42 into the FCAL cards and have used the Qlogic 7.0.3
driver instead of the Redhat supplied one. I am running kernel
2.4.21-9.0.1Elsmp.

I used to get about 50mb/s write and 60mb/s read speed.

I am now getting about 50mb/s write and 20mb/s read speeds.

Read speed is now only a third of what it was.

Comment 5 Chris Worley 2004-12-30 15:58:21 UTC
Gary:

It would help to get sgp_dd performance numbers, as the difference
between sgp_dd and dd implicate the SCSI layer and not the SAN.

You'll probably drool when you see what you could be getting!

I think the main problem is that the Linux SCSI layer has been tuned
to SCSI adapters and left FC considerations in the dust.  While FC is
not the culprit (which sgp_dd shows), with a SCSI adapter and a JBOD I
can get great dd read/write performance (i.e. 300MB/s)... but,
whatever the kernel is doing to boost SCSI adapters effects FC adversely.

Comment 6 Gary Mansell 2004-12-30 16:09:45 UTC
Chris,

How do I go about getting sgp_dd performance numbers?

Regards

Gary

Comment 7 Chris Worley 2004-12-30 16:13:22 UTC
See:

http://sg.torque.net/sg/u_index.html

sg_dd is like dd, while sgp_dd shows the effect of multiple threads.

You'll need "scsi generic" either built-in to your kernel, or the sg.o
module loaded.

Comment 8 Gary Mansell 2004-12-30 16:32:32 UTC
Hi,

I can report that I have now overcome this problem - look at bug
report 106771.

I came across this before - basically, the max-readahead value is set
to 31 which is more suited towards random access. For sequential
reads, set this to 256.

I have done this and my read performance is back up to circa 60mb/s.

Comment 9 Chris Worley 2005-04-08 15:12:42 UTC
Created attachment 112856 [details]
Problem partially resolved in RHEL3U4 and GFS 6

Same system has been upgraded to RHEL3U4 and GFS 6, and the performance numbers
are compared to GFS 5.2/RHEL3U2 in the attached PDF.

The SCSI performance problems are solved.  If you compare the top two charts in
their lower left corner, you'll see the read speed is much better than it was. 
This is a single threaded test, so 100MB/s is the max.	In multi-threaded tests
I can get 140MB/s reading on an I/O node.

Good Job RedHat!

But, "read" scalability is still an issue in GFS.  While the peak "read" speed
has improved slightly, it's not reflecting the available hardware (and now OS)
available performance.

So, RH7.3 GFS still beats RHEL in "read" scalability.

Comment 10 RHEL Program Management 2007-10-19 19:13:55 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.