Bug 599264

Summary:	[Intel 6.0 Bug] RHEL, low fcoe throughput (~20MB) on out of box rhel6.0 kernel
Product:	Red Hat Enterprise Linux 6	Reporter:	john.r.fastabend
Component:	kernel	Assignee:	Neil Horman <nhorman>
Status:	CLOSED NOTABUG	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	low
Version:	6.0	CC:	agospoda, eric.w.multanen, jack.morgan, jane.lv, john.r.fastabend, john.ronciak, jvillalo, luyu, mchristi, mzywusko, nhorman, robert.w.love, rpacheco, yi.zou
Target Milestone:	rc
Target Release:	6.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-06-09 19:12:46 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	551128

Description john.r.fastabend 2010-06-03 00:53:40 UTC

Description of problem:

FCoE and ISCSI I/O write throughput is extremely low approximately 20-40MB/s.  However read throughput appears to be more acceptable ~200MB/s.  This is seen with the default kernel and default configuration in RHEL6.0 Beta snap5.  These tests are done with very basic dd commands, but even using these basic tests 200+MB/s can be seen on the scsi-misc kernel with a custom kernel config.

# dd if=/dev/zero of=/dev/sdj bs=512k count=900 oflag=direct

Steps to Reproduce:
1. configure iscsi/fcoe
2. dd if=/dev/zero of=/dev/sdj bs=512k count=900 oflag=direct
3. review throughput numbers
  
Actual results: 
throughput to remote disk ~20MB/s

Expected results:
throughput to remote disk expected 200+ MB/s or roughly 1.6Gbps.

Comment 2 RHEL Program Management 2010-06-03 14:23:20 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Neil Horman 2010-06-03 14:52:09 UTC

email update I sent last night with status on this:

Hey all-

        On todays call I'd promised to give you all a status update on our
investigation of the performance issues we encountered at the plugfest, so here
it is.

        A few days ago Mike Christie had tried to recreate the problem locally,
and indicated that he was able to.  Performance testing here locally by gospo
and myself disagreed with that, which left us confused to say the least.  Today,
mattz and I reinstalled the intel system we were using (due to vairiuos
accessability issues), and retried our test.  On a whim I extended our test time
to 200 seconds on the netperf test, and I got some interesting results, shown
below:


[root@intel-s3ea2-08 x86_64]# netperf -H 10.16.47.27 -D 5,5 -l 200 -t TCP_STREAM
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.16.47.27
(10.16.47.27) port 0 AF_INET : histogram : interval : demo
Interim result: 0.51 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.01 seconds
Interim result: 0.66 GBytes/s over 5.03 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.04 seconds
Interim result: 0.62 GBytes/s over 5.35 seconds
Interim result: 0.63 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.03 GBytes/s over 109.03 seconds   <===!!!!!!!NOTE
Interim result: 0.65 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.01 seconds
Interim result: 0.66 GBytes/s over 5.03 seconds
Interim result: 0.66 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.00 seconds
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    GBytes/sec

 87380  16384  16384    200.00      0.31

Excluding outliers we passed approximately .66 Gbytes/s, which is about
5Gbit/s, and on an untuned system, that seems to be pretty reasonable
througput.  However, if you note we have one sample that took 109 seconds to
record, and in that time we only passed .03 Gbytes.  This sends our average
_way_ down, and I think is a factor in our performance issue.  I was able to
recreate this condition consistently.  It seems that periodically, the ixgbe
cards simply stops passing traffic for some amount of time.  I assert that,
because during these periods when netperf sees this 'hiccup' all other hosts
over this interface are unreachable.  Eventually the interface recovers and
normal throughput is resumed, but if we leave the test running, another hiccup
will be encountered, further dragging down our average throughput.


I was speaking with gospo about this, and while he was poking around on the
system we were using, he noted the following in the dmesg log:

[root@intel-s3ea2-08 x86_64]# dmesg | tail -20
pcieport 0000:00:03.0: PCIE Bus Error: severity=Corrected, type=Data Link Layer,
id=0018(Transmitter ID)
pcieport 0000:00:03.0:   device [8086:340a] error status/mask=00001000/00002000
pcieport 0000:00:03.0:    [12] Replay Timer Timeout
pcieport 0000:00:03.0: AER: Corrected error received: id=0018
pcieport 0000:00:03.0: PCIE Bus Error: severity=Corrected, type=Data Link Layer,
id=0018(Transmitter ID)
pcieport 0000:00:03.0:   device [8086:340a] error status/mask=00001000/00002000
pcieport 0000:00:03.0:    [12] Replay Timer Timeout
pcieport 0000:00:03.0: AER: Corrected error received: id=0018
pcieport 0000:00:03.0: PCIE Bus Error: severity=Corrected, type=Data Link Layer,
id=0018(Transmitter ID)
pcieport 0000:00:03.0:   device [8086:340a] error status/mask=00001000/00002000
pcieport 0000:00:03.0:    [12] Replay Timer Timeout
pcieport 0000:00:03.0: AER: Corrected error received: id=0018
pcieport 0000:00:03.0: PCIE Bus Error: severity=Corrected, type=Data Link Layer,
id=0018(Transmitter ID)
pcieport 0000:00:03.0:   device [8086:340a] error status/mask=00001000/00002000
pcieport 0000:00:03.0:    [12] Replay Timer Timeout
pcieport 0000:00:03.0: AER: Corrected error received: id=0018
pcieport 0000:00:03.0: PCIE Bus Error: severity=Corrected, type=Data Link Layer,
id=0018(Transmitter ID)
pcieport 0000:00:03.0:   device [8086:340a] error status/mask=00001000/00002000
pcieport 0000:00:03.0:    [12] Replay Timer Timeout


It seems this correlates to the above 'hiccups'.  So while we can't say with any
certainty whats going on yet, it seems that (at least in our test
environment), these errors are responsible for our bad througput.

We'll keep digging here of course, but given this information I think it would
be good if:


1) The systems used at plugfest were brought back online and retested to see if
these errors were occuring as well there (to validate that we're seeing the same
problem)

2) If someone from Intel could provide some comment as to why we might be
getting bus errors like the above, and what (if anything we can do about it)


JohnF will check on the condition of the plugfest systems when they are available again, and has queried the pci people at intel for ideas as to what might cause the above errors.  We will try to verify if the systems here are pre-production, and attempt to find some production grade hw to test with further.

Comment 4 Neil Horman 2010-06-04 13:20:18 UTC

mattz contacted me yesterday to let me know that intel had been out to the office and replaced the niantics in intel-s3ea2-08.  With the new CNA's I am able to consistently see > 5Gbit/s on the cards using netperf:

[root@intel-s3ea2-08 ~]# netperf -H 10.16.47.27 -D 5,5 -l 200 -t TCP_STREAM -f G
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.16.47.27 (10.16.47.27) port 0 AF_INET : histogram : interval : demo
Interim result: 0.38 GBytes/s over 5.00 seconds
Interim result: 0.62 GBytes/s over 5.36 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.04 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.04 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.01 seconds
Interim result: 0.63 GBytes/s over 5.32 seconds
Interim result: 0.66 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.05 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.01 seconds
Interim result: 0.66 GBytes/s over 5.03 seconds
Interim result: 0.66 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.03 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.04 seconds
Interim result: 0.66 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.05 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.64 GBytes/s over 5.17 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.66 GBytes/s over 5.04 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.64 GBytes/s over 5.21 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Interim result: 0.67 GBytes/s over 5.00 seconds
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    GBytes/sec  

 87380  16384  16384    200.00      0.65


That goes to >6Gbit if I preform the UDP_STREAM test

I can reproduce this througput consistently.  So it would seem the performance impact that we saw previously is confirmed to be due to the pre-production hardware.  AT this point we need to:

1) See if block I/O (ie. fcoe and iscsi) performance improves as a result

2) confirm that the plugfest systems encountered the same perf impacting issue that we did

JohnF is looking into (2), I'm going to setup fcoe and iscsi to test (1) using iozone.

I know  Mike Christie had noted that he saw degraded performance on multiple vendor CNA's, which suggests that there is something else going on.  Mike, if you could post access info to an exemplar system set that I could test with as well, that would be great.

Comment 5 Neil Horman 2010-06-04 19:40:13 UTC

Note: minor delay, after the niantic upgrade and re-install of the latest snapshot, fcoe config has dissappeared.  Mattz and I are looking into restoring it now.

Comment 6 Neil Horman 2010-06-08 17:49:48 UTC

mattz any update on the fcoe config for these systems?

John F: Do we have any update on what the hw condition of the systems at plugfest was?  Can their througput be attributed to backlevel hw?

Comment 7 john.r.fastabend 2010-06-09 15:33:43 UTC

The plugfest hw just got in today I'll look at it when it gets unpacked.  However it is production HW.

Comment 8 Matt Zywusko 2010-06-09 16:02:18 UTC

Neil: Still trying to get EngOps to give one of the new Niantics access to the VLAN and add the ports to the Cisco zoneset. I'm hoping this will be done today but I'll escalate it this afternoon

Comment 9 Neil Horman 2010-06-09 17:52:09 UTC

Barry Donahue pointed me at an alternate system  (storageqe3) that had ixgbe up-to-date hw in place that I could use while the origional test bed gets sorted out.  I'd still like to run some netperf tests on it to see raw networking througput, but this is the result of this iozone command:
iozone -R -l 5 -u 5 -r 512k -s 100m -F ./1 ./2 ./3 ./4 ./5

It was run on an ext4 formatted partition on an fcoe exported LUN.  The results were:
"Throughput report Y-axis is type of test X-axis is number of processes"
"Record size = 512 Kbytes "
"Output is in Kbytes/sec"

"  Initial write "  845263.18 

"        Rewrite "  720649.84 

"           Read " 6181272.56 

"        Re-read " 6281995.94 

"   Reverse Read " 6184136.69 

"    Stride read " 6366550.94 

"    Random read " 6410678.44 

" Mixed workload " 5770597.31 

"   Random write "  880244.67 


I'm able to reproduce these results consistently.  The sloweset test seemed to be the rewrite test, which clocked in at 720649 kilbytes/second, which workes out to 5.5 Gbit/sec, or just over 1/2 of the CNAs 10Gbit capacity.  I'm not sure what sort of throughput the back end storage saturates at, so I'm not sure how much better we can do, but this is clearly superior performance to the reported 20Mb throughput that this bug noted.  I'll do some netperf testing to confirm, but I think this can be attributed at this point to pre-production hardware

Comment 10 Neil Horman 2010-06-09 19:12:46 UTC

I've done some raw netperf testing on some new test systems here, and here are the results:

[root@storageqe-03 ~]# netperf -H 10.16.47.187 -D 5,5 -l 200 -t TCP_STREAM -f G
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.16.47.187 (10.16.47.187) port 0 AF_INET : histogram : interval : demo
Interim result: 0.42 GBytes/s over 5.28 seconds
Interim result: 0.44 GBytes/s over 5.00 seconds
Interim result: 0.42 GBytes/s over 5.28 seconds
Interim result: 0.42 GBytes/s over 5.30 seconds
Interim result: 0.42 GBytes/s over 5.04 seconds
Interim result: 0.45 GBytes/s over 5.28 seconds
Interim result: 0.45 GBytes/s over 5.26 seconds
Interim result: 0.42 GBytes/s over 5.44 seconds
Interim result: 0.45 GBytes/s over 5.00 seconds
Interim result: 0.41 GBytes/s over 5.46 seconds
Interim result: 0.42 GBytes/s over 5.11 seconds
Interim result: 0.50 GBytes/s over 5.00 seconds
Interim result: 0.44 GBytes/s over 5.66 seconds
Interim result: 0.43 GBytes/s over 5.16 seconds
Interim result: 0.44 GBytes/s over 5.35 seconds
Interim result: 0.44 GBytes/s over 5.00 seconds
Interim result: 0.43 GBytes/s over 5.08 seconds
Interim result: 0.44 GBytes/s over 5.45 seconds
Interim result: 0.44 GBytes/s over 5.06 seconds
Interim result: 0.43 GBytes/s over 5.06 seconds
Interim result: 0.41 GBytes/s over 5.26 seconds
Interim result: 0.43 GBytes/s over 5.00 seconds
Interim result: 0.44 GBytes/s over 5.36 seconds
Interim result: 0.42 GBytes/s over 5.18 seconds
Interim result: 0.42 GBytes/s over 5.06 seconds
Interim result: 0.41 GBytes/s over 5.08 seconds
Interim result: 0.46 GBytes/s over 5.08 seconds
Interim result: 0.43 GBytes/s over 5.31 seconds
Interim result: 0.45 GBytes/s over 5.00 seconds
Interim result: 0.43 GBytes/s over 5.31 seconds
Interim result: 0.41 GBytes/s over 5.18 seconds
Interim result: 0.46 GBytes/s over 5.00 seconds
Interim result: 0.44 GBytes/s over 5.26 seconds
Interim result: 0.44 GBytes/s over 5.00 seconds
Interim result: 0.43 GBytes/s over 5.10 seconds
Interim result: 0.43 GBytes/s over 5.05 seconds
Interim result: 0.45 GBytes/s over 5.00 seconds
Interim result: 0.45 GBytes/s over 5.07 seconds
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    GBytes/sec  

 87380  16384  16384    200.00      0.43

Its not perfect (I likely need to do some tuning on the systems in question), but regardless, we're getting consistently 4Gbit/s through the nic, which is a far cry over the originally reported 20Mbyte/sec througput.

As such, I think we can close this as NOTABUG, unless someone is able to reproduce the degraded throughput on a system I can investigate on.

Comment 11 john.r.fastabend 2010-06-09 19:20:20 UTC

Neil, let me verify the production hw we had at plugfest before we close this if that is alright.  Should get to this today, Thanks.

Comment 12 john.r.fastabend 2010-06-09 19:22:25 UTC

Looks like its already been closed that is fine I'll retest here and go from there.