102679 – LTC3931-[perf][tpch][BETA] raw IO on RHEL3 B1 degrades

Bug 102679 - LTC3931-[perf][tpch][BETA] raw IO on RHEL3 B1 degrades

Summary: LTC3931-[perf][tpch][BETA] raw IO on RHEL3 B1 degrades

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Tom Coughlan
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	101028 103278
TreeView+	depends on / blocked

Reported:	2003-08-19 20:05 UTC by IBM Bug Proxy
Modified:	2007-11-30 22:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHEL 3 gold
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-11-03 18:26:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
attach-oprofile_231_4G_dd.kernel (314.63 KB, text/plain) 2003-08-19 20:09 UTC, IBM Bug Proxy	no flags	Details
attach-oprofile_389_4G_dd.kernel (326.20 KB, text/plain) 2003-08-19 20:10 UTC, IBM Bug Proxy	no flags	Details
attach-config.as30b1_389 (18.90 KB, text/plain) 2003-08-19 20:12 UTC, IBM Bug Proxy	no flags	Details
View All

Description IBM Bug Proxy 2003-08-19 20:05:49 UTC

The following has be reported by IBM LTC:  
[perf][tpch][BETA] raw IO on RHAS30 B1 degrades
Please fill in each of the sections below.

Hardware Environment: IBM x440 8way, 16GB Ram, (8) qlogic 2310s, 8 FastT200s

Software Environment: RHAS30 Beta 1 build 389


Steps to Reproduce:
1. setup raw devices to partitions on qlogic attached disks
2. Run DDs to raw partitions (20 raw devices used):
         dd if=/dev/raw/raw<x> of=/dev/null bs=262144 count=4000
3. Run vmstat to see BI rates of ~285000/second, CPU time sys=0, kernel=1, 
idle=0, iowait=99

On RHAS30 alpha4 (build 231) IO rates using the same steps to reproduce had BI 
rates fo ~385000/sec, CPU time sys=0, kernel=7, idle=97, iowait 0.

Attached are kernel oprofiles for both the 231 and 389 runsCreated an attachment
(id=1387)
oprofile  kernel info for build 231

oprofile kernel output for build 231Created an attachment (id=1388)
oprofile  kernel info for build 389

oprofile  kernel info for build 389Created an attachment (id=1389)
config used to build kernel 389
Glen/Greg - please submit this RHEL 3 Beta1 bug to Red Hat.  Thanks.

Comment 1 IBM Bug Proxy 2003-08-19 20:09:31 UTC

Created attachment 93762 [details]
attach-oprofile_231_4G_dd.kernel

Comment 2 IBM Bug Proxy 2003-08-19 20:10:54 UTC

Created attachment 93763 [details]
attach-oprofile_389_4G_dd.kernel

Comment 3 IBM Bug Proxy 2003-08-19 20:12:08 UTC

Created attachment 93764 [details]
attach-config.as30b1_389

Comment 4 Stephen Tweedie 2003-08-28 16:36:14 UTC

Why is there a config attached here --- did you create your own private build
rather than using a Red Hat one?  We can't support custom builds, especially of
beta kernels.

Please repeat the tests with a recent kernel from the RHN sushi channel --- the
B1 kernels had 4g/4g enabled plus debug enabled, which slowed stuff down.  It
looks like those were off in the config you attached, but please retry with a
recent -smp (not hugemem) kernel so that it's unambiguous --- there are also
some other fixes in later kernels which might be relevant.

Comment 5 IBM Bug Proxy 2003-08-30 17:03:43 UTC

------ Additional Comments From mksully.com  2003-30-08 10:29 -------
Just an update on the reconfirmation of this problem on build 399. 
 
Because the qlogic driver doesn't complete it's Inquiry correctly on sparse 
lun configuration without CONFIG_SCSI_MULTI_LUN set it was necessary to 
rebuild the kernel. I used the i686-smp from /configs as a base just changing 
the minimum required to build. 
 
Unfortunately even with this kernel IO rates are about 120MB/sec (where we 
normally get > 700 MB/sec). This is lower than the build we originally 
reported the problem on (389). 
 
If you have a specific binary for us to try could you make sure 
CONFIG_SCSI_MULTI_LUN is enabled? 
 
Other Notes: 
1. qlogics still fail (even on single CEC) during install (defect 3761 
updated). 
2. Needed to turn off mod versions to get the qla2300 module to build 
successfully.

Comment 6 IBM Bug Proxy 2003-08-30 17:06:30 UTC

------ Additional Comments From mksully.com  2003-30-08 11:50 -------
I added options scsi_mod max_scsi_luns=255 to modules.conf and rebuilt the 
initrd file. With this I was able to boot the binary entsmp kernel and see all 
of the qlogic disks. Unfortunately the dd's to the devices still had a read 
throughput of about 125MB/sec where we normally see > 700MB sec. This is a 
severe degradation from the a4 version of the kernel.

Comment 9 IBM Bug Proxy 2003-09-05 13:12:33 UTC

------ Additional Comments From mksully.com  2003-04-09 23:26 -------
I updated the the 414 build and the problem still occurs. I also noticed that I 
don't see the same degradation when doing raw reads to the disks attached to 
the aic7xxx on board controller. I dug out an older version (6.05.60) version 
of the qla2xxx driver and integrated it into the build tree. With this earlier 
version of the driver the problem no longer occurs on the qlogic 2310 attached 
disks. Maybe the 6.06 version of the driver provided as an addon is buggy?

Comment 10 Stephen Tweedie 2003-09-05 14:53:31 UTC

We now enable IRQ mitigation on the qla2300 driver by default, and we suspect
that that's the changed factor here.

IRQ mitigation is a significant performance win under load, but the extra
latency that it results in for single IOs means that it will show up as a
performance degradation under single-client sequential raw IO testing.

You can change the controller's IRQ delay when you load the qla2300.o module,
though: give it a "ql2xintrdelaytimer=<n>" module parameter to set the IRQ
latency to n*100usec.  The default is currently 3 (ie. 300usec); for testing raw
IO bandwidth you should be able to set it to 0 to disable IRQ mitigation.

We suspect that on a more realistic performance test, you'll get better
performance with the mitigation turned on, though.

Can you verify that this helps, please?

Comment 11 IBM Bug Proxy 2003-09-05 16:39:45 UTC

------ Additional Comments From mksully.com  2003-05-09 11:52 -------
I set ql2xintrdelaytimer to zero and it actually degraded a bit more(~9%). I 
verified that by the message in /var/log/messages that the parameter was used. 

I'm also trying this on a large database benchmark and it also performs poorly 
with very low I/O rates similiar to what we saw in the plain dds.

Comment 12 IBM Bug Proxy 2003-09-05 16:46:49 UTC

------ Additional Comments From mksully.com  2003-05-09 12:00 -------
The degradation occurs when using the 6.06.00b11 version of the driver shipped 
with the beta. We are driving eight qla2310 adapters each attached to an IBM 
FastT200 enclosure with 10 physical disks, 80 disk in all.  Just using dd to do 
raw reads exposes the problem.  As a test, I replaced the shipped driver with 
the 6.06.00b12 version built against the beta tree and the degradation goes 
away.

Comment 13 IBM Bug Proxy 2003-09-05 18:24:31 UTC

I am looking into why QA Contact is being removed.

Comment 14 IBM Bug Proxy 2003-09-05 20:15:18 UTC

------ Additional Comments From mksully.com  2003-05-09 15:45 -------
The full database run completed successfully with good performance on build 414 
with the 6.06.00b12 driver substituted for the one included in the distro.

Comment 15 Rik van Riel 2003-09-05 21:00:23 UTC

ok, good to know that this isn't a bug in Taroon

thank you for testing

Comment 16 IBM Bug Proxy 2003-09-05 21:44:37 UTC

------ Additional Comments From mksully.com  2003-05-09 17:24 -------
It appears that RH inadvertantly closed this bug. Please reopen with the 
following response:

But it is a bug. The qlogic driver shipped in taroon (6.06.0011b) appears to be 
broken. The use of 6.06.0012b was just to demonstrate that the earlier version 
of the driver was probably the culprit.

Comment 17 IBM Bug Proxy 2003-09-05 21:45:55 UTC

------ Additional Comments From mksully.com  2003-05-09 17:35 -------
But it is a bug. The qlogic driver shipped in taroon (6.06.0011b) appears to be 
broken. The use of 6.06.0012b was just to demonstrate that the earlier version 
of the driver was probably the culprit.

Comment 18 Matt Wilson 2003-09-11 20:21:47 UTC

Note, RHEL 3 not RHAS30

Comment 19 Stephen Tweedie 2003-09-11 20:42:37 UTC

Could you try with ql2xintrdelaytimer=1?  The firmware may be interpreting "0"
as a wrapped "infinity" value.

Comment 20 IBM Bug Proxy 2003-09-12 04:29:04 UTC

------ Additional Comments From mksully.com  2003-11-09 21:45 -------
I tried ql2xintrdelaytimer=1 on the 6.06.00b11 driver and it didn't help. I/O 
was still low.

Comment 21 Stephen Tweedie 2003-09-12 08:07:34 UTC

But how did it compare against =3 and =0?  I'm trying to work out how much of
the observed performance is down to the IRQ mitigation, and how much might be
another problem.

Comment 22 IBM Bug Proxy 2003-09-12 16:05:56 UTC

------ Additional Comments From mksully.com  2003-12-09 11:32 -------
Using "dd if=/dev/raw/raw<x> of=/dev/null bs=262144 count=4000&" to 40 raw 
devices I've attached the vmstat data for ql2xinterdelaytimer=0, 1, and 3.
Looks like 
0=~138MB/sec
1=~153MB/sec
3=~124MB/sec

[root@sambaperf tmp]# insmod qla2300 ql2xintrdelaytimer=0
Using /lib/modules/2.4.21-
1.1931.2.399.entsmp/kernel/drivers/addon/qla2200/qla2300.o
[root@sambaperf db2inst1]# ./10dd_mln0.sh; ./10dd_mln1.sh
[root@sambaperf db2inst1]# vmstat 5 | tee vmstat_0.output
procs                      memory      swap          io     system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 41      0 16378172   2204  23848    0    0    36     0    8     2  0  0 99  0
 0 40      0 16377940   2208  23852    0    0 138945     2 3200  1117  0  3 12 
85
 0 40      0 16377940   2208  23852    0    0 138240     0 3199  1090  0  2 12 
86
 1 40      0 16374248   2208  23852    0    0 140083     2 3226  1123  0  3 12 
85
 0 40      0 16377776   2208  23852    0    0 138342     0 3201  1092  0  3 12 
85
 1 39      0 16377776   2208  23852    0    0 137472     0 3175  1085  0  3 12 
85
 0 40      0 16377748   2208  23852    0    0 138035    27 3205  1110  0  3 12 
85
 0 40      0 16377748   2208  23852    0    0 137165     8 3164  1082  0  3 12 
85
 0 40      0 16377560   2208  23852    0    0 137626     2 3185  1106  0  3 12 
86
 0 40      0 16377560   2208  23852    0    0 138138     0 3202  1088  0  3  9 
88


[root@sambaperf db2inst1]# insmod qla2300 ql2xintrdelaytimer=1
Using /lib/modules/2.4.21-
1.1931.2.399.entsmp/kernel/drivers/addon/qla2200/qla2300.o
[root@sambaperf db2inst1]# ./10dd_mln0.sh; ./10dd_mln1.sh
[root@sambaperf db2inst1]# vmstat 5 | tee vmstat_1.output
procs                      memory      swap          io     system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 40      0 16376248   2208  23892    0    0   135     0   11     3  0  0 98  1
 0 40      0 16376248   2212  23892    0    0 153857     0 3790  1210  0  3  0 
97
 0 40      0 16374432   2212  23892    0    0 151757     2 3757  1214  0  3  0 
97
 0 40      0 16374432   2212  23892    0    0 152730     0 3786  1202  0  4  0 
96
 0 40      0 16374340   2212  23892    0    0 155443    53 3814  1245  0  4  0 
96
 0 40      0 16374420   2212  23892    0    0 153907     0 3796  1216  0  3  0 
97
 0 40      0 16374448   2212  23892    0    0 150067     2 3748  1207  0  4  0 
96
 0 40      0 16374448   2212  23892    0    0 152576    26 3777  1202  0  3  0 
97


[root@sambaperf db2inst1]# insmod qla2300 ql2xintrdelaytimer=3
Using /lib/modules/2.4.21-
1.1931.2.399.entsmp/kernel/drivers/addon/qla2200/qla2300.o
[root@sambaperf db2inst1]# ./10dd_mln0.sh; ./10dd_mln1.sh
[root@sambaperf db2inst1]# vmstat 5 | tee vmstat_3.output
procs                      memory      swap          io     system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 40      0 16371784   2212  23908    0    0   239     0   13     4  0  0 97  2
 0 40      0 16371784   2212  23908    0    0 122829    26 2728   970  0  2 13 
85
 0 40      0 16371608   2212  23908    0    0 125491     2 2751  1009  0  2 13 
85
 1 39      0 16371608   2212  23908    0    0 121651    51 2729   960  0  2 13 
85
 0 40      0 16371420   2212  23908    0    0 124621     2 2747  1004  0  3  3 
94
 1 39      0 16371420   2212  23908    0    0 124570     0 2739   982  0  2  0 
98
 0 40      0 16369560   2212  23908    0    0 124467     2 2742  1003  0  2  0 
98
 1 39      0 16369560   2212  23908    0    0 124979    23 2748   986  0  2  0 
98
 0 41      0 16369560   2212  23908    0    0 123187     1 2727   973  0  2  0 
98
 0 40      0 16371340   2212  23908    0    0 122573     1 2717   987  0  2  0 
98

Comment 23 IBM Bug Proxy 2003-09-17 14:37:21 UTC

------ Additional Comments From mksully.com  2003-17-09 10:35 -------
The execution_throttle parameter is being setup to an invalid value in the scsi 
host structure. 
1. During init they did:
nv->execution_throttle  = __constant_cpu_to_le16(16); (0x100 value).
2. During setup of the host structure they did:
ha->execution_throttle = le16_to_cpu(nv->execution_throttle); (0x100 value 
remains). It doesn't appear that the macro did what was intended.
3. In qla2x00_device_queue_depth() they set
        int default_depth = max((int)64, (int)p->execution_throttle);
       (resulting in a value of 0x100 for default_depth)
4. Later when they assign it to 
	device->queue_depth = default_depth;
   It gets truncated to 0 since queue_depth is only 8 bits.

This results in the slowdown. Since execution throttle doesn't seem to be a 
settable parameter I suggest the following patch:

--- qla2x00.c.org	2003-09-18 11:48:30.000000000 -0500
+++ qla2x00.c	2003-09-18 11:48:42.000000000 -0500
@@ -4853,7 +4853,7 @@
 void
 qla2x00_device_queue_depth(scsi_qla_host_t *p, Scsi_Device *device)
 {
-	int default_depth = max((int)64, (int)p->execution_throttle);
+	int default_depth = 64;
 
 	device->queue_depth = default_depth;
 	if (device->tagged_supported) {

I tested this on my setup and IO was restored to expected levels.

Comment 24 Stephen Tweedie 2003-09-17 16:46:55 UTC

I don't follow the analysis:

1. During init they did:
nv->execution_throttle  = __constant_cpu_to_le16(16); (0x100 value).
2. During setup of the host structure they did:
ha->execution_throttle = le16_to_cpu(nv->execution_throttle); (0x100 value 
remains). It doesn't appear that the macro did what was intended.

But i386 is little-endian already; "__constant_cpu_to_le16(16)" should evaluate
to 16, not "0x100 value".  And the file in question is already including
<asm/byteorder.h>, which should set up the right endian definitions for all the
cpu_to_le* macros.

Comment 25 IBM Bug Proxy 2003-09-17 17:07:47 UTC

------ Additional Comments From mksully.com  2003-17-09 13:04 -------
I agree that it should work but it doesn't. To verify I rebuilt the 414 tree 
with printk output of the execution_throttle value and it shows it to be 0x100. 
I guess I can dig deeper into the endian macros but since the max assignment in 
qla2x00_device_queue_depth will always be 64 should we bother?

Comment 26 Stephen Tweedie 2003-09-17 17:17:55 UTC

Well, a byte-swap of 16 is 0x1000, not 0x100, so that's not what's happening.

And the assignment

int default_depth = max((int)64, (int)p->execution_throttle);

enforces a _minimum_ of 64, not a maximum, so clipping it there may harm
performance.

Arjan has suggested that clipping the queue depth to a maximum of 255 might be a
better solution.

Comment 27 IBM Bug Proxy 2003-09-17 17:27:19 UTC

------ Additional Comments From mksully.com  2003-17-09 13:26 -------
Good points. Are you suggesting a direct assignment of 255, like this?

int default_depth = 255;

or this?

int default_depth = max((int)255, (int)p->execution_throttle);

Comment 28 Arjan van de Ven 2003-09-17 17:29:36 UTC

+       int default_depth = min(max((int)64, (int)p->execution_throttle), 255);

Comment 29 IBM Bug Proxy 2003-09-17 18:12:16 UTC

------ Additional Comments From mksully.com  2003-17-09 14:10 -------
Yep. Works for me. I dug deeper on the original error and it appears
that in qla2x00_nvram_config() the value of 0x100 is being read directly out of 
nvram and into the nvram22_t buffer. You're right, no endianess issues involved 
(the code path that used the endianess macros would only have been executed if 
the nvram data was in error). The truncation problem that was occuring when the 
integer default_depth as being assigned into the 8 bit 
device->queue_depth was occuring on the raw nvram value. 
Wrapping it in the min statement should take care of that.

+int default_depth = min(max((int)64, (int)p->execution_throttle), 255);

Comment 30 Tom Coughlan 2003-09-17 19:58:40 UTC

This fix is checked in the RHEL 3.

Comment 31 IBM Bug Proxy 2005-03-02 21:01:11 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          QAContact|khoa.com             |corryk.com




------- Additional Comments From corryk.com(prefers email via kevcorry.com)  2005-03-02 15:57 EST -------
Hi Mike,

Please verify that this fix is included in the latest RHEL3 Update. If so, go
ahead and close this bug. Thanks!

Comment 32 mark wisner 2005-11-03 17:16:16 UTC

This bug is closed on the IBM side.

Comment 33 Ernie Petrides 2005-11-03 20:51:25 UTC

A fix for this problem was committed to the RHEL3 U3 patch pool
on 17-Jun-2004 (in kernel version 2.4.21-15.13.EL).

It was released in U3 with the following Errata System message:

 "An errata has been issued which should help the problem 
  described in this bug report. This report is therefore being 
  closed with a resolution of ERRATA. For more information
  on the solution and/or where to find the updated files, 
  please follow the link below. You may reopen this bug report 
  if the solution does not work for you.

  http://rhn.redhat.com/errata/RHBA-2004-433.html"


Obviously, now it would be more appropriate to upgrade to U6, which is here:

  http://rhn.redhat.com/errata/RHSA-2005-663.html

Note You need to log in before you can comment on or make changes to this bug.