120926 – effect of changes to __make_request with Informix

Bug 120926 - effect of changes to __make_request with Informix

Summary: effect of changes to __make_request with Informix

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-04-15 11:59 UTC by Heather Conway
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:	2.4.21-15.EL
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-07-23 15:17:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
diff file between ll_rw_blck.c in v2.4.21-9.0.1.EL vs. v2.4.21-13.EL (4.46 KB, text/plain) 2004-04-21 14:48 UTC, Heather Conway	no flags	Details
requested files in tarball are attached (8.29 KB, text/plain) 2004-04-28 21:47 UTC, Heather Conway	no flags	Details
View All

Description Heather Conway 2004-04-15 11:59:42 UTC

Description of problem:
Customer is running into performance problems when running with 
Informix. They have reported that the RHEL 3.0 (v2.4.21-4.EL and 
v2.4.21-9.0.1.EL) host slows down and pauses when running Informix 
I/O over FC attach, but when running app on the local disk, the 
performance is acceptable. They have tried this with both the 
v6.06.00b11 and v6.07.02 QLogic drivers and both with and without 
PowerPath. The summary is as follows:
	When running PP w/AS 2.1, when Informix archiving was stared 
it had a permanent impact
	on the performance of the host.
	Now, when running PP w/3.0, when Informix archiving is 
started the system performance
	returns to normal as the archiving process finished.
	Lastly, and most importantly, the customer removed PP from 
the 3.0 environment and the SAME THING HAPPENED. System performance 
returns to normal as the archiving process finished. 

This is very readily reproduceable at the customer site.
Does this appear to Red Ht to be related to the __make_request bug 
that is being fixed in the RHEL 3.0 U2?  


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Ernie Petrides 2004-04-15 22:01:53 UTC

Hello, Heather.  I'm a bit confused about this bug report.  It seems
that once archiving completes, performance returns to normal.  Isn't
that expected/desirable?

Also, I don't have any record of a "__make_request bug" that is
being fixed in RHEL3 U2.  Could you please refer to a specific
bugzilla id or changelog entry to identify this?

(I'm tentatively reassigning this to Doug Ledford.)

Thanks.  -ernie

Comment 2 Heather Conway 2004-04-21 14:48:30 UTC

Created attachment 99603 [details]
diff file between ll_rw_blck.c in v2.4.21-9.0.1.EL vs. v2.4.21-13.EL 

The changes referred to are between the ll_rw_blk.c file in v2.4.21-9.0.1.EL
vs. v2.4.21-13.EL, specifically in the areas of the request queue CPU masks and
when the queue gets plugged/unplugged.

Comment 3 Heather Conway 2004-04-21 14:49:18 UTC

The changes log refers to
disable linux-2.4.21-scsi-affine-queue patch

Comment 4 Greg Nuss 2004-04-23 15:56:44 UTC

Doug/Ernie: On EMC status call today, EMC is looking for any insights 
from us on whether the changes made to ll_rw_blk.c between 2.4.21-
9.0.1.EL vs -13.EL could be related to this issue. 

Thanks!

Comment 5 Heather Conway 2004-04-23 17:39:47 UTC

Per the customer, they have downloaded and installed the RPMs for the 
RHEL 3.0 U2.  Unfortunately, the new kernel does not appear to have 
resolved the problem.  Do you have any idea on what the issue may be 
and how to resolve it?  
Please let me know if additional information is required.
Thanks.
Heather

Comment 6 Heather Conway 2004-04-27 21:14:04 UTC

Any suggestions/comments on what may be the issue at this customer 
site?
Any feedback is appreciated.
Thanks.

Comment 7 Doug Ledford 2004-04-28 15:53:39 UTC

OK, the problem statement here is not clear.  So let me ask for
clarification.  Is the problem that AS2.1 has permanent poor
performance after starting an Informix archive operation and is
therefore unusable and that RHEL3 U2 has temporary poor performance
during an Informix archive operation and is therefore unusable?  If
so, then am I correct in assuming that RHEL3 U2 is closer to what they
want, aka no performance impact noticable during an archive operation,
but not close enough?  Finally, you say that with local disks the
problem doesn't exist, but can you tell me what local disks means?  Is
it actual SCSI disks in the machine, and if so attached to what SCSI
controller, what class of disks, etc.  For the FC attachment, is it
going through a fiber switch?  If so, have they tried direct FC
attachment to see if it behaves similar to direct SCSI attached disks?
 Since they are using the QLogic driver, can you get the output of
dmesg that shows the QLogic driver's configuration, including things
like the tagged queue depth on the attached device(s) and attach it to
this bugzilla?  The contents of the /proc/scsi/qla*/* files might be
useful as well.

Comment 8 Heather Conway 2004-04-28 21:47:54 UTC

Created attachment 99759 [details]
requested files in tarball are attached

Comment 9 Heather Conway 2004-04-28 21:48:11 UTC

RHEL 2.1 is known to have performance issues in general and is used 
here as a comparison to RHEL 3.0.  RHEL 3.0 does perform better then 
RHEL 2.1 for the customer, but the performance is still not 
acceptable.  
There are no noticeable improviements between RHEL 3.0 vs. RHEL 3.0 
U2.
The local disks are attached to the internal SCSI controller which in 
this case is a MegaRAID controller.
Yes, the fibre attach is via a switch.  They do not have the time nor 
at this point, the willingness to try a direct attach.
I do not have the most recent output from the customer's system using 
the U2.  I will attach the files from the prior v2.4.21-4.0.1 config.

Comment 10 Rik van Riel 2004-04-28 22:48:28 UTC

Heather, could you please ask the customer to try decreasing the read
latency with elvtune ?

It may be possible that the IO queues are simply too big for the
fibrechannel disks, leading to high latencies and the bad read
performance that goes with long queues.

Comment 11 Heather Conway 2004-04-29 15:41:12 UTC

Would you please provide recommendation as to what value the read 
latency should be decreased to?

Comment 12 Rik van Riel 2004-04-30 15:05:04 UTC

It would be worth trying some different values, to get the right
latency/throughput tradeoff for your configuration.  I would probably
try 8, 32 and 128, maybe even a higher value.

Comment 13 Doug Ledford 2004-05-03 21:03:47 UTC

I looked over the files and didn't see anything overly odd except the
EMC Powerpath stuff is loaded.  We've got lots of reports of poor
performance when that's in use.  I'm not positive if it's related or
not.  However, as of RHEL3 U2, there is a new service that is
available to admins, service mdmpd (MD MultiPath Daemon) which works
similar to EMC's Powerpath but without being a kernel module.  It's
part of the mdadm-1.4.0-1 and later mdadm packages.  If they are
willing to test with it instead of Powerpath, it would be useful.  Let
me know if they'll give it a shot, and if so I'll post directions on
how to get it set up.

Comment 14 Doug Ledford 2004-05-03 21:08:55 UTC

Heather, I should note that I'm not trying to step on EMC's toes with
this suggestion.  I don't know if you guys sell Powerpath or just give
it away with your external boxes.  But, we do have a generic solution
for the multipath problem now (although still a bit young, the 2.6
version of the solution is nicer), and it doesn't require any
additional kernel modules besides the MD multipath that's been in the
kernel for a long time.  The main thing is that it also doesn't sit
inbetween I/Os at all, so it would help identify the possible source
of the high latency should the processing in Powerpath be involved.

Comment 15 Heather Conway 2004-05-11 16:42:12 UTC

We have not tested the mdmpd and therefore won't recommend that a 
customer attempt to use it with our storage as we are not prepared to 
appropriately support the customer.
  
Please note that the problem being reported by the customer occurs 
both with and without PowerPath.

Comment 16 Doug Ledford 2004-05-12 17:14:48 UTC

OK, so the general problem description right now is that it happens
with fiber channel through a switch, with and without powerpath, and
does not happen on locally connected disks.  Basically, this should
exonerate the scsi mid layer, the block layer, the sd driver.  It
leaves in question issues related to the qlogic driver vs. the
megaraid driver and it leaves the question of direct attach vs.
through a fiber switch.  The __make_request changes could impact this.
 Any kernel version later than 2.4.21-12.EL has those changes in it. 
If these later kernels still have the same slowdown problem, then my
first suspect would be some interaction between the switch fabric and
the load patterns coming out of the qlogic driver.

Comment 17 Heather Conway 2004-07-23 15:17:26 UTC

I believe v2.4.21-15.EL has resolved the problem at the customer site 
so I am closing this issue.
Thanks for your suggestions and help.
Heather

Note You need to log in before you can comment on or make changes to this bug.