Bug 643236

Summary:	iscsi: get nopout and conn errors.
Product:	Red Hat Enterprise Linux 6	Reporter:	Mike Christie <mchristi>
Component:	kernel	Assignee:	Mike Christie <mchristi>
Status:	CLOSED ERRATA	QA Contact:	Gris Ge <fge>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.0	CC:	coughlan, fge, qcai
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-2.6.32-112.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-05-19 12:31:55 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Christie 2010-10-15 03:18:30 UTC

Description of problem:


The scsi layer is sending too many commands to the iscsi layer (more than target->can_queue). The iscsi layer can then end up using all the IO structs for scsi command IO. If the target sends a nop as ping to us, we till not have a struct to use for the reply. In /var/log/messages you will see:

Could not send nopout

This may be followed by a conn error 1011 or 1020 error if the target decides to then drop the session as a result of the nop being dropped.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:


This can be worked around by setting the node.session.cmds_max value higher than the target's command window.

Log out of the targets. Set the value in /etc/iscsi/iscsid.conf then rerun the discovery command and relogin into targets.

Or logout of the targets and run:

iscsiadm -m node -o update node.session.cmds_max -v $NEW_VALUE

Then relogin.

Comment 2 RHEL Program Management 2010-10-15 03:28:45 UTC

Thank you for your bug report. This issue was evaluated for inclusion
in the current release of Red Hat Enterprise Linux. Unfortunately, we
are unable to address this request in the current release. Because we
are in the final stage of Red Hat Enterprise Linux 6 development, only
significant, release-blocking issues involving serious regressions and
data corruption can be considered.

If you believe this issue meets the release blocking criteria as
defined and communicated to you by your Red Hat Support representative,
please ask your representative to file this issue as a blocker for the
current release. Otherwise, ask that it be evaluated for inclusion in
the next minor release of Red Hat Enterprise Linux.

Comment 4 RHEL Program Management 2010-10-21 14:29:52 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 6 Aristeu Rozanski 2011-02-03 17:24:53 UTC

Patch(es) available on kernel-2.6.32-112.el6

Comment 10 Gris Ge 2011-03-09 03:24:19 UTC

Mike,

I failed to reproduce this issue by reduce node.session.cmds_max to 16 and perform huge IO aganist that iscsi disk for about 2 hour.

The network between target and initiator is about 200ms.
10 process 'perl -e while(1){}' is running to consume CPU.

I don't have any change to see any 'Could not send nopout' error.

Can you advise me on how to reproduce this problem?

Comment 11 Mike Christie 2011-03-10 00:41:22 UTC

It is a little difficult.

What target are you using? It is easiest to hit with a Equallogic target.


You need to use bnx2i or cxgb3i (iscsi_tcp does not show the problem), make sure your IO test is set to send more than cmds_max IOs and your target also has to support that many IOs.

So I think most targets support at least 32 cmds. So set

node.session.cmds_max = 16
node.session.queue_depth = 32

(either set this in iscsid.conf then rerun iscsiadm discovery command so the new iscsid.conf values are used or run iscsiadm -m node -o update -n $NAME_OF_SETTING_ABOVE -v $VALUE_ABOVE on a existing target portal record).


Then we want to send more than cmds_max IOs. With this command we would send about 64:

disktest -PT -T130 -h1 -K64 -B256k -ID /dev/sdXYZ



If you cannot find a EQL target, then if you use the settings about and run the command above and you do

cat /sys/class/scsi_host/hostX/host_busy

then that value should always be less than cmds_max if the problem is fixed. If the problem is larger then cmds_max then you hit the problem.

Comment 12 Gris Ge 2011-04-28 08:29:47 UTC

Mike,

We have emulex be2iscsi at hand (not configured).
Does that hit this problem?

Comment 13 Mike Christie 2011-04-29 02:30:17 UTC

Yeah you would hit the problem with that driver.

I think you can just do a sanity check. I tested it here and I believe Chelsio tested it too (They are the ones that pinged me about merging the patch and tested it upstream).

To get the timing right is really hard. I do not think it is worth your time to try and replicate the problem. As long as there are not regressions it should be ok.

Comment 14 Gris Ge 2011-04-29 03:53:18 UTC

Code reviewed.
Patch applyed.

iscsi basic funciton was tested by errata and it's pass the test.

No Hardware and Sanity Only.

Comment 15 errata-xmlrpc 2011-05-19 12:31:55 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html