504338 – iscsi losing paths when heavily utilized

Bug 504338 - iscsi losing paths when heavily utilized

Summary: iscsi losing paths when heavily utilized

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	iscsi-initiator-utils
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Leech
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-05 16:39 UTC by Joshua Meppiel
Modified:	2018-12-04 14:17 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-04-04 20:42:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log excerpt from server experiencing iscsi issue (78.16 KB, text/plain) 2009-06-05 19:48 UTC, Joshua Meppiel	no flags	Details
log excerpt from Dell Equallogic SAN device (PS5000XV) (12.13 KB, text/plain) 2009-06-05 19:49 UTC, Joshua Meppiel	no flags	Details
View All

Description Joshua Meppiel 2009-06-05 16:39:26 UTC

Description of problem: iscsid will lose one or more paths if iscsi fabric is flooded.  Discovered during benchmarks using Bonnie++.


Version-Release number of selected component (if applicable): iscsi-initiator-utils-6.2.0.868-0.18.el5


How reproducible: Reproducable everytime benchmark is run.  Is more likely to happen when benchmark dataset is large(>32GB). 


Steps to Reproduce:
1. Create iscsi mount to SAN device(Dell Equallogic PS5000)
2. Initiate Bonnie++ or run other IO intensive tasks
  
Actual results:
Multiple paths will begin disconnecting/reconnecting. We are using 3 paths currently, and eventually all 3 will disconnect at the same time(anywhere from 5 minutes to 1hr after benchmark start).

Expected results: 
Benchmark should complete without errors and report statistics. 


Additional info: Many of the following errors flood syslog:

Jun  4 18:06:56 stl-dt-sls-006 iscsid: received iferror -38
Jun  4 18:53:49 stl-dt-sls-006 kernel:  connection3:0: iscsi: detected conn error (1011)
Jun  4 18:53:49 stl-dt-sls-006 iscsid: Kernel reported iSCSI connection 3:0 error (1011) state (3)

Comment 1 Mike Christie 2009-06-05 19:38:07 UTC

Could you send a little more of the log? I am looking for something about a iscsi ping or nop timing out, or something about a logout from the target was requested. Also on the target could you check for something about it requesting a login or dropping a session/connection or doing load balancing (load balancing initiated from the target will cause the errors above too because it forces us to logout of one portal and into another)?

Comment 2 Mike Christie 2009-06-05 19:39:11 UTC

(In reply to comment #1)
> Could you send a little more of the log? I am looking for something about a
> iscsi ping or nop timing out, or something about a logout from the target was
> requested. Also on the target could you check for something about it requesting
> a login or dropping a session/connection or doing load balancing (load

I mean a logout not login there..

> balancing initiated from the target will cause the errors above too because it
> forces us to logout of one portal and into another)?

Comment 3 Joshua Meppiel 2009-06-05 19:48:59 UTC

Created attachment 346708 [details]
log excerpt from server experiencing iscsi issue

This is an excerpt from syslog on one of the hosts experiencing this iSCSI disconnect issue under load.

Comment 4 Joshua Meppiel 2009-06-05 19:49:41 UTC

Created attachment 346709 [details]
log excerpt from Dell Equallogic SAN device (PS5000XV)

Comment 5 Mike Christie 2009-06-08 15:14:17 UTC

Joshua,

It looks like the target might be trying to load balance the sessions. When it does this it asks us to logout and relogin and then when we try to log back in it will redirect us to what it believes is the optimal portal. It looks like there are some other connections possibly failing also, but let's try to cut some stuff out.

Could you try to turn the target's load balancing off first. Here are the instructions I got from Equallogic:

You can turn off load balancing in the command line interface.  Telnet
to the array's group address, login as grpadmnin, and do:

> > grpparams conn-balancing disable

Comment 13 RHEL Program Management 2012-09-18 17:39:21 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 14 RHEL Program Management 2012-10-30 06:09:41 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 16 Wadie Guizani 2014-02-06 07:32:45 UTC

Hello,

I know this case is a bit Old, but we are experiencing the same kind of issue with RHEL 5.8 and Equallogic PS6100XV.

We got the same Connection Error messages resulting in a unmount and remount the FileSystem in ReadOnly mode. This is very annoying as the FS hosting ArchiveLog for Oracle Database.

We are able to bring the serveur running after simple reboot but we got the issue once last week and already twice this week.

regards

Comment 18 Chris Williams 2017-04-04 20:42:58 UTC

Red Hat Enterprise Linux 5 shipped it's last minor release, 5.11, on September 14th, 2014. On March 31st, 2017 RHEL 5 exits Production Phase 3 and enters Extended Life Phase. For RHEL releases in the Extended Life Phase, Red Hat  will provide limited ongoing technical support. No bug fixes, security fixes, hardware enablement or root-cause analysis will be available during this phase, and support will be provided on existing installations only.  If the customer purchases the Extended Life-cycle Support (ELS), certain critical-impact security fixes and selected urgent priority bug fixes for the last minor release will be provided.  The specific support and services provided during each phase are described in detail at http://redhat.com/rhel/lifecycle

This BZ does not appear to meet ELS criteria so is being closed WONTFIX. If this BZ is critical for your environment and you have an Extended Life-cycle Support Add-on entitlement, please open a case in the Red Hat Customer Portal, https://access.redhat.com ,provide a thorough business justification and ask that the BZ be re-opened for consideration of an errata. Please note, only certain critical-impact security fixes and selected urgent priority bug fixes for the last minor release can be considered.

Note You need to log in before you can comment on or make changes to this bug.