640111 – [Broadcom 5.6 bug] Race condition in the INVALID_HOST path

Bug 640111 - [Broadcom 5.6 bug] Race condition in the INVALID_HOST path

Summary: [Broadcom 5.6 bug] Race condition in the INVALID_HOST path

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	iscsi-initiator-utils
Sub Component:
Version:	5.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	5.6
Assignee:	Mike Christie
QA Contact:	Storage QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	640115
TreeView+	depends on / blocked

Reported:	2010-10-04 20:36 UTC by Eddie Wai
Modified:	2011-01-13 22:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:	iscsi-initiator-utils-6.2.0.872-5
Doc Type:	Bug Fix
Doc Text:	A host removal could become suspended when the bnx2i, cxgb3i, or be2iscsi drivers were used and iSCSI sessions could not be cleaned up. With this update, the iSCSI daemon has been corrected to handle the error event, and a host removal behaves as expected.
Clone Of:
Clones:	640115 (view as bug list)
Environment:
Last Closed:	2011-01-13 22:59:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
ISCSID: Fixed a race condition in the INVALID_HOST path (2.46 KB, patch) 2010-10-04 20:39 UTC, Eddie Wai	no flags	Details \| Diff
Patch for uIP-0.6.2.2.1 (6.99 KB, text/plain) 2010-11-09 06:01 UTC, Benjamin Li	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0072	0	normal	SHIPPED_LIVE	iscsi-initiator-utils bug fix update	2011-01-12 17:22:06 UTC

Description Eddie Wai 2010-10-04 20:36:40 UTC

Description of problem:
A race condition can be observed in the ISCSI_ERR_INVALID_HOST handling in open-iscsi when it is being issued immediately after the reception of the ISCSI_ERR_CONN_FAILED nl message to execute the re-open path.  The race condition was found to be related to how the single threaded scheduler handles this INVALID_HOST message asynchronously from within the re-open path.  

The end result is that the actor which acts on the INVALID_HOST handling will get flushed by the re-open path handling.

Version-Release number of selected component (if applicable):
open-iscsi-2.0.871.1 - From inbox

How reproducible:
A few iterations depending on the timing execution of the procedure

Steps to Reproduce:
1. For every active sessions, issue an ISCSI_ERR_CONN_FAILED nl msg
   (This will eventually put all active connections into the actor_list ready
   to execute the session_conn_reopen procedure)
2. Asynchronously after a few seconds, call the iscsi_host_remove procedure
   (This will notify iscsid with the ISCSI_ERR_INVALID_HOST nl message)
3.
  
Actual results:
The number of outstanding session will not converge to 0 after the INVALID_HOST handling.

Expected results:
The number of outstanding session should converge to 0 after the INVALID_HOST handling.

Additional info:
The iscsi_host_remove() in libiscsi will wait indefinitely if this race problem occurs.

Comment 1 Eddie Wai 2010-10-04 20:39:35 UTC

Created attachment 451527 [details]
ISCSID: Fixed a race condition in the INVALID_HOST path

This should fix the race condition presented.

Comment 2 Mike Christie 2010-10-26 00:59:40 UTC

Note: this should only affect offload drivers and it should only affect their shutdown path, so the patch should be pretty safe regression wise.

Comment 3 edwardn 2010-10-27 22:32:15 UTC

Broadcom QA has done extensive testing and verification with this patch on RH5.5 using our latest OOB driver set.  We are also attempting to verify this patch in RH5.6 with the inbox driver.  However, we are running into what looks like some unrelated problems and are in the process of debugging/troubleshooting.

Comment 4 Andrius Benokraitis 2010-10-28 01:26:13 UTC

Great, keep us posted if you find out anything definite.

Comment 6 Mike Christie 2010-11-02 22:38:14 UTC

Thank for the patch Eddie. This is merged in iscsi-initiator-utils-6.2.0.872-5.el5. You can download it here:
http://people.redhat.com/mchristi/iscsi/rhel5.6/iscsi-initiator-utils/

Comment 9 Benjamin Li 2010-11-09 06:01:35 UTC

Created attachment 458985 [details]
Patch for uIP-0.6.2.2.1

While testing, we also saw that the uIP userspace daemon would segfault when uIP would take an error path some the sysfs iSCSI entries were removed.  The fault is a NULL deference which the incorrect errno value is being used.  The global errno should be used and not the errno from the file descriptor.  This only occurs   in an error condition with the default logging level.

Note that this is a very small 1 line fix and only effects the bnx2i driver.

Comment 10 Andrius Benokraitis 2010-11-09 14:11:02 UTC

Mike, I'll leave it to you if this needs another BZ or not.

Comment 11 Mike Christie 2010-11-09 18:26:18 UTC

Andrius, talked to Barry in QE. He said it was ok with him to just update the fix for this with this fix.

Fixed in iscsi-initiator-utils-6.2.0.872-5.el5. You can download it here:
http://people.redhat.com/mchristi/iscsi/rhel5.6/iscsi-initiator-utils/

Comment 12 Andrius Benokraitis 2010-11-09 18:42:20 UTC

(In reply to comment #11)
> Andrius, talked to Barry in QE. He said it was ok with him to just update the
> fix for this with this fix.
> 
> Fixed in iscsi-initiator-utils-6.2.0.872-5.el5. You can download it here:
> http://people.redhat.com/mchristi/iscsi/rhel5.6/iscsi-initiator-utils/

OK, setting back to ASSIGNED then.

Comment 14 edwardn 2010-11-11 01:10:18 UTC

It appears that BZ651287 is preventing this problem from being correctly verified using RHEL5.6-Server-20101029.0-x86_64-DVD.iso and iscsi-initiator-utils-6.2.0.872-6.el5.  The fix had previously been verified with the patch along with Broadcom OOB drivers using RH5.5.

Comment 15 Chris Ward 2010-11-30 13:55:22 UTC

@Broadcom, it appears BZ651287 has been confirmed as fixed with kernel-2.6.18-233.el5.

Please let us of the latest test results for this bug (#640111) once they're available. Thanks!

Comment 16 Chris Ward 2010-12-02 15:27:12 UTC

Reminder! There should be a fix present for this BZ in snapshot 3 -- unless otherwise noted in a previous comment.

Please test and update this BZ with test results as soon as possible.

Comment 17 edwardn 2010-12-02 17:11:21 UTC

Testing has been ongoing with kernel-2.6.18-233.el5 and there are some conditions in which this problem still surfaces.  i.e. when there are 128 iSCSI sessions active and ifup/down is issued repeatedly.  The same test was run with 35 iSCSI sessions overnight and was working ok.  Perhaps the BRCM engineering team can comment on this.  Thanks.

Comment 19 Jaromir Hradilek 2010-12-08 10:58:08 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A host removal could become suspended when the bnx2i, cxgb3i, or be2iscsi drivers were used and iSCSI sessions could not be cleaned up. With this update, the iSCSI daemon has been corrected to handle the error event, and a host removal behaves as expected.

Comment 21 errata-xmlrpc 2011-01-13 22:59:21 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0072.html

Note You need to log in before you can comment on or make changes to this bug.