Description of problem: A race condition can be observed in the ISCSI_ERR_INVALID_HOST handling in open-iscsi when it is being issued immediately after the reception of the ISCSI_ERR_CONN_FAILED nl message to execute the re-open path. The race condition was found to be related to how the single threaded scheduler handles this INVALID_HOST message asynchronously from within the re-open path. The end result is that the actor which acts on the INVALID_HOST handling will get flushed by the re-open path handling. Version-Release number of selected component (if applicable): open-iscsi-2.0.871.1 - From inbox How reproducible: A few iterations depending on the timing execution of the procedure Steps to Reproduce: 1. For every active sessions, issue an ISCSI_ERR_CONN_FAILED nl msg (This will eventually put all active connections into the actor_list ready to execute the session_conn_reopen procedure) 2. Asynchronously after a few seconds, call the iscsi_host_remove procedure (This will notify iscsid with the ISCSI_ERR_INVALID_HOST nl message) 3. Actual results: The number of outstanding session will not converge to 0 after the INVALID_HOST handling. Expected results: The number of outstanding session should converge to 0 after the INVALID_HOST handling. Additional info: The iscsi_host_remove() in libiscsi will wait indefinitely if this race problem occurs.
Created attachment 451527 [details] ISCSID: Fixed a race condition in the INVALID_HOST path This should fix the race condition presented.
Note: this should only affect offload drivers and it should only affect their shutdown path, so the patch should be pretty safe regression wise.
Broadcom QA has done extensive testing and verification with this patch on RH5.5 using our latest OOB driver set. We are also attempting to verify this patch in RH5.6 with the inbox driver. However, we are running into what looks like some unrelated problems and are in the process of debugging/troubleshooting.
Great, keep us posted if you find out anything definite.
Thank for the patch Eddie. This is merged in iscsi-initiator-utils-6.2.0.872-5.el5. You can download it here: http://people.redhat.com/mchristi/iscsi/rhel5.6/iscsi-initiator-utils/
Created attachment 458985 [details] Patch for uIP-0.6.2.2.1 While testing, we also saw that the uIP userspace daemon would segfault when uIP would take an error path some the sysfs iSCSI entries were removed. The fault is a NULL deference which the incorrect errno value is being used. The global errno should be used and not the errno from the file descriptor. This only occurs in an error condition with the default logging level. Note that this is a very small 1 line fix and only effects the bnx2i driver.
Mike, I'll leave it to you if this needs another BZ or not.
Andrius, talked to Barry in QE. He said it was ok with him to just update the fix for this with this fix. Fixed in iscsi-initiator-utils-6.2.0.872-5.el5. You can download it here: http://people.redhat.com/mchristi/iscsi/rhel5.6/iscsi-initiator-utils/
(In reply to comment #11) > Andrius, talked to Barry in QE. He said it was ok with him to just update the > fix for this with this fix. > > Fixed in iscsi-initiator-utils-6.2.0.872-5.el5. You can download it here: > http://people.redhat.com/mchristi/iscsi/rhel5.6/iscsi-initiator-utils/ OK, setting back to ASSIGNED then.
It appears that BZ651287 is preventing this problem from being correctly verified using RHEL5.6-Server-20101029.0-x86_64-DVD.iso and iscsi-initiator-utils-6.2.0.872-6.el5. The fix had previously been verified with the patch along with Broadcom OOB drivers using RH5.5.
@Broadcom, it appears BZ651287 has been confirmed as fixed with kernel-2.6.18-233.el5. Please let us of the latest test results for this bug (#640111) once they're available. Thanks!
Reminder! There should be a fix present for this BZ in snapshot 3 -- unless otherwise noted in a previous comment. Please test and update this BZ with test results as soon as possible.
Testing has been ongoing with kernel-2.6.18-233.el5 and there are some conditions in which this problem still surfaces. i.e. when there are 128 iSCSI sessions active and ifup/down is issued repeatedly. The same test was run with 35 iSCSI sessions overnight and was working ok. Perhaps the BRCM engineering team can comment on this. Thanks.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A host removal could become suspended when the bnx2i, cxgb3i, or be2iscsi drivers were used and iSCSI sessions could not be cleaned up. With this update, the iSCSI daemon has been corrected to handle the error event, and a host removal behaves as expected.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0072.html