Bug 576709

Summary: [Cisco 5.6 bug] fnic: flush Tx queue bug fix
Product: Red Hat Enterprise Linux 5 Reporter: Abhijeet Joglekar <abjoglek>
Component: kernelAssignee: Mike Christie <mchristi>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.6CC: andriusb, coughlan, cward, james.brown, jwest, mchristi, mgahagan, savbu-lnx-drivers, Stuart.Kirk, tao, vbhamidi
Target Milestone: rcKeywords: OtherQA, ZStream
Target Release: 5.6   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
A host could crash during an SAN (storage area network) installation when using the Cisco 'fnic' driver. During driver initialization, an error in the 'fnic' driver caused it to flush the wrong queue. The flush code could then incorrectly access the memory and crash the host. With this update, the error in the 'fnic' driver has been fixed and crashed no longer occur.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 21:21:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 578328    
Bug Blocks: 557597, 580828, 580829    
Attachments:
Description Flags
Patch for fnic flush transmit queue issue none

Description Abhijeet Joglekar 2010-03-24 21:31:36 UTC
Description of problem:
Fnic 5.5 driver has a bug where once the fabric login has completed, it flushes the Rx queue instead of flushing the intended Tx queue. This can cause a crash during SAN Boot or other times like when using an "fcc reset" to re-login to the fabric.

Version-Release number of selected component (if applicable):
1.4.0.98

How reproducible:


Steps to Reproduce:
1. fcc reset
2. San boot
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Abhijeet Joglekar 2010-03-24 21:35:58 UTC
This fix should also be included in 5.5-z series kernel. Will upload patch soon, and also submit to upstream.

Comment 2 Tom Coughlan 2010-03-25 12:04:57 UTC
(In reply to comment #1)
> This fix should also be included in 5.5-z series kernel. Will upload patch
> soon, and also submit to upstream.    

Okay, since it can cause a crash, I will propose it for .z.

Comment 3 Andrius Benokraitis 2010-03-25 16:14:37 UTC
Mike, it would be great if this gets committed to the 5.6 tree very early (right when it opens?) so that we have enough time to get this in the first 5.5.z. How is your workload to get this POSTed in the next week or two once we get it?

Comment 4 Mike Christie 2010-03-26 19:22:43 UTC
Is this upstream (in James's tree at least)? If so could you send a git commit link?

Comment 5 Mike Christie 2010-03-26 19:23:53 UTC
We need this for RHEL6 too right?

If so then we should ping Rob since he has the fnic RHEL6.0 update (I traded that bug with him for some fc class one).

Comment 6 Andrius Benokraitis 2010-03-26 19:37:20 UTC
Right, Cisco: Assuming this fix isn't in bug 570693 since this isn't upstream yet.

Comment 7 Mike Christie 2010-03-26 20:32:20 UTC
(In reply to comment #5)
> We need this for RHEL6 too right?
> 
> If so then we should ping Rob since he has the fnic RHEL6.0 update

Ah shoot. Rob sent this patch already. I guess we need to clone this for RHEL6.

Comment 8 Mike Christie 2010-03-26 20:33:03 UTC
(In reply to comment #4)
> Is this upstream (in James's tree at least)? If so could you send a git commit
> link?    

You can also at the very least send me a link to the posting on linux-scsi.

Comment 9 Venkata Siva Vijayendra Bhamidipati 2010-03-26 21:18:13 UTC
Created attachment 402956 [details]
Patch for fnic flush transmit queue issue

Comment 10 Mike Christie 2010-03-27 03:25:29 UTC
Adding links to the submission so I can track it later.

http://www.open-fcoe.org/pipermail/devel/2010-March/010116.html
http://www.open-fcoe.org/pipermail/devel/2010-March/010117.html

Comment 11 Abhijeet Joglekar 2010-03-30 21:29:28 UTC
Sorry, didn't get chance to reply to this earlier. Yes, we need to include in 6.0 too. Will create a bugzilla for that.

thanks.

Comment 13 Andrius Benokraitis 2010-04-05 03:52:34 UTC
Created 6.0 bug 578328

Comment 16 Abhijeet Joglekar 2010-04-06 20:07:31 UTC
Symptom: Customer may see a host crash during SAN install using fnic driver

Problem: ELS frames generated by libfc are queued by the fnic driver in a TX queue, until the fabric login is done and the adapter is set up in the correct mode. Once fabric login completes, and the FCID received from the fabric is programmed, the TX queue should be flushed and frames sent out on the wire.

The issue is that driver is incorrectly flushing another queue instead of the TX queue. The buffers in these queues are aligned differently and so the flush code can access memory incorrectly and crash the host.

Fix: Flush the Tx queue instead of the other queue. This fix will be present in 5.5-z series. It will also be provided by Cisco on a driver disk to replace the in-kernel driver during SAN install if customer does not want to upgrade to errata kernel.

Comment 17 Tom Coughlan 2010-04-06 20:25:17 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
Ryan, this is for the on-line 5.5 release notes. 

This is a shorter version of comment 16. 

Symptom: Possible host crash during SAN install using Cisco fnic driver. 

Problem: During driver initialization an error in the fnic driver causes it to flush the wrong queue. The flush
code can access memory incorrectly and crash the host.

Fix: There is a plan to ship a fixed fnic driver in a RHEL 5.5 errata.  This fix will also be provided by Cisco on a driver disk, if needed.

Comment 27 Jarod Wilson 2010-04-21 19:42:11 UTC
in kernel-2.6.18-197.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 28 Andrius Benokraitis 2010-04-22 13:40:57 UTC
Abhijeet - URGENT - please test ASAP, your results are blocking the inclusion of this into RHEL 5.5.z. Thanks!

Comment 29 Abhijeet Joglekar 2010-04-22 15:10:59 UTC
Already forwarded the kernel link to QA yesterday, will forward them the urgent timeline requirement. Thanks!

Comment 30 Abhijeet Joglekar 2010-04-27 18:09:30 UTC
This fix was verified by QA. They ran the following test:

1) Install RHEL 5.5 GA bits (kernel 194) on a SAN array. Use a RHEL 5.5 driver disk that has the fnic driver 1.4.0.145 with the bug fix. Install goes through fine, driver goes into updates/fnic/fnic.ko and takes precedence over the inbox driver in kernel 194 (which had the tx queue flush bug)

2) Install kernel rpms 197. Then reboot the system in 197; the in-box driver in 197 now shows up as 1.4.0.145 from drivers/scsi/fnic/fnic.ko. Continue testing with this driver.

Comment 35 Martin Prpič 2010-12-22 17:23:14 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,10 +1 @@
-Ryan, this is for the on-line 5.5 release notes. 
+A host could crash during an SAN (storage area network) installation when using the Cisco 'fnic' driver. During driver initialization, an error in the 'fnic' driver caused it to flush the wrong queue. The flush code could then incorrectly access the memory and crash the host. With this update, the error in the 'fnic' driver has been fixed and crashed no longer occur.-
-This is a shorter version of comment 16. 
-
-Symptom: Possible host crash during SAN install using Cisco fnic driver. 
-
-Problem: During driver initialization an error in the fnic driver causes it to flush the wrong queue. The flush
-code can access memory incorrectly and crash the host.
-
-Fix: There is a plan to ship a fixed fnic driver in a RHEL 5.5 errata.  This fix will also be provided by Cisco on a driver disk, if needed.

Comment 37 errata-xmlrpc 2011-01-13 21:21:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html