Bug 646513 - HP_GETHOSTINFO ioctl always causes mpt controller reset
Summary: HP_GETHOSTINFO ioctl always causes mpt controller reset
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: i686
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Tomas Henzl
QA Contact: Gris Ge
URL:
Whiteboard:
Depends On:
Blocks: 684128 707606
TreeView+ depends on / blocked
 
Reported: 2010-10-25 14:28 UTC by David Jeffery
Modified: 2018-11-14 16:40 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
A call to the HP_GETHOSTINFO ioctl (I/O Control) in the mptctl module could result in the MPT (Message Passing Technology) fusion driver being reset due to erroneous detection of completed ioctl commands. With this update, the message context sent to the mptctl module is stored (previously, it was zeroed). When an ioctl command completes, the saved message context is used to recognize the completion of the message, thus resolving the faulty detection.
Clone Of:
Environment:
Last Closed: 2011-07-21 10:00:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Save and restore the MsgContext around the memset (840 bytes, patch)
2010-10-25 14:32 UTC, David Jeffery
no flags Details | Diff
Save and restore the MsgContext around the memset - new version (1.14 KB, patch)
2011-02-23 12:07 UTC, Tomas Henzl
thenzl: review? (djeffery)
Details | Diff
ioctl HP_GETHOSTINFO test code. (1.93 KB, text/x-csrc)
2011-05-31 03:33 UTC, Gris Ge
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description David Jeffery 2010-10-25 14:28:38 UTC
Description of problem:

With the mpt fusion driver update in 5.5, any  call to the HP_GETHOSTINFO ioctl in the mptctl module results in the controller being reset.

The newer driver contains changes to the method used to detect the completion of ioctl commands.  In the 5.5 driver, the variable ioc->ioctl_cmds.msg_context is used to store the context ID of the command message which the ioctl is waiting for completion.  But the command's completed isn't being detected.

A systemtap script was attached to log context IDs for commands set and completed by the mptctl module.  it produced the output:

1286534058478249 cmascsid:4905 entering mptctl_hp_hostinfo mptctl_id=6
1286534058478379 cmascsid:4905 mpt_put_msg_frame: 0xf7a0d000 cb_idx:6 req msg_context:0
1286534058478599 swapper:0 mptctl_reply: 0xf7a0d000 mptctl_id:6 ioctl msg_context:0 req msg_context:393266
1286534068477163 cmascsid:4905 entering mptctl_timeout_expired mptctl_id=6
1286534074062524 cmascsid:4905 leaving mptctl_timeout_expired
1286534074062588 cmascsid:4905 leaving mptctl_hp_hostinfo

The message context the mptctl_hp_hostinfo() function is waiting on is 0!  Later when a command completes, we see a message context of 393266.

mptctl_hp_hostinfo() watches for a message context of 0 do to a memset to initialize the message structure.  Only after zeroing out the entire structure is the MsgContext field referenced.  mptctl_reply() then watches for a message completion with a MsgContext of 0.  But call to mpt_put_msg_frame() to send the message to the adapter restores The message's context ID.  Thus, the message's completion doesn't notify mptctl_hp_hostinfo().  When mptctl_hp_hostinfo then times out, it calls mptctl_timeout_expired() which then resets the adapter.


Version-Release number of selected component (if applicable):
kernel 2.6.18-194 and higher.

How reproducible:
The reset always occurs when the HP_GETHOSTINFO ioctl is called.

Expected results:
The HP_GETHOSTINFO ioctl should complete normally and not cause resets when the adapter is responding.

Additional info:

By itself, this issue will only cause a temporary I/O interruption.  However, the resets were triggering the other bug BZ640839 for the customer.

Comment 1 David Jeffery 2010-10-25 14:32:10 UTC
Created attachment 455550 [details]
Save and restore the MsgContext around the memset

The patch saves the MsgContext value before the memset call, and then restores it after the memset completes.  The code watching for the message's completion will then watch for the correct context ID instead of watching for 0.

Comment 3 Tomas Henzl 2011-02-18 12:41:47 UTC
(In reply to comment #1)
> The patch saves the MsgContext value before the memset call, and then restores
> it after the memset completes.  The code watching for the message's completion
> will then watch for the correct context ID instead of watching for 0.


David,
thanks for the analysis and the patch. Are you going to post it upstream?

Comment 4 Tomas Henzl 2011-02-18 12:44:50 UTC
Hi Kashyap,
is this solution OK for you? If yes, would you care about upstream posting?
-tomash

Comment 5 kashyap 2011-02-18 14:27:43 UTC
I review the patch. Logic is understood.

Can we do one optimization ?

+	IstwiRWRequest->MsgContext = msgcontext;
 	IstwiRWRequest->Function = MPI_FUNCTION_TOOLBOX;
 	IstwiRWRequest->Tool = MPI_TOOLBOX_ISTWI_READ_WRITE_TOOL;
 	IstwiRWRequest->MsgContext = mpi_hdr->MsgContext;  <--We can remove this.

Do you agree ? I will post this patch to upstream.

Comment 9 Tomas Henzl 2011-02-23 12:07:29 UTC
Created attachment 480432 [details]
Save and restore the MsgContext around the memset - new version

(In reply to comment #5)
> I review the patch. Logic is understood.
> 
> Can we do one optimization ?
> 
> + IstwiRWRequest->MsgContext = msgcontext;
>   IstwiRWRequest->Function = MPI_FUNCTION_TOOLBOX;
>   IstwiRWRequest->Tool = MPI_TOOLBOX_ISTWI_READ_WRITE_TOOL;
>   IstwiRWRequest->MsgContext = mpi_hdr->MsgContext;  <--We can remove this.
> 
> Do you agree ? I will post this patch to upstream.

I agree, and I have asked David for review.
I think we can remove other lines too, as mpi_hdr is no more used.

Comment 10 Tomas Henzl 2011-02-25 16:27:04 UTC
Kashyap,
please go ahead an post upstream whatever version you want, add me to cc, I'll then continue with needed steps here.

Comment 11 Tomas Henzl 2011-03-10 14:34:37 UTC
Posted for internal review - a corresponding upstream post - http://www.spinics.net/lists/linux-scsi/msg50852.html

Comment 14 Jarod Wilson 2011-03-16 18:01:33 UTC
Patch(es) available in kernel-2.6.18-248.el5
Detailed testing feedback is always welcomed.

Comment 16 Martin Prpič 2011-04-14 10:14:50 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A call to the HP_GETHOSTINFO ioctl (I/O Control) in the mptctl module could result in the MPT (Message Passing Technology) fusion driver being reset due to erroneous detection of completed ioctl commands. With this update, the message context sent to the mptctl module is stored (previously, it was zeroed). When an ioctl command completes, the saved message context is used to recognize the completion of the message, thus resolving the faulty detection.

Comment 17 Gris Ge 2011-05-30 09:55:54 UTC
I have tried to make a ioctl call to /dev/mptctl, but failed with error:
mptctl.h:104: error: expected ‘:’, ‘,’, ‘;’, ‘}’ or ‘__attribute__’ before ‘*’ token

Any sample code for testing HP_GETHOSTINFO?

Comment 18 Gris Ge 2011-05-31 03:33:18 UTC
Created attachment 501903 [details]
ioctl HP_GETHOSTINFO test code.

Test code attached.

in -194 kernel, we will got these dmesg:
===
mptctl: ioc0: WARNING - mptctl_hp_hostinfo: failed
mptbase: ioc0: Initiating recovery
===
Which indicate contorler reset.

in -264 kernel, no reset has been performed.

Comment 19 Gris Ge 2011-05-31 03:34:16 UTC
VERIFY this bug.

Comment 20 Tomas Henzl 2011-05-31 12:07:46 UTC
(In reply to comment #17)
> I have tried to make a ioctl call to /dev/mptctl, but failed with error:
> mptctl.h:104: error: expected ‘:’, ‘,’, ‘;’, ‘}’ or ‘__attribute__’ before ‘*’
> token
This was a compiler error?
> Any sample code for testing HP_GETHOSTINFO?
Sorry, I wanted to verify it on a mptsas box, so I'm now too late with my test code.

Comment 21 Gris Ge 2011-05-31 13:42:49 UTC
Tomas,
Thanks. I already solved my problem for that error by pull all related macros out from mptctl.h.

My test code was attached, I think comment #18 is capable for verify this bug.

If you have time, you might explain why I got kernel warning: "mptctl: ioc0: WARNING - mptctl_hp_hostinfo: failed"

Comment 22 Tomas Henzl 2011-05-31 14:06:10 UTC
(In reply to comment #21)
> If you have time, you might explain why I got kernel warning: "mptctl: ioc0:
> WARNING - mptctl_hp_hostinfo: failed"
This indicates that a timeout happened in the mptctl_hp_hostinfo - it's what the patch is about - 
(from comment #0)
> message to the adapter restores The message's context ID.  Thus, the message's
> completion doesn't notify mptctl_hp_hostinfo().  When mptctl_hp_hostinfo then
> times out, it calls mptctl_timeout_expired() which then resets the adapter.
Btw. the fw and some other informations are read fine regardless of this timeout - it's used only to fill the rsvd field.

Comment 23 Gris Ge 2011-05-31 14:37:35 UTC
(In reply to comment #22)
> (In reply to comment #21)
> Btw. the fw and some other informations are read fine regardless of this
> timeout - it's used only to fill the rsvd field.
Yes. The test code confirmed that by reading hp_host_info_t. But the timeout warning is still there after patched. That is by design or my test code's fault?
(sorry, I don't have time for checking mptctl_hp_hostinfo())

Comment 24 Tomas Henzl 2011-05-31 15:38:05 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > (In reply to comment #21)
> > Btw. the fw and some other informations are read fine regardless of this
> > timeout - it's used only to fill the rsvd field.
> Yes. The test code confirmed that by reading hp_host_info_t. But the timeout
> warning is still there after patched. That is by design or my test code's
> fault?
That shouldn't be there - I'm sure with my tests it wasn't, the program responded immediately with the patch, without it took several several seconds and there was the message in dmesg.
I'll try it with your test program and will let you know.

Comment 25 Tomas Henzl 2011-05-31 15:51:36 UTC
(In reply to comment #24)
> > Yes. The test code confirmed that by reading hp_host_info_t. But the timeout
> > warning is still there after patched. That is by design or my test code's
> > fault?
> That shouldn't be there - I'm sure with my tests it wasn't, the program
> responded immediately with the patch, without it took several several seconds
> and there was the message in dmesg.
> I'll try it with your test program and will let you know.

I'm testing with your code right now - it returns immediately and there are no messages in dmesg. 
kernel 2.6.18-264, a timeout occurs with 2.6.18-238
...
host_no 0
fw_version 01.27.00.00
serial_number

Comment 26 Gris Ge 2011-06-01 02:14:10 UTC
I will re-test on the server hp-ml310g5-01.rhts.eng.bos.redhat.com (currently used by someone else, I am waiting in the queue).
Will let you know.

Comment 27 Gris Ge 2011-06-02 02:11:12 UTC
Like you said, the test code return right after executed.
======
host_no 0
fw_version 01.23.34.00
serial_number P62190AGKUZ0AG
======

No timeout error on 2.6.18-264.el5 x84_64

OK to verify.

Comment 28 errata-xmlrpc 2011-07-21 10:00:32 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.