Description of problem: With the mpt fusion driver update in 5.5, any call to the HP_GETHOSTINFO ioctl in the mptctl module results in the controller being reset. The newer driver contains changes to the method used to detect the completion of ioctl commands. In the 5.5 driver, the variable ioc->ioctl_cmds.msg_context is used to store the context ID of the command message which the ioctl is waiting for completion. But the command's completed isn't being detected. A systemtap script was attached to log context IDs for commands set and completed by the mptctl module. it produced the output: 1286534058478249 cmascsid:4905 entering mptctl_hp_hostinfo mptctl_id=6 1286534058478379 cmascsid:4905 mpt_put_msg_frame: 0xf7a0d000 cb_idx:6 req msg_context:0 1286534058478599 swapper:0 mptctl_reply: 0xf7a0d000 mptctl_id:6 ioctl msg_context:0 req msg_context:393266 1286534068477163 cmascsid:4905 entering mptctl_timeout_expired mptctl_id=6 1286534074062524 cmascsid:4905 leaving mptctl_timeout_expired 1286534074062588 cmascsid:4905 leaving mptctl_hp_hostinfo The message context the mptctl_hp_hostinfo() function is waiting on is 0! Later when a command completes, we see a message context of 393266. mptctl_hp_hostinfo() watches for a message context of 0 do to a memset to initialize the message structure. Only after zeroing out the entire structure is the MsgContext field referenced. mptctl_reply() then watches for a message completion with a MsgContext of 0. But call to mpt_put_msg_frame() to send the message to the adapter restores The message's context ID. Thus, the message's completion doesn't notify mptctl_hp_hostinfo(). When mptctl_hp_hostinfo then times out, it calls mptctl_timeout_expired() which then resets the adapter. Version-Release number of selected component (if applicable): kernel 2.6.18-194 and higher. How reproducible: The reset always occurs when the HP_GETHOSTINFO ioctl is called. Expected results: The HP_GETHOSTINFO ioctl should complete normally and not cause resets when the adapter is responding. Additional info: By itself, this issue will only cause a temporary I/O interruption. However, the resets were triggering the other bug BZ640839 for the customer.
Created attachment 455550 [details] Save and restore the MsgContext around the memset The patch saves the MsgContext value before the memset call, and then restores it after the memset completes. The code watching for the message's completion will then watch for the correct context ID instead of watching for 0.
(In reply to comment #1) > The patch saves the MsgContext value before the memset call, and then restores > it after the memset completes. The code watching for the message's completion > will then watch for the correct context ID instead of watching for 0. David, thanks for the analysis and the patch. Are you going to post it upstream?
Hi Kashyap, is this solution OK for you? If yes, would you care about upstream posting? -tomash
I review the patch. Logic is understood. Can we do one optimization ? + IstwiRWRequest->MsgContext = msgcontext; IstwiRWRequest->Function = MPI_FUNCTION_TOOLBOX; IstwiRWRequest->Tool = MPI_TOOLBOX_ISTWI_READ_WRITE_TOOL; IstwiRWRequest->MsgContext = mpi_hdr->MsgContext; <--We can remove this. Do you agree ? I will post this patch to upstream.
Created attachment 480432 [details] Save and restore the MsgContext around the memset - new version (In reply to comment #5) > I review the patch. Logic is understood. > > Can we do one optimization ? > > + IstwiRWRequest->MsgContext = msgcontext; > IstwiRWRequest->Function = MPI_FUNCTION_TOOLBOX; > IstwiRWRequest->Tool = MPI_TOOLBOX_ISTWI_READ_WRITE_TOOL; > IstwiRWRequest->MsgContext = mpi_hdr->MsgContext; <--We can remove this. > > Do you agree ? I will post this patch to upstream. I agree, and I have asked David for review. I think we can remove other lines too, as mpi_hdr is no more used.
Kashyap, please go ahead an post upstream whatever version you want, add me to cc, I'll then continue with needed steps here.
Posted for internal review - a corresponding upstream post - http://www.spinics.net/lists/linux-scsi/msg50852.html
Patch(es) available in kernel-2.6.18-248.el5 Detailed testing feedback is always welcomed.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A call to the HP_GETHOSTINFO ioctl (I/O Control) in the mptctl module could result in the MPT (Message Passing Technology) fusion driver being reset due to erroneous detection of completed ioctl commands. With this update, the message context sent to the mptctl module is stored (previously, it was zeroed). When an ioctl command completes, the saved message context is used to recognize the completion of the message, thus resolving the faulty detection.
I have tried to make a ioctl call to /dev/mptctl, but failed with error: mptctl.h:104: error: expected ‘:’, ‘,’, ‘;’, ‘}’ or ‘__attribute__’ before ‘*’ token Any sample code for testing HP_GETHOSTINFO?
Created attachment 501903 [details] ioctl HP_GETHOSTINFO test code. Test code attached. in -194 kernel, we will got these dmesg: === mptctl: ioc0: WARNING - mptctl_hp_hostinfo: failed mptbase: ioc0: Initiating recovery === Which indicate contorler reset. in -264 kernel, no reset has been performed.
VERIFY this bug.
(In reply to comment #17) > I have tried to make a ioctl call to /dev/mptctl, but failed with error: > mptctl.h:104: error: expected ‘:’, ‘,’, ‘;’, ‘}’ or ‘__attribute__’ before ‘*’ > token This was a compiler error? > Any sample code for testing HP_GETHOSTINFO? Sorry, I wanted to verify it on a mptsas box, so I'm now too late with my test code.
Tomas, Thanks. I already solved my problem for that error by pull all related macros out from mptctl.h. My test code was attached, I think comment #18 is capable for verify this bug. If you have time, you might explain why I got kernel warning: "mptctl: ioc0: WARNING - mptctl_hp_hostinfo: failed"
(In reply to comment #21) > If you have time, you might explain why I got kernel warning: "mptctl: ioc0: > WARNING - mptctl_hp_hostinfo: failed" This indicates that a timeout happened in the mptctl_hp_hostinfo - it's what the patch is about - (from comment #0) > message to the adapter restores The message's context ID. Thus, the message's > completion doesn't notify mptctl_hp_hostinfo(). When mptctl_hp_hostinfo then > times out, it calls mptctl_timeout_expired() which then resets the adapter. Btw. the fw and some other informations are read fine regardless of this timeout - it's used only to fill the rsvd field.
(In reply to comment #22) > (In reply to comment #21) > Btw. the fw and some other informations are read fine regardless of this > timeout - it's used only to fill the rsvd field. Yes. The test code confirmed that by reading hp_host_info_t. But the timeout warning is still there after patched. That is by design or my test code's fault? (sorry, I don't have time for checking mptctl_hp_hostinfo())
(In reply to comment #23) > (In reply to comment #22) > > (In reply to comment #21) > > Btw. the fw and some other informations are read fine regardless of this > > timeout - it's used only to fill the rsvd field. > Yes. The test code confirmed that by reading hp_host_info_t. But the timeout > warning is still there after patched. That is by design or my test code's > fault? That shouldn't be there - I'm sure with my tests it wasn't, the program responded immediately with the patch, without it took several several seconds and there was the message in dmesg. I'll try it with your test program and will let you know.
(In reply to comment #24) > > Yes. The test code confirmed that by reading hp_host_info_t. But the timeout > > warning is still there after patched. That is by design or my test code's > > fault? > That shouldn't be there - I'm sure with my tests it wasn't, the program > responded immediately with the patch, without it took several several seconds > and there was the message in dmesg. > I'll try it with your test program and will let you know. I'm testing with your code right now - it returns immediately and there are no messages in dmesg. kernel 2.6.18-264, a timeout occurs with 2.6.18-238 ... host_no 0 fw_version 01.27.00.00 serial_number
I will re-test on the server hp-ml310g5-01.rhts.eng.bos.redhat.com (currently used by someone else, I am waiting in the queue). Will let you know.
Like you said, the test code return right after executed. ====== host_no 0 fw_version 01.23.34.00 serial_number P62190AGKUZ0AG ====== No timeout error on 2.6.18-264.el5 x84_64 OK to verify.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html