Bug 602714
Summary: | megaraid_sas: fix physical disk handling | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Bryn M. Reeves <bmr> | ||||||
Component: | kernel | Assignee: | Tomas Henzl <thenzl> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Storage QE <storage-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 5.5 | CC: | andriusb, bdonahue, bo.yang, bubrown, coughlan, dhoward, jpirko, jwest, ltroan, martin.wilck, mchristi, moshiro, revers, tao, vgoyal | ||||||
Target Milestone: | rc | ||||||||
Target Release: | 5.6 | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | 577178 | Environment: | |||||||
Last Closed: | 2010-07-29 10:53:59 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Bryn M. Reeves
2010-06-10 14:46:18 UTC
This issue already fixed in our latest driver (4.27 or 4.30). We already submited the patch to rhel5.6 and rhel6.0. (the one come with the online controller reset -- OCR added) This is open as a separate bug because it has been reported on earlier releases; since we cannot include a wholesale driver update in EUS packages (targeted fixes only) it needs to be tracked separately. Event posted on 06-16-2010 12:07pm JST by moshiro Hi Bo, Following is from Tokunaga-san, Fujitsu. Could you please kindly reply to him? --- Looking at the patchset provided by LSI for 5.6 (bug 564249 Comment 17), it turns out it's not a whole version-up patchset, but it's a minimized one for 5.6, thanks to Bo's efforts on this. So, we suppose it won't be really difficult to identify pieces for 5.3 hotfix. Bo, Could you please identify pieces for 5.3 hotfix as soon as possible? One of Fujitsu's customers has requested a 5.3 hotfix and it is quite urgent. Kei Tokunaga --- This event sent from IssueTracker by moshiro issue 1000913 Bo, Please post the specific fix for the bug described in this BZ, backported from your latest driver (4.27 or 4.30), ready to go in 5.5.z, 5.4.z, and 5.3.z. Tom Created attachment 429116 [details]
add the online controller reset to the driver
1. Add the fix of the kernel panic if the applocation cmds take too long (> 3 mins).
2. Add the ontime controller reset to the driver (this is need for fixing item 1).
A. for online controller reset, driver added the chip reset functions.
B. In driver's ISR function, driver will receive FW state change interrupt
plus fw in failed state to trig the driver do the Online Controller Reset.
C. during the FW reset time, driver will save the pending cmds to the internal
queue and re-fire those cmds after the OCR finished.
D. In driver's ioctl routine, the cmds from application should wait for the OCR
to finish to issue the cmds.
D. If driver's timeout routine get called during the OCR, driver will return the OS
as reset.
(In reply to comment #10) Bo, These changes appear to be on top of other changes currently missing from the 5.3(z) version of the driver. For example none of the megasas*skinny routines exist in the 5.3z driver. There are some other changes that indicate other missing patches are needed prior to applying these changes. Is there a more comprehensive patch available? bud brown I am using rhel5.5 as the base code. If you want to apply to rhel5.3 or rhel5.4. Can you send me the base src by e-mail and I will create the patches (which will be fast). Otherwise, I need to download the src from rhel5.3 and rhel5.4 to make the patches. Just send me megaraid_sas.c and megaraid_sas.h. Created attachment 429157 [details]
submit the patch based on rhel5.3z
recreate the patch based on 5.3z
Event posted on 07-05-2010 02:14pm JST by moshiro Hi, Could you please answer the question From FJ? --- From bug 602714: > bo yang 2010-07-02 16:27:21 EDT > > Created an attachment (id=429157) [details] > submit the patch based on rhel5.3z > > recreate the patch based on 5.3z We built a kernel with the patch applied and started a test on it. It's been about two days and no kernel panics have been seen. What we did is create 32 processes and each process issued an ioctol continuously. We saw no adapter resets during the test. However, there is still one thing we'd like to clarify. If a hardware failure happens with a megaraid_sas adapter, the adapter likely is unable to send a response back to the OS. In such a case, the OS has to detect it and notify the apps so that the apps can either retry the ioctl or initiates a fail-over process. But, with the patch, it looks there is no way for the OS to detect the failure. (Before, with wait_event_timeout(), the OS was able to detect a hardware failure.) Could you explain how to handle such a scenario after applying the patch? We didn't do any normal IOs during the test. If we did, because some of the ioctls occasionally waited for a response for 20 minutes or more, the normal IOs would sleep as well due to the stacked ioctls. Then, the common SCSI layer would detect it and initiate an adapter reset? If that's the case, we will do the test again with normal IOs. > We have created a test package. Could you please verify it and give us your feedback? > > http://people.redhat.com/moshiro/1000913/ Thank you. We'll use it from the next time for testing. --- Best Regards, M Oshiro Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by moshiro issue 1000913 setting needinfo by a request from FJ. setting needinfo - bo.yang. From FJ: ================================================================= Hi Bo, We downloaded the test kernel Red Hat made, including your fix, and we installed it into our machine. We ran the test again on the kernel with running relatively many normal IOs continuously this time. But, the results are the same. We saw no adapter resets. (We modified the kernel to have an interface to trigger adapter reset and tried it. An adapter reset took about 30 secs to complete and any IO related works were not able to perform during the reset. Also, a reset outputs some printk messages. So, it's easy to notice when a reset runs.) We had a chance to talk with our hardware team. They gave us an information about OCR feature. They explained that OCR was originally introduced for relatively old adapters that have a bug in their chipset (BZ563083). They also informed us that MegaRAID SAS 8880EM2, which is installed in the customer's systems, doesn't have the bug, and therefore it disables the adapter reset feature in its firmware by default. That means even the megaraid_sas driver with OCR feature won't initiate OCR anyways on the customer systems. That leads us to the question again: what happens when a something wrong with the firmware or hardware occurs on the adapter and a response to ioctl never comes back? We thought OCR would save such a situation, but that might not be the case? Kei Tokunaga ================================================================= Bo-san, could you please kindly reply? Another comment from FJ: We've been doing some testing and investigations for a fix and got some questions. We'd like to obtain information/answers from LSI. (This list includes some questions Fujitsu asked before.) 1) We had a chance to talk with our hardware team. They gave us an information about OCR feature. They explained that OCR was originally introduced for relatively old adapters that have a bug in their chipset (BZ563083). They also informed us that MegaRAID SAS 8880EM2, which is installed in the customer's systems, doesn't have the bug, and therefore it disables the adapter reset feature in its firmware by default. That means even the megaraid_sas driver with OCR feature won't initiate OCR anyways on the customer systems. Is that true? And, this question will be broke down to two technical questions. 1-1) megasas_wait_for_outstanding() sees instance->disableOnlineCtrlReset, which we believe is firmware setting, to determine whether or not it should call megasas_do_ocr(). Is it set on MegaRAID SAS 8880EM2? 1-2) If instance->diableOnlineCtrlReset is set, what happens when we call megasas_do_ocr() directly? The firmware won't do a reset? 2) OCR will be initiated when one of the following conditions is met. a) SCSI common layer detects timeout. b) The firmware reports a failure (the firmware is in failure state) to the driver. What kind of situations will the firmware report a failure to the driver? 3) What is the scenario when a response for ioctl never comes back due to a firmware/hardware failure? 4) Per our investigation, MegaRAID SAS 8880EM2 uses ppc functions (not xscale, and gen2). Per the source code, megasas_adp_reset_ppc() does nothing. Adapter reset (OCR) is not supported on ppc machines such as MegaRAID SAS 8880EM2? --- Best Regards, Moritoshi Oshiro I went to vacation and just get the chance to see the questions: >1) We had a chance to talk with our hardware team. They gave us > an information about OCR feature. They explained that OCR was > originally introduced for relatively old adapters that have a > bug in their chipset (BZ563083). They also informed us that > MegaRAID SAS 8880EM2, which is installed in the customer's > systems, doesn't have the bug, and therefore it disables the > adapter reset feature in its firmware by default. That means > even the megaraid_sas driver with OCR feature won't initiate > OCR anyways on the customer systems. Is that true? And, this > question will be broke down to two technical questions. In the driver and FW, OCR implemeneted for XScale and 2108 chip (Gen2). For skinny and PPC chip, FW will not support OCR now. >1-1) megasas_wait_for_outstanding() sees > instance->disableOnlineCtrlReset, which we believe is > firmware setting, to determine whether or not it should > call megasas_do_ocr(). Is it set on MegaRAID SAS 8880EM2? FW need to set the flag to support the OCR. MegaRAID SAS 8880EM2 used PPC chip which will not have the OCR support in FW. > 1-2) If instance->diableOnlineCtrlReset is set, what happens > when we call megasas_do_ocr() directly? The firmware > won't do a reset? If the flag is set and call megasas_do_ocr, fw will do reset. >2) OCR will be initiated when one of the following conditions is > met. > a) SCSI common layer detects timeout. > b) The firmware reports a failure (the firmware is in failure > state) to the driver. OCR will be be initiated if: a) FW set OCR flag, generated the FW state change interrupt and FW in failed state (MFI state). b}. SCSI layer detected timeout, FW set OCR flag, and driver detected FW in failed state (MFI state). >3) What is the scenario when a response for ioctl never comes > back due to a firmware/hardware failure? For Xscale and Gen2 chips, driver will do OCR try to bring the HW/FW back. ioctl cmds may take long time to return. For PPC chip, if FW/HW failed, controller will be killed. >4) Per our investigation, MegaRAID SAS 8880EM2 uses ppc functions > (not xscale, and gen2). Per the source code, > megasas_adp_reset_ppc() does nothing. Adapter reset (OCR) is > not supported on ppc machines such as MegaRAID SAS 8880EM2? For MegaRAID SAS 8880EM2, OCR will not be supported by FW and driver. For rhel 5.4z and rhel5.5z, does the customer only want to apply this patch to MegaRAID SAS 8880EM2 controller (PPC) or other type of controllers (XScale and Gen2) also will be applied? If it is only applied to MegaRAID SAS 8880EM2, we can create the minimal patch. Thanks, Bo Yang Event posted on 07-21-2010 02:55pm JST by moshiro Hi Bo-san, FJ has replied to your last comment as below: --- > For rhel 5.4z and rhel5.5z, does the customer only want to > apply this patch to MegaRAID SAS 8880EM2 controller (PPC) > or other type of controllers (XScale and Gen2) also will > be applied? > > If it is only applied to MegaRAID SAS 8880EM2, we can create > the minimal patch. Some customers are using XScale and Gen2 controllers, so please include the OCR feature in the patch. Kei Tokunaga --- Thanks. Best Regards, M Oshiro Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by moshiro issue 1000913 Event posted on 07-22-2010 12:24pm JST by moshiro Hi Bo-san, Following is from FJ: --- >> For rhel 5.4z and rhel5.5z, does the customer only want to >> apply this patch to MegaRAID SAS 8880EM2 controller (PPC) >> or other type of controllers (XScale and Gen2) also will >> be applied? >> >> If it is only applied to MegaRAID SAS 8880EM2, we can create >> the minimal patch. > > Some customers are using XScale and Gen2 controllers, so please include the OCR feature in the patch. Bo, Reading your comments again, I noticed you only talked about 5.4.z and 5.5.z there, but not 4.9 or 4.8.z explicitly. Could you please add the OCR feature to a fix for 4.9 and 4.8.z as well? Kei Tokunaga --- Thanks. Moritoshi This event sent from IssueTracker by moshiro issue 1000913 Hi Bo, sorry to be a pain, but this is hot - could you post comment on comment#25? Kei, I am in China now and am traving back. Just let you know to port OCR to 5.4z, 5.5z and 4.xz, we need to spend at least one and half days to do each (no other interrupt come). I may start the porting after I back to office next Monday or Tuesday. rhel 4.xz will be delayed until finished 5.x. Bo Yang Event posted on 2010-07-26 17:12 JST by myamazak Hello Bo-san, I've got a response from FJ. ---------------------------------------------------------------------- Fujitsu already confirmed the hotfix works fine. Fujitsu is waiting for errata that are supposed to come out on 10th Aug. If there are any concerns with the date, please let us know. ---------------------------------------------------------------------- Best regards, M Yamazaki This event sent from IssueTracker by myamazak issue 1000913 (In reply to comment #32) > ---------------------------------------------------------------------- > Fujitsu already confirmed the hotfix works fine. > > Fujitsu is waiting for errata that are supposed to come out on 10th Aug. > If there are any concerns with the date, please let us know. > ---------------------------------------------------------------------- > > Best regards, > M Yamazaki Based on comment#25, I think Fujitsu still wants the OCR support for all controllers. Is it so? Event posted on 2010-07-27 12:38 JST by myamazak Hi all, FJ gave us a summary of the status of this issue. If anyone has any comments, please let me know. ---------------------------------------------------------------------- Here is a summary of the status of this issue. - This ticket has been used for 5.3hotfix and 5.3.z, and now used for 5.5.z and 5.6 as well. - 5.3hotfix was provided and Fujitsu confirmed it works fine. - 5.3.z errata release is planned for 10th Aug. - Fujitsu requested LSI to provide a fix patch with OCR for 5.5.z, 5.6, 4.8.z, and 4.9 and LSI acknowledged it. (4.8.z and 4.9 are handled on IT604473) Please give it a correction if there is anything inaccurate. Kei Tokunaga ---------------------------------------------------------------------- Regards, M Yamazaki This event sent from IssueTracker by myamazak issue 1000913 Can Fujitsu confirm rhel5.3z works fine? Also for rhel5.4z, rhel5.5z, rhel4.8z and rhel4.9z, we are waiting for the feedback from Fujitsu. Thanks, Bo Yang The test kernel is posted on http://people.redhat.com/thenzl/602714/ if you want builds for other archs, please let me know. Setting NEEDINFO=tmuneda per comment #40 and comment #41 above. Event posted on 2010-07-29 09:47 JST by myamazak Here is a response from FJ. ---------------------------------------------------------------------- Bo wrote: > Can Fujitsu confirm rhel5.3z works fine? We sure will. > Also for rhel5.4z, rhel5.5z, rhel4.8z and rhel4.9z, we are > waiting for the feedback from Fujitsu. Do you mean you need to wait for feedback from Fujitsu on 5.3.z to start development of 5.5.z, 5.4.z, 4.8.z and 4.9? Or, do you mean you are looking for some other information? Tomas wrote: > The test kernel is posted on http://people.redhat.com/thenzl/602714/ Thank you for the packages. We will start testing. Just one thing to make sure: they are test packages of 5.3.z, right? Kei Tokunaga ---------------------------------------------------------------------- This event sent from IssueTracker by myamazak issue 1000913 This issue is fixed by driver update - bz#564249. *** This bug has been marked as a duplicate of bug 564249 *** (In reply to comment #43) > > Tomas wrote: > > The test kernel is posted on http://people.redhat.com/thenzl/602714/ > > Thank you for the packages. We will start testing. Just one thing to > make sure: they are test packages of 5.3.z, right? Yes it's 5.3.z (In reply to comment #40) > Can Fujitsu confirm rhel5.3z works fine? > > Also for rhel5.4z, rhel5.5z, rhel4.8z and rhel4.9z, we are waiting for the > feedback from Fujitsu. > Bo, when the patch is accepted for 5.3.z, then it is a must to have it in also for 5.4.z, 5.5.z. Not having it here would create a regression. On the other side chances are good that we can use the 5.3.z patch for that. For 5.4.z is bz#619363, for 5.5.z is bz#619365 |