Bug 159869
Summary: | Diskdump fails through ipr driver | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | David Milburn <dmilburn> | ||||||
Component: | kernel | Assignee: | Nobuhiro Tachino <ntachino> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.0 | CC: | aimamura, ktokunag, tao, tburke | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | powerpc | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHSA-2006-0132 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-03-07 19:07:20 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 168429 | ||||||||
Attachments: |
|
Description
David Milburn
2005-06-08 18:30:25 UTC
Created attachment 115233 [details]
Patch to fix problem
Could you please explain how the patch fixes the problem? Thank you. The patch keeps the adapter from being reset before the dump is created. Customer believes that reenabling interrupts when resetting the adapter is leaving a window for other interrupts to come in, causing the system to hang. With this patch, ipr_eh_host_rest() returns without calling ipr_reset_reload() when crashdump_mode() is on. So I think the first part of patch is not required. Is it correct? Correct, I think they were cleaning up ipr_reset_reload(), removing the check for crashdump_mode() since this check was moved to ipr_eh_host_reset(), which as you stated, does not call ipr_reset_reload() when crashdump_mode() is on. Created attachment 116128 [details]
patch2.diff
I tried to modify the patch. The original patch does not wait for completion of
reset and I think waiting for completion is better if it is possible to do.
With this patch, ipr_eh_host_reset() calls ipr_reset_reload() and
ipr_reset_reload() calls spin_unlock() instead of spin_unlock_irq() if
crashdump_mode() is on. So the new code can wait for completion fo reset
without enabling interrupt. If the customer agree with this, could you please
request the customer to test the patch?
> The patch does not work. The ipr driver requires interrupts to be
> functioning in order for its host reset processing to work. The patch would
> result in a hang as well.
I'm afraid allowing interrupts during diskdump process makes the possibility of
failure increase because it also allows unexpected other interrupts. Is there
any way to detect the completion without allowing interrupts?
Please confirm your next comment is written in this BZ in public comment. If the
comment is private, I cannot read it.
Nobuhiro, here is the customer's response to your question: I agree that enabling interrupts during diskdump is not the right thing to do. It would be possible to make ipr's adapter reset routine work without needing interrupts, but to do so would be complex and, I think, unnecessary. The problem is not so much just providing a way to detect the completion of the adapter reset with interrupts disabled. The problem is that the process of resetting an ipr adapter such that it can accept new commands is a multi-stage process. At a high level, what ipr needs to do is to run BIST on the card, then wait for 2 minutes. During this time, if ipr were to attempt to talk to the adapter, it would result in an EEH error and the reset would fail. It must then write to a register and wait for an interrupt to occur. This interrupt could take as long as 5 minutes to occur (usually it takes around 30 seconds or less). Then it needs to send several commands to the adapter, all of which are timed and require interrupts in the current implementation. To implement what you are proposing would require somone calling ipr_poll repeatedly to make forward progress on the reset job and also would require a way for ipr to time events which could take as long as 5 minutes. There is no reason for diskdump to reset the ipr adapter before executing a dump. There should be no need to quiesce the host before the dump as other commands that may be completing should have no effect on the dump. If there is a desire to prevent other scsi ops that are in progress at the time the dump is started from having their done function from being called, it would be much simpler to accomplish this. To make ipr's reset job work without interrupts and without timer interrupts is complex and adds lots of potential for deadlock if not properly implemented, which is why the initial patch was suggested, which essentially no-ops the host reset when in diskdump. In my tests, diskdump through ipr worked with the patch, but did not work consistently without the patch. Brian says modifying the driver to do a reset with interrupts disabled is not trivial. Besides, he also says it's unnecessary to reset the adapter before the dump. Since he is the ipr driver maintainer, I give his opinion added weight. > There is no reason for diskdump to reset the ipr adapter before executing a
> dump. There should be no need to quiesce the host before the dump as other
> commands that may be completing should have no effect on the dump. If there is a
> desire to prevent other scsi ops that are in progress at the time the dump is
> started from having their done function from being called, it would be much
> simpler to accomplish this.
I'm not sure the ipr adapter has no need to reset the host. But given that it is
correct, ipr adapter can just return SUCCESS as the result of its host reset
handler when crashdump_mode() is set. And ipr adapter can do anything in its
quiesce handler to dispose of its scsi commands in progress. IMHO this is more
robust than allowing interrupts.
> Are you expecting a new patch from IBM that incorporates the suggestions in comment #22? Yes, I'm expecting a new patch. If host reset before diskdump is useless for ipr as you said, I recommend skipping to do host reset. I think that makes diskdump on ipr more robust because it does not open a window for interrupts. The revised patch which was based on the patch of comment#1 was posted. Got one ACK. committed in -22.25 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html |