Bug 513472
| Summary: | cciss_scan00 doesn't stop during suspend/resume on intel machine hp-dl580g5-01.rhts.bos.redhat.com | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Prarit Bhargava <prarit> | ||||
| Component: | kernel | Assignee: | Tomas Henzl <thenzl> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 5.4 | CC: | coughlan, dzickus, jarod, jfeeney, jtluka, mike.miller, prarit, revers, rlerch, syeghiay | ||||
| Target Milestone: | rc | Keywords: | Regression | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
A change to the cciss driver in Red Hat Enterprise Linux 5.4 made it incompatible with the "echo disk > /sys/power/state" suspend-to-disk operation. Consequently, the system will not suspend properly, returning messages such as:
<screen>
Stopping tasks:
======================================================================
stopping tasks timed out after 20 seconds (1 tasks remaining):
cciss_scan00
Restarting tasks...<6> Strange, cciss_scan00 not stopped
done
</screen>
|
Story Points: | --- | ||||
| Clone Of: | 511211 | Environment: | |||||
| Last Closed: | 2009-11-30 13:15:00 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 513501, 525215, 533192 | ||||||
| Attachments: |
|
||||||
|
Description
Prarit Bhargava
2009-07-23 19:34:29 UTC
Binary search lead me to this group of commits:
commit 1435a23505d04d799de2cf54eae5acb0fbe4961c
Author: Tomas Henzl <thenzl>
Date: Wed Apr 22 15:19:04 2009 +0300
[scsi] cciss: change in discovering memory bar
Message-id: 49EF0B38.9030109
O-Subject: [RHEL5.4 PATCH3/4] cciss change in discovering memory bar
Bugzilla: 474392
Upstream in e143858104e318263689c551543dfc3f186cea12
HP: Change the way we discover the first memory BAR
Add a method for discovering the first memory BAR. All Smart Array
controllers to date have always had the the memory BAR as the first BAR.
A new controller to be released later this year breaks that model.
commit 7845b91861809f87a8443ad3b87bda698c362739
Author: Tomas Henzl <thenzl>
Date: Wed Apr 22 15:18:58 2009 +0300
[scsi] cciss: version change for RHEL-5.4
Message-id: 49EF0B32.5080600
O-Subject: [RHEL5.4 PATCH4/4] cciss version change
Bugzilla: 474392
cciss version change
commit f3ca70f7cae5a1604aad39125efce1c23c374c24
Author: Tomas Henzl <thenzl>
Date: Wed Apr 22 15:18:54 2009 +0300
[scsi] cciss: thread to detect config changes on MSA2012
Message-id: 49EF0B2E.9020401
O-Subject: [RHEL5.4 PATCH2/4] cciss Kernel thread to detect config changes o
Bugzilla: 474392
Upstream:
0a9279cc7cbe726e995c44a1acae81d446775816
HP:
The MSA2012 cannot inform the driver of configuration changes since all
management is out of band. This is a departure from any storage we have
supported in the past. We need some way to detect changes on the topology
so we implement this kernel thread. In some instances there's nothing we
can do from the driver (like LUN failure) so just print out a message. In
the case where logical volumes are added or deleted we call
rebuild_lun_table to refresh the driver's view of the world.
commit 4d7769e7763e5a42e1a4e77e40e7904402930738
Author: Tomas Henzl <thenzl>
Date: Wed Apr 22 15:18:50 2009 +0300
[scsi] cciss: changes in config functions
Message-id: 49EF0B2A.9090707
O-Subject: [RHEL5.4 PATCH1/4] cciss changes in config functions
Bugzilla: 474392
RH-Acked-by: Mike Christie <mchristi>
Upstream in
6ae5ce8e8d4de666f31286808d2285aa6a50fa40
a72da29b6cbc5cf918567f2a0d76df6871e94b01
HP:
This patch removes redundant code where ever logical volumes are added or
removed. It adds 3 new functions that are called instead of having the same
code spread throughout the driver. It also removes the cciss_getgeometry
function.
It also makes the rebuild_lun_table smart enough to not rip a logical
volume out from under the OS. Without this fix if a customer is running
hpacucli to monitor their storage the driver will blindly remove and re-add
the disks whenever the utility calls the CCISS_REGNEWD ioctl. Unfortunately,
both hpacucli and ACUXE call the ioctl repeatedly. Customers have reported
IO coming to a standstill. Calling the ioctl is the problem, this patch is
the fix.
FWIW this works in RHEL5.3, so this is a regression.
P.
Created attachment 354920 [details]
Try to freeze in the scan thread
Does this help?
(In reply to comment #2) > Created an attachment (id=354920) [details] > Try to freeze in the scan thread > > Does this help? Yup, now it works :) P. (In reply to comment #3) > (In reply to comment #2) > > Created an attachment (id=354920) [details] [details] > > Try to freeze in the scan thread > > > > Does this help? > > Yup, now it works :) > > P. Prarit, does this mean that this patch passed your QA and is good for 5.4 ? Mike, Can you take a look at the attached patch and let me know if you agree with this? It is not in scsi-misc. If you agree, would you push this upstream? I didn't see any use of try_to_freeze in or under the scsi directory for rhel5.4. I'm still looking at is use in other drivers. This is to fix a regression in rhel5.4 and time is very short. Thanks, Rob try_to_freeze() is only required if there are still driver threads running after the suspend() method in the driver has been called. The bug arose in this case because the driver update for 5.4 added a thread to check for reconfiguration events. I think we need to add a suspend method to the driver. I also thought this had been deferred because suspending/sleeping/hibernation is not that much of an issue in server environments. Although this patch appears to be correct for the thread in question, I was not able to get a dl585 system w/ cciss as the boot device to suspend and resume as I would have expected. After applying the patch, and unloading qla2xxx.ko, I induced a suspend, which appeared to start off normally. A short while after the suspend started, the host appeared to reboot. Perhaps I did something wrong, or as you suggest Mike, there is a suspend method that needs to be implemented. Rob *** Bug 513423 has been marked as a duplicate of this bug. *** At this stage in the 5.4 release, it seems best to defer this patch to 5.5. - The failure of kernel suspend is not considered a blocker issue for 5.4 on cciss systems. These systems tend to be servers where the kernel is not normally suspended. - Although Prarit tested this successfully on two Intel-based HP machines, it did not resume properly on the AMD DL585 system. This requires more investigation. - It would be best to get this included upstream before it goes in to RHEL. I have drafted a 5.4 release note. Please review. Tom Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A change in the cciss driver in RHEL 5.4 makes it incompatible with the suspend-to-disk operation: echo disk > /sys/power/state The system will not suspend properly. Instead, messages such as the following will be displayed: Stopping tasks: ============================================================================== stopping tasks timed out after 20 seconds (1 tasks remaining): cciss_scan00 Restarting tasks...<6> Strange, cciss_scan00 not stopped done A solution to this problem is being investigated. (In reply to comment #10) > Although this patch appears to be correct for the thread in question, I was not > able to get a dl585 system w/ cciss as the boot device to suspend and resume as > I would have expected. > > After applying the patch, and unloading qla2xxx.ko, I induced a suspend, which > appeared to start off normally. A short while after the suspend started, the > host appeared to reboot. Mike, I tested this patch on a system with '5i' controller and was able to suspend/resume the machine, so for me the patch is OK. But with the test result above it really looks like we need a better patch. (In reply to comment #16) > Mike, > I tested this patch on a system with '5i' controller and was able to > suspend/resume the machine, so for me the patch is OK. But with the test result > above it really looks like we need a better patch. Tomas, keep in mind that aside from the cciss issue, there could be other issues that are problematic for suspend/resume on large systems. P. Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,14 +1,9 @@ -A change in the cciss driver in RHEL 5.4 makes it incompatible with the suspend-to-disk operation: - -echo disk > /sys/power/state - -The system will not suspend properly. Instead, messages such as the following will be displayed: - +A change to the cciss driver in Red Hat Enterprise Linux 5.4 made it incompatible with the "echo disk > /sys/power/state" suspend-to-disk operation. Consequently, the system will not suspend properly, returning messages such as: +<screen> Stopping tasks: -============================================================================== +====================================================================== stopping tasks timed out after 20 seconds (1 tasks remaining): cciss_scan00 Restarting tasks...<6> Strange, cciss_scan00 not stopped done - +</screen>-A solution to this problem is being investigated. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Mike, have you found a better solution for this, better then the patch posted here ? (In reply to comment #9) > I think we need to add a suspend method to the driver. I also thought this had > been deferred because suspending/sleeping/hibernation is not that much of an > issue in server environments. Support for suspend/resume is a low priority for the type of large servers that contain cciss. Mike confirmed this. I'm closing this as won't fix now, this will probably solved by an cciss update in 5.6. |