+++ This bug was initially created as a clone of Bug #511211 +++ Description of problem: on intel machine hp-dl580g5-01.rhts.bos.redhat.com, after suspend/resume, some of /sys/devices/system/cpu/cpu*/cpufreq files disappear. Version-Release number of selected component (if applicable): cpuspeed-1.2.1-8.el5 How reproducible: always Steps to Reproduce: 1.echo disk > /sys/power/state 2.resume the machine 3.ls /sys/devices/system/cpu/cpu*/cpufreq Actual results: /sys/devices/system/cpu/cpu0/cpufreq: affected_cpus ondemand scaling_cur_freq scaling_max_freq cpuinfo_max_freq scaling_available_frequencies scaling_driver scaling_min_freq cpuinfo_min_freq scaling_available_governors scaling_governor /sys/devices/system/cpu/cpu12/cpufreq: affected_cpus ondemand scaling_cur_freq scaling_max_freq cpuinfo_max_freq scaling_available_frequencies scaling_driver scaling_min_freq cpuinfo_min_freq scaling_available_governors scaling_governor /sys/devices/system/cpu/cpu1/cpufreq: affected_cpus ondemand scaling_cur_freq scaling_max_freq cpuinfo_max_freq scaling_available_frequencies scaling_driver scaling_min_freq cpuinfo_min_freq scaling_available_governors scaling_governor /sys/devices/system/cpu/cpu2/cpufreq: affected_cpus ondemand scaling_cur_freq scaling_max_freq cpuinfo_max_freq scaling_available_frequencies scaling_driver scaling_min_freq cpuinfo_min_freq scaling_available_governors scaling_governor /sys/devices/system/cpu/cpu3/cpufreq: affected_cpus ondemand scaling_cur_freq scaling_max_freq cpuinfo_max_freq scaling_available_frequencies scaling_driver scaling_min_freq cpuinfo_min_freq scaling_available_governors scaling_governor /sys/devices/system/cpu/cpu4/cpufreq: affected_cpus ondemand scaling_cur_freq scaling_max_freq cpuinfo_max_freq scaling_available_frequencies scaling_driver scaling_min_freq cpuinfo_min_freq scaling_available_governors scaling_governor /sys/devices/system/cpu/cpu8/cpufreq: affected_cpus ondemand scaling_cur_freq scaling_max_freq cpuinfo_max_freq scaling_available_frequencies scaling_driver scaling_min_freq cpuinfo_min_freq scaling_available_governors scaling_governor Expected results: all cpus (cpu0~cpu15) should list cpufreq Additional info: on AMD machine dell-per905-01.rhts.bos.redhat.com does not have this problem. --- Additional comment from kzhang on 2009-07-14 05:03:48 EDT --- Created an attachment (id=351564) cpuinfo of rhts machine hp-dl580g5-01.rhts.bos.redhat.com --- Additional comment from kzhang on 2009-07-14 05:05:45 EDT --- for all following kernels I installed, same thing happens. kernel-2.6.18-128.el5 kernel-2.6.18-157.el5 kernel-2.6.18-141.el5 --- Additional comment from jwilson on 2009-07-14 09:19:03 EDT --- This sounds more like a kernel issue than a cpuspeed package issue. --- Additional comment from jwilson on 2009-07-14 09:19:39 EDT --- ...the heck? I swear I said "kernel", not "kdepim"... fixing... --- Additional comment from prarit on 2009-07-22 09:01:34 EDT --- >1.echo disk > /sys/power/state >2.resume the machine Kexin, what did you do to resume the machine? Thanks, P. --- Additional comment from kzhang on 2009-07-22 22:30:02 EDT --- click on the "reboot" button on the http://lab.rhts.bos.redhat.com/cgi-bin/rhts/system.cgi?id=1112 it is through power switch? --- Additional comment from prarit on 2009-07-23 08:25:50 EDT --- (In reply to comment #6) > click on the "reboot" button on the > http://lab.rhts.bos.redhat.com/cgi-bin/rhts/system.cgi?id=1112 > it is through power switch? Yes, I always thought that did a hard rest of the system. I'll ask in #rhts ... P. --- Additional comment from prarit on 2009-07-23 13:29:57 EDT --- Kexin, The sequence is: 1. use console to access system console 2. [root@hp-dl580g5-01 sys]# find ./ -name *cpufreq* > /tmp/before [root@hp-dl580g5-01 sys]# cat /tmp/before ./module/cpufreq_ondemand ./module/acpi_cpufreq ./module/cpufreq ./devices/system/cpu/cpu15/cpufreq ./devices/system/cpu/cpu14/cpufreq ./devices/system/cpu/cpu13/cpufreq ./devices/system/cpu/cpu12/cpufreq ./devices/system/cpu/cpu11/cpufreq ./devices/system/cpu/cpu10/cpufreq ./devices/system/cpu/cpu9/cpufreq ./devices/system/cpu/cpu8/cpufreq ./devices/system/cpu/cpu7/cpufreq ./devices/system/cpu/cpu6/cpufreq ./devices/system/cpu/cpu5/cpufreq ./devices/system/cpu/cpu4/cpufreq ./devices/system/cpu/cpu3/cpufreq ./devices/system/cpu/cpu2/cpufreq ./devices/system/cpu/cpu1/cpufreq ./devices/system/cpu/cpu0/cpufreq 3. echo disk > /sys/power/state 4. let system go to sleep 5. resume by pressing a few keys on the system console 6. [root@hp-dl580g5-01 sys]# find ./ -name *cpufreq* ./module/cpufreq_ondemand ./module/acpi_cpufreq ./module/cpufreq ./devices/system/cpu/cpu12/cpufreq ./devices/system/cpu/cpu8/cpufreq ./devices/system/cpu/cpu4/cpufreq ./devices/system/cpu/cpu3/cpufreq ./devices/system/cpu/cpu2/cpufreq ./devices/system/cpu/cpu1/cpufreq ./devices/system/cpu/cpu0/cpufreq I'll have to put some code in to determine why the files didn't "return" after the resume. P. --- Additional comment from prarit on 2009-07-23 13:40:53 EDT --- Actually, it doesn't look like the system is entering suspend at all: CPU6 is down CPU 7 is now offline CPU7 is down CPU 8 is now offline CPU8 is down CPU 9 is now offline CPU9 is down CPU 10 is now offline CPU10 is down CPU 11 is now offline CPU11 is down CPU 12 is now offline CPU12 is down Breaking affinity for irq 59 CPU 13 is now offline CPU13 is down Breaking affinity for irq 14 CPU 14 is now offline CPU14 is down Breaking affinity for irq 75 CPU 15 is now offline CPU15 is down Stopping tasks: ====================================================================== stopping tasks timed out after 20 seconds (1 tasks remaining): cciss_scan00 Restarting tasks...<6> Strange, cciss_scan00 not stopped done Enabling non-boot CPUs ... Initializing CPU#1 Intel(R) Xeon(R) CPU E7340 @ 2.40GHz stepping 0b CPU1 is up P. --- Additional comment from prarit on 2009-07-23 13:47:37 EDT --- Suspend works in -128.el5 (RHEL5.3 kernel). I'm going to do a bit of debugging to figure out why the cciss_scan00 thread isn't stopping. P.
Binary search lead me to this group of commits: commit 1435a23505d04d799de2cf54eae5acb0fbe4961c Author: Tomas Henzl <thenzl> Date: Wed Apr 22 15:19:04 2009 +0300 [scsi] cciss: change in discovering memory bar Message-id: 49EF0B38.9030109 O-Subject: [RHEL5.4 PATCH3/4] cciss change in discovering memory bar Bugzilla: 474392 Upstream in e143858104e318263689c551543dfc3f186cea12 HP: Change the way we discover the first memory BAR Add a method for discovering the first memory BAR. All Smart Array controllers to date have always had the the memory BAR as the first BAR. A new controller to be released later this year breaks that model. commit 7845b91861809f87a8443ad3b87bda698c362739 Author: Tomas Henzl <thenzl> Date: Wed Apr 22 15:18:58 2009 +0300 [scsi] cciss: version change for RHEL-5.4 Message-id: 49EF0B32.5080600 O-Subject: [RHEL5.4 PATCH4/4] cciss version change Bugzilla: 474392 cciss version change commit f3ca70f7cae5a1604aad39125efce1c23c374c24 Author: Tomas Henzl <thenzl> Date: Wed Apr 22 15:18:54 2009 +0300 [scsi] cciss: thread to detect config changes on MSA2012 Message-id: 49EF0B2E.9020401 O-Subject: [RHEL5.4 PATCH2/4] cciss Kernel thread to detect config changes o Bugzilla: 474392 Upstream: 0a9279cc7cbe726e995c44a1acae81d446775816 HP: The MSA2012 cannot inform the driver of configuration changes since all management is out of band. This is a departure from any storage we have supported in the past. We need some way to detect changes on the topology so we implement this kernel thread. In some instances there's nothing we can do from the driver (like LUN failure) so just print out a message. In the case where logical volumes are added or deleted we call rebuild_lun_table to refresh the driver's view of the world. commit 4d7769e7763e5a42e1a4e77e40e7904402930738 Author: Tomas Henzl <thenzl> Date: Wed Apr 22 15:18:50 2009 +0300 [scsi] cciss: changes in config functions Message-id: 49EF0B2A.9090707 O-Subject: [RHEL5.4 PATCH1/4] cciss changes in config functions Bugzilla: 474392 RH-Acked-by: Mike Christie <mchristi> Upstream in 6ae5ce8e8d4de666f31286808d2285aa6a50fa40 a72da29b6cbc5cf918567f2a0d76df6871e94b01 HP: This patch removes redundant code where ever logical volumes are added or removed. It adds 3 new functions that are called instead of having the same code spread throughout the driver. It also removes the cciss_getgeometry function. It also makes the rebuild_lun_table smart enough to not rip a logical volume out from under the OS. Without this fix if a customer is running hpacucli to monitor their storage the driver will blindly remove and re-add the disks whenever the utility calls the CCISS_REGNEWD ioctl. Unfortunately, both hpacucli and ACUXE call the ioctl repeatedly. Customers have reported IO coming to a standstill. Calling the ioctl is the problem, this patch is the fix. FWIW this works in RHEL5.3, so this is a regression. P.
Created attachment 354920 [details] Try to freeze in the scan thread Does this help?
(In reply to comment #2) > Created an attachment (id=354920) [details] > Try to freeze in the scan thread > > Does this help? Yup, now it works :) P.
(In reply to comment #3) > (In reply to comment #2) > > Created an attachment (id=354920) [details] [details] > > Try to freeze in the scan thread > > > > Does this help? > > Yup, now it works :) > > P. Prarit, does this mean that this patch passed your QA and is good for 5.4 ?
Mike, Can you take a look at the attached patch and let me know if you agree with this? It is not in scsi-misc. If you agree, would you push this upstream? I didn't see any use of try_to_freeze in or under the scsi directory for rhel5.4. I'm still looking at is use in other drivers. This is to fix a regression in rhel5.4 and time is very short. Thanks, Rob
try_to_freeze() is only required if there are still driver threads running after the suspend() method in the driver has been called. The bug arose in this case because the driver update for 5.4 added a thread to check for reconfiguration events.
I think we need to add a suspend method to the driver. I also thought this had been deferred because suspending/sleeping/hibernation is not that much of an issue in server environments.
Although this patch appears to be correct for the thread in question, I was not able to get a dl585 system w/ cciss as the boot device to suspend and resume as I would have expected. After applying the patch, and unloading qla2xxx.ko, I induced a suspend, which appeared to start off normally. A short while after the suspend started, the host appeared to reboot. Perhaps I did something wrong, or as you suggest Mike, there is a suspend method that needs to be implemented. Rob
*** Bug 513423 has been marked as a duplicate of this bug. ***
At this stage in the 5.4 release, it seems best to defer this patch to 5.5. - The failure of kernel suspend is not considered a blocker issue for 5.4 on cciss systems. These systems tend to be servers where the kernel is not normally suspended. - Although Prarit tested this successfully on two Intel-based HP machines, it did not resume properly on the AMD DL585 system. This requires more investigation. - It would be best to get this included upstream before it goes in to RHEL. I have drafted a 5.4 release note. Please review. Tom
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A change in the cciss driver in RHEL 5.4 makes it incompatible with the suspend-to-disk operation: echo disk > /sys/power/state The system will not suspend properly. Instead, messages such as the following will be displayed: Stopping tasks: ============================================================================== stopping tasks timed out after 20 seconds (1 tasks remaining): cciss_scan00 Restarting tasks...<6> Strange, cciss_scan00 not stopped done A solution to this problem is being investigated.
(In reply to comment #10) > Although this patch appears to be correct for the thread in question, I was not > able to get a dl585 system w/ cciss as the boot device to suspend and resume as > I would have expected. > > After applying the patch, and unloading qla2xxx.ko, I induced a suspend, which > appeared to start off normally. A short while after the suspend started, the > host appeared to reboot. Mike, I tested this patch on a system with '5i' controller and was able to suspend/resume the machine, so for me the patch is OK. But with the test result above it really looks like we need a better patch.
(In reply to comment #16) > Mike, > I tested this patch on a system with '5i' controller and was able to > suspend/resume the machine, so for me the patch is OK. But with the test result > above it really looks like we need a better patch. Tomas, keep in mind that aside from the cciss issue, there could be other issues that are problematic for suspend/resume on large systems. P.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,14 +1,9 @@ -A change in the cciss driver in RHEL 5.4 makes it incompatible with the suspend-to-disk operation: - -echo disk > /sys/power/state - -The system will not suspend properly. Instead, messages such as the following will be displayed: - +A change to the cciss driver in Red Hat Enterprise Linux 5.4 made it incompatible with the "echo disk > /sys/power/state" suspend-to-disk operation. Consequently, the system will not suspend properly, returning messages such as: +<screen> Stopping tasks: -============================================================================== +====================================================================== stopping tasks timed out after 20 seconds (1 tasks remaining): cciss_scan00 Restarting tasks...<6> Strange, cciss_scan00 not stopped done - +</screen>-A solution to this problem is being investigated.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Mike, have you found a better solution for this, better then the patch posted here ? (In reply to comment #9) > I think we need to add a suspend method to the driver. I also thought this had > been deferred because suspending/sleeping/hibernation is not that much of an > issue in server environments.
Support for suspend/resume is a low priority for the type of large servers that contain cciss. Mike confirmed this. I'm closing this as won't fix now, this will probably solved by an cciss update in 5.6.