Bug 513472 - cciss_scan00 doesn't stop during suspend/resume on intel machine hp-dl580g5-01.rhts.bos.redhat.com
Summary: cciss_scan00 doesn't stop during suspend/resume on intel machine hp-dl580g5-0...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Tomas Henzl
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
: 513423 (view as bug list)
Depends On:
Blocks: 5.4, TechnicalNotes 525215 533192
TreeView+ depends on / blocked
 
Reported: 2009-07-23 19:34 UTC by Prarit Bhargava
Modified: 2013-01-10 08:00 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
A change to the cciss driver in Red Hat Enterprise Linux 5.4 made it incompatible with the "echo disk > /sys/power/state" suspend-to-disk operation. Consequently, the system will not suspend properly, returning messages such as: <screen> Stopping tasks: ====================================================================== stopping tasks timed out after 20 seconds (1 tasks remaining): cciss_scan00 Restarting tasks...<6> Strange, cciss_scan00 not stopped done </screen>
Clone Of: 511211
Environment:
Last Closed: 2009-11-30 13:15:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Try to freeze in the scan thread (293 bytes, patch)
2009-07-23 20:02 UTC, Matthew Garrett
no flags Details | Diff

Description Prarit Bhargava 2009-07-23 19:34:29 UTC
+++ This bug was initially created as a clone of Bug #511211 +++

Description of problem:
on intel machine hp-dl580g5-01.rhts.bos.redhat.com, after suspend/resume, some of /sys/devices/system/cpu/cpu*/cpufreq files disappear.

Version-Release number of selected component (if applicable):
cpuspeed-1.2.1-8.el5

How reproducible:
always

Steps to Reproduce:
1.echo disk > /sys/power/state
2.resume the machine
3.ls  /sys/devices/system/cpu/cpu*/cpufreq
  
Actual results:

/sys/devices/system/cpu/cpu0/cpufreq:
affected_cpus     ondemand                       scaling_cur_freq  scaling_max_freq
cpuinfo_max_freq  scaling_available_frequencies  scaling_driver    scaling_min_freq
cpuinfo_min_freq  scaling_available_governors    scaling_governor

/sys/devices/system/cpu/cpu12/cpufreq:
affected_cpus     ondemand                       scaling_cur_freq  scaling_max_freq
cpuinfo_max_freq  scaling_available_frequencies  scaling_driver    scaling_min_freq
cpuinfo_min_freq  scaling_available_governors    scaling_governor

/sys/devices/system/cpu/cpu1/cpufreq:
affected_cpus     ondemand                       scaling_cur_freq  scaling_max_freq
cpuinfo_max_freq  scaling_available_frequencies  scaling_driver    scaling_min_freq
cpuinfo_min_freq  scaling_available_governors    scaling_governor

/sys/devices/system/cpu/cpu2/cpufreq:
affected_cpus     ondemand                       scaling_cur_freq  scaling_max_freq
cpuinfo_max_freq  scaling_available_frequencies  scaling_driver    scaling_min_freq
cpuinfo_min_freq  scaling_available_governors    scaling_governor

/sys/devices/system/cpu/cpu3/cpufreq:
affected_cpus     ondemand                       scaling_cur_freq  scaling_max_freq
cpuinfo_max_freq  scaling_available_frequencies  scaling_driver    scaling_min_freq
cpuinfo_min_freq  scaling_available_governors    scaling_governor

/sys/devices/system/cpu/cpu4/cpufreq:
affected_cpus     ondemand                       scaling_cur_freq  scaling_max_freq
cpuinfo_max_freq  scaling_available_frequencies  scaling_driver    scaling_min_freq
cpuinfo_min_freq  scaling_available_governors    scaling_governor

/sys/devices/system/cpu/cpu8/cpufreq:
affected_cpus     ondemand                       scaling_cur_freq  scaling_max_freq
cpuinfo_max_freq  scaling_available_frequencies  scaling_driver    scaling_min_freq
cpuinfo_min_freq  scaling_available_governors    scaling_governor

Expected results:
all cpus (cpu0~cpu15) should list cpufreq

Additional info:
on AMD machine dell-per905-01.rhts.bos.redhat.com does not have this problem.

--- Additional comment from kzhang on 2009-07-14 05:03:48 EDT ---

Created an attachment (id=351564)
cpuinfo of rhts machine hp-dl580g5-01.rhts.bos.redhat.com

--- Additional comment from kzhang on 2009-07-14 05:05:45 EDT ---

for all following kernels I installed, same thing happens.

kernel-2.6.18-128.el5
kernel-2.6.18-157.el5
kernel-2.6.18-141.el5

--- Additional comment from jwilson on 2009-07-14 09:19:03 EDT ---

This sounds more like a kernel issue than a cpuspeed package issue.

--- Additional comment from jwilson on 2009-07-14 09:19:39 EDT ---

...the heck? I swear I said "kernel", not "kdepim"... fixing...

--- Additional comment from prarit on 2009-07-22 09:01:34 EDT ---

>1.echo disk > /sys/power/state
>2.resume the machine

Kexin, what did you do to resume the machine?

Thanks,

P.

--- Additional comment from kzhang on 2009-07-22 22:30:02 EDT ---

click on the "reboot" button on the http://lab.rhts.bos.redhat.com/cgi-bin/rhts/system.cgi?id=1112
it is through power switch?

--- Additional comment from prarit on 2009-07-23 08:25:50 EDT ---

(In reply to comment #6)
> click on the "reboot" button on the
> http://lab.rhts.bos.redhat.com/cgi-bin/rhts/system.cgi?id=1112
> it is through power switch?  

Yes, I always thought that did a hard rest of the system.  I'll ask in #rhts ...

P.

--- Additional comment from prarit on 2009-07-23 13:29:57 EDT ---

Kexin,

The sequence is:

1. use console to access system console
2.

[root@hp-dl580g5-01 sys]# find ./ -name *cpufreq* > /tmp/before
[root@hp-dl580g5-01 sys]# cat /tmp/before 
./module/cpufreq_ondemand
./module/acpi_cpufreq
./module/cpufreq
./devices/system/cpu/cpu15/cpufreq
./devices/system/cpu/cpu14/cpufreq
./devices/system/cpu/cpu13/cpufreq
./devices/system/cpu/cpu12/cpufreq
./devices/system/cpu/cpu11/cpufreq
./devices/system/cpu/cpu10/cpufreq
./devices/system/cpu/cpu9/cpufreq
./devices/system/cpu/cpu8/cpufreq
./devices/system/cpu/cpu7/cpufreq
./devices/system/cpu/cpu6/cpufreq
./devices/system/cpu/cpu5/cpufreq
./devices/system/cpu/cpu4/cpufreq
./devices/system/cpu/cpu3/cpufreq
./devices/system/cpu/cpu2/cpufreq
./devices/system/cpu/cpu1/cpufreq
./devices/system/cpu/cpu0/cpufreq

3.  echo disk > /sys/power/state
4.  let system go to sleep
5.  resume by pressing a few keys on the system console
6.

[root@hp-dl580g5-01 sys]# find ./ -name *cpufreq*
./module/cpufreq_ondemand
./module/acpi_cpufreq
./module/cpufreq
./devices/system/cpu/cpu12/cpufreq
./devices/system/cpu/cpu8/cpufreq
./devices/system/cpu/cpu4/cpufreq
./devices/system/cpu/cpu3/cpufreq
./devices/system/cpu/cpu2/cpufreq
./devices/system/cpu/cpu1/cpufreq
./devices/system/cpu/cpu0/cpufreq

I'll have to put some code in to determine why the files didn't "return" after the resume.

P.

--- Additional comment from prarit on 2009-07-23 13:40:53 EDT ---

Actually, it doesn't look like the system is entering suspend at all:

CPU6 is down
CPU 7 is now offline
CPU7 is down
CPU 8 is now offline
CPU8 is down
CPU 9 is now offline
CPU9 is down
CPU 10 is now offline
CPU10 is down
CPU 11 is now offline
CPU11 is down
CPU 12 is now offline
CPU12 is down
Breaking affinity for irq 59
CPU 13 is now offline
CPU13 is down
Breaking affinity for irq 14
CPU 14 is now offline
CPU14 is down
Breaking affinity for irq 75
CPU 15 is now offline
CPU15 is down
Stopping tasks: ======================================================================
 stopping tasks timed out after 20 seconds (1 tasks remaining):
  cciss_scan00
Restarting tasks...<6> Strange, cciss_scan00 not stopped
 done
Enabling non-boot CPUs ...
 
Initializing CPU#1
Intel(R) Xeon(R) CPU           E7340  @ 2.40GHz stepping 0b
CPU1 is up

P.

--- Additional comment from prarit on 2009-07-23 13:47:37 EDT ---

Suspend works in -128.el5 (RHEL5.3 kernel).  I'm going to do a bit of debugging to figure out why the cciss_scan00 thread isn't stopping.

P.

Comment 1 Prarit Bhargava 2009-07-23 19:38:33 UTC
Binary search lead me to this group of commits:



commit 1435a23505d04d799de2cf54eae5acb0fbe4961c
Author: Tomas Henzl <thenzl>
Date:   Wed Apr 22 15:19:04 2009 +0300

    [scsi] cciss: change in discovering memory bar
    
    Message-id: 49EF0B38.9030109
    O-Subject: [RHEL5.4 PATCH3/4] cciss change in discovering memory bar
    Bugzilla: 474392
    
    Upstream in e143858104e318263689c551543dfc3f186cea12
    
    HP: Change the way we discover the first memory BAR
    Add a method for discovering the first memory BAR.  All Smart Array
    controllers to date have always had the the memory BAR as the first BAR.
    A new controller to be released later this year breaks that model.

commit 7845b91861809f87a8443ad3b87bda698c362739
Author: Tomas Henzl <thenzl>
Date:   Wed Apr 22 15:18:58 2009 +0300

    [scsi] cciss: version change for RHEL-5.4
    
    Message-id: 49EF0B32.5080600
    O-Subject: [RHEL5.4 PATCH4/4] cciss version change
    Bugzilla: 474392
    
    cciss version change

commit f3ca70f7cae5a1604aad39125efce1c23c374c24
Author: Tomas Henzl <thenzl>
Date:   Wed Apr 22 15:18:54 2009 +0300

    [scsi] cciss: thread to detect config changes on MSA2012
    
    Message-id: 49EF0B2E.9020401
    O-Subject: [RHEL5.4 PATCH2/4] cciss Kernel thread to detect config changes o
    Bugzilla: 474392
    
    Upstream:
    0a9279cc7cbe726e995c44a1acae81d446775816
    
    HP:
    The MSA2012 cannot inform the driver of configuration changes since all
    management is out of band.  This is a departure from any storage we have
    supported in the past.  We need some way to detect changes on the topology
    so we implement this kernel thread.  In some instances there's nothing we
    can do from the driver (like LUN failure) so just print out a message.  In
    the case where logical volumes are added or deleted we call
    rebuild_lun_table to refresh the driver's view of the world.

commit 4d7769e7763e5a42e1a4e77e40e7904402930738
Author: Tomas Henzl <thenzl>
Date:   Wed Apr 22 15:18:50 2009 +0300

    [scsi] cciss: changes in config functions
    
    Message-id: 49EF0B2A.9090707
    O-Subject: [RHEL5.4 PATCH1/4] cciss changes in config functions
    Bugzilla: 474392
    RH-Acked-by: Mike Christie <mchristi>
    
    Upstream in
    6ae5ce8e8d4de666f31286808d2285aa6a50fa40
    a72da29b6cbc5cf918567f2a0d76df6871e94b01

    HP:
    This patch removes redundant code where ever logical volumes are added or
    removed. It adds 3 new functions that are called instead of having the same
    code spread throughout the driver. It also removes the cciss_getgeometry
    function.
    It also makes the rebuild_lun_table smart enough to not rip a logical
    volume out from under the OS. Without this fix if a customer is running
    hpacucli to monitor their storage the driver will blindly remove and re-add
    the disks whenever the utility calls the CCISS_REGNEWD ioctl. Unfortunately,
    both hpacucli and ACUXE call the ioctl repeatedly. Customers have reported
    IO coming to a standstill. Calling the ioctl is the problem, this patch is
    the fix.


FWIW this works in RHEL5.3, so this is a regression.

P.

Comment 2 Matthew Garrett 2009-07-23 20:02:36 UTC
Created attachment 354920 [details]
Try to freeze in the scan thread

Does this help?

Comment 3 Prarit Bhargava 2009-07-23 23:24:36 UTC
(In reply to comment #2)
> Created an attachment (id=354920) [details]
> Try to freeze in the scan thread
> 
> Does this help?  

Yup, now it works :)

P.

Comment 5 Tomas Henzl 2009-07-30 22:35:24 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > Created an attachment (id=354920) [details] [details]
> > Try to freeze in the scan thread
> > 
> > Does this help?  
> 
> Yup, now it works :)
> 
> P.  

Prarit,
does this mean that this patch passed your QA and is good for 5.4 ?

Comment 6 Rob Evers 2009-07-31 12:32:54 UTC
Mike,

Can you take a look at the attached patch and let me know if you agree with this?  It is not in scsi-misc.  If you agree, would you push this upstream?

I didn't see any use of try_to_freeze in or under the scsi directory for rhel5.4.  I'm still looking at is use in other drivers.

This is to fix a regression in rhel5.4 and time is very short.

Thanks, Rob

Comment 7 Matthew Garrett 2009-07-31 13:08:41 UTC
try_to_freeze() is only required if there are still driver threads running after the suspend() method in the driver has been called. The bug arose in this case because the driver update for 5.4 added a thread to check for reconfiguration events.

Comment 9 Mike Miller (OS Dev) 2009-08-03 14:57:54 UTC
I think we need to add a suspend method to the driver. I also thought this had been deferred because suspending/sleeping/hibernation is not that much of an issue in server environments.

Comment 10 Rob Evers 2009-08-03 17:36:12 UTC
Although this patch appears to be correct for the thread in question, I was not able to get a dl585 system w/ cciss as the boot device to suspend and resume as I would have expected.

After applying the patch, and unloading qla2xxx.ko, I induced a suspend, which appeared to start off normally.  A short while after the suspend started, the host appeared to reboot.

Perhaps I did something wrong, or as you suggest Mike, there is a suspend method that needs to be implemented.

Rob

Comment 11 Tom Coughlan 2009-08-03 19:08:48 UTC
*** Bug 513423 has been marked as a duplicate of this bug. ***

Comment 12 Tom Coughlan 2009-08-03 19:48:03 UTC
At this stage in the 5.4 release, it seems best to defer this patch to 5.5. 

- The failure of kernel suspend is not considered a blocker issue for 5.4 on cciss systems. These systems tend to be servers where the kernel is not normally suspended.

- Although Prarit tested this successfully on two Intel-based HP machines, it did not resume properly on the AMD DL585 system. This requires more investigation. 

- It would be best to get this included upstream before it goes in to RHEL. 

I have drafted a 5.4 release note. Please review. 

Tom

Comment 13 Tom Coughlan 2009-08-03 19:48:03 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
A change in the cciss driver in RHEL 5.4 makes it incompatible with the suspend-to-disk operation:

echo disk > /sys/power/state

The system will not suspend properly. Instead, messages such as the following will be displayed:

Stopping tasks:
==============================================================================
 stopping tasks timed out after 20 seconds (1 tasks remaining):
  cciss_scan00
Restarting tasks...<6> Strange, cciss_scan00 not stopped
 done

A solution to this problem is being investigated.

Comment 16 Tomas Henzl 2009-08-04 10:48:09 UTC
(In reply to comment #10)
> Although this patch appears to be correct for the thread in question, I was not
> able to get a dl585 system w/ cciss as the boot device to suspend and resume as
> I would have expected.
> 
> After applying the patch, and unloading qla2xxx.ko, I induced a suspend, which
> appeared to start off normally.  A short while after the suspend started, the
> host appeared to reboot.

Mike,
I tested this patch on a system with '5i' controller and was able to suspend/resume the machine, so for me the patch is OK. But with the test result above it really looks like we need a better patch.

Comment 18 Prarit Bhargava 2009-08-04 12:08:16 UTC
(In reply to comment #16)

> Mike,
> I tested this patch on a system with '5i' controller and was able to
> suspend/resume the machine, so for me the patch is OK. But with the test result
> above it really looks like we need a better patch.  

Tomas, keep in mind that aside from the cciss issue, there could be other issues that are problematic for suspend/resume on large systems.

P.

Comment 23 Ryan Lerch 2009-08-19 00:15:47 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,14 +1,9 @@
-A change in the cciss driver in RHEL 5.4 makes it incompatible with the suspend-to-disk operation:
-
-echo disk > /sys/power/state
-
-The system will not suspend properly. Instead, messages such as the following will be displayed:
-
+A change to the cciss driver in Red Hat Enterprise Linux 5.4 made it incompatible with the "echo disk > /sys/power/state" suspend-to-disk operation. Consequently, the system will not suspend properly, returning messages such as:
+<screen>
 Stopping tasks:
-==============================================================================
+======================================================================
  stopping tasks timed out after 20 seconds (1 tasks remaining):
   cciss_scan00
 Restarting tasks...<6> Strange, cciss_scan00 not stopped
  done
-
+</screen>-A solution to this problem is being investigated.

Comment 24 RHEL Program Management 2009-09-25 17:41:03 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 25 Tomas Henzl 2009-10-14 14:00:35 UTC
Mike,
have you found a better solution for this, better then the patch posted here ?

(In reply to comment #9)
> I think we need to add a suspend method to the driver. I also thought this had
> been deferred because suspending/sleeping/hibernation is not that much of an
> issue in server environments.

Comment 26 Tomas Henzl 2009-11-30 13:15:00 UTC
Support for suspend/resume is a low priority for the type of large servers that contain cciss. Mike confirmed this. I'm closing this as won't fix now, this will probably solved by an cciss update in 5.6.


Note You need to log in before you can comment on or make changes to this bug.