Bug 690561

Summary: Warn the user when dumping to certain SmartArray (hpsa/cciss) adapters
Product: Red Hat Enterprise Linux 6 Reporter: Cong Wang <amwang>
Component: system-config-kdumpAssignee: Roman Rakus <rrakus>
Status: CLOSED NOTABUG QA Contact: Chao Ye <cye>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.0CC: czhang, donhoover, rkhan, rvokal, thenzl, tsmetana, vgoyal, vincent
Target Milestone: rcKeywords: Rebase
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-13 12:52:41 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On: 682239    
Bug Blocks: 767187    

Description Cong Wang 2011-03-24 12:33:34 EDT
Description of problem:
Some SmartArray (hpsa/cciss) adapters are known to be non-resettable in kdump, so we should not allow the user to dump on such devices. In kexec-tools, we detect this kind of devices and emit an error in such cases. s-c-kdump should do this too.

Additional info:
See also Bug 674893.
Comment 2 RHEL Product and Program Management 2011-03-24 13:08:10 EDT
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 4 Roman Rakus 2011-04-18 08:19:00 EDT
Is there (which?) any system I can test it on?
Comment 6 RHEL Product and Program Management 2011-07-05 20:06:29 EDT
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 7 Vincent S. Cojot 2011-08-09 09:28:55 EDT
Could we know what HBA's are deemed 'not resettable', please?
Comment 8 Tomas Henzl 2011-08-15 08:18:43 EDT
(In reply to comment #7)
> Could we know what HBA's are deemed 'not resettable', please?

There was a pleasant progress in this area - almost all cards are resettable either via a soft or hard reset Right now non resetable cards are '6400' 6400 EM' and '5i'.
On some cards changes in firmware were needed, so please update your firmware to the latest.
Comment 9 Don Hoover 2011-09-14 09:59:59 EDT
HP has a bulletin saying that they have fixed this in THEIR custom kernel
module(drivers).

I DO NOT WANT to use any custom kernel modules and pollute my build...can we
try and see what fixes were made in their modules and backport them to the RHEL
distro kernel modules?



From HP DOC:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02758009&lang=en&cc=us&taskId=101&prodSeriesId=4268682&prodTypeId=3709945
----------------------
SUPPORT COMMUNICATION - CUSTOMER ADVISORY
Document ID: c02758009
Version: 2
Advisory: (Revision) HP Smart Array Controllers - CUSTOMER ACTION REQUIRED for
Certain HP Smart Array Controllers to Ensure Pending Writes to Storage Devices
Complete Properly if Attempting to Use the Linux Kdump Facility on ProLiant
Servers


Pending writes to storage devices may not complete properly as detailed below
if the kdump facility is used. Using the kdump facility could leave the server
in an unstable condition, which could potentially result in an inconsistent
filesystem state. By disregarding this notification, the customer accepts the
risk of incurring potential related errors.
The Linux kdump facility fails to execute properly when used on HP ProLiant
servers running Linux and configured with certain Smart Array controllers using
certain versions of the cciss device drivers.

Kdump is a facility for capturing a system memory image when a kernel panic
occurs. These memory images are useful for debugging. Kdump loads a special
kernel into a reserved memory area. When a panic occurs in the main kernel,
control is transferred to the kdump kernel. As the memory image dump process
begins, storage device drivers reset their associated controllers to clear all
pending I/O activity before starting the memory dump activity, including
pending writes to storage devices.

When kdump is executing on the affected Smart Array controllers, the reset
process does not work as expected, and commands that were issued to the
controller prior to the kdump kernel beginning execution may complete during
the kdump process, potentially disrupting the kdump kernel's I/O, and
potentially leading to a corrupt kdump image or inconsistent filesystem state.

This problem occurs on Smart Array controllers that do not support the PCI
power management reset method. In these cases, the needed reset never occurs.
In some cases, if no I/O was pending at the time of the system panic, the
missed reset may go unnoticed, and the kdump process may appear to work
normally.


Messages similar to the following may be displayed on some but not all
controllers having this problem:

<4>cciss 0000:19:08.0: Unable to successfully reset controller. . . 
SCOPE
Any HP ProLiant server configured with any of the following HP Smart Array
controllers:

HP Smart Array P400 controller
HP Smart Array P400i controller
HP Smart Array P800 controller
HP Smart Array E500 controller
HP Smart Array P700m controller
HP Smart Array E200 controller
HP Smart Array E200i controller
And running any of the following versions of the HP Smart Array driver (cciss):

For Red Hat Enterprise Linux 5, Version 3.6.28-7 (or earlier)
For SUSE Linux Enterprise Server 10, Version 3.6.28-6 (or earlier)
For Red Hat Enterprise Linux 6, Version 4.6.28-6 (or earlier)
For SUSE Linux Enterprise Server 11, Version 4.6.28-6 (or earlier)
RESOLUTION
To ensure this issue does not occur, do not use the kdump facility to dump the
memory image on the affected Smart Array controllers when using the cciss
drivers listed in the Scope section.

To eliminate the potential for lost data if the kdump process is used on the
affected Smart Array controllers, use the following versions of the HP Smart
Array driver (cciss):

For Red Hat Enterprise Linux 5, Version 3.6.28-12 (or later)
For Red Hat Enterprise Linux 6, Version 4.6.28-12 (or later)

----------------------
Comment 10 Tomas Henzl 2011-09-14 11:24:59 EDT
(In reply to comment #9)
> HP has a bulletin saying that they have fixed this in THEIR custom kernel
> module(drivers).
> 
> I DO NOT WANT to use any custom kernel modules and pollute my build...can we
> try and see what fixes were made in their modules and backport them to the RHEL
> distro kernel modules?

It is backported to RHEL6.1, please check the resettable parameter on your system. (an updated fw is also needed)
There is only a very small set of cards which aren't resettable please see comment#8 - those cards aren't mentioned in the HP document...


> http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02758009&lang=en&cc=us&taskId=101&prodSeriesId=4268682&prodTypeId=3709945
...
> HP Smart Array P400 controller
> HP Smart Array P400i controller
> HP Smart Array P800 controller
> HP Smart Array E500 controller
> HP Smart Array P700m controller
> HP Smart Array E200 controller
> HP Smart Array E200i controller
Comment 11 Roman Rakus 2012-02-13 10:35:39 EST
So this bug can be closed? I'm not sure - is it fixed in firmware (or elsewhere)?
Comment 12 Tomas Henzl 2012-02-13 12:07:19 EST
(In reply to comment #11)
> So this bug can be closed? I'm not sure - is it fixed in firmware (or
> elsewhere)?
It is fixed in both, firmware and the driver.

There was no response for several months I think you can close it.