Bug 722780 - RHEL 5.6: cciss driver giving "CHECK CONDITION sense key = 0x3" errors
RHEL 5.6: cciss driver giving "CHECK CONDITION sense key = 0x3" errors
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.6
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Red Hat Kernel Manager
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-07-17 14:05 EDT by Charles Butterfield
Modified: 2015-03-30 08:53 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-06-03 08:48:13 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Charles Butterfield 2011-07-17 14:05:05 EDT
Description of problem:

While running disk performance benchmarks on several HP DL165 rackmount servers, the kernel begins spewing the same message, and eventually the system needed to be power-cycled (cannot login).  The error message is:

cciss cmd ffff810037e00000 had CHECK CONDITION sense key = 0x3

Each machine has an "HP Smart Array P400" controller with 4 1-TB drives, with the hardware RAID controller just presenting them as 4 separate drives (i.e. no hardware RAID).  So far I have only seen the problem when testing software RAID with /dev/md, but the problem is sporadic so I don't know if the "md" software is part of the problem.


Version-Release number of selected component (if applicable):

Kernel = 2.6.18-238.el5

How reproducible: Sporadic

Steps to Reproduce:
1.  Generate sustained I/O load on /dev/md0, e.g.

# below is conceptual, actual script had more noise in it
<create RAID on /dev/md0, varying level, numdevices, stripes, chunksize>
mkfs.ext4 /dev/md0
mount /dev/md0 /mnt
while true; do
    dd if=/dev/zero of=/mnt/foo bs=100M count=140 | grep copied
done

2.
3.
  
Actual results:

Usually no error message, but sometimes a flood of error messages and a hung system.

Expected results:
No error messages and one line for each dd invocation


Additional info:
Comment 1 Anders 2011-10-07 04:21:45 EDT
Not sure if this will help, but "sense key = 0x3" maps to Medium error and I've also seen suggestions about BIOS/firmware upgrades helping, so maybe worth a try.

/Anders
Comment 2 Simon J Mudd 2012-08-06 07:10:35 EDT
Also seen on similar hardware on a MySQL database server:
ProLiant BL460c G7, Processors: 24, Memory: 188.92 GB

Smart Array P410i in Slot 0 (Embedded)
   array A (Solid State SAS, Unused Space: 0 MB)


      logicaldrive 1 (745.2 GB, RAID 0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SAS, 400 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SAS, 400 GB, OK)

   SEP (Vendor ID PMCSIERA, Model  SRC 8x6G) 250 (WWID: XXXXXXXXXXXXXXXX)

The application server (mysqld 5.6.5) reported this and eventually crashed:

InnoDB: Error: tried to read 16384 bytes at offset 992739328
InnoDB: Was only able to read 8192.
120804  3:04:51  InnoDB: Operating system error number 11 in a file operation.
InnoDB: Error number 11 means 'Resource temporarily unavailable'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
InnoDB: Error: tried to read 16384 bytes at offset 992739328
InnoDB: Was only able to read 8192.

My firmware version is:

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: HP
        Version: I27
        Release Date: 05/05/2011
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 8192 kB
        Characteristics:
                ...
        Firmware Revision: 1.26

Still investigating for other info.

Kernel is the same (this is running CentOS not RHEL but symptoms look identical):

> select * from v_host_rpm where hostname like 'xxxxxxxxxxxxxxx.%' and uninstall_date is null and name ='kernel';
+-------------+-----------------+--------+--------+---------+-------------+--------+---------------------+---------------------+----------------+
| hostname_id | hostname        | rpm_id | name   | version | rpm_release | arch   | last_check          | install_date        | uninstall_date |
+-------------+-----------------+--------+--------+---------+-------------+--------+---------------------+---------------------+----------------+
|      312417 | xxxxxxxxxxxxxxx |   4757 | kernel | 2.6.18  | 238.el5     | x86_64 | 2012-08-06 02:44:03 | 2012-02-28 12:49:44 | NULL           |
+-------------+-----------------+--------+--------+---------+-------------+--------+---------------------+---------------------+----------------+
1 row in set (0.01 sec)
Comment 3 Simon J Mudd 2012-08-06 07:13:55 EDT
I should have said: I am using LVM with an xfs filesystem.
Comment 4 RHEL Product and Program Management 2012-10-30 01:56:14 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 5 Simon J Mudd 2013-01-17 02:23:34 EST
Note: Also seen on different, but again similar hardware:
Hardware: ProLiant DL380 G6, with 96GB of RAM
Kernel: 2.6.18-238.19.1.el5

Disks:

Smart Array P410i in Slot 0 (Embedded)    (sn: XXXXXXXXXXXX)

   array A (SAS, Unused Space: 0 MB)


      logicaldrive 1 (410.1 GB, RAID 1+0, OK)

      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 146 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 146 GB, OK)
      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK, spare)

   SEP (Vendor ID PMCSIERA, Model  SRC 8x6G) 250 (WWID: XXXXXXXXXXXX)


Error message reported by the application (MySQL 5.5.16):

InnoDB: Error: tried to read 16384 bytes at offset 0 4011769856.
InnoDB: Was only able to read 12288.
130117  1:30:28  InnoDB: Operating system error number 0 in a file operation.
InnoDB: Error number 0 means 'Success'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/operating-system-error-codes.html
InnoDB: File operation call: 'read'.
InnoDB: Cannot continue operation.

Seen in /var/log/messages:

Jan 17 01:30:27 xxxxxxxx kernel: cciss: cmd ffff810037e00000 has CHECK CONDITION sense key = 0x3
Jan 17 01:30:28 xxxxxxxx kernel: cciss: cmd ffff810037e00000 has CHECK CONDITION sense key = 0x3
...
Jan 17 01:30:33 xxxxxxxx cmaeventd[5143]: Logical drive 1 of Embedded Array Controller: I/O request fatal error.
Jan 17 01:30:33 xxxxxxxx cmaeventd[5143]: Logical drive 1 of Embedded Array Controller: I/O request fatal error.

and Firmware version:

BIOS Information
        Vendor: HP
        Version: P62
        Release Date: 12/01/2010
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 8192 kB
        Characteristics:
                PCI is supported
                PNP is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                ESCD support is available
                Boot from CD is supported
                Selectable boot is supported
                EDD is supported
                5.25"/360 kB floppy services are supported (int 13h)
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                CGA/mono video services are supported (int 10h)
                ACPI is supported
                USB legacy is supported
                BIOS boot specification is supported
                Function key-initiated network boot is supported
                Targeted content distribution is supported
        Firmware Revision: 2.5

This happened after 330 days MySQL uptime, so normally the server is running fine. I'll see if updating the firmware helps as despite the read error no disks were marked as failed and later things worked fine.
Comment 6 Simon J Mudd 2013-01-17 05:24:28 EST
Note: on rebooting the server I saw this:

Slot 0 HP Smart Array P410i Controller (256MB, v3.00) 1 Logical Drive
1716-Slot 0 Drive Array - Unrecoverable Media Errors Detected on Drives 
during previous Rebuild or Background Surface Analysis (ARM) scan. 
Errors will be fixed automatically when the sector(s) are overwritten. 
Backup and Restore recommended.

So it does look like the controller is aware of the problem. I'm not inclined to trust the data on the disks now, and will upgrade the RAID controller firmware to see if that helps.
Comment 7 RHEL Product and Program Management 2014-03-07 07:11:50 EST
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in the  last planned RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX. To request that Red Hat re-consider this request, please re-open the bugzilla via  appropriate support channels and provide additional business and/or technical details about its importance to you.
Comment 8 RHEL Product and Program Management 2014-06-03 08:48:13 EDT
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).
Comment 9 Charles Butterfield 2015-03-30 08:53:30 EDT
Adding this comment merely to clear the NEEDINFO flag (which is still generating sporadic emails to me).  I no longer have access to this hardware.

Note You need to log in before you can comment on or make changes to this bug.