Red Hat Bugzilla – Bug 722780
RHEL 5.6: cciss driver giving "CHECK CONDITION sense key = 0x3" errors
Last modified: 2015-03-30 08:53:30 EDT
Description of problem: While running disk performance benchmarks on several HP DL165 rackmount servers, the kernel begins spewing the same message, and eventually the system needed to be power-cycled (cannot login). The error message is: cciss cmd ffff810037e00000 had CHECK CONDITION sense key = 0x3 Each machine has an "HP Smart Array P400" controller with 4 1-TB drives, with the hardware RAID controller just presenting them as 4 separate drives (i.e. no hardware RAID). So far I have only seen the problem when testing software RAID with /dev/md, but the problem is sporadic so I don't know if the "md" software is part of the problem. Version-Release number of selected component (if applicable): Kernel = 2.6.18-238.el5 How reproducible: Sporadic Steps to Reproduce: 1. Generate sustained I/O load on /dev/md0, e.g. # below is conceptual, actual script had more noise in it <create RAID on /dev/md0, varying level, numdevices, stripes, chunksize> mkfs.ext4 /dev/md0 mount /dev/md0 /mnt while true; do dd if=/dev/zero of=/mnt/foo bs=100M count=140 | grep copied done 2. 3. Actual results: Usually no error message, but sometimes a flood of error messages and a hung system. Expected results: No error messages and one line for each dd invocation Additional info:
Not sure if this will help, but "sense key = 0x3" maps to Medium error and I've also seen suggestions about BIOS/firmware upgrades helping, so maybe worth a try. /Anders
Also seen on similar hardware on a MySQL database server: ProLiant BL460c G7, Processors: 24, Memory: 188.92 GB Smart Array P410i in Slot 0 (Embedded) array A (Solid State SAS, Unused Space: 0 MB) logicaldrive 1 (745.2 GB, RAID 0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SAS, 400 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SAS, 400 GB, OK) SEP (Vendor ID PMCSIERA, Model SRC 8x6G) 250 (WWID: XXXXXXXXXXXXXXXX) The application server (mysqld 5.6.5) reported this and eventually crashed: InnoDB: Error: tried to read 16384 bytes at offset 992739328 InnoDB: Was only able to read 8192. 120804 3:04:51 InnoDB: Operating system error number 11 in a file operation. InnoDB: Error number 11 means 'Resource temporarily unavailable'. InnoDB: Some operating system error numbers are described at InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html InnoDB: Error: tried to read 16384 bytes at offset 992739328 InnoDB: Was only able to read 8192. My firmware version is: Handle 0x0000, DMI type 0, 24 bytes BIOS Information Vendor: HP Version: I27 Release Date: 05/05/2011 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 8192 kB Characteristics: ... Firmware Revision: 1.26 Still investigating for other info. Kernel is the same (this is running CentOS not RHEL but symptoms look identical): > select * from v_host_rpm where hostname like 'xxxxxxxxxxxxxxx.%' and uninstall_date is null and name ='kernel'; +-------------+-----------------+--------+--------+---------+-------------+--------+---------------------+---------------------+----------------+ | hostname_id | hostname | rpm_id | name | version | rpm_release | arch | last_check | install_date | uninstall_date | +-------------+-----------------+--------+--------+---------+-------------+--------+---------------------+---------------------+----------------+ | 312417 | xxxxxxxxxxxxxxx | 4757 | kernel | 2.6.18 | 238.el5 | x86_64 | 2012-08-06 02:44:03 | 2012-02-28 12:49:44 | NULL | +-------------+-----------------+--------+--------+---------+-------------+--------+---------------------+---------------------+----------------+ 1 row in set (0.01 sec)
I should have said: I am using LVM with an xfs filesystem.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Note: Also seen on different, but again similar hardware: Hardware: ProLiant DL380 G6, with 96GB of RAM Kernel: 2.6.18-238.19.1.el5 Disks: Smart Array P410i in Slot 0 (Embedded) (sn: XXXXXXXXXXXX) array A (SAS, Unused Space: 0 MB) logicaldrive 1 (410.1 GB, RAID 1+0, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK) physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 146 GB, OK) physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 146 GB, OK) physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK) physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK) physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK, spare) SEP (Vendor ID PMCSIERA, Model SRC 8x6G) 250 (WWID: XXXXXXXXXXXX) Error message reported by the application (MySQL 5.5.16): InnoDB: Error: tried to read 16384 bytes at offset 0 4011769856. InnoDB: Was only able to read 12288. 130117 1:30:28 InnoDB: Operating system error number 0 in a file operation. InnoDB: Error number 0 means 'Success'. InnoDB: Some operating system error numbers are described at InnoDB: http://dev.mysql.com/doc/refman/5.5/en/operating-system-error-codes.html InnoDB: File operation call: 'read'. InnoDB: Cannot continue operation. Seen in /var/log/messages: Jan 17 01:30:27 xxxxxxxx kernel: cciss: cmd ffff810037e00000 has CHECK CONDITION sense key = 0x3 Jan 17 01:30:28 xxxxxxxx kernel: cciss: cmd ffff810037e00000 has CHECK CONDITION sense key = 0x3 ... Jan 17 01:30:33 xxxxxxxx cmaeventd[5143]: Logical drive 1 of Embedded Array Controller: I/O request fatal error. Jan 17 01:30:33 xxxxxxxx cmaeventd[5143]: Logical drive 1 of Embedded Array Controller: I/O request fatal error. and Firmware version: BIOS Information Vendor: HP Version: P62 Release Date: 12/01/2010 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 8192 kB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed ESCD support is available Boot from CD is supported Selectable boot is supported EDD is supported 5.25"/360 kB floppy services are supported (int 13h) 5.25"/1.2 MB floppy services are supported (int 13h) 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported BIOS boot specification is supported Function key-initiated network boot is supported Targeted content distribution is supported Firmware Revision: 2.5 This happened after 330 days MySQL uptime, so normally the server is running fine. I'll see if updating the firmware helps as despite the read error no disks were marked as failed and later things worked fine.
Note: on rebooting the server I saw this: Slot 0 HP Smart Array P410i Controller (256MB, v3.00) 1 Logical Drive 1716-Slot 0 Drive Array - Unrecoverable Media Errors Detected on Drives during previous Rebuild or Background Surface Analysis (ARM) scan. Errors will be fixed automatically when the sector(s) are overwritten. Backup and Restore recommended. So it does look like the controller is aware of the problem. I'm not inclined to trust the data on the disks now, and will upgrade the RAID controller firmware to see if that helps.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in the last planned RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX. To request that Red Hat re-consider this request, please re-open the bugzilla via appropriate support channels and provide additional business and/or technical details about its importance to you.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).
Adding this comment merely to clear the NEEDINFO flag (which is still generating sporadic emails to me). I no longer have access to this hardware.