Bug 722780
| Summary: | RHEL 5.6: cciss driver giving "CHECK CONDITION sense key = 0x3" errors | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Charles Butterfield <cb20777> |
| Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
| Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 5.6 | CC: | akarlsso, cb20777, sjmudd |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-06-03 12:48:13 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Not sure if this will help, but "sense key = 0x3" maps to Medium error and I've also seen suggestions about BIOS/firmware upgrades helping, so maybe worth a try. /Anders Also seen on similar hardware on a MySQL database server:
ProLiant BL460c G7, Processors: 24, Memory: 188.92 GB
Smart Array P410i in Slot 0 (Embedded)
array A (Solid State SAS, Unused Space: 0 MB)
logicaldrive 1 (745.2 GB, RAID 0, OK)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SAS, 400 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SAS, 400 GB, OK)
SEP (Vendor ID PMCSIERA, Model SRC 8x6G) 250 (WWID: XXXXXXXXXXXXXXXX)
The application server (mysqld 5.6.5) reported this and eventually crashed:
InnoDB: Error: tried to read 16384 bytes at offset 992739328
InnoDB: Was only able to read 8192.
120804 3:04:51 InnoDB: Operating system error number 11 in a file operation.
InnoDB: Error number 11 means 'Resource temporarily unavailable'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
InnoDB: Error: tried to read 16384 bytes at offset 992739328
InnoDB: Was only able to read 8192.
My firmware version is:
Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
Vendor: HP
Version: I27
Release Date: 05/05/2011
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 8192 kB
Characteristics:
...
Firmware Revision: 1.26
Still investigating for other info.
Kernel is the same (this is running CentOS not RHEL but symptoms look identical):
> select * from v_host_rpm where hostname like 'xxxxxxxxxxxxxxx.%' and uninstall_date is null and name ='kernel';
+-------------+-----------------+--------+--------+---------+-------------+--------+---------------------+---------------------+----------------+
| hostname_id | hostname | rpm_id | name | version | rpm_release | arch | last_check | install_date | uninstall_date |
+-------------+-----------------+--------+--------+---------+-------------+--------+---------------------+---------------------+----------------+
| 312417 | xxxxxxxxxxxxxxx | 4757 | kernel | 2.6.18 | 238.el5 | x86_64 | 2012-08-06 02:44:03 | 2012-02-28 12:49:44 | NULL |
+-------------+-----------------+--------+--------+---------+-------------+--------+---------------------+---------------------+----------------+
1 row in set (0.01 sec)
I should have said: I am using LVM with an xfs filesystem. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Note: Also seen on different, but again similar hardware:
Hardware: ProLiant DL380 G6, with 96GB of RAM
Kernel: 2.6.18-238.19.1.el5
Disks:
Smart Array P410i in Slot 0 (Embedded) (sn: XXXXXXXXXXXX)
array A (SAS, Unused Space: 0 MB)
logicaldrive 1 (410.1 GB, RAID 1+0, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 146 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 146 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK, spare)
SEP (Vendor ID PMCSIERA, Model SRC 8x6G) 250 (WWID: XXXXXXXXXXXX)
Error message reported by the application (MySQL 5.5.16):
InnoDB: Error: tried to read 16384 bytes at offset 0 4011769856.
InnoDB: Was only able to read 12288.
130117 1:30:28 InnoDB: Operating system error number 0 in a file operation.
InnoDB: Error number 0 means 'Success'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/operating-system-error-codes.html
InnoDB: File operation call: 'read'.
InnoDB: Cannot continue operation.
Seen in /var/log/messages:
Jan 17 01:30:27 xxxxxxxx kernel: cciss: cmd ffff810037e00000 has CHECK CONDITION sense key = 0x3
Jan 17 01:30:28 xxxxxxxx kernel: cciss: cmd ffff810037e00000 has CHECK CONDITION sense key = 0x3
...
Jan 17 01:30:33 xxxxxxxx cmaeventd[5143]: Logical drive 1 of Embedded Array Controller: I/O request fatal error.
Jan 17 01:30:33 xxxxxxxx cmaeventd[5143]: Logical drive 1 of Embedded Array Controller: I/O request fatal error.
and Firmware version:
BIOS Information
Vendor: HP
Version: P62
Release Date: 12/01/2010
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 8192 kB
Characteristics:
PCI is supported
PNP is supported
BIOS is upgradeable
BIOS shadowing is allowed
ESCD support is available
Boot from CD is supported
Selectable boot is supported
EDD is supported
5.25"/360 kB floppy services are supported (int 13h)
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
CGA/mono video services are supported (int 10h)
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Function key-initiated network boot is supported
Targeted content distribution is supported
Firmware Revision: 2.5
This happened after 330 days MySQL uptime, so normally the server is running fine. I'll see if updating the firmware helps as despite the read error no disks were marked as failed and later things worked fine.
Note: on rebooting the server I saw this: Slot 0 HP Smart Array P410i Controller (256MB, v3.00) 1 Logical Drive 1716-Slot 0 Drive Array - Unrecoverable Media Errors Detected on Drives during previous Rebuild or Background Surface Analysis (ARM) scan. Errors will be fixed automatically when the sector(s) are overwritten. Backup and Restore recommended. So it does look like the controller is aware of the problem. I'm not inclined to trust the data on the disks now, and will upgrade the RAID controller firmware to see if that helps. Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in the last planned RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX. To request that Red Hat re-consider this request, please re-open the bugzilla via appropriate support channels and provide additional business and/or technical details about its importance to you. Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support). Adding this comment merely to clear the NEEDINFO flag (which is still generating sporadic emails to me). I no longer have access to this hardware. |
Description of problem: While running disk performance benchmarks on several HP DL165 rackmount servers, the kernel begins spewing the same message, and eventually the system needed to be power-cycled (cannot login). The error message is: cciss cmd ffff810037e00000 had CHECK CONDITION sense key = 0x3 Each machine has an "HP Smart Array P400" controller with 4 1-TB drives, with the hardware RAID controller just presenting them as 4 separate drives (i.e. no hardware RAID). So far I have only seen the problem when testing software RAID with /dev/md, but the problem is sporadic so I don't know if the "md" software is part of the problem. Version-Release number of selected component (if applicable): Kernel = 2.6.18-238.el5 How reproducible: Sporadic Steps to Reproduce: 1. Generate sustained I/O load on /dev/md0, e.g. # below is conceptual, actual script had more noise in it <create RAID on /dev/md0, varying level, numdevices, stripes, chunksize> mkfs.ext4 /dev/md0 mount /dev/md0 /mnt while true; do dd if=/dev/zero of=/mnt/foo bs=100M count=140 | grep copied done 2. 3. Actual results: Usually no error message, but sometimes a flood of error messages and a hung system. Expected results: No error messages and one line for each dd invocation Additional info: