Bug 795397

Summary: Server frequently goes read only mode on Intel Corporation 5 Series/3400 Series Chipset
Product: Red Hat Enterprise Linux 5 Reporter: yolte <burak>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 5.7   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-14 16:07:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
lspci output none

Description yolte 2012-02-20 12:20:35 UTC
Description of problem:
We have 250+ Fujitsu RX100S6 servers running Centos 5.7 X64. These servers (i think on some high load) goes into read-only mode.


Version-Release number of selected component (if applicable):
2.6.18-274.12.1.el5

How reproducible:
It happens on all Centos 5.5, 5.6 or 5.7 based servers. These servers are web hosting servers. They are runnig, plesk, directadmin or cpanel control panels.

Steps to Reproduce:
1. Not sure. I think it happens on some high server load. For example runnig a backup task, copying or moving files to somewhere. So it is related by disk I/O.
  
Actual results:
Feb 11 14:37:51 server kernel: ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 11 14:37:51 server kernel: ata1.00: irq_stat 0x40000008
Feb 11 14:37:51 server kernel: ata1.00: cmd 60/08:00:e7:4f:55/00:00:16:00:00/40 tag 0 ncq 4096 in
Feb 11 14:37:51 server kernel:          res 41/40:00:e7:4f:55/00:00:16:00:00/40 Emask 0x409 (media error) <F>
Feb 11 14:37:51 server kernel: ata1.00: status: { DRDY ERR }
Feb 11 14:37:51 server kernel: ata1.00: error: { UNC }
Feb 11 14:37:51 server kernel: ata1.00: configured for UDMA/133
Feb 11 14:37:51 server kernel: ata1: EH complete
Feb 11 14:37:51 server kernel: SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
Feb 11 14:37:51 server kernel: sda: Write Protect is off
Feb 11 14:37:51 server kernel: SCSI device sda: drive cache: write back
Feb 11 14:38:56 server kernel: ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Feb 11 14:38:56 server kernel: ata1.00: cmd 61/a8:00:4f:5d:02/03:00:00:00:00/40 tag 0 ncq 479232 out
Feb 11 14:38:56 server kernel:          res 40/00:00:e7:4f:55/00:00:16:00:00/40 Emask 0x4 (timeout)
Feb 11 14:38:56 server kernel: ata1.00: status: { DRDY }
Feb 11 14:38:56 server kernel: ata1: hard resetting link
Feb 11 14:38:57 server kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 11 14:38:57 server kernel: ata1.00: configured for UDMA/133
Feb 11 14:38:57 server kernel: ata1: EH complete
Feb 11 14:38:57 server kernel: SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
Feb 11 14:38:57 server kernel: sda: Write Protect is off
Feb 11 14:38:57 server kernel: SCSI device sda: drive cache: write back

Expected results:
Should not go read-only mode

Additional info:
As you see on attahcment of lspci, these servers has SATA controller: Intel Corporation 5 Series/3400 Series Chipset 6 port SATA AHCI Controller. Maybe this controller had a problem with Centos.
I also tried to turn of NCQ on servers with this command below, but it does not works;
echo 1 > /sys/block/sda/device/queue_depth (also added to rc.local)

Comment 1 yolte 2012-02-20 12:21:36 UTC
Created attachment 564423 [details]
lspci output

lspci

Comment 2 Jes Sorensen 2013-02-14 16:07:03 UTC
You're getting media errors from the disk drive(s) - this is a hardware issue
not a software issue.