Hide Forgot
Description of problem: Problem-- On (at least) the modern Intel Cougar Point SATA AHCI controller, the chipset support is at least buggy and can cause (at least) false drive failure reports. Detail of Problem-- I installed Fedora 14 (as I'm not liking 15 with gnome 3 yet) on my new motherboard which holds the above chipset. using 5 drives in md raid 5 (4 in the array, adding another) i began to grow my raid array on the new hardware, untested. within seconds, the rebuild stopped and dmesg stated that 2 drives had failed (reset sdc1 and sdd1). Thinking my data was gone forever, i ordered 4 new drives (2tb). Began a new build of md raid 5 and within seconds 2 drives had failed (reset sdc1 and sdd1). moving these drives to another controller (marvell) on the same motherboard, with the same install, the array built just fine. no complaints of failed disks. all cables had been checked and rechecked. i destroyed that array and built another just to be sure, failed on the intel chipset, built on the marvell without issue. before RMAing the motherboard, i decided to make sure it wasn't a kernel issue, so i compiled a 3.0.4 kernel, with the fedora 14 kernel config file, making only the needed changes. i rebooted into the new kernel, and the array builds fine on the intel chipset. Version-Release number of selected component (if applicable): How reproducible: Seems very reproducible on my system. every time i attempted to construct raid 5 on the intel chipset, it failed within seconds saying the same 2 drives had failed (reset sdc1 and sdd1). Steps to Reproduce: 1.Using fedora stock kernel 2.6.35.14-97.fc14.x86_64 2.Using any drives, old, new, known working, build an array (mine was raid 5) 3.Watch with horror? Actual results: Showed 2 drives failed with reset and were kicked from the array. 2 drives actually failing at exactly the same time, is pretty unlikely, however, of 4 new drives, to have them also show the same 2 drives (in relation to the controller and ports) fail, the odds are starting to get astronomical. Expected results: clean growth of the raid array? Additional info: lspci output of relevant chipset-- 00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI Controller (rev 05) 03:00.0 SATA controller: Marvell Technology Group Ltd. Device 9120 (rev 12) lspci -vv 00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI Controller (rev 05) (prog-if 01 [AHCI 1.0]) Subsystem: ASRock Incorporation Device 1c02 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin B routed to IRQ 53 Region 0: I/O ports at f070 [size=8] Region 1: I/O ports at f060 [size=4] Region 2: I/O ports at f050 [size=8] Region 3: I/O ports at f040 [size=4] Region 4: I/O ports at f020 [size=32] Region 5: Memory at fbf05000 (32-bit, non-prefetchable) [size=2K] Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee0100c Data: 41c1 Capabilities: [70] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004 Capabilities: [b0] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: ahci 03:00.0 SATA controller: Marvell Technology Group Ltd. Device 9120 (rev 12) (prog-if 01 [AHCI 1.0]) Subsystem: ASRock Incorporation Device 9120 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 54 Region 0: I/O ports at d040 [size=8] Region 1: I/O ports at d030 [size=4] Region 2: I/O ports at d020 [size=8] Region 3: I/O ports at d010 [size=4] Region 4: I/O ports at d000 [size=16] Region 5: Memory at fbd10000 (32-bit, non-prefetchable) [size=2K] Expansion ROM at fbd00000 [disabled] [size=64K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee1100c Data: 41c9 Capabilities: [70] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Kernel driver in use: ahci i wish i could get the actual dmesg output from the dmesg log history, but it is gone... if i absolutely must, i can boot into the old kernel and reproduce it yet again to get the data if needed. but i think i reproduced it enough with 5 old drives known good, and 5 new drives, failures on stock fedora kernel, happy on 3.0.4 kernel. I am assigning this high Severity, as though, it doesnt seem to give issues if drives aren't being accessed, or are access with low volume, someone could possibly lose data on an array such as raid 5, possibly raid 10 if they were rebuilding or growing at the time. (and by someone, i mean me) :P Since this does seem to be so reproducible right in front of me, i am willing to hit it with whatever you guys want me to do, as i have no data on my new drives, and i am not sure if i can recover anything on my old ones, they will stay in the closet for now. i can pull this and move them, or boot to the old kernel, and fail them, or whatever if you deem it a good move.
At this point in the F14 lifecycle, we are not going to fix this in the 2.6.35 F14 kernel. Since you have a working solution in the 3.0.4 kernel, we suggest you stick with that, or alternatively upgrade to F15 which is based on the 3.0.6 kernel (2.6.40.6).
i wasn't requesting a fix... i was only notifying of the issue, as it can result if data loss under the right circumstances. How important this is to the redhat/fedora team is not my decision. My choice for using fedora 14 is the fact that gnome 3 is an infant. (not be derogatory) my preference on desktop can stay out of this discussion, with the exception that, there might still be many people that would prefer using the last gnome 2 supported release of fedora... Hopefully, they will find this page, and know they have to build a new kernel for that.
edit--- if they have this chipset ^