Description of problem: An old Athlon system of mine is not liking Severn. The ata code reports errors followed by severe fs corruption. This occurs with and without acpi. This particular system has run lots of 2.4 kernels with no major problems, and was successfully running RHL9 up until a few weeks ago. The drive is healthy. Here's an example of what happens when it fails: hde: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } hde: task_no_data_intr: error=0x04 { DriveStatusError } ... EXT3-fs error (...) in start_transaction: Journal has aborted Version-Release number of selected component: 2.4.21-20.1.2024.2.1.nptl (athlon rpm) Hardware info: A7V mobo (KT133 chipset, onboard VIA ata66 and Promise ata100 controllers) hdc is an old Sony burner hde: QUANTUM FIREBALLP AS40.0, ATA DISK drive (see attachments for more details) How reproducible: ~10 mins of idling seems to reliably reproduce it. I've also seen it occur during bootup, and occasionally during normal use. I cannot reproduce it with RHL9 errata kernel (2.4.20-19.9) + Severn userland. Additional info: I also had a similar problem with the installer kernel, but it's much harder to reproduce. It took a few reboots for anaconda to successfully mount the existing ext3 partition. (IIRC it reported lost irq, disabled DMA, and PIO mode didn't work) Also, I see similar 0x51/0x04 errors with hdc after smartd starts.
Created attachment 93286 [details] lspci
What happens if you use the i686 kernel instead of the athlon kernel?
Same result with the i686 version. (acpi=off) I'll try some recent vanilla and -ac kernels later.
I could not reproduce it with 2.4.21, 2.4.22-pre6-ac1, or with Arjans 2.6 RPMs (2.6.0-0.test1.1.26 and 2.6.0-0.test2.1.28).
What happens if you turn of the loading of smartd ("chkconfig --level 35 smartd off") and reboot the machine? On one of my machines smartd does a devicescan and due to this my ide-tape drive does funny things.
It stops the errors about hdc. No change with hde.
Its something in the severn stuff - I've seen multiple reports and even with ACPI and "all the usual suspects" enabled it only happens with the RH tree. Its really quite weird and I really don't know what severn is doing here.
I've just tried moving the drive from the Promise to the VIA controller - same result. BTW here's another report - (the only hardware in common is the harddrive) http://www.redhat.com/archives/rhl-beta-list/2003-July/msg00962.html
I tried the Severn2 kernel (2.4.22-1.2061.nptl) with the Severn1 installation and the same errors occur. I have just noticed what has changed - when using the Severn kernels the hard drive spins down after 5-10 mins. (I'd really like to know why)
Bugzilla has lost the last few comments, so here is a summary. laptop_mode is disabled. "HDD power down" is disabled in BIOS. After a fresh install of Fedora 0.94 it still occurs. (0x51/04 errors, ide and ext3 failures, reset, manually fsck if required)
We're starting to suspect DMA problems with fireball drives, as this is the third report I've been able to find, which is the only common factor. (Different chipsets each time). If you feel motivated to investigate this, can you paste the boot messages of both a RHL9 and a cambridge kernel so we can see how they differ ? Additionally, booting with ide=nodma may prevent around the corruption if our guesses are correct.
You might want to add that quantum drive to the local blacklist for the PDC202xx - not sure why it should bite just the quantumn though
It'll need adding in multiple places if thats the case, as this has been seen on at least 3 different controllers now.
Also #91932 looks very similar (same hardware, also seeing corruption). disabling DMA didn't help in that case, so it's back to the drawing board.
Are you using LVM ?
I'm interested to hear if this fares any better... http://people.redhat.com/davej/2.4.22-1.2086.nptl/
No LVM, and ide=nodma didn't help much. (btw I can't reproduce it with the Taroon kernel - 2.4.21-3.EL)
2.4.22-1.2086.nptl is looking good so far.
Any update on this ? Is it behaving now ?
With the limited amount of testing I've been able to do, 2.4.22-1.2086.nptl seems to fix the problem. 2086 lasts for over 6 hours - previous Severn kernels would fail within 20 mins. I'll do some further tests, but I believe the problem is fixed.
Sounds promising. Looks like the acoustic management patch doesn't play well with these drives. Thanks for chasing this.
Can you paste the output of hdparm -I /dev/hd? from that Quantum Fireball please ?
/dev/hde: ATA device, with non-removable media Model Number: QUANTUM FIREBALLP AS40.0 Serial Number: 194034230190 Firmware Revision: A1Y.1300 Standards: Used: ATA/ATAPI-5 T13 1321D revision 1 Supported: 5 4 3 2 & some of 6 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 78177792 device size with M = 1024*1024: 38172 MBytes device size with M = 1000*1000: 40027 MBytes (40 GB) Capabilities: LBA, IORDY(can be disabled) bytes avail on r/w long: 4 Queue depth: 1 Standby timer values: spec'd by Vendor, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Recommended acoustic management value: 254, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * READ BUFFER cmd * WRITE BUFFER cmd * Host Protected Area feature set * Look-ahead * Write cache * Power Management feature set Security Mode feature set * SMART feature set * Automatic Acoustic Management feature set * DOWNLOAD MICROCODE cmd Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count not supported: enhanced erase 24min for SECURITY ERASE UNIT. 8min for ENHANCED SECURITY ERASE UNIT. HW reset results: CBLID- above Vih Device num = 0 determined by CSEL Checksum: correct