Bug 91932

Summary: IDE Errors
Product: [Retired] Red Hat Linux Beta Reporter: Thornton Prime <thornton>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED NEXTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: alpha 2CC: djh, pfrields
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-10-16 01:25:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 100643    
Attachments:
Description Flags
Severn Boot Messages
none
RedHat 9 Boot Messages none

Description Thornton Prime 2003-05-29 23:07:12 UTC
I've been having some serious problems under Cambridge, first with alpha1,
then alpha2, and now alpha2+rawhide kernel. THe same system is fine under 
RH8.0 and 9.

After moderate use the system starts reporting repeated hard disk errors 
...

  hda: task_no_data_intr: error=0x04 { DriveStatusError }
  hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }

Rebooting usually finds the system unusable ...

  Kernel panic: No init found.  Try passing init= option to kernel.

The system also hangs when I boot without adding "apm=off acpi=off", but I 
suspect that's unrelated.

The machine is a rather normal 1U Penguin Computing system with a P3. The 
IDE is reported as an Intel 82801AA rev2 chipset. The drive is a QUANTUM 
FIREBALLP AS30.0.

Comment 1 Thornton Prime 2003-06-19 01:59:10 UTC
Still problems in Cambridge alpha3, but I think problems are isolated to LVM.


Comment 2 Thornton Prime 2003-06-19 03:09:16 UTC
Spoke to soon.

Happened on plain ext3 partition (no LVM), though it took a lot longer ... after
about 50 passes of bonnie++, the filesystem became unreadable, with the same
errors ...

hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete}
hda: task_no_data_intr: error=0x04 { DriveStatusError }
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x04 { DriveStatusError }
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x04 { DriveStatusError }


Comment 3 Warren Togami 2003-09-21 10:57:16 UTC
Have you gone back to a stable release and these problems have gone away?
Do these problems persist with the latest kernel in rawhide?


Comment 4 Thornton Prime 2003-09-21 16:07:53 UTC
I did go back to a stable release for this machine because the machine was
worthless as a test machine since the IDE problems would crop up within minutes.

I will try the latest rawhide, though.

Comment 5 Thornton Prime 2003-09-30 03:47:27 UTC
Problems look solved with Fedora Severn2/2.4.22-1.2061.nptl.



Comment 6 Thornton Prime 2003-09-30 04:28:38 UTC
Spoke to soon ... my 5th pass of bonnie++ gave this:

# bonnie++ -u root -d .
-bash: /usr/sbin/bonnie++: /lib/ld-linux.so.2: bad ELF interpreter: No such
filySegmentation fault
journal_bmap_R16ad4e4d: journal block not found at offset 116)Aborting journal
on device lvm(58,0).
journal_bmap_R16ad4e4d: journal block not found at offset 269 on lvm(58,1)
Aborting journal on device lvm(58,1).
ext3_abort called.
EXT3-fs abort (device lvm(58,0)): ext3_journal_start: Detected aborted journal
Remounting filesystem read-only
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete}hda:
task_no_data_intr: error=0x04 { DriveStatusError }
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x04 { DriveStatusError }
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x04 { DriveStatusError }
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x04 { DriveStatusError }


Comment 7 Dave Jones 2003-10-02 14:24:43 UTC
(updating with lost bug reports from bugzilla crash).

==============================================================================
------- Additional Comments From christoph.wickert  2003-09-30 17:58 -------
Depends on your Kernelconfig!
Quote (help to
>CONFIG_IDEDISK_MULTI_MODE:
>
> If you get this error, try to say Y here:
>
> hda: set_multmode: status=0x51 { DriveReady SeekComplete Error }
> hda: set_multmode: error=0x04 { DriveStatusError }
>
> If in doubt, say N.
================================================================================

Current Fedora kernels already set this option. The help text is out of date,
and those warnings can occur from other parts of the IDE code.

Its the drive saying it doesn't understand a command it was passed.
Which is quite easy to hit if you use an old drive. The triggers for these
commands need to be found so that some of these messages can be silence.

They are however, very likely to be unrelated to the corruption problem reported
here before bugzilla ate the original reporters posting..

Comment 8 Dave Jones 2003-10-02 15:20:02 UTC
We're starting to suspect DMA problems with fireball drives, as this is the
third report I've been able to find, which is the only common factor.
(Different chipsets each time).

If you feel motivated to investigate this, can you paste the boot messages
of both a RHL9 and a cambridge kernel so we can see how they differ ?

Additionally, booting with ide=nodma may prevent around the corruption if our
guesses are correct.


Comment 9 Thornton Prime 2003-10-03 01:19:19 UTC
I am testing now with ide=nodma

Here are boot messages from a Severn2 (I'll post RH9 once I'm done testing):

Linux version 2.4.22-1.2061.nptl (bhcompile.redhat.com) (gcc
version3BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 0000000007ec0000 (usable)
 BIOS-e820: 0000000007ec0000 - 0000000007ef8000 (ACPI data)
 BIOS-e820: 0000000007ef8000 - 0000000007f00000 (ACPI NVS)
 BIOS-e820: 00000000ffb80000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
0MB HIGHMEM available.
126MB LOWMEM available.
On node 0 totalpages: 32448
zone(0): 4096 pages.
zone(1): 28352 pages.
zone(2): 0 pages.
ACPI disabled because your bios is from 2000 and too old
You can enable it with acpi=force
ACPI: RSDP (v000 AMI                                       ) @ 0x000ff980
ACPI: RSDT (v001 CAYMAN 8C1A100A 0x20000210 MSFT 0x00000097) @ 0x07ef0000
ACPI: FADT (v001 CAYMAN 8C1A100A 0x20000210 MSFT 0x00000097) @ 0x07ef1000
ACPI: DSDT (v001 CAYMAN CA81020A 0x00000012 MSFT 0x0100000b) @ 0x00000000
Kernel command line: ro root=/dev/vg00/lv00 console=tty0 console=ttyS0,9600n81
eide_setup: ide0=nodma,notune -- BAD OPTION
Initializing CPU#0
Detected 697.900 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1392.64 BogoMIPS
Memory: 124212k/129792k available (1509k kernel code, 5192k reserved, 1114k
dat)Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode cache hash table entries: 8192 (order: 4, 65536 bytes)
Mount cache hash table entries: 512 (order: 0, 4096 bytes)
Buffer cache hash table entries: 4096 (order: 2, 16384 bytes)
Page-cache hash table entries: 32768 (order: 5, 131072 bytes)
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: Intel Pentium III (Coppermine) stepping 03
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.40 (20010327) Richard Gooch (rgooch.au)
mtrr: detected mtrr type: Intel
ACPI: Subsystem revision 20030916
ACPI: Interpreter disabled.
PCI: PCI BIOS revision 2.10 entry at 0xfda95, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
Transparent bridge - Intel Corp. 82801AA PCI Bridge
PCI: Using IRQ router PIIX/ICH [8086/2410] at 00:1f.0
isapnp: Scanning for PnP cards...
isapnp: No Plug & Play device found
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
apm: BIOS version 1.2 Flags 0x0b (Driver version 1.16)
apm: disabled on user request.
Starting kswapd
VFS: Disk quotas vdquot_6.5.1
Asus Laptop ACPI Extras version 0.24a
  Couldn't get the DSDT table header
  Error registering Asus Laptop ACPI Extras Driver
        -0420: *** Error: Could not allocate an object descriptor
Detected PS/2 Mouse Port.
pty: 2048 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT SHARE_IRQ
SEdttyS0 at 0x03f8 (irq = 4) is a 16550A
ttyS1 at 0x02f8 (irq = 3) is a 16550A
Real Time Clock Driver v1.10e
NET4: Frame Diverter 0.46
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ICH: IDE controller at PCI slot 00:1f.1
ICH: chipset revision 2
ICH: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
hda: QUANTUM FIREBALLP AS30.0, ATA DISK drive
blk: queue c040f3a0, I/O limit 4095Mb (mask 0xffffffff)
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: attached ide-disk driver.
hda: host protected area => 1
hda: 58633344 sectors (30020 MB) w/1902KiB Cache, CHS=3649/255/63, UDMA(66)
Partition check:
 hda: hda1 hda2
ide: late registration of driver.
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
Initializing Cryptographic API
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 512 buckets, 4Kbytes
TCP: Hash tables configured (established 8192 bind 16384)
Linux IP multicast router 0.06 plus PIM-SM
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
RAMDISK: Compressed image found at block 0


Comment 10 Thornton Prime 2003-10-03 01:45:32 UTC
ide=nodma resulted in new errors when running bonnie++ tests (these are
repeating endlessly). So far no corruption (fingers crossed). I can try again
without LVM, but in the past I've seen corruption regardless of LVM ...

Writing with putc()...EXT3-fs error (device lvm(58,2)): ext3_new_block:
Allocat4EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in
system zo5EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in
system zo7EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in
system zo9EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in
system zo3EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in
system zo5EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in
system zo7EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in
system zo8EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in 
...



Comment 11 Thornton Prime 2003-10-03 02:02:29 UTC
OK ... even with ide=nodma, it still looks like I have problems.

I ran a few bonnie++ runs. After rebooting, the system couldn't find init. This
time I am able to boot with init=/bin/sh and I was able to repair.



Comment 12 Thornton Prime 2003-10-03 11:16:03 UTC
Created attachment 94910 [details]
Severn Boot Messages

My previous boot up log was booting up with a bad kernel parameter and
ide=nodma wasn't getting loaded.
 
I rebuilt and rebooted (with the correct parameters -- boot messages attached)
and  started over ... I am still getting filesystem corruption. After a few
dozen passes of bonnie++, I got the errors below. The interesting thing is that
bonnie++ was writing to /var on one logical volume, and only reading /usr from
another logical volume, but it was /usr that got corrupt ... this certainly
points to something beneath the filesystem as the source of the corruption.
 
Rebooting, the /usr volume was pretty hosed. Most of my shared libraries were
unrecoverable.

I'll re-run the same test without LVM, but with ide=nodma.

----------
 
# EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in0EXT3-fs error
(device lvm(58,1)): ext3_readdir: bad entry in directory #80003: 0EXT3-fs error
(device lvm(58,1)): ext3_add_entry: bad entry in directory #800030INIT: version
2.85 reloading
EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in directory #80003:
0EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in directory #80003:
0EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in directory #80003:
0									       
 
[root@vajra root]# bonnie++
bonnie++: error while loading shared libraries: libstdc++.so.5: cannot open
shay[root@vajra root]# ldconfig
EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in directory #80003:
0

Comment 13 Thornton Prime 2003-10-03 11:24:30 UTC
Created attachment 94911 [details]
RedHat 9 Boot Messages

Here are the boot messages from a RH9 install.

Comment 14 Dave Jones 2003-10-03 14:52:29 UTC
I'm interested to hear if this fares any better...
http://people.redhat.com/davej/2.4.22-1.2086.nptl/


Comment 15 Dave Jones 2003-10-09 15:02:25 UTC
Any news on this ?


Comment 16 Thornton Prime 2003-10-10 00:40:24 UTC
Sorry, my testing windows on this machine are rather limited ... but I will
hopefully get another one very soon.

Comment 17 Thornton Prime 2003-10-16 01:25:34 UTC
Sorry, I never got a chance to load 2086, but I am now running 2088 (Severn3).

So far, my same tests (bonnie++ plus some large finds) has been working great.
It has been running almost 12 hours straight with only one error in bonnie and
no kernel errors to speak of. No file system corruption.

This works for me! Thanks.

Comment 18 Dave Jones 2003-10-16 15:16:18 UTC
Looks like it was due to the AAM patch.
Can you paste the output of hdparm -I /dev/hda (or whatever drive that Quantum
Fireball is).



Comment 19 Thornton Prime 2003-10-17 01:39:10 UTC
# hdparm -I /dev/hda
 
/dev/hda:
 
ATA device, with non-removable media
        Model Number:       QUANTUM FIREBALLP AS30.0
        Serial Number:      193036239076
        Firmware Revision:  A1Y.1300
Standards:
        Used: ATA/ATAPI-5 T13 1321D revision 1
        Supported: 5 4 3 2 & some of 6
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:   58633344
        device size with M = 1024*1024:       28629 MBytes
        device size with M = 1000*1000:       30020 MBytes (30 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        bytes avail on r/w long: 4      Queue depth: 1
        Standby timer values: spec'd by Vendor, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Recommended acoustic management value: 254, current value: 128
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 *udma4 udma5
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    READ BUFFER cmd
           *    WRITE BUFFER cmd
           *    Host Protected Area feature set
           *    Look-ahead
           *    Write cache
           *    Power Management feature set
                Security Mode feature set
           *    SMART feature set
           *    Automatic Acoustic Management feature set
           *    DOWNLOAD MICROCODE cmd
Security:
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
        not     frozen
        not     expired: security count
        not     supported: enhanced erase
        18min for SECURITY ERASE UNIT. 8min for ENHANCED SECURITY ERASE UNIT.
HW reset results:
        CBLID- above Vih
        Device num = 0 determined by CSEL
Checksum: correct


Comment 20 Need Real Name 2006-05-19 17:10:41 UTC
I've gotten similar errors and system lockups from an NEC-6500A DVD burner in
two different Thinkpad laptops under RH9 (not sure what kernel) and FC4 (the
stock kernel, 2.6.9, I believe).