Bug 142248

Summary: Possible ext3 filesystem corruption with 667 or 681 kernel
Product: [Fedora] Fedora Reporter: Philippe Rigault <prigault>
Component: kernelAssignee: Alan Cox <alan>
Status: CLOSED CANTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3CC: alan, davej, jesus.salvo, sct, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-03 00:41:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Philippe Rigault 2004-12-08 15:32:44 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.3; Linux) (KHTML, like Gecko)

Description of problem:
Dell Inspiron 5100 (i686)
60GB Hard Disk

1. Fresh install of FC3 from CDs.
   All filesystems are ext3
   Formatted: /, /boot, /usr
   No formatting (left as is): /home, /opt
   Install proceeds OK
   System boots with kernel-2.6.9-1.667.i686
2. Download updates, install kernel-2.6.9-1.681_FC3.i686
   Reboot kernel-2.6.9-1.681_FC3.i686
3. Machine boots fine.
   Could be remotely accessed by ssh
   Started disk/CPU intensive activity on /opt (compile KDE)
   
4. After a few hours, the machine became unavailable through ssh.

On the console, I got:

Dec  8 04:03:59 mybox kernel: hdc: dma_timer_expiry: dma status == 0x21
Dec  8 04:04:09 mybox kernel: hdc: DMA timeout error
Dec  8 04:04:09 mybox kernel: hdc: dma timeout error: status=0xd0 { Busy }
Dec  8 04:04:09 mybox kernel:
Dec  8 04:04:09 mybox kernel: ide: failed opcode was: unknown
Dec  8 04:04:09 mybox kernel: hdc: DMA disabled
Dec  8 04:04:10 mybox kernel: ide1: reset: success
Dec  8 04:04:55 mybox init: Trying to re-exec init

5. I then rebooted an got the following messages:

Dec  8 09:20:46 mybox rpc.statd: rpc.statd: error while loading shared libraries: /usr/lib/libwrap.so.0: invalid ELF header
Dec  8 09:20:54 mybox sshd: /usr/sbin/sshd: error while loading shared libraries: /usr/lib/libwrap.so.0: invalid ELF header
Dec  8 09:20:54 mybox xinetd: xinetd: error while loading shared libraries: /usr/lib/libwrap.so.0: invalid ELF header

6. Suspecting corrupted files, I tried to reinstall RPMS:

   Mounted the CD and issued the command:
rpm -ivh --replacepkgs --replacefiles tcp_wrappers-7.6-37.2.i386.rpm

ldconfig complained about other libraries not being ELF in
libcroco, libxfcegui4, libxfce4util, xffm, binutils

I also saw this on the console:

Dec  8 09:36:43 mybox kernel: hda: irq timeout: status=0xd0 { Busy }
Dec  8 09:36:43 mybox kernel: hda: irq timeout: error=0xd0LastFailedSense 0x0d
Dec  8 09:36:43 mybox kernel: hda: DMA disabled
Dec  8 09:36:43 mybox kernel: hda: ATAPI reset complete

At that time, I saw bug#131822 and decided to fsck all partitions

7. Reboot with /forcefsck
   *No* errors reported, fsck completed fine

8. Reinstalled the following packages with
   rpm -ivh --replacepkgs --replacefiles
   libcroco, libxfcegui4, libxfce4util, xffm, binutils

No more ldconfig complains and hd{a,c} timeout (for now at least).
Services like ssh restarted fine.

I have several questions:

a). If there were indeed file corruption, how come fsck did not report errors ?
b). How can I check my install for _all_ installed packages being sane (ELF format OK, etc) ?

Thanks,

Philippe Rigault

Version-Release number of selected component (if applicable):
kernel-2.6.9-1.667.i686 and kernel-2.6.9-1.681_FC3.i686

How reproducible:
Didn't try

Steps to Reproduce:
1.
2.
3.
    

Additional info:

Comment 1 Stephen Tweedie 2004-12-09 22:45:14 UTC
fsck can only check filesystem metadata.  It has no idea what file contents
should look like, so any corruption that has only hit internal file data blocks
will not be visible to fsck.

"rpm -V" has access to file checksums so is able to check the correctness of
file contents.  "rpm -Va" to check all packages (though be aware that things
like config files are expected to have changed since install time.)



Comment 3 Dave Jones 2004-12-09 23:06:47 UTC
can you paste the output of 'lspci' please ?
hdparm -i /dev/hda
and any ide related messages from 'dmesg' ?

Might also be worth looking through /var/log/messages to see if theres any other
nasty looking IDE messages above the ones you pasted.


Comment 5 Philippe Rigault 2004-12-10 00:04:41 UTC
I forgot to mention that hda is a CD-RW and hdc is the hard disk.        
        
> can you paste the output of 'lspci' please ?                
                
00:00.0 Host bridge: Intel Corp. 82845G/GL[Brookdale-G]/GE/PE DRAM                 
Controller/Host-Hub Interface (rev 03)                 
00:01.0 PCI bridge: Intel Corp. 82845G/GL[Brookdale-G]/GE/PE Host-to-AGP                 
Bridge (rev 03)                 
00:1d.0 USB Controller: Intel Corp. 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB                 
UHCI Controller #1 (rev 02)                 
00:1d.1 USB Controller: Intel Corp. 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB                 
UHCI Controller #2 (rev 02)                 
00:1d.2 USB Controller: Intel Corp. 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB                 
UHCI Controller #3 (rev 02)                 
00:1d.7 USB Controller: Intel Corp. 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI                 
Controller (rev 02)                 
00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev 82)                 
00:1f.0 ISA bridge: Intel Corp. 82801DB/DBL (ICH4/ICH4-L) LPC Interface Bridge                 
(rev 02)                 
00:1f.1 IDE interface: Intel Corp. 82801DB (ICH4) IDE Controller (rev 02)                 
00:1f.5 Multimedia audio controller: Intel Corp. 82801DB/DBL/DBM                 
(ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 02)                 
00:1f.6 Modem: Intel Corp. 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Modem                 
Controller (rev 02)                 
01:00.0 VGA compatible controller: ATI Technologies Inc Radeon Mobility M7 LW                 
[Radeon Mobility 7500]                 
02:01.0 Ethernet controller: Broadcom Corporation BCM4401 100Base-T (rev 01)                 
02:02.0 Network controller: Broadcom Corporation BCM4309 802.11a/b/g (rev 02)                 
02:04.0 CardBus bridge: Texas Instruments PCI4510 PC card Cardbus Controller                 
(rev 02)                 
02:04.1 FireWire (IEEE 1394): Texas Instruments PCI4510 IEEE-1394 Controller                 
                 
                 
>hdparm -i /dev/hda                 
                 
/dev/hda:                 
                 
 Model=HL-DT-STCD-RW/DVD-ROM GCC-4240N, FwRev=E112, SerialNo=                 
 Config={ Fixed Removeable DTR<=5Mbs DTR>10Mbs nonMagnetic }                 
 RawCHS=0/0/0, TrkSize=0, SectSize=0, ECCbytes=0                 
 BuffType=unknown, BuffSize=0kB, MaxMultSect=0                 
 (maybe): CurCHS=0/0/0, CurSects=0, LBA=yes, LBAsects=0                 
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}                 
 PIO modes:  pio0 pio1 pio2 pio3 pio4                 
 DMA modes:  sdma0 sdma1 sdma2 mdma0 mdma1 mdma2                 
 UDMA modes: udma0 udma1 *udma2                 
 AdvancedPM=no                 
 Drive conforms to: device does not report version:                 
                 
 * signifies the current active mode                 
 
While I am at it, her is hdc: 
                 
/dev/hdc: 
 
 Model=IC25N060ATMR04-0, FwRev=MO3OAD0A, SerialNo=MRG357K3HPJBYH 
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } 
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 
 BuffType=DualPortCache, BuffSize=7884kB, MaxMultSect=16, MultSect=16 
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117210240 
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} 
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 
 AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled 
 Drive conforms to: ATA/ATAPI-6 T13 1410D revision 3a: 
 
 * signifies the current active mode 
 
 
> and any ide related messages from 'dmesg' ?              
             
Indeed, I don't like the 'Wait for ready failed before probe' ones:          
          
PCI: Enabling device 0000:00:1f.1 (0005 -> 0007)             
ACPI: PCI interrupt 0000:00:1f.1[A] -> GSI 11 (level, low) -> IRQ 11             
ICH4: chipset revision 2             
ICH4: not 100% native mode: will probe irqs later             
    ide0: BM-DMA at 0xbfa0-0xbfa7, BIOS settings: hda:DMA, hdb:pio             
    ide1: BM-DMA at 0xbfa8-0xbfaf, BIOS settings: hdc:DMA, hdd:pio             
Probing IDE interface ide0...             
hda: HL-DT-STCD-RW/DVD-ROM GCC-4240N, ATAPI CD/DVD-ROM drive             
Using cfq io scheduler             
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14             
Probing IDE interface ide1...             
hdc: IC25N060ATMR04-0, ATA DISK drive             
ide1 at 0x170-0x177,0x376 on irq 15             
Probing IDE interface ide2...             
ide2: Wait for ready failed before probe !             
Probing IDE interface ide3...             
ide3: Wait for ready failed before probe !             
Probing IDE interface ide4...             
ide4: Wait for ready failed before probe !             
Probing IDE interface ide5...             
ide5: Wait for ready failed before probe !             
hdc: max request size: 1024KiB             
hdc: 117210240 sectors (60011 MB) w/7884KiB Cache, CHS=16383/255/63, UDMA(100)             
hdc: cache flushes supported             
 hdc: hdc1 hdc2 hdc3 hdc4 < hdc5 hdc6 hdc7 hdc8 hdc9 hdc10 >             
hda: ATAPI 24X DVD-ROM CD-R/RW drive, 2048kB Cache, UDMA(33)             
Uniform CD-ROM driver Revision: 3.20             
ide-floppy driver 0.99.newide             
             
> Might also be worth looking through /var/log/messages to see if theres any           
other           
> nasty looking IDE messages above the ones you pasted.           
       
I am sending to you privately the complete /var/log/messages (17k compressed).      
      
The first thing I noticed is that the first time the machine booted after the          
install (667 kernel), there were no messages like "Wait for ready failed          
before probe !":          
          
Dec  7 15:02:27 mybox kernel: RAMDISK driver initialized: 16 RAM disks of          
16384K size 1024 blocksize          
Dec  7 15:02:27 mybox kernel: Uniform Multi-Platform E-IDE driver Revision:          
7.00alpha2          
Dec  7 15:02:27 mybox kernel: ide: Assuming 33MHz system bus speed for PIO          
modes; override with idebus=xx          
Dec  7 15:02:27 mybox kernel: ICH4: IDE controller at PCI slot 0000:00:1f.1          
Dec  7 15:02:27 mybox kernel: PCI: Enabling device 0000:00:1f.1 (0005 -> 0007)          
Dec  7 15:02:27 mybox kernel: ACPI: PCI interrupt 0000:00:1f.1[A] -> GSI 11          
(level, low) -> IRQ 11          
Dec  7 15:02:27 mybox kernel: ICH4: chipset revision 2          
Dec  7 15:02:27 mybox kernel: ICH4: not 100%% native mode: will probe irqs          
later          
Dec  7 15:02:27 mybox kernel:     ide0: BM-DMA at 0xbfa0-0xbfa7, BIOS          
settings: hda:DMA, hdb:pio          
Dec  7 15:02:27 mybox kernel:     ide1: BM-DMA at 0xbfa8-0xbfaf, BIOS          
settings: hdc:DMA, hdd:pio          
Dec  7 15:02:27 mybox kernel: hda: HL-DT-STCD-RW/DVD-ROM GCC-4240N, ATAPI          
CD/DVD-ROM drive          
Dec  7 15:02:27 mybox kernel: Using cfq io scheduler          
Dec  7 15:02:27 mybox kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14          
Dec  7 15:02:27 mybox kernel: hdc: IC25N060ATMR04-0, ATA DISK drive          
Dec  7 15:02:27 mybox kernel: ide1 at 0x170-0x177,0x376 on irq 15          
Dec  7 15:02:27 mybox kernel: hdc: max request size: 1024KiB          
Dec  7 15:02:27 mybox kernel: hdc: 117210240 sectors (60011 MB) w/7884KiB          
Cache, CHS=16383/255/63, UDMA(100)          
Dec  7 15:02:27 mybox kernel:  hdc: hdc1 hdc2 hdc3 hdc4 < hdc5 hdc6 hdc7 hdc8          
hdc9 hdc10 >          
Dec  7 15:02:27 mybox kernel: hda: ATAPI 24X DVD-ROM CD-R/RW drive, 2048kB          
Cache, UDMA(33)          
Dec  7 15:02:27 mybox kernel: Uniform CD-ROM driver Revision: 3.20          
Dec  7 15:02:27 mybox kernel: ide-floppy driver 0.99.newide          
          
  
           
 

Comment 6 Alan Cox 2004-12-10 14:12:19 UTC
The probe messages are just escaped irrelevant debug. Ignore those.

The rest seems in itself like the machine had problems with drives - is this a
laptop that is getting suspended/restored ?

The two sets of traces are
   DMA fails (still running)
   We whack the hard disk to reset it
   The disk stays busy
    We try PIO
  [end of log]

The other one is quite similar - the IDE CD decided it was busy still after our
timeout. There was no sense data available and we tried to switch down to PIO


Comment 7 Philippe Rigault 2004-12-10 16:13:59 UTC
> The rest seems in itself like the machine had problems with drives - is this 
a 
> laptop that is getting suspended/restored ? 
 
Not since the install of FC3. 
BUT _prior_ to the FC3 install, it did run FC2 with a custom kernel (vanilla 
2.6.9 patched with swsusp2, which worked brilliantly btw) and has been 
suspended/restored a few times. 
 
The partition that had the corruption problem was formatted during FC3 install 
though. 
 
 

Comment 8 Jesus Salvo Jr. 2005-01-14 04:43:02 UTC
I had this problem as well, with the last 2.6.9 FC3 kernel before the 2.6.10 
kernel was released, since I do recall updating the kernel last week ( and the 
last 2.6.9 kernel was 724 ). 
 
Hardware: Dell PowerEdge 750. No LVM, no RAID used ( neither hardware or 
software ). 
 
I recall seeing "ext3 journal aborted" several times on the console. 
 
Since its all hosed anyway, am reinstalling FC3, this time I'll be updating to 
the latest 2.6.10 kernel. 
 
 
 

Comment 9 rickyrockrat 2005-04-20 14:00:27 UTC
I have similar issues with a fresh FC3 install on all new hardware. Drive is IDE 
UDMA 100, Seagate. Machine is a AMD desktop. I do not have details on specifics, 
but the above messages are similar. I see ext3 journal aborted in a steady 
stream.  I also seem to have had similar problems with a AMD 64 bit system using 
a serial ATA drive with a promise controller and a VIA chipset. I updated to the 
latest FC3 updates and it seems to have fixed the 64-bit system.

This is a major bad bug, though I have no clue where it is.  As a side note, be 
sure to update HAL when you update the kernel-for those of us used to older 
distros.

Comment 10 Dave Jones 2005-07-15 18:13:42 UTC
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 11 Dave Jones 2005-10-03 00:41:33 UTC
This bug has been automatically closed as part of a mass update.
It had been in NEEDINFO state since July 2005.
If this bug still exists in current errata kernels, please reopen this bug.

There are a large number of inactive bugs in the database, and this is the only
way to purge them.

Thank you.