Description of problem: When Fedora8 is installed on my PC (PIII on Abit BX133, HDDs connected to on-board HPT370 controller, 768MB of ECC RAM), the data read from disk is corrupted. Here and there some 4 bytes of data from disk are placed two times (occupying 8 bytes), then some kilobytes later some 4 bytes are deleted Any activity which rewrites data is causing more and more damage. prelink destroys the systems beyound repair quite quickly. e2fsck is causing immediate damage to the filesystem. The corrupted data is placed in buffer cache, after cache is cleared ( echo 1 > /proc/sys/vm/drop_caches or echo 3 > /proc/sys/vm/drop_caches ) the data corruption disappear only to appear for some other data piece. Clearing cache does not seem to fix filesystem data corruption, e.g. directory attributes. Please note that the system have been running with no problems on Fedora Core 4 and Fedora Core 6 This make Fedora 8 on this system unusable and self-destroying even after prelink is disabled. Version-Release number of selected component (if applicable): kernels kernel-2.6.23.1-42.fc8.i686 and kernel-2.6.23.15-137.fc8.i686. The last kernel seems to be better, but the bug is definitely still there. How reproducible: On my system always Steps to Reproduce: 1. disable prelink 2. rpm -Va 3. while rpm -V is running copy files which are marged with 5 but not T to another filesystem 4. clear cache by writing Actual results: Sometimes RPM database is reported to be corrupted (esp. with older kernel) Random files get MD5 verification failed. After buffer cache is cleared and rpm -Va run again, some other files MD5 verificatoin fails. Expected results: kernel should read data correctly from disk, like FC6 was. Additional info: Two hard disks are connected to HPT-370 on-board controller, one for each channel. No mirroring or RAID0 employed. /boot was on separate extended partition, the remaining partitions on LVM logical volumes. I will include additional information as attachments.
lspci result for the system: 00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03) 00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03) 00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02) 00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01) 00:07.2 USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01) 00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02) 00:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10) 00:0b.0 Ethernet controller: Intel Corporation 82557/8/9 Ethernet Pro 100 (rev 08) 00:0d.0 USB Controller: NEC Corporation USB (rev 43) 00:0d.1 USB Controller: NEC Corporation USB (rev 43) 00:0d.2 USB Controller: NEC Corporation USB 2.0 (rev 04) 00:0f.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 07) 00:0f.1 Input device controller: Creative Labs SB Live! Game Port (rev 07) 00:13.0 Mass storage controller: Triones Technologies, Inc. HPT366/368/370/370A/372/372N (rev 04) 01:00.0 VGA compatible controller: nVidia Corporation NV6 [Vanta/Vanta LT] (rev 15) lsmod result includes the following lines: pata_hpt3x2n 10561 0 pata_hpt366 10817 0 pata_hpt37x 15681 9 libata 99633 4 ata_piix,pata_hpt3x2n,pata_hpt366,pata_hpt37x sd_mod 27329 11 scsi_mod 119757 4 sg,sr_mod,libata,sd_mod
Created attachment 295357 [details] output of dmesg command, including segfaults fino I am including output from dmesg command. At the end of the files there are some segfaults cause by data corruption
Created attachment 295361 [details] archive containng 'od' diffs which show data corruption The archive file (tar.bz2 format) contains the following directories: 080211_F8 080212_F8DVD_rescue 080212_F8HD_100MHzFSB 080213_FC8 Each contains test results. While running rpm -Va I was copying files reported to fail MD5 verificaton (but without T as the data corruption I experience do not normally change modification time of file) Then, using Fedora6 I have compared the corrupted verision of file with the original: 080211_F8 - original kernel 080212_F8DVD_rescue - system started from F8 DVD in rescue mode, then chrooted to disk 080212_F8HD_100MHzFSB - decreased FSB from 121MHz to 100MHz, just in case 080213_FC8 - tests with updated kernel Each of the directory contains: .flist - file names list with respect to / .flist.ls-l--full* - result of ls -l --full-time .diffs*bash - loop run to get the results in .diffs Please note that I am not including the damages or original files in the archive to keep the size reasonable .diffs - result of running .diffs*bash commands; in general, the files are processed by od command to display the data without offsets in hex, one byte per line, then the results of 'od' are compared with diff program.
Can you test kernel-2.6.24.2-7 from the updates-testing repository?
(In reply to comment #4) > Can you test kernel-2.6.24.2-7 from the updates-testing repository? Unfortunately my Fedora8 installation get corrupted and I do not know how to fix it. I would prefer to install it again using the kernel from updates-testing. Could you possibly point me to some documention how it can be done (a single machine, so no PXE, NFS, etc);
I tried test kernel kernel-2.6.24.2-7 on Fedora8. The problem with duplicating 4 bytes on HPT370 is still present and happens very often. I will include dmesg output, since there are some new messages which did not occur before. Also, when trying to md5sum 2 copies of a file slightly under 2GiB on two disks (both connected to the same HPT370 controller on different channels, on FedoraCore6 they are named hde and hdg) at the same time, system became completely unresponsive, I could not change text console, even AltSysRq did not work. I had to press reset key
Created attachment 295755 [details] dmesg output with test kernel dmesg output shows system startup - which was very slow indeed (2nd attempt, the first one locked during udev creation). To compare with FedoraCore 6, the dmesg on FC6 shows both disks are running at UDMA66 (with no problems); I believe on some earlier linux version one of the disks run at UDMA100; So dmesg on FC6 shows: hde: max request size: 512KiB hde: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=16383/255/63, UDMA(66) hde: cache flushes supported hde: hde1 hde2 hde3 hde4 < hde5 hde6 hde7 hde8 hde9 hde10 hde11 hde12 hde13 > hdg: max request size: 128KiB hdg: 60036480 sectors (30738 MB) w/1916KiB Cache, CHS=59560/16/63, UDMA(66) hdg: cache flushes not supported
Created attachment 295757 [details] comparing two copies of ~2GB file While running F8 with test kernel, after the usual game with rpm -V to see what files fails verification and save them, I have run another test; I have created a file of size slightly below 2GiB on /home (sda) using compressed data, and copied it to /var/tmp (on sdb); When I tried to compute md5sum of them at the same time system locked solid very quickly (HD LED always on, top did show only one md5sum consuming some CPU time); after I tried to run dmesg the system locked, no response to Alt-Fx or to alt-sysrq; Also, no message on screen; After rebooting the system back to FC6 I compared the two files; I have split them into 64KiB pieces and those pieces different between the two files were compared after being filtered with od; The bash command to produce diffs is in file __compare.bash The pieces are named according to dd parameter using for extracting the parts of testfile. Please note that some offsets happen a lot of times: $ cat *diff | grep '^@@' | sort | uniq -cd 30 @@ -24441,16 +24441,20 @@ 2 @@ -24442,16 +24442,20 @@ 2 @@ -49137,24 +49145,16 @@ 35 @@ -49141,20 +49145,16 @@ Perhaps this would give some hint what is going on?
4 byte shifts look like a FIFO handling bug or a chipset problem. What chipset is the BX133 - a VIA KL133 or similar ?
No, the chipset is Intel 440BX; lspci -v returns: 00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03) Flags: bus master, medium devsel, latency 32 Memory at d0000000 (32-bit, prefetchable) [size=64M] Capabilities: [a0] AGP version 1.0 00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03) (prog-if 00 [Normal decode]) Flags: bus master, 66MHz, medium devsel, latency 64 Bus: primary=00, secondary=01, subordinate=01, sec-latency=32 Memory behind bridge: d4000000-d5ffffff Prefetchable memory behind bridge: d6000000-d7ffffff 00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02) Flags: bus master, medium devsel, latency 0 [...] 00:13.0 Mass storage controller: Triones Technologies, Inc. HPT366/368/370/370A/372/372N (rev 04) Subsystem: Triones Technologies, Inc. HPT370A Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 11 I/O ports at d400 [size=8] I/O ports at d800 [size=4] I/O ports at dc00 [size=8] I/O ports at e000 [size=4] I/O ports at e400 [size=256] Expansion ROM at 40100000 [disabled by cmd] [size=128K] Capabilities: [60] Power Management version 2 01:00.0 VGA compatible controller: nVidia Corporation NV6 [Vanta/Vanta LT] (rev 15) (prog-if 00 [VGA]) Subsystem: nVidia Corporation Unknown device 0072 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7 Memory at d4000000 (32-bit, non-prefetchable) [size=16M] Memory at d6000000 (32-bit, prefetchable) [size=32M] [virtual] Expansion ROM at d5000000 [disabled] [size=64K] Capabilities: [60] Power Management version 1 Capabilities: [44] AGP version 2.0 https://bugzilla.redhat.com/attachment.cgi?id=295357 contains dmesg output. Mainboard information as reported by dmidecode is Handle 0x0002, DMI type 2, 8 bytes. Base Board Information Manufacturer: <<http:\\www.abit.com.tw>> Product Name: I440BX-W977(BX133-RAID/BE6-II v2.0) Version: Serial Number:
Possible fix sent upstream. Fixes the mode masking
That fix was merged in F8 kernel 2.6.24.3-22, but bugzilla wasn't updated. The kernel has been submitted to updates-testing. Please test.
(In reply to comment #12) > That fix was merged in F8 kernel 2.6.24.3-22, but bugzilla wasn't updated. > The kernel has been submitted to updates-testing. Please test. The kernel is still not on any of rsync mirrors I looked at. It is also not in ftp://download.fedora.redhat.com/pub/fedora/linux/updates/testing/8 Am I missing something? Wojtek
Is it possible to switch Fedora8 system to use ide-disk driver for HPT-connected disks? On FedoraCore6 I have $ cat /proc/ide/ide2/hde/driver ide-disk version 1.18 $ cat /proc/ide/ide3/hdg/driver ide-disk version 1.18 On Fedora8 I have $ cat /mnt/fc8/etc/modprobe.conf alias scsi_hostadapter libata alias scsi_hostadapter1 pata_hpt37x alias scsi_hostadapter2 ata_piix alias snd-card-0 snd-emu10k1 options snd-card-0 index=0 options snd-emu10k1 index=0 Is it possible to modify modprobe.conf and/or initrd to use ide-disk driver for hde and hdg? Wojtek
I got kernel 2.6.24.3-22.fc8.i686 today and installed it into my test Fedora8 system (using FC6 and chroot; otherwise rpm database would have most probably become corrupted). The kernel behaves somewhat differently from the 2.6.24.2-7: On my Seagate disk ST380011A the data read is still getting corrupted (some bytes added, then some removed). On my IBM-DTLA-307030 disk (that on blacklist) things are more interesting. I was not able to get any data corruption. On the other hand, a lot of errors are generated, and recovery from the errors take long minutes. During the recovery the process trying to read from disk is not killable at all, and any other process trying to access the same disk is stalled. During one of the tests, when reading the large file from IBM disk and writing 3 to /proc/sys/vm/drop_caches several time, the system locked solid, even Alt-Ctrl-Fx or Alt-SysRq did not work. I will attach portions of dmesg. output
Created attachment 297997 [details] dmesg output + part of /var/log/messages, kernel 2.6.24.3-22.fc8.i686 This is output of dmesg saved a few minutes before system has locked solid (Alt-SysRq not working at all) Please note the almost all filesystems are on sda-located LVs (ST disk), /var/tmp is on sdb; While reading a large file from /var/tmp the system displaed a number of error then locked solid. Since not all ata2 errors were included in dmesg output, I have also appended part of /var/log/messages
Created attachment 297999 [details] dmesg from another test of kernel 2.6.24.3-22.fc8 on HPT-370 attached disks Please note that all I/O errors are fake (result of buggy driver) - the sectors read perfectly well from FC6 system. The file which was being read was about 190 MiB long; Reading the file using the same tool on FC6 takes 8 seconds. sdb: On FC8 reading it from sdb took 10 minutes to 20 minutes or more. But no data corruption took place on sdb sda: on sda, reading under FC8 was fast (10-20 s, compared to 7s on FC6), but data corruption was evident; the size of part of file with moved data (between insertion and deletion) was 5*4K - 16*4K, with 7*4K being most common; Tests was done by generating MD5 for each 4K chunk of file data and comparing with reference MD5 sequence. Before each test I run echo 3 > /proc/sys/vm/drop_caches
Can you attach an FC6 boot dmesg for comparison ?
Created attachment 298598 [details] dmesg output from FedoraCore6 on the same machine I am attaching dmesg output from Fedora Core 6 (kernel Linux version 2.6.22.14-72.fc6 (brewbuilder.redhat.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-13)) I am also including below excerpt from /proc/ide/*/hd[eg]/* on FC6 Wojtek "head -3 /proc/ide/ide2/hde/*" result: ==> /proc/ide/ide2/hde/cache <== 2048 ==> /proc/ide/ide2/hde/capacity <== 156301488 ==> /proc/ide/ide2/hde/driver <== ide-disk version 1.18 ==> /proc/ide/ide2/hde/geometry <== physical 16383/16/63 logical 16383/255/63 ==> /proc/ide/ide2/hde/identify <== 0c5a 3fff c837 0010 0000 0000 003f 0000 0000 0000 354a 5633 584e 5330 2020 2020 2020 2020 2020 2020 0000 1000 0004 332e ==> /proc/ide/ide2/hde/media <== disk ==> /proc/ide/ide2/hde/model <== ST380011A ==> /proc/ide/ide2/hde/settings <== name value min max mode ---- ----- --- --- ---- acoustic 0 0 254 rw ==> /proc/ide/ide2/hde/smart_thresholds <== 000a 0601 0000 0000 0000 0000 0000 0003 0000 0000 0000 0000 0000 1404 0000 0000 0000 0000 0000 2405 0000 0000 0000 0000 ==> /proc/ide/ide2/hde/smart_values <== 000a 0f01 4300 d23f 8e4b 000a 0000 0303 6200 0062 0000 0000 0000 3204 6400 0464 0000 0000 0000 3305 6400 0064 0000 0000 "head -3 /proc/ide/ide3/hdg/*" result: ==> /proc/ide/ide3/hdg/cache <== 1916 ==> /proc/ide/ide3/hdg/capacity <== 60036480 ==> /proc/ide/ide3/hdg/driver <== ide-disk version 1.18 ==> /proc/ide/ide3/hdg/geometry <== physical 16383/16/63 logical 59560/16/63 ==> /proc/ide/ide3/hdg/identify <== 045a 3fff 37c8 0010 0000 0000 003f 0000 0000 0000 2020 2020 2020 2020 2059 4b44 594b 5638 3031 3935 0003 0ef8 0028 5458 ==> /proc/ide/ide3/hdg/media <== disk ==> /proc/ide/ide3/hdg/model <== IBM-DTLA-307030 ==> /proc/ide/ide3/hdg/settings <== name value min max mode ---- ----- --- --- ---- acoustic 0 0 254 rw ==> /proc/ide/ide3/hdg/smart_thresholds <== 0010 3c01 0000 0000 0000 0000 0000 3202 0000 0000 0000 0000 0000 1803 0000 0000 0000 0000 0000 0004 0000 0000 0000 0000 ==> /proc/ide/ide3/hdg/smart_values <== 0010 0b01 6400 0064 0000 0000 0000 0502 8400 5584 0001 0000 0000 0703 7b00 b07b b300 0400 0000 1204 6400 2464 000c 0000
Not clear what is going on and the published info on the errata is minimal. Might be interesting to try changing both hpt370 and hpt370a_filter to read if (1 || hpt_dma_blacklisted(adev, "UDMA100", bad_ata100_5)) just as a test on this box.
Closing - apparent underlying bug now fixed upstream, no activity
The bug is still present on Fedora12 and same hardware; There are two hard disks connected to HPT370A : (lspci -v displays: Subsystem: HighPoint Technologies, Inc. HPT370A ) ST380011A and IBM-DTLA-307030 The system works perfectly well with CentOS 5.4, with ST380011A configured for UDMA/100 and IBM-DTLA-307030 configured for UDMA/66 On Fedora 12, kernel 2.6.31.5-127.fc12.i686 aceess to ST disk works but data are corrupted when reading, IBM writes OK, when reading there are problems (resets, etc, programs trying to read large files never complete - see dmesg file included) The corruptions are as in Fedora8: here and there some 4 bytes are duplicated, a few kilobytes another 4 bytes are removed - between those two places the file contents are shifted by 4 bytes. I have done tests by running rpm -Va and copying files reported to have non-matching digest to /dev/shm; Then I compared results of od -An -vw4 -x1 of the original, good file and of the corrupted file. The diffs file showing exactly what has changed are attached in tar.gz file
Created attachment 377165 [details] dmesg from Fedora 12, kernel 2.6.31.5-127.fc12.i686 there are "lost interrupt", "drained 32768 bytes to clear DRQ.", "device reported invalid CHS sector 0" messages here; Please note the disks works perfectly well on CentOS 5.4 (and it used to work well on Fedora 6, as well)
Created attachment 377167 [details] Compressed tar of diffs showing data corruptions This is a result of comparing files with corrupted data read from ST disk with the original data ( no offsets shown, can be calculated using formula 4 * (line number - 1) ) In the diffs lines starting with "-" refer to good data, lines starting with "+" refer to currupted data (in the area between insertion and deletion of 4-byte strings, all data is shifted by 4 bytes) All files were copied from filesystems / and /var created on LVM logical drives created on PVS located on ST disk (sda)
I booted Fedora-13-KDE alpha yesterday on the same machine (from image F13-Alpha-i686-Live-KDE.iso, kernel 2.6.33-0.52.rc8.git6.fc13.i686 ) then chrooted to my F12 test installation on ST380011A hard disk and run rpm -Va several times; During some of these runs I have also read some data from the DTLA disk and cleared cache by writing "3" to /proc/sys/vm/drop_caches rpm reported MD5 mismatch on random files (different files in different runs) and a lots of messages have been written to /var/log/messages (excerpt attached) Transfer from both disks was much slower than normally. lsmod indicated pata_hpt37x driver is being used The computer is normally running CentOS5 with no problems (no libata on hard disks)
Created attachment 402241 [details] selected parts of /var/log/messages, kernel 2.6.33-0.52.rc8.git6.fc13.i686 The file attached contains selected lines from /var/log/messages (compressed with xz; NetworkManager and USB lines removed) from Fedora13 alpha CD (i686 KDE) kernel 2.6.33-0.52.rc8.git6.fc13.i686
How about Fedora 13 final?
This message is a reminder that Fedora 13 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '13'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 13's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 13 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.