Bug 433557
Description
Wojciech Pilorz
2008-02-19 23:46:55 UTC
lspci result for the system: 00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03) 00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03) 00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02) 00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01) 00:07.2 USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01) 00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02) 00:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10) 00:0b.0 Ethernet controller: Intel Corporation 82557/8/9 Ethernet Pro 100 (rev 08) 00:0d.0 USB Controller: NEC Corporation USB (rev 43) 00:0d.1 USB Controller: NEC Corporation USB (rev 43) 00:0d.2 USB Controller: NEC Corporation USB 2.0 (rev 04) 00:0f.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 07) 00:0f.1 Input device controller: Creative Labs SB Live! Game Port (rev 07) 00:13.0 Mass storage controller: Triones Technologies, Inc. HPT366/368/370/370A/372/372N (rev 04) 01:00.0 VGA compatible controller: nVidia Corporation NV6 [Vanta/Vanta LT] (rev 15) lsmod result includes the following lines: pata_hpt3x2n 10561 0 pata_hpt366 10817 0 pata_hpt37x 15681 9 libata 99633 4 ata_piix,pata_hpt3x2n,pata_hpt366,pata_hpt37x sd_mod 27329 11 scsi_mod 119757 4 sg,sr_mod,libata,sd_mod Created attachment 295357 [details]
output of dmesg command, including segfaults fino
I am including output from dmesg command.
At the end of the files there are some segfaults cause by data corruption
Created attachment 295361 [details] archive containng 'od' diffs which show data corruption The archive file (tar.bz2 format) contains the following directories: 080211_F8 080212_F8DVD_rescue 080212_F8HD_100MHzFSB 080213_FC8 Each contains test results. While running rpm -Va I was copying files reported to fail MD5 verificaton (but without T as the data corruption I experience do not normally change modification time of file) Then, using Fedora6 I have compared the corrupted verision of file with the original: 080211_F8 - original kernel 080212_F8DVD_rescue - system started from F8 DVD in rescue mode, then chrooted to disk 080212_F8HD_100MHzFSB - decreased FSB from 121MHz to 100MHz, just in case 080213_FC8 - tests with updated kernel Each of the directory contains: .flist - file names list with respect to / .flist.ls-l--full* - result of ls -l --full-time .diffs*bash - loop run to get the results in .diffs Please note that I am not including the damages or original files in the archive to keep the size reasonable .diffs - result of running .diffs*bash commands; in general, the files are processed by od command to display the data without offsets in hex, one byte per line, then the results of 'od' are compared with diff program. Can you test kernel-2.6.24.2-7 from the updates-testing repository? (In reply to comment #4) > Can you test kernel-2.6.24.2-7 from the updates-testing repository? Unfortunately my Fedora8 installation get corrupted and I do not know how to fix it. I would prefer to install it again using the kernel from updates-testing. Could you possibly point me to some documention how it can be done (a single machine, so no PXE, NFS, etc); I tried test kernel kernel-2.6.24.2-7 on Fedora8. The problem with duplicating 4 bytes on HPT370 is still present and happens very often. I will include dmesg output, since there are some new messages which did not occur before. Also, when trying to md5sum 2 copies of a file slightly under 2GiB on two disks (both connected to the same HPT370 controller on different channels, on FedoraCore6 they are named hde and hdg) at the same time, system became completely unresponsive, I could not change text console, even AltSysRq did not work. I had to press reset key Created attachment 295755 [details]
dmesg output with test kernel
dmesg output shows system startup - which was very slow indeed (2nd attempt,
the first one locked during udev creation).
To compare with FedoraCore 6, the dmesg on FC6 shows both disks are running at
UDMA66 (with no problems); I believe on some earlier linux version one of the
disks run at UDMA100;
So dmesg on FC6 shows:
hde: max request size: 512KiB
hde: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=16383/255/63, UDMA(66)
hde: cache flushes supported
hde: hde1 hde2 hde3 hde4 < hde5 hde6 hde7 hde8 hde9 hde10 hde11 hde12 hde13 >
hdg: max request size: 128KiB
hdg: 60036480 sectors (30738 MB) w/1916KiB Cache, CHS=59560/16/63, UDMA(66)
hdg: cache flushes not supported
Created attachment 295757 [details]
comparing two copies of ~2GB file
While running F8 with test kernel, after the usual game with rpm -V
to see what files fails verification and save them,
I have run another test;
I have created a file of size slightly below 2GiB on /home (sda)
using compressed data, and copied it to /var/tmp (on sdb);
When I tried to compute md5sum of them at the same time system
locked solid very quickly (HD LED always on, top did show
only one md5sum consuming some CPU time); after I tried to run dmesg
the system locked, no response to Alt-Fx or to alt-sysrq;
Also, no message on screen;
After rebooting the system back to FC6 I compared the two files;
I have split them into 64KiB pieces and those pieces different between
the two files were compared after being filtered with od;
The bash command to produce diffs is in file
__compare.bash
The pieces are named according to dd parameter using
for extracting the parts of testfile.
Please note that some offsets happen a lot of times:
$ cat *diff | grep '^@@' | sort | uniq -cd
30 @@ -24441,16 +24441,20 @@
2 @@ -24442,16 +24442,20 @@
2 @@ -49137,24 +49145,16 @@
35 @@ -49141,20 +49145,16 @@
Perhaps this would give some hint what is going on?
4 byte shifts look like a FIFO handling bug or a chipset problem. What chipset is the BX133 - a VIA KL133 or similar ? No, the chipset is Intel 440BX; lspci -v returns: 00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03) Flags: bus master, medium devsel, latency 32 Memory at d0000000 (32-bit, prefetchable) [size=64M] Capabilities: [a0] AGP version 1.0 00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03) (prog-if 00 [Normal decode]) Flags: bus master, 66MHz, medium devsel, latency 64 Bus: primary=00, secondary=01, subordinate=01, sec-latency=32 Memory behind bridge: d4000000-d5ffffff Prefetchable memory behind bridge: d6000000-d7ffffff 00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02) Flags: bus master, medium devsel, latency 0 [...] 00:13.0 Mass storage controller: Triones Technologies, Inc. HPT366/368/370/370A/372/372N (rev 04) Subsystem: Triones Technologies, Inc. HPT370A Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 11 I/O ports at d400 [size=8] I/O ports at d800 [size=4] I/O ports at dc00 [size=8] I/O ports at e000 [size=4] I/O ports at e400 [size=256] Expansion ROM at 40100000 [disabled by cmd] [size=128K] Capabilities: [60] Power Management version 2 01:00.0 VGA compatible controller: nVidia Corporation NV6 [Vanta/Vanta LT] (rev 15) (prog-if 00 [VGA]) Subsystem: nVidia Corporation Unknown device 0072 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7 Memory at d4000000 (32-bit, non-prefetchable) [size=16M] Memory at d6000000 (32-bit, prefetchable) [size=32M] [virtual] Expansion ROM at d5000000 [disabled] [size=64K] Capabilities: [60] Power Management version 1 Capabilities: [44] AGP version 2.0 https://bugzilla.redhat.com/attachment.cgi?id=295357 contains dmesg output. Mainboard information as reported by dmidecode is Handle 0x0002, DMI type 2, 8 bytes. Base Board Information Manufacturer: <<http:\\www.abit.com.tw>> Product Name: I440BX-W977(BX133-RAID/BE6-II v2.0) Version: Serial Number: Possible fix sent upstream. Fixes the mode masking That fix was merged in F8 kernel 2.6.24.3-22, but bugzilla wasn't updated. The kernel has been submitted to updates-testing. Please test. (In reply to comment #12) > That fix was merged in F8 kernel 2.6.24.3-22, but bugzilla wasn't updated. > The kernel has been submitted to updates-testing. Please test. The kernel is still not on any of rsync mirrors I looked at. It is also not in ftp://download.fedora.redhat.com/pub/fedora/linux/updates/testing/8 Am I missing something? Wojtek Is it possible to switch Fedora8 system to use ide-disk driver for HPT-connected disks? On FedoraCore6 I have $ cat /proc/ide/ide2/hde/driver ide-disk version 1.18 $ cat /proc/ide/ide3/hdg/driver ide-disk version 1.18 On Fedora8 I have $ cat /mnt/fc8/etc/modprobe.conf alias scsi_hostadapter libata alias scsi_hostadapter1 pata_hpt37x alias scsi_hostadapter2 ata_piix alias snd-card-0 snd-emu10k1 options snd-card-0 index=0 options snd-emu10k1 index=0 Is it possible to modify modprobe.conf and/or initrd to use ide-disk driver for hde and hdg? Wojtek I got kernel 2.6.24.3-22.fc8.i686 today and installed it into my test Fedora8 system (using FC6 and chroot; otherwise rpm database would have most probably become corrupted). The kernel behaves somewhat differently from the 2.6.24.2-7: On my Seagate disk ST380011A the data read is still getting corrupted (some bytes added, then some removed). On my IBM-DTLA-307030 disk (that on blacklist) things are more interesting. I was not able to get any data corruption. On the other hand, a lot of errors are generated, and recovery from the errors take long minutes. During the recovery the process trying to read from disk is not killable at all, and any other process trying to access the same disk is stalled. During one of the tests, when reading the large file from IBM disk and writing 3 to /proc/sys/vm/drop_caches several time, the system locked solid, even Alt-Ctrl-Fx or Alt-SysRq did not work. I will attach portions of dmesg. output Created attachment 297997 [details]
dmesg output + part of /var/log/messages, kernel 2.6.24.3-22.fc8.i686
This is output of dmesg saved a few minutes before system
has locked solid (Alt-SysRq not working at all)
Please note the almost all filesystems are on sda-located LVs (ST disk),
/var/tmp is on sdb; While reading a large file from /var/tmp the system
displaed a number of error then locked solid.
Since not all ata2 errors were included in dmesg output,
I have also appended part of /var/log/messages
Created attachment 297999 [details]
dmesg from another test of kernel 2.6.24.3-22.fc8 on HPT-370 attached disks
Please note that all I/O errors are fake (result of buggy driver) - the sectors
read perfectly well from FC6 system.
The file which was being read was about 190 MiB long;
Reading the file using the same tool on FC6 takes 8 seconds.
sdb:
On FC8 reading it from sdb took 10 minutes to 20 minutes or more.
But no data corruption took place on sdb
sda:
on sda, reading under FC8 was fast (10-20 s, compared to 7s on FC6), but data
corruption was evident; the size of part of file with moved data (between
insertion and deletion) was 5*4K - 16*4K, with 7*4K being most common;
Tests was done by generating MD5 for each 4K chunk of file data and comparing
with reference MD5 sequence.
Before each test I run
echo 3 > /proc/sys/vm/drop_caches
Can you attach an FC6 boot dmesg for comparison ? Created attachment 298598 [details]
dmesg output from FedoraCore6 on the same machine
I am attaching dmesg output from Fedora Core 6 (kernel Linux version
2.6.22.14-72.fc6 (brewbuilder.redhat.com) (gcc version 4.1.2
20070626 (Red Hat 4.1.2-13))
I am also including below excerpt from /proc/ide/*/hd[eg]/* on FC6
Wojtek
"head -3 /proc/ide/ide2/hde/*" result:
==> /proc/ide/ide2/hde/cache <==
2048
==> /proc/ide/ide2/hde/capacity <==
156301488
==> /proc/ide/ide2/hde/driver <==
ide-disk version 1.18
==> /proc/ide/ide2/hde/geometry <==
physical 16383/16/63
logical 16383/255/63
==> /proc/ide/ide2/hde/identify <==
0c5a 3fff c837 0010 0000 0000 003f 0000
0000 0000 354a 5633 584e 5330 2020 2020
2020 2020 2020 2020 0000 1000 0004 332e
==> /proc/ide/ide2/hde/media <==
disk
==> /proc/ide/ide2/hde/model <==
ST380011A
==> /proc/ide/ide2/hde/settings <==
name value min max mode
---- ----- --- --- ----
acoustic 0 0 254 rw
==> /proc/ide/ide2/hde/smart_thresholds <==
000a 0601 0000 0000 0000 0000 0000 0003
0000 0000 0000 0000 0000 1404 0000 0000
0000 0000 0000 2405 0000 0000 0000 0000
==> /proc/ide/ide2/hde/smart_values <==
000a 0f01 4300 d23f 8e4b 000a 0000 0303
6200 0062 0000 0000 0000 3204 6400 0464
0000 0000 0000 3305 6400 0064 0000 0000
"head -3 /proc/ide/ide3/hdg/*" result:
==> /proc/ide/ide3/hdg/cache <==
1916
==> /proc/ide/ide3/hdg/capacity <==
60036480
==> /proc/ide/ide3/hdg/driver <==
ide-disk version 1.18
==> /proc/ide/ide3/hdg/geometry <==
physical 16383/16/63
logical 59560/16/63
==> /proc/ide/ide3/hdg/identify <==
045a 3fff 37c8 0010 0000 0000 003f 0000
0000 0000 2020 2020 2020 2020 2059 4b44
594b 5638 3031 3935 0003 0ef8 0028 5458
==> /proc/ide/ide3/hdg/media <==
disk
==> /proc/ide/ide3/hdg/model <==
IBM-DTLA-307030
==> /proc/ide/ide3/hdg/settings <==
name value min max mode
---- ----- --- --- ----
acoustic 0 0 254 rw
==> /proc/ide/ide3/hdg/smart_thresholds <==
0010 3c01 0000 0000 0000 0000 0000 3202
0000 0000 0000 0000 0000 1803 0000 0000
0000 0000 0000 0004 0000 0000 0000 0000
==> /proc/ide/ide3/hdg/smart_values <==
0010 0b01 6400 0064 0000 0000 0000 0502
8400 5584 0001 0000 0000 0703 7b00 b07b
b300 0400 0000 1204 6400 2464 000c 0000
Not clear what is going on and the published info on the errata is minimal. Might be interesting to try changing both hpt370 and hpt370a_filter to read if (1 || hpt_dma_blacklisted(adev, "UDMA100", bad_ata100_5)) just as a test on this box. Closing - apparent underlying bug now fixed upstream, no activity The bug is still present on Fedora12 and same hardware; There are two hard disks connected to HPT370A : (lspci -v displays: Subsystem: HighPoint Technologies, Inc. HPT370A ) ST380011A and IBM-DTLA-307030 The system works perfectly well with CentOS 5.4, with ST380011A configured for UDMA/100 and IBM-DTLA-307030 configured for UDMA/66 On Fedora 12, kernel 2.6.31.5-127.fc12.i686 aceess to ST disk works but data are corrupted when reading, IBM writes OK, when reading there are problems (resets, etc, programs trying to read large files never complete - see dmesg file included) The corruptions are as in Fedora8: here and there some 4 bytes are duplicated, a few kilobytes another 4 bytes are removed - between those two places the file contents are shifted by 4 bytes. I have done tests by running rpm -Va and copying files reported to have non-matching digest to /dev/shm; Then I compared results of od -An -vw4 -x1 of the original, good file and of the corrupted file. The diffs file showing exactly what has changed are attached in tar.gz file Created attachment 377165 [details]
dmesg from Fedora 12, kernel 2.6.31.5-127.fc12.i686
there are "lost interrupt", "drained 32768 bytes to clear DRQ.",
"device reported invalid CHS sector 0"
messages here; Please note the disks works perfectly well
on CentOS 5.4 (and it used to work well on Fedora 6, as well)
Created attachment 377167 [details]
Compressed tar of diffs showing data corruptions
This is a result of comparing files with corrupted data
read from ST disk with the original data
( no offsets shown, can be calculated using
formula 4 * (line number - 1)
)
In the diffs lines starting with "-" refer to good data,
lines starting with "+" refer to currupted data
(in the area between insertion and deletion of 4-byte strings,
all data is shifted by 4 bytes)
All files were copied from filesystems / and /var created on LVM logical drives
created on PVS located on ST disk (sda)
I booted Fedora-13-KDE alpha yesterday on the same machine (from image F13-Alpha-i686-Live-KDE.iso, kernel 2.6.33-0.52.rc8.git6.fc13.i686 ) then chrooted to my F12 test installation on ST380011A hard disk and run rpm -Va several times; During some of these runs I have also read some data from the DTLA disk and cleared cache by writing "3" to /proc/sys/vm/drop_caches rpm reported MD5 mismatch on random files (different files in different runs) and a lots of messages have been written to /var/log/messages (excerpt attached) Transfer from both disks was much slower than normally. lsmod indicated pata_hpt37x driver is being used The computer is normally running CentOS5 with no problems (no libata on hard disks) Created attachment 402241 [details]
selected parts of /var/log/messages, kernel 2.6.33-0.52.rc8.git6.fc13.i686
The file attached contains selected lines from /var/log/messages
(compressed with xz; NetworkManager and USB lines removed)
from Fedora13 alpha CD (i686 KDE)
kernel 2.6.33-0.52.rc8.git6.fc13.i686
How about Fedora 13 final? This message is a reminder that Fedora 13 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '13'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 13's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 13 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |