From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.9) Gecko/20050711 Firefox/1.0.5 Description of problem: I copy ~30G of data from one directory to another directory on the same filesystem. Then I compare the files and see file corruption in one or more small or large files. Version-Release number of selected component (if applicable): kernel-2.6.9-11.EL How reproducible: Sometimes Steps to Reproduce: 1. rsync -a /home/XXX/ /home/XXX2 2. diff -r /home/XXX /home/XXX2 Actual Results: The diff output shows: /home/XXX/ftp/pub/Linux/StarOffice/staroffice-es-5.2-6mdk.i586.rpm and /home/XXX2/ftp/pub/Linux/StarOffice/staroffice-es-5.2-6mdk.i586.rpm differ /home/XXX/ftp/pub/mp3/double_house-vol_2.track04.mp3 and /home/XXX2/ftp/pub/mp3/double_house-vol_2.track04.mp3 differ Expected Results: All files should be identical. Additional info: I'll post additional informations below because bugzilla seems to refuse my large report.
I have lots of additional information because I'm investigating this problem for some weeks now and I have already spent more than $1500 on it for new hardware which in the end has not been found to be the culprit :( Please bear in mind that every single test I did runs around 4 to 12 hours until the problem shows up. Here is some more info: - "rsync" is not to blame here, all the same happens with "cp" as well. - it's not a problem of any IDE (U)DMA issue. My first idea was that this is another DMA transfer corruption, but it seems to be not. I have two very different servers where I can reproduce exactly the same corruption problem. Corruption is not single bit/byte but large parts of the files, it happens in small and large files. Server1: - ASUS P2B-S board - CPU PIII 450MHz - Intel 440BX/ZX/DX Chipset - 4x128M memory (ECC enabled) - 2x IDE disks Seagate Barracuda 400G, connected to onboard "Intel PIIX4 Ultra 33" - Promise Ultra100TX2 adapter for additional tests Server2: - DELL PowerEdge 1400 - CPU PIII 800MHz - ServerWorks OSB4 Chipset - 4x256M memory (ECC enabled) - 2x U320 SCSI disks Maxtor Atlas 10K 146G - onboard Adaptec aic7899 Ultra160 SCSI adapter System config is the same for both, except the IDE/SCSI difference. Config is shown for Server2: [root@crash ~]# cat /proc/mdstat Personalities : [raid1] md9 : active raid1 sdb13[1] sda13[0] 15631104 blocks [2/2] [UU] md8 : active raid1 sdb12[1] sda12[0] 15631104 blocks [2/2] [UU] md7 : active raid1 sdb11[1] sda11[0] 15631104 blocks [2/2] [UU] md6 : active raid1 sdb10[1] sda10[0] 15631104 blocks [2/2] [UU] md5 : active raid1 sdb9[1] sda9[0] 15631104 blocks [2/2] [UU] md4 : active raid1 sdb8[1] sda8[0] 15631104 blocks [2/2] [UU] md3 : active raid1 sdb7[1] sda7[0] 15631104 blocks [2/2] [UU] md2 : active raid1 sdb6[1] sda6[0] 15631104 blocks [2/2] [UU] md10 : active raid1 sdb14[1] sda14[0] 2787136 blocks [2/2] [UU] md1 : active raid1 sdb5[1] sda5[0] 15631104 blocks [2/2] [UU] md0 : active raid1 sdb1[1] sda1[0] 192640 blocks [2/2] [UU] unused devices: <none> [root@crash ~]# pvscan PV /dev/md2 VG VolGroup01 lvm2 [14.91 GB / 0 free] PV /dev/md3 VG VolGroup01 lvm2 [14.91 GB / 0 free] PV /dev/md4 VG VolGroup01 lvm2 [14.91 GB / 0 free] PV /dev/md5 VG VolGroup01 lvm2 [14.91 GB / 0 free] PV /dev/md6 VG VolGroup01 lvm2 [14.91 GB / 0 free] PV /dev/md7 VG VolGroup01 lvm2 [14.91 GB / 0 free] PV /dev/md8 VG VolGroup01 lvm2 [14.91 GB / 4.34 GB free] PV /dev/md9 VG VolGroup01 lvm2 [14.91 GB / 14.91 GB free] PV /dev/md10 VG VolGroup01 lvm2 [2.66 GB / 2.66 GB free] PV /dev/md1 VG VolGroup00 lvm2 [14.91 GB / 4.09 GB free] Total: 10 [136.81 GB] / in use: 10 [136.81 GB] / in no VG: 0 [0 ] [root@crash ~]# vgscan Reading all physical volumes. This may take a while... Found volume group "VolGroup01" using metadata type lvm2 Found volume group "VolGroup00" using metadata type lvm2 [root@crash ~]# lvscan ACTIVE '/dev/VolGroup01/LogVol05' [100.00 GB] inherit ACTIVE '/dev/VolGroup00/LogVol00' [1.00 GB] inherit ACTIVE '/dev/VolGroup00/LogVol04' [1.00 GB] inherit ACTIVE '/dev/VolGroup00/LogVol02' [2.94 GB] inherit ACTIVE '/dev/VolGroup00/LogVol03' [3.91 GB] inherit ACTIVE '/dev/VolGroup00/LogVol01' [1.97 GB] inherit [root@crash ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 1032088 114776 864884 12% / /dev/md0 186555 9151 167772 6% /boot none 517956 0 517956 0% /dev/shm /dev/mapper/VolGroup00-LogVol04 1032088 34128 945532 4% /tmp /dev/mapper/VolGroup00-LogVol02 3031760 983580 1894172 35% /usr /dev/mapper/VolGroup00-LogVol03 4031680 1502532 2324348 40% /var ftp:/home/ftp/pub 251997952 54477344 197258464 22% /mnt/nfs /dev/mapper/VolGroup01-LogVol05 103212320 47521248 54642496 47% /home [root@crash ~]# tune2fs -l /dev/VolGroup01/LogVol05 tune2fs 1.35 (28-Feb-2004) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: 4b442f42-2b59-4386-8b13-145b9a6e9c07 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 13107200 Block count: 26214400 Reserved block count: 262144 Free blocks: 25779512 Free inodes: 13107189 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Sat Jul 30 10:44:30 2005 Last mount time: Sat Jul 30 10:45:12 2005 Last write time: Sat Jul 30 10:45:12 2005 Mount count: 1 Maximum mount count: -1 Last checked: Sat Jul 30 10:44:30 2005 Check interval: 0 (<none>) Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: 046fa8f6-3811-4754-89b7-9ec12180da3a Journal backup: inode blocks [root@crash ~]# du -sh /home/XXX/* 8.5G /home/XXX/ISO 17G /home/XXX/Linux 15G /home/XXX/mp3 [root@crash ~]# cat mktest LOGF=/root/diff.log umount /home mkfs.ext3 -m 1 -O dir_index /dev/VolGroup01/LogVol05 tune2fs -c 0 -i 0 /dev/VolGroup01/LogVol05 mount /dev/VolGroup01/LogVol05 /home rsync -a /mnt/nfs/ /home/XXX while true; do rm -rf /home/XXX2 rsync -a /home/XXX/ /home/XXX2 date >> $LOGF diff -r /home/XXX /home/XXX2 >> $LOGF done [root@crash ~]# cat diff.log Sat Jul 30 12:58:24 CEST 2005 /home/XXX/ftp/pub/Linux/StarOffice/staroffice-es-5.2-6mdk.i586.rpm and /home/XXX2/ftp/pub/Linux/StarOffice/staroffice-es-5.2-6mdk.i586.rpm differ /home/XXX/ftp/pub/mp3/double_house-vol_2.track04.mp3 and /home/XXX2/ftp/pub/mp3/double_house-vol_2.track04.mp3 differ I have tested all the same on plain RAID1 and on plain LVM2 with a volume stiped on both disks. I couldn't reproduce the problem there so it seems like it's really a problem when LVM2 is stacked on top of RAID1. Going back to LVM2 on RAID1 made the problem show up again. I didn't test other filesystems than EXT3. One of the biggest problems is that it takes so much time for a single test so it could also be that the problem shows up on plain LVM2 or RAID1 after waiting long enough. I simply don't know. This problem is very critical to me that's why I already spent so much time analyzing it as much as possible. I hope RedHat and other people will reproduce it and help to find the solution for it.
I tried kernel-2.6.12-1.1398_FC4 from Fedora Core 4 now but the results are all the same. The following output shows that I get at least one corrupt file with every iteration of the test script: Sat Jul 30 17:12:23 CEST 2005 Binary files /home/XXX/mp3/energy_trance-vol6.track18.mp3 and /home/XXX2/mp3/energy_trance-vol6.track18.mp3 differ Sat Jul 30 19:05:55 CEST 2005 Binary files /home/XXX/Linux/rh-7.1/i386/XFree86-ISO8859-9-100dpi-fonts-4.1.0-15.i386.rpm and /home/XXX2/Linux/rh-7.1/i386/XFree86-ISO8859-9-100dpi-fonts-4.1.0-15.i386.rpm differ Binary files /home/XXX/mp3/gamma_ray-heading_for_tomorrow.track08.mp3 and /home/XXX2/mp3/gamma_ray-heading_for_tomorrow.track08.mp3 differ Sat Jul 30 20:58:38 CEST 2005 Binary files /home/XXX/Linux/Doc/RDBMS/rdbms.pdf and /home/XXX2/Linux/Doc/RDBMS/rdbms.pdf differ Sat Jul 30 22:53:42 CEST 2005 Binary files /home/XXX/Linux/FAUmachine/faumachine-redhat-9-20031110-1.i386.rpm and /home/XXX2/Linux/FAUmachine/faumachine-redhat-9-20031110-1.i386.rpm differ
As this appears to be something for which you have some urgency, please contact our support team about it. In order to file a support issue, please either contact Red Hat's Technical Support line at 888-GO-REDHAT or file a web ticket at http://www.redhat.com/apps/support/. Bugzilla is not an official support channel, has no response guarantees, and may not route your issue to the correct area to assist you. Using the official support channels above will guarantee that your issue is handled appropriately and routed to the individual or group which can best assist you with this issue and will also allow Red Hat to track the issue, ensuring that any applicable bug fix is included in all releases and is not dropped from a future update or major release.
Just some more info: - The problem is not only with the RHEL4 kernel, it also happens with a vanilla kernel 2.6.12.3 as well as with NOVELL/SUSE 9.3 - The filesystem doesn't matter, the problem exists also with EXT2 and XFS I made some new tests on a brand new DELL PE1850 with the same results. I've put some config files toghether here at our site: http://www.invoca.ch/bugs/linux-2.6-corruption-on-lvm2-on-raid1/ I have also posted to the kernel ML because I consider this issue very critical now.
Chris Adams pointed out on the kernel ML that it may be the same bug like the one he filed here https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=152162#c43 Indeed, I've built updated kernel rpms with the bio_clone fix included and all corruption problems are gone. I strongly suggest that RedHat releases updated kernel rpms immediately to prevent it's customers/users from getting data corruption - one of the worst things to happen in the enterprise environment.
*** This bug has been marked as a duplicate of 152162 ***
Reopening, this is RHEL4, not FC3. Patch is at: http://bugzilla.kernel.org/attachment.cgi?id=5394&action=view