|Summary:||ext2 and device dm-0 byond 2Terabyte causes /var/log/messages file size to crash system|
|Product:||Red Hat Enterprise Linux 4||Reporter:||David Grifffith <lownoisefloor>|
|Component:||kernel||Assignee:||Stephen Tweedie <sct>|
|Status:||CLOSED ERRATA||QA Contact:||Brian Brock <bbrock>|
|Version:||4.0||CC:||agk, bpeck, coughlan, davej, jbaron, k.georgiou, lockhart, lownoisefloor, sct, shillman, tburke|
|Fixed In Version:||RHBA-2005-298||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2005-06-08 15:13:14 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
Description David Grifffith 2005-01-01 03:36:40 UTC
Description of problem: Server is setup with RHEL Beta4 Version2. 14 300GB SCSI drives are configured with LVM to present 3.8 terabyte file system to user. EXT2 file type is used. EXT2 is capable of 32 terabyte partition size but only 2 terabyte individual file size. When an application tries to continuously write to the 3.8 Terabytes available, and the 2 Terabyte limit is reached, an error message ---- begin error message ---- Dec 30 23:48:52 localhost kernel: EXT2-fs error (device dm-0): ext2_free_blocks: Freeing blocks in system zones - Block = 16843009, count = 1 Dec 30 23:48:52 localhost kernel: EXT2-fs error (device dm-0): ext2_free_blocks: bit already cleared for block 16843009 --- error message repeats endlessly until /var/log/messages file reaches full disk space. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux 4 Beta 2 (220.127.116.11-648smp) How reproducible: 1. You must have an array greater that 2 terabytes. The system under test has 3.8 terabytes. 2. Setup all disks with LVM 2 to achieve the full 3.8 terabytes. 3. begin writing a file that grows until the 2Terabyte limit is reached. .... error occurs here Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: It is a guess that the device-dm0 or device mapper funstion is the issue here. Handling limits and exiting out to tell the user that file size limit has been reached is desireable.
Comment 2 Tom Coughlan 2005-01-19 14:24:09 UTC
I tried the following test on a system with two 3 TB hardware RAID 0 logical units. It did not reproduce the problem, although there are some issues. The two storage devices are sda and sdb: parted /dev/sda mklabel gpt parted /dev/sdb mklabel gpt parted /dev/sda mkpart primary ext3 0 3050352 parted /dev/sdb mkpart primary ext3 0 3050352 pvcreate /dev/sda pvcreate /dev/sdb vgcreate bigvg00 /dev/sda /dev/sdb vgdisplay --- Volume group --- VG Name bigvg00 System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 1 VG Access read/write VG Status resizable MAX LV 0 Cur LV 0 Open LV 0 Max PV 0 Cur PV 2 Act PV 2 VG Size 5.82 TB PE Size 4.00 MB Total PE 1525174 Alloc PE / Size 0 / 0 Free PE / Size 1525174 / 5.82 TB VG UUID aw4hbU-xljx-8bB8-6wyy-7Cup-MCnX-jSXGRJ lvcreate -i 2 -L 3TB bigvg00 lvdisplay --- Logical volume --- LV Name /dev/bigvg00/lvol0 VG Name bigvg00 LV UUID 7SXMuz-7sw6-pHon-wuys-H2eD-ajgU-F4cF4d LV Write Access read/write LV Status available # open 0 LV Size 3.00 TB Current LE 786432 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:0 mke2fs -T largefile4 -j /dev/bigvg00/lvol0 fsck -f /dev/bigvg00/lvol0 No problem. Then I mounted the fs and ran the "dt" test, to write a 2.5 TB file: # ./dt of=/mnt/scratch/testfile bs=200m limit=2500GB dispose=keep log=bigfile.log The test ended with the error: File size limit exceeded # ls -lh /mnt/scratch/testfile -rw-r--r-- 1 root root 2.0T Jan 18 23:07 /mnt/scratch/testfile There were no errors in the log. There are three differences between this test and the problem report: this tests uses kernel 2.6.9-5.EL, it uses ext3, and the storage configuration is slightly different. Although there were no failures during the test I have seen some problems since. Parted "print" says the the primary GPT table is corrupt: parted /dev/sda print Error: The primary GPT table is corrupt, but the backup appears ok, so that will be used. OK/Cancel? ok Disk geometry for /dev/sda: 0.000-3050352.000 megabytes Disk label type: gpt Minor Start End Filesystem Name Flags 1 0.017 3050351.983 But when I ran parted later it did not report this error. Also pvdisplay, lvdisplay and vgdisplay are no longer working. # pvdisplay /dev/sda No physical volume label read from /dev/sda Failed to read physical volume "/dev/sda" # pvdisplay /dev/sdb No physical volume label read from /dev/sdb Failed to read physical volume "/dev/sdb" # vgdisplay No volume groups found # lvdisplay No volume groups found I am still investigating. Do you think ext2 vs. ext3 is likely to matter for this test?
Comment 3 Tom Coughlan 2005-01-19 16:12:48 UTC
Okay, Alasdair pointed out my mistake. I should have used "pvcreate /dev/sda1" not "pvcreate /dev/sda". Duh. That explains the LVM and parted problems. Is "File size limit exceeded" the expected result when attempting to write an ext3 file > 2 TB?
Comment 4 Stephen Tweedie 2005-01-20 12:15:38 UTC
ext2 vs. ext3 should not matter in theory, but it would be responsible to test both! The 2.6 kernel currently caps file sizes at 2TB on both ext2 and ext3. There is unfortunately a 32-bit limit on statbuf->st_blocks, which caps the file size we can reliably return "df" info on to 2TB.
Comment 5 Tom Coughlan 2005-01-20 12:55:59 UTC
I fixed up the parted/LVM mistake and ran a second test, this time with ext2 and "dd" instead of "dt". Same result: # dd if=/dev/zero of=/mnt/scratch/testfile bs=512M count=5000 File size limit exceeded # ls -lh /mnt/scratch/testfile -rw-r--r-- 1 root root 2.0T Jan 19 21:36 /mnt/scratch/testfile # mount /dev/hda2 on / type ext3 (rw) none on /proc type proc (rw) none on /sys type sysfs (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) usbfs on /proc/bus/usb type usbfs (rw) /dev/hda1 on /boot type ext3 (rw) none on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) /dev/mapper/bigvg01-lvol0 on /mnt/scratch type ext2 (rw) # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda2 37373192 6298844 29175868 18% / /dev/hda1 46633 7955 36270 18% /boot none 257868 0 257868 0% /dev/shm /dev/mapper/bigvg01-lvol0 3220913576 2149672988 910179316 71% /mnt/scratch David, can you update to RHEL 4 RC and re-run your test?
Comment 6 Tom Coughlan 2005-01-22 17:43:54 UTC
Status update: After further testing I reproduced the filesystem corruption reported in this BZ. I found that after I used "dd" to create a 2TB file, as described above, and then created a second file, and attemped to delete the second file, there were 10's of thousands of the following message generated in the log: ----- kernel: EXT2-fs error (device dm-0): ext2_free_blocks: Freeing blocks not in datazone - block = 969120825, count = 1 ----- Then I did: umount and e2fsck -n . This reports thousands of the following: ----- Inode 12 is too big. Truncate? no Block #536349691 (536947491) causes file to be too big. IGNORED. ----- as well as other messages. Alasdair identified a missing cast in dm-stripe that could cause data corruption on devices with stripes > ~1TB. He proposed the attahed patch. Unfortunately, this patch did not fix the problem. The test scenario described above still fails. :^( Prior to running with the patch, I also ran a test to check for data corruption on the LVM /dev, in an effort to exclude the filesystem. This consisted of a write pass: ./lmdd opat=1 of=/dev/mapper/bigvg01-lvol0 &> testoutlog.txt write: No space left on device write: wanted=8192 got=-1 3298534.8833 MB in 31084.7642 secs, 106.1142 MB/sec followed by a read/check pass (I stopped this command before it finished, so I could test Alasdair's patch, but the output shows that it got well past the 2TB boundary without errors.) ./lmdd ipat=1 if=/dev/mapper/bigvg01-lvol0 &> testinlog.txt 2650585.6942 MB in 41003.5009 secs, 64.6429 MB/sec It is possible that the lmdd data pattern wraps in a way that masks the data corruption. I am continuing to investigate both the original problem as well as the impact of the bug that Alasdair found. Attempts to use dd and hexdump to test and examine regions of the storage device have so far produced only unreasonable results. All suggestions welcome.
Comment 7 Tom Coughlan 2005-01-22 17:45:46 UTC
Created attachment 110092 [details] fix dm-stripe cast
Comment 8 Stephen Tweedie 2005-01-28 14:42:38 UTC
Update: We've been able to reproduce this on LVM, but not on an equivalent-sized raw SCSI partition. I've written a simple program to write and verify a pattern with 64-bit offsets at various locations over the disk, specifically to test for block aliasing patterns. Running it every 128G over the LVM partition shows an odd pattern of corruption starting at 2TB. (The block at 2TB is fine for 8192, but then misses the next 8192 bytes on subsequent read.) Again, plain SCSI does not show the problem. So it looks like we're easily able to reproduce this in a matter of seconds by using the right test load, which will help as we attempt to fix the problem. I've also determined that "lmdd"'s pattern is only 32-bits long, so it will wrap and will not be able to properly reproduce this problem.
Comment 9 Stephen Tweedie 2005-01-28 14:47:43 UTC
Created attachment 110353 [details] Test program for reproducing 2TB device aliasing problems Run with, for example: ./verify-data /dev/$DEV 128g to perform a 1MB data write and verify every 128G throughout the device.
Comment 10 Tom Coughlan 2005-05-24 19:53:22 UTC
David, We were never able to reproduce the exact problem you reported in this BZ. In trying, though, we fixed several things in U1, most notably LVM on large devices. Would you please try to reproduce your problem with U1? QA would appreciate it if you can do this this week, so this BZ can be closed out wrt the official U1 errata. Thanks, Tom
Comment 11 Tim Powers 2005-06-08 15:13:14 UTC
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-420.html