Bug 143907

Summary: ext2 and device dm-0 byond 2Terabyte causes /var/log/messages file size to crash system
Product: Red Hat Enterprise Linux 4 Reporter: David Grifffith <lownoisefloor>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: agk, bpeck, coughlan, davej, jbaron, k.georgiou, lockhart, lownoisefloor, sct, shillman, tburke
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2005-298 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-06-08 15:13:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 137160    
Attachments:
Description Flags
fix dm-stripe cast
none
Test program for reproducing 2TB device aliasing problems none

Description David Grifffith 2005-01-01 03:36:40 UTC
Description of problem: Server is setup with RHEL Beta4 Version2. 14 
300GB SCSI drives are configured with LVM to present 3.8 terabyte 
file system to user.  EXT2 file type is used.  EXT2 is capable of 32 
terabyte partition size but only 2 terabyte individual file size. 
When an application tries to continuously write to the 3.8 Terabytes 
available, and the 2 Terabyte limit is reached, an error message 
---- begin error message ----
Dec 30 23:48:52 localhost kernel: EXT2-fs error (device dm-0): 
ext2_free_blocks: Freeing blocks in system zones - Block = 16843009, 
count = 1
Dec 30 23:48:52 localhost kernel: EXT2-fs error (device dm-0): 
ext2_free_blocks: bit already cleared for block 16843009
--- error message repeats endlessly until /var/log/messages file 
reaches full disk space.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux 4 Beta 2  (2.6.9.1-648smp)


How reproducible:
  1. You must have an array greater that 2 terabytes.
     The system under test has 3.8 terabytes.
  2. Setup all disks with LVM 2 to achieve the full 3.8 terabytes.
  3. begin writing a file that grows until the 2Terabyte limit is
      reached.
  .... error occurs here


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:  It is a guess that the 
device-dm0 or device mapper funstion is the issue here.
Handling limits and exiting out to tell the user that file size
limit has been reached is desireable.

Comment 2 Tom Coughlan 2005-01-19 14:24:09 UTC
I tried the following test on a system with two 3 TB hardware RAID 0 logical
units. It did not reproduce the problem, although there are some issues.

The two storage devices are sda and sdb: 
 
parted /dev/sda mklabel gpt
parted /dev/sdb mklabel gpt
parted /dev/sda mkpart primary ext3 0 3050352
parted /dev/sdb mkpart primary ext3 0 3050352
pvcreate /dev/sda
pvcreate /dev/sdb
vgcreate bigvg00 /dev/sda /dev/sdb
vgdisplay
  --- Volume group ---
  VG Name               bigvg00
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               5.82 TB
  PE Size               4.00 MB
  Total PE              1525174
  Alloc PE / Size       0 / 0
  Free  PE / Size       1525174 / 5.82 TB
  VG UUID               aw4hbU-xljx-8bB8-6wyy-7Cup-MCnX-jSXGRJ

lvcreate -i 2 -L 3TB bigvg00
lvdisplay
  --- Logical volume ---
  LV Name                /dev/bigvg00/lvol0
  VG Name                bigvg00
  LV UUID                7SXMuz-7sw6-pHon-wuys-H2eD-ajgU-F4cF4d
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                3.00 TB
  Current LE             786432
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:0

mke2fs -T largefile4 -j  /dev/bigvg00/lvol0
fsck -f /dev/bigvg00/lvol0

No problem.

Then I mounted the fs and ran the "dt" test, to write a 2.5 TB file:
  
# ./dt of=/mnt/scratch/testfile bs=200m limit=2500GB dispose=keep log=bigfile.log

The test ended with the error:

File size limit exceeded

# ls -lh /mnt/scratch/testfile
-rw-r--r--  1 root root 2.0T Jan 18 23:07 /mnt/scratch/testfile

There were no errors in the log.

There are three differences between this test and the problem report: this tests
uses kernel 2.6.9-5.EL, it uses ext3, and the storage configuration is slightly
different.

Although there were no failures during the test I have seen some problems since.  
Parted "print" says the the primary GPT table is corrupt:

parted /dev/sda print
Error: The primary GPT table is corrupt, but the backup appears ok, so that will
be used.
OK/Cancel? ok
Disk geometry for /dev/sda: 0.000-3050352.000 megabytes
Disk label type: gpt
Minor    Start       End     Filesystem  Name                  Flags
1          0.017 3050351.983

But when I ran parted later it did not report this error.

Also pvdisplay, lvdisplay and vgdisplay are no longer working.

# pvdisplay /dev/sda
  No physical volume label read from /dev/sda
  Failed to read physical volume "/dev/sda"
# pvdisplay /dev/sdb
  No physical volume label read from /dev/sdb
  Failed to read physical volume "/dev/sdb"
# vgdisplay
  No volume groups found
# lvdisplay
  No volume groups found

I am still investigating. Do you think ext2 vs. ext3 is likely to matter for
this test?

Comment 3 Tom Coughlan 2005-01-19 16:12:48 UTC
Okay, Alasdair pointed out my mistake. I should have used "pvcreate /dev/sda1" not
"pvcreate /dev/sda". Duh. 

That explains the LVM and parted problems. 

Is "File size limit exceeded" the expected result when attempting to write an
ext3 file > 2 TB?

Comment 4 Stephen Tweedie 2005-01-20 12:15:38 UTC
ext2 vs. ext3 should not matter in theory, but it would be responsible to test both!

The 2.6 kernel currently caps file sizes at 2TB on both ext2 and ext3.  There is
unfortunately a 32-bit limit on statbuf->st_blocks, which caps the file size we
can reliably return "df" info on to 2TB.


Comment 5 Tom Coughlan 2005-01-20 12:55:59 UTC
I fixed up the parted/LVM mistake and ran a second test, this time with ext2 and
"dd" instead of "dt". Same result:

# dd if=/dev/zero of=/mnt/scratch/testfile bs=512M count=5000
File size limit exceeded

# ls -lh /mnt/scratch/testfile
-rw-r--r--  1 root root 2.0T Jan 19 21:36 /mnt/scratch/testfile

# mount
/dev/hda2 on / type ext3 (rw)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/hda1 on /boot type ext3 (rw)
none on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/mapper/bigvg01-lvol0 on /mnt/scratch type ext2 (rw)

# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/hda2             37373192   6298844  29175868  18% /
/dev/hda1                46633      7955     36270  18% /boot
none                    257868         0    257868   0% /dev/shm
/dev/mapper/bigvg01-lvol0
                     3220913576 2149672988 910179316  71% /mnt/scratch

David, can you update to RHEL 4 RC and re-run your test?

Comment 6 Tom Coughlan 2005-01-22 17:43:54 UTC
Status update:

After further testing I reproduced the filesystem corruption reported in this BZ. 

I found that after I used "dd" to create a 2TB file, as described above, and
then created a second file, and attemped to delete the second file, there were
10's of thousands of the following message generated in the log:

-----
kernel: EXT2-fs error (device dm-0): ext2_free_blocks: Freeing blocks not in
datazone - block = 969120825, count = 1
-----

Then I did: umount and e2fsck -n . This reports thousands of the following:

-----
Inode 12 is too big.  Truncate? no

Block #536349691 (536947491) causes file to be too big.  IGNORED.
-----

as well as other messages. 

Alasdair identified a missing cast in dm-stripe that could cause data corruption
on devices with stripes > ~1TB. He proposed the attahed patch. Unfortunately,
this patch did not fix the problem. The test scenario described above still
fails. :^(

Prior to running with the patch, I also ran a test to check for data corruption
on the LVM /dev, in an effort to exclude the filesystem. This consisted of a
write pass:

./lmdd opat=1 of=/dev/mapper/bigvg01-lvol0  &>  testoutlog.txt
write: No space left on device
write: wanted=8192 got=-1
3298534.8833 MB in 31084.7642 secs, 106.1142 MB/sec

followed by a read/check pass (I stopped this command before it finished, so I
could test Alasdair's patch, but the output shows that it got well past the 2TB
boundary without errors.)

./lmdd ipat=1 if=/dev/mapper/bigvg01-lvol0  &>  testinlog.txt
2650585.6942 MB in 41003.5009 secs, 64.6429 MB/sec

It is possible that the lmdd data pattern wraps in a way that masks the data
corruption. I am continuing to investigate both the original problem as well as
the impact of the bug that Alasdair found. Attempts to use dd and hexdump to
test and examine regions of the storage device have so far produced only
unreasonable results. All suggestions welcome. 






Comment 7 Tom Coughlan 2005-01-22 17:45:46 UTC
Created attachment 110092 [details]
fix dm-stripe cast

Comment 8 Stephen Tweedie 2005-01-28 14:42:38 UTC
Update:

We've been able to reproduce this on LVM, but not on an equivalent-sized raw
SCSI partition.

I've written a simple program to write and verify a pattern with 64-bit offsets
at various locations over the disk, specifically to test for block aliasing
patterns.  Running it every 128G over the LVM partition shows an odd pattern of
corruption starting at 2TB.  (The block at 2TB is fine for 8192, but then misses
the next 8192 bytes on subsequent read.)  Again, plain SCSI does not show the
problem.  So it looks like we're easily able to reproduce this in a matter of
seconds by using the right test load, which will help as we attempt to fix the
problem.

I've also determined that "lmdd"'s pattern is only 32-bits long, so it will wrap
and will not be able to properly reproduce this problem.


Comment 9 Stephen Tweedie 2005-01-28 14:47:43 UTC
Created attachment 110353 [details]
Test program for reproducing 2TB device aliasing problems

Run with, for example:

./verify-data /dev/$DEV 128g

to perform a 1MB data write and verify every 128G throughout the device.

Comment 10 Tom Coughlan 2005-05-24 19:53:22 UTC
David,

We were never able to reproduce the exact problem you reported in this BZ. In
trying, though, we fixed several things in U1, most notably LVM on large devices. 

Would you please try to reproduce your problem with U1? QA would appreciate it
if you can do this this week, so this BZ can be closed out wrt the official U1
errata.

Thanks,

Tom

Comment 11 Tim Powers 2005-06-08 15:13:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-420.html