Bug 624691

Summary: fsck.gfs2 deletes directories if they get too big
Product: Red Hat Enterprise Linux 6 Reporter: Robert Peterson <rpeterso>
Component: clusterAssignee: Robert Peterson <rpeterso>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.0CC: adas, bmarzins, cluster-maint, ddumas, fdinitto, lhh, rpeterso, rwheeler, ssaha, swhiteho
Target Milestone: rcKeywords: Regression
Target Release: 6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: cluster-3.0.12-23.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 624689
: 628013 (view as bug list) Environment:
Last Closed: 2010-11-10 20:00:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 575968, 620384, 624689    
Bug Blocks: 622576    
Attachments:
Description Flags
Proposed patch for STABLE3
none
fsck.gfs2 core dump none

Description Robert Peterson 2010-08-17 14:00:14 UTC
+++ This bug was initially created as a clone of Bug #624689 +++
+++ which was initially created as a clone of Bug #620384 +++

Yesterday I did more rigorous testing for bug #620384 and
discovered this bug:  Basically, the latest and greatest
fsck.gfs2 doesn't like when directories get really big (i.e.
lots of entries).  This happens easier with a small block size
like the 512B blocks.

For almost all normal directories, the metadata structure looks
like this:

height  structure
------  -------------------------------------------------
0.      dinode
1.      journaled data block (hash table block pointers)
2.      directory leaf blocks

When directories get really big their metadata structure gets
more complex and ends up looking like this:

height  structure
------  -------------------------------------------------
0.      dinode
1.      indirect block (block pointers to block pointers)
2.      journaled data block (hash table block pointers)
3.      directory leaf blocks

If there are enough directory entries, the structure can
reach more heights, with level 2 being another level of
indirect blocks:

height  structure
------  -------------------------------------------------
0.      dinode
1.      indirect block (block pointers to block pointers)
2.      indirect block (block pointers to block pointers)
3.      journaled data block (hash table block pointers)
4.      directory leaf blocks

Right now, fsck.gfs2 can only handle directories of the first
form.  Large directories with four different metadata types
are flagged as errors and data is destroyed.  This is very
serious and needs to get fixed ASAP.  I've written a patch for
this issue and I'm testing it now.  So far the patch has passed
a simple unit test using a four-level directory.

Comment 1 Robert Peterson 2010-08-17 15:25:38 UTC
I did some testing and discovered this bug does not exist in
gfs1's fsck, gfs_fsck.  So gfs_fsck does not have this problem.
I also tested gfs2-utils-0.1.60-1.el5 and it does not have this
problem, so it is, in fact, a regression.

Here's how to recreate the failure and what it looks like:

[root@kool ~]# mkfs.gfs2 -O -b512 -p lock_nolock -t "kool:bob" -j1 /dev/kool_vg/kool_bob 
Device:                    /dev/kool_vg/kool_bob
Blocksize:                 512
Device Size                40.00 GB (83886080 blocks)
Filesystem Size:           40.00 GB (83886078 blocks)
Journals:                  1
Resource Groups:           160
Locking Protocol:          "lock_nolock"
Lock Table:                "kool:bob"
UUID:                      05060249-E9BB-1DCF-C9C8-112EF09BD56C

You have new mail in /var/spool/mail/root
[root@kool ~]# sync
[root@kool ~]# mount -tgfs2 /dev/kool_vg/kool_bob  /mnt/bob
[root@kool ~]# mkdir /mnt/bob/bob
[root@kool ~]# for i in `seq 1 10000` ; do touch /mnt/bob/bob/file_name_$i ; done
[root@kool ~]# !umo
umount /mnt/bob
[root@kool ~]# /sbin/fsck.gfs2 -V
GFS2 fsck DEVEL.1274286054 (built May 19 2010 11:22:48)
Copyright (C) Red Hat, Inc.  2004-2006  All rights reserved.
[root@kool ~]# fsck.gfs2 /dev/kool_vg/kool_bob 
Initializing fsck
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Starting pass1
Block 287425 (0x462c1) seems to be free space, but is marked as data in the bitmap.
Okay to fix the bitmap? (y/n)y
The bitmap was fixed.
Block 287426 (0x462c2) seems to be free space, but is marked as data in the bitmap.
Okay to fix the bitmap? (y/n)y
The bitmap was fixed.
Error: inode 269038 (0x41aee) has unrecoverable errors; invalidating.
Block 269038 (0x41aee) seems to be free space, but is marked as inode in the bitmap.
Okay to fix the bitmap? (y/n)y
The bitmap was fixed.
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Directory entry 'bob' referencing inode 269038 (0x41aee) in dir inode 398 (0x18e) block type 0: was deleted or is not an inode.
Clear directory entry to non-inode block? (y/n) 

Obviously, the fsck of the file system should come up clean.

Comment 3 Robert Peterson 2010-08-17 17:20:53 UTC
Created attachment 439170 [details]
Proposed patch for STABLE3

Here is the STABLE3 crosswrite patch.  I'm testing it now.

Comment 4 Robert Peterson 2010-08-17 17:41:58 UTC
I tested this patch and found it to be correct on system roth-08.
I pushed the patch to the master branch of the gfs2-utils git tree,
and the STABLE3, RHEL6, and RHEL60 branches of the cluster git
tree for inclusion into 6.0.  Changing status to POST until a new
cluster package gets built.

Comment 5 Fabio Massimo Di Nitto 2010-08-17 18:11:08 UTC
Now built with cherry pick from RHEL60 branch.

Comment 7 Nate Straz 2010-08-25 22:37:02 UTC
Created attachment 441066 [details]
fsck.gfs2 core dump

I'm trying to verify this but I'm think I'm missing something from Bob's description.  I did the mkfs w/ 512 byte block size:

mkfs -t gfs2 -O -b 512 -j 1 -p lock_dlm -t west:brawl0 /dev/brawl/brawl0

/dev/brawl/brawl0 is a 1G LV.

Then I ran a loop to fill a directory with entries

while touch bigdir/a_long_filename_to_make_quick_use_of_dentry_blocks_$i; do 
	let i=i+1
	if ((i % 10000 == 0)); then 
		echo "$i"
		height=`gfs2_edit -p $bigdir_ino field di_height /dev/brawl/brawl0`
		if [ "$height" -eq 4 ]; then 
			break
		fi
	fi
done

This loop ran until the file system was filled and did not reach a di_height of 4.
...
760000
touch: cannot touch `bigdir/a_long_filename_to_make_quick_use_of_dentry_blocks_767252': No space left on device

[root@west-01 ~]# fsck.gfs2 -y /dev/brawl/brawl0; echo $?
Initializing fsck
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Starting pass1
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c completelete.
Starting pass2
Entries is 767253 - should be 442972 for inode block 269009 (0x41ad1)
Pass2 complete      
Starting pass3
Pass3 complete      
Starting pass4
Found unlinked inode at 508636 (0x7c2dc)
Added inode #508636 (0x7c2dc) to lost+found dir
Found unlinked inode at 508665 (0x7c2f9)
Added inode #508665 (0x7c2f9) to lost+found dir
Found unlinked inode at 519865 (0x7eeb9)
Added inode #519865 (0x7eeb9) to lost+found dir
Found unlinked inode at 520069 (0x7ef85)
Segmentation fault (core dumped)
139

<abrt deleted core dump>

rerunning fsck.gfs2
[root@west-01 ~]# fsck.gfs2 -y /dev/brawl/brawl0; echo $?
Initializing fsck
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Starting pass1
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c completelete.
Starting pass2
Pass2 complete      
Starting pass3
Pass3 complete      
Starting pass4
Found unlinked inode at 508636 (0x7c2dc)
Added inode #508636 (0x7c2dc) to lost+found dir
Found unlinked inode at 508665 (0x7c2f9)
Added inode #508665 (0x7c2f9) to lost+found dir
Found unlinked inode at 519865 (0x7eeb9)
Added inode #519865 (0x7eeb9) to lost+found dir
Found unlinked inode at 520069 (0x7ef85)
Segmentation fault (core dumped)

#0  bwrite (bh=0xcc03070000000000)
    at /usr/src/debug/cluster-3.0.12/gfs2/libgfs2/buf.c:63
#1  0x000000000041d80d in dir_make_exhash (dip=0x534c1d0, 
    filename=0x7fff14898640 "lost_file_520069", len=16, inum=0xd0d2e0, type=8)
    at /usr/src/debug/cluster-3.0.12/gfs2/libgfs2/fs_ops.c:1193
#2  dir_l_add (dip=0x534c1d0, filename=0x7fff14898640 "lost_file_520069", 
    len=16, inum=0xd0d2e0, type=8)
    at /usr/src/debug/cluster-3.0.12/gfs2/libgfs2/fs_ops.c:1202
#3  dir_add (dip=0x534c1d0, filename=0x7fff14898640 "lost_file_520069", 
    len=16, inum=0xd0d2e0, type=8)
    at /usr/src/debug/cluster-3.0.12/gfs2/libgfs2/fs_ops.c:1220
#4  0x0000000000406c3d in add_inode_to_lf (ip=0xd0d2c0)
    at /usr/src/debug/cluster-3.0.12/gfs2/fsck/lost_n_found.c:168
#5  0x000000000041549c in scan_inode_list (sbp=0x7fff14898820)
    at /usr/src/debug/cluster-3.0.12/gfs2/fsck/pass4.c:130
#6  pass4 (sbp=0x7fff14898820)
    at /usr/src/debug/cluster-3.0.12/gfs2/fsck/pass4.c:195
#7  0x0000000000407a27 in main (argc=<value optimized out>, 
    argv=<value optimized out>)
    at /usr/src/debug/cluster-3.0.12/gfs2/fsck/main.c:311

Comment 8 Nate Straz 2010-08-27 15:20:59 UTC
I reproduced this again w/o running out of space on the device.  Once I get above 500,000 entries it truncates the directory and segfaults while moving stuff into lost+found.

[root@morph-04 ~]# fsck.gfs2 -y /dev/morph-cluster/bigdir
Initializing fsck
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Starting pass1
Pass1 complete
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c completeete..
Starting pass2
root inode 398 (0x18e): Entries is 516033 - should be 408582
Entries updated
Pass2 complete
Starting pass3
Pass3 complete
Starting pass4
Found unlinked inode at 377457 (0x5c271)
Added inode #377457 (0x5c271) to lost+found dir
Found unlinked inode at 377486 (0x5c28e)
Added inode #377486 (0x5c28e) to lost+found dir
Found unlinked inode at 388686 (0x5ee4e)
Added inode #388686 (0x5ee4e) to lost+found dir
Found unlinked inode at 388890 (0x5ef1a)
Segmentation fault (core dumped)

Comment 14 Nate Straz 2010-08-27 16:45:38 UTC
I ran the new test against the old fsck.gfs2 and found that what I found in comment #7 is different than the original report.  I was able to get much further with gfs2-utils-3.0.12-23.el6 than gfs2-utils-3.0.12-21.el6.  I consider the fix good and this bug verified.  I moved the issue in command #7 to 628013.

Comment 15 releng-rhel@redhat.com 2010-11-10 20:00:34 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.