Alexey Kuznetsov <kuznet.ac.ru> from Virtuozzo/OpenVz linux kernel team has discowered following issue: [GFS] bug in gfs truncate GFS enforces a strange invariant: tree of indirect blocks must be constructed for all the blocks potentially allocateable inside inode size. Technically this is a stupid requirement, but both gfs kernel and user space gfs_fsck consider inodes with bad size as totally invalid and destroy them. It means that GFS has to be especially careful to maintain correct inode size. One point of failure is truncate(): if quota does not allow to grow inode size, it leaves i_size at wrong value. Subsequent write to inode moves wrong i_size to disk size and height of tree remains not updated. Such inode is invalid and it is destroyed by subsequent fsck. diff -urp ../gfs-kmod-0.1.16-orig/gfs-kernel-0.1.16/src/gfs/ops_inode.c gfs-kernel-0.1.16/src/gfs/ops_inode.c --- ../gfs-kmod-0.1.16-orig/gfs-kernel-0.1.16/src/gfs/ops_inode.c 2007-05-20 19:18:19.000000000 +0400 +++ gfs-kernel-0.1.16/src/gfs/ops_inode.c 2007-05-20 00:43:03.000000000 +0400 @@ -1395,8 +1507,11 @@ gfs_setattr(struct dentry *dentry, struc } error = gfs_truncatei(ip, attr->ia_size, gfs_truncator_page); - if (error) + if (error) { + if (inode->i_size != ip->i_di.di_size) + i_size_write(inode, ip->i_di.di_size); goto fail; + } if ((sdp->sd_vfs->s_flags & MS_SYNCHRONOUS) && !gfs_is_jdata(ip))
I looked at the code and I don't see how this can be a problem. The suggested code is fixing the inode size on error, but this is the wrong place to do it. All the functions, like gfs_truncatei have error paths that are supposed to clean up after themselves and as far as I can see, the code is doing that properly. I ran this quick test to see if my theory was correct. I put a 1M quota on a gfs file system, then exceeded that quota. The inode information seemed to be correct. Here's what I did: [root@trin-10 root]# mount -tgfs /dev/trin_vg/hell /mnt/hell/ [root@trin-10 root]# gfs_quota limit -l 1 -u testmonkey -f /mnt/hell [root@trin-10 root]# chmod 777 /mnt/hell [root@trin-10 root]# su testmonkey [testmonkey@trin-10 root]$ cd /mnt/hell [testmonkey@trin-10 hell]$ mkdir monkey [testmonkey@trin-10 hell]$ cd monkey/ [testmonkey@trin-10 monkey]$ dd if=/dev/zero of=/mnt/hell/1000K bs=1K count=1000 1000+0 records in 1000+0 records out 1024000 bytes (1.0 MB) copied, 0.20244 seconds, 5.1 MB/s [testmonkey@trin-10 monkey]$ cp /home/devel/gfs2-2.6.git.tgz /mnt/hell/monkey/ GFS: fsid=bob_cluster2:test_gfs.0: quota exceeded for user 500 cp: writing `/mnt/hell/monkey/gfs2-2.6.git.tgz': Disk quota exceeded [testmonkey@trin-10 monkey]$ exit [root@trin-10 root]# ls -lr /mnt/hell/monkey/gfs2-2.6.git.tgz -rw-r--r-- 1 testmonkey testmonkey 16384 Jun 26 09:27 /mnt/hell/monkey/gfs2-2.6.git.tgz [root@trin-10 ../gfs-kernel/src/gfs]# stat !$ stat /mnt/hell/monkey/gfs2-2.6.git.tgz File: `/mnt/hell/monkey/gfs2-2.6.git.tgz' Size: 16384 Blocks: 40 IO Block: 4096 regular file Device: fe03h/65027d Inode: 30 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/testmonkey) Gid: ( 500/testmonkey) Access: 2007-06-26 09:27:16.000000000 -0500 Modify: 2007-06-26 09:27:16.000000000 -0500 Change: 2007-06-26 09:27:16.000000000 -0500 Maybe I misunderstood what the problem is. Please give an example of how to recreate the quota problem and what bad value(s) you are seeing.
>All the functions, like gfs_truncatei >have error paths that are supposed to clean up after themselves and >as far as I can see, the code is doing that properly. Actually error path just not exist. So we have to fix i_size explicitly. Simple test case: /* gcc trunc.c -otrunc */ #include <unistd.h> #include <unistd.h> #include <fcntl.h> int main(int argc, char** argv) { int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0666); lseek(fd, 1024*1024*1024, SEEK_SET); /* i_size = 0, di_size = 0 */ ftruncate(fd, 1024*1024*1024); /* this have to fail because of quota*/ /* after trucate failed we have:i_size = 1073741824, di_size = 0 */ write(fd, "", 1); close(fd); } # mount /dev/vzvg/gfs /mnt/ su testmonkey; $ touch /mnt/file # create file for later test $ dd if=/dev/zero of=/mnt/BIG_FILE # write until -EDQUOT $ ./trunc /mnt/file # test itself $ exit # sync;sync;sync # force commit journal #umount /mnt/; gfs_fsck /dev/vzvg/gfs Initializing fsck Clearing journals (this may take a while)..... Journals cleared. Starting pass1 Found unused inode marked in-use Clear unused inode at block 30? (y/n) y Pass1 complete Starting pass1b Pass1b complete Starting pass1c Pass1c complete Starting pass2 Pass2 complete Starting pass3 Pass3 complete Starting pass4 Pass4 complete Starting pass5 ondisk and fsck bitmaps differ at block 30 Fix bitmap for block 30? (y/n) y Succeeded. RG #1 free count inconsistent: is 63398 should be 63399 RG #1 used inode count inconsistent: is 8 should be 7 Update resource group counts? (y/n) y Resource group counts updated Pass5 complete Writing changes to disk
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
After looking at the code path and how gfs calls vmtruncate, I agree that this is the correct fix. Crosswrite for gfs2 has already been proposed, but not yet in the git tree yet. The suggested patch was tested on system trin-10 and fixes the problem. I'll try to squeeze it into RHEL5.1 but we're right at the cutoff, so I don't know if it will make it there or 5.2.
The fix was tested on system roth-01 using information in the bugzilla. The fix was committed to RHEL5 and RHEL51 branches. Changing the status to modified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0577.html