Description of problem: [ext4/xfstests] 011 caused filesystem corruption after running many times in a loop on ppc64 # while ./check 011; do echo ok................; done ... 011 2s ... 1s Ran: 011 Passed all 1 tests ok................ FSTYP -- ext4 PLATFORM -- Linux/ppc64 ibm-js12-vios-01-lp3 2.6.18-236.el5 MKFS_OPTIONS -- /dev/loop1 MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:nfs_t:s0 /dev/loop1 /mnt/testarea/scratch 011 1s ... 1s Ran: 011 Passed all 1 tests ok................ FSTYP -- ext4 PLATFORM -- Linux/ppc64 ibm-js12-vios-01-lp3 2.6.18-236.el5 MKFS_OPTIONS -- /dev/loop1 MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:nfs_t:s0 /dev/loop1 /mnt/testarea/scratch 011 1s ... 1s _check_generic_filesystem: filesystem on /dev/loop0 is inconsistent (see 011.full) Ran: 011 Passed all 1 tests 011.full shew: _check_generic filesystem: filesystem on /dev/loop0 is inconsistent *** fsck.ext4 output *** fsck 1.39 (29-May-2006) e4fsck 1.41.12 (17-May-2010) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Inode bitmap differences: -132265 Fix? no Free inodes count wrong for group #16 (8192, counted=8191). Fix? no Free inodes count wrong (327669, counted=327668). Fix? no /dev/loop0: ********** WARNING: Filesystem still has errors ********** /dev/loop0: 11/327680 files (0.0% non-contiguous), 55902/1310720 blocks *** end fsck.ext4 output *** mount output *** /dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) /dev/sda2 on /boot type ext3 (rw) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) /dev/loop1 on /mnt/testarea/scratch type ext4 (rw,acl,user_xattr,context="system_u:object_r:nfs_t:s0") *** end mount output Version-Release number of selected component (if applicable): # uname -rm 2.6.18-236.el5 ppc64 # cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga) How reproducible: I encountered this problem for three times. First, https://beaker.engineering.redhat.com/jobs/39580 Then, manually run 011 solely in a loop. Steps to Reproduce: 1.Install and configure xfstests(see README under xfstests directory) 2.while ./check 011; do echo ok................; done 3. Actual results: Filesystem corruption founded. Expected results: No filesystem corruption by running test case 011. Additional info: The host that I triggered this problem manually is ibm-js12-vios-01-lp3.rhts.eng.bos.redhat.com.
With more same runs under x86_64, I also caught this problem.
Created attachment 473938 [details] fix patch Hello, this issue has been fixed upstream, broken and then fixed again. So now it is _fixed_ upstream. Quoting Eric Sandeen: This bug was introduced in: 393418676a7602e1d7d3f6e560159c65c8cbd50e ext4: Fix the race between read_inode_bitmap() and ext4_new_inode() I fixed it in: 7ce9d5d1f3c8736511daa413c64985a05b2feee3 ext4: fix ext4_free_inode() vs. ext4_claim_inode() race it got broken again in: 955ce5f5be67dfe0d1d096b543af33fe8a1ce3dd ext4: Convert ext4_lock_group to use sb_bgl_lock and ultimately fixed again in: d17413c08cd2b1dd2bf2cfdbb0f7b736b2b2b15c ext4: clean up inode bitmaps manipulation in ext4_free_inode This patch should fix the issue in RHEL5.6. It has been tested on 2.6.18-239.el5 i386 with expected result (no corruption during the test). Igor, please could you give it a try on other architectures ? Thanks! -Lukas
On x86_64 with kernel 2.6.18-239.el5, this problem still existed: # uname -a Linux intel-s3e36-01.rhts.eng.nay.redhat.com 2.6.18-239.el5 #1 SMP Tue Jan 4 13:13:58 EST 2011 x86_64 x86_64 x86_64 GNU/Linux FSTYP -- ext4 PLATFORM -- Linux/x86_64 intel-s3e36-01 2.6.18-239.el5 MKFS_OPTIONS -- /dev/loop1 MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:nfs_t:s0 /dev/loop1 /mnt/testarea/scratch 011 1s ... 1s _check_generic_filesystem: filesystem on /dev/loop0 is inconsistent (see 011.full) Ran: 011 Passed all 1 tests _check_generic filesystem: filesystem on /dev/loop0 is inconsistent *** fsck.ext4 output *** fsck 1.39 (29-May-2006) e4fsck 1.41.12 (17-May-2010) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Inode bitmap differences: -134175 Fix? no Free inodes count wrong for group #16 (8192, counted=8191). Fix? no Free inodes count wrong (327669, counted=327668). Fix? no /dev/loop0: ********** WARNING: Filesystem still has errors ********** /dev/loop0: 11/327680 files (0.0% non-contiguous), 55902/1310720 blocks *** end fsck.ext4 output *** mount output *** /dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) /dev/sda1 on /boot type ext3 (rw) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) /dev/loop1 on /mnt/testarea/scratch type ext4 (rw,acl,user_xattr,context="system_u:object_r:nfs_t:s0") *** end mount output
(In reply to comment #5) > On x86_64 with kernel 2.6.18-239.el5, this problem still existed: Are you sure you have applied the patch above ? I am sorry but this is not clear from your comment. Anyway, with that patch I am not able to reproduce the corruption you have seen, however I am seeing different corruption: EXT4-fs error (device sdb): file system corruption: inode #3538947 logical block 2 mapped to 925434301 (size 1) which is kind of worrisome, but this it is for new BZ I guess. Need to look at it more closely. -Lukas
(In reply to comment #8) > (In reply to comment #5) > > On x86_64 with kernel 2.6.18-239.el5, this problem still existed: > > Are you sure you have applied the patch above ? I am sorry but this is not > clear from your comment. > > Anyway, with that patch I am not able to reproduce the corruption you have > seen, however I am seeing different corruption: > > EXT4-fs error (device sdb): file system corruption: inode #3538947 logical > block 2 mapped to 925434301 (size 1) > > which is kind of worrisome, but this it is for new BZ I guess. Need to look at > it more closely. > > -Lukas Cite from your comment 4, "This patch should fix the issue in RHEL5.6. It has been tested on 2.6.18-239.el5 i386 with expected result..." I roughly got the idea that it has been fixed in kernel 2.6.18-239.el5. So I just tested against it. Checking from http://intranet.corp.redhat.com/ic/intranet/RHEL5ChangeLog2#23X.el5, there isn't your mentioned patch since kernel 2.6.18-236.el5. I'll retest this problem when a new kernel build containing the fix is released.
The patch mentioned in Comment 4 definitely fixes the bug. The problems I have seen aside that are not related to the problem and is not reproducible outside my environment. Thanks! -Lukas
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Requesting blocker since this is fs corruption.
*** Bug 667762 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-245.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Reproduced in 2.6.18-244.el5 and verified in 2.6.18-245.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html