663563 – [ext4/xfstests] 011 caused filesystem corruption after running many times in a loop

Bug 663563 - [ext4/xfstests] 011 caused filesystem corruption after running many times in a loop

Summary: [ext4/xfstests] 011 caused filesystem corruption after running many times in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	beta
Target Release:	---
Assignee:	Lukáš Czerner
QA Contact:	Petr Beňas
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	667762 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-12-16 08:09 UTC by Igor Zhang
Modified:	2015-01-04 23:00 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-07-21 10:10:54 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
fix patch (2.51 KB, patch) 2011-01-17 21:59 UTC, Lukáš Czerner	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1065	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update	2011-07-21 09:21:37 UTC

Description Igor Zhang 2010-12-16 08:09:30 UTC

Description of problem:
[ext4/xfstests] 011 caused filesystem corruption after running many times in a loop on ppc64
# while ./check 011; do echo ok................; done
...
011 2s ... 1s
Ran: 011
Passed all 1 tests
ok................
FSTYP         -- ext4
PLATFORM      -- Linux/ppc64 ibm-js12-vios-01-lp3 2.6.18-236.el5
MKFS_OPTIONS  -- /dev/loop1
MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:nfs_t:s0 /dev/loop1 /mnt/testarea/scratch

011 1s ... 1s
Ran: 011
Passed all 1 tests
ok................
FSTYP         -- ext4
PLATFORM      -- Linux/ppc64 ibm-js12-vios-01-lp3 2.6.18-236.el5
MKFS_OPTIONS  -- /dev/loop1
MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:nfs_t:s0 /dev/loop1 /mnt/testarea/scratch

011 1s ... 1s
_check_generic_filesystem: filesystem on /dev/loop0 is inconsistent (see 011.full)
Ran: 011
Passed all 1 tests


011.full shew:
_check_generic filesystem: filesystem on /dev/loop0 is inconsistent
*** fsck.ext4 output ***
fsck 1.39 (29-May-2006)
e4fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Inode bitmap differences:  -132265
Fix? no

Free inodes count wrong for group #16 (8192, counted=8191).
Fix? no

Free inodes count wrong (327669, counted=327668).
Fix? no


/dev/loop0: ********** WARNING: Filesystem still has errors **********

/dev/loop0: 11/327680 files (0.0% non-contiguous), 55902/1310720 blocks
*** end fsck.ext4 output
*** mount output ***
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda2 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/loop1 on /mnt/testarea/scratch type ext4 (rw,acl,user_xattr,context="system_u:object_r:nfs_t:s0")
*** end mount output


Version-Release number of selected component (if applicable):
# uname -rm
2.6.18-236.el5 ppc64
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga)

How reproducible:
I encountered this problem for three times.
First, https://beaker.engineering.redhat.com/jobs/39580
Then, manually run 011 solely in a loop.

Steps to Reproduce:
1.Install and configure xfstests(see README under xfstests directory)
2.while ./check 011; do echo ok................; done
3.
  
Actual results:
Filesystem corruption founded.

Expected results:
No filesystem corruption by running test case 011.

Additional info:
The host that I triggered this problem manually is ibm-js12-vios-01-lp3.rhts.eng.bos.redhat.com.

Comment 1 Igor Zhang 2010-12-16 08:44:22 UTC

With more same runs under x86_64, I also caught this problem.

Comment 4 Lukáš Czerner 2011-01-17 21:59:52 UTC

Created attachment 473938 [details]
fix patch

Hello,

this issue has been fixed upstream, broken and then fixed again. So now it is _fixed_ upstream.

Quoting Eric Sandeen:

This bug was introduced in:
393418676a7602e1d7d3f6e560159c65c8cbd50e ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
I fixed it in:
7ce9d5d1f3c8736511daa413c64985a05b2feee3 ext4: fix ext4_free_inode() vs. ext4_claim_inode() race
it got broken again in:
955ce5f5be67dfe0d1d096b543af33fe8a1ce3dd ext4: Convert ext4_lock_group to use sb_bgl_lock
and ultimately fixed again in:
d17413c08cd2b1dd2bf2cfdbb0f7b736b2b2b15c ext4: clean up inode bitmaps manipulation in ext4_free_inode

This patch should fix the issue in RHEL5.6. It has been tested on 2.6.18-239.el5 i386 with expected result (no corruption during the test).

Igor, please could you give it a try on other architectures ?

Thanks!
-Lukas

Comment 5 Igor Zhang 2011-01-18 05:36:16 UTC

On x86_64 with kernel 2.6.18-239.el5, this problem still existed:
# uname -a
Linux intel-s3e36-01.rhts.eng.nay.redhat.com 2.6.18-239.el5 #1 SMP Tue Jan 4 13:13:58 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

FSTYP         -- ext4
PLATFORM      -- Linux/x86_64 intel-s3e36-01 2.6.18-239.el5
MKFS_OPTIONS  -- /dev/loop1
MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:nfs_t:s0 /dev/loop1 /mnt/testarea/scratch

011 1s ... 1s
_check_generic_filesystem: filesystem on /dev/loop0 is inconsistent (see 011.full)
Ran: 011
Passed all 1 tests


_check_generic filesystem: filesystem on /dev/loop0 is inconsistent
*** fsck.ext4 output ***
fsck 1.39 (29-May-2006)
e4fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Inode bitmap differences:  -134175
Fix? no

Free inodes count wrong for group #16 (8192, counted=8191).
Fix? no

Free inodes count wrong (327669, counted=327668).
Fix? no


/dev/loop0: ********** WARNING: Filesystem still has errors **********

/dev/loop0: 11/327680 files (0.0% non-contiguous), 55902/1310720 blocks
*** end fsck.ext4 output
*** mount output ***
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/loop1 on /mnt/testarea/scratch type ext4 (rw,acl,user_xattr,context="system_u:object_r:nfs_t:s0")
*** end mount output

Comment 8 Lukáš Czerner 2011-01-20 18:30:03 UTC

(In reply to comment #5)
> On x86_64 with kernel 2.6.18-239.el5, this problem still existed:

Are you sure you have applied the patch above ? I am sorry but this is not clear from your comment.

Anyway, with that patch I am not able to reproduce the corruption you have seen, however I am seeing different corruption:

EXT4-fs error (device sdb): file system corruption: inode #3538947 logical block 2 mapped to 925434301 (size 1)

which is kind of worrisome, but this it is for new BZ I guess. Need to look at it more closely.

-Lukas

Comment 9 Igor Zhang 2011-01-21 02:29:15 UTC

(In reply to comment #8)
> (In reply to comment #5)
> > On x86_64 with kernel 2.6.18-239.el5, this problem still existed:
> 
> Are you sure you have applied the patch above ? I am sorry but this is not
> clear from your comment.
> 
> Anyway, with that patch I am not able to reproduce the corruption you have
> seen, however I am seeing different corruption:
> 
> EXT4-fs error (device sdb): file system corruption: inode #3538947 logical
> block 2 mapped to 925434301 (size 1)
> 
> which is kind of worrisome, but this it is for new BZ I guess. Need to look at
> it more closely.
> 
> -Lukas

Cite from your comment 4,
"This patch should fix the issue in RHEL5.6. It has been tested on
2.6.18-239.el5 i386 with expected result..."
I roughly got the idea that it has been fixed in kernel 2.6.18-239.el5. So I just tested against it.

Checking from http://intranet.corp.redhat.com/ic/intranet/RHEL5ChangeLog2#23X.el5, there isn't your mentioned patch since kernel 2.6.18-236.el5.

I'll retest this problem when a new kernel build containing the fix is released.

Comment 10 Lukáš Czerner 2011-01-25 07:37:38 UTC

The patch mentioned in Comment 4 definitely fixes the bug. The problems I have seen aside that are not related to the problem and is not reproducible outside my environment.

Thanks!
-Lukas

Comment 11 RHEL Program Management 2011-02-01 16:56:46 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Eric Sandeen 2011-02-03 17:58:23 UTC

Requesting blocker since this is fs corruption.

Comment 13 Eric Sandeen 2011-02-03 17:58:44 UTC

*** Bug 667762 has been marked as a duplicate of this bug. ***

Comment 18 Jarod Wilson 2011-02-21 20:57:29 UTC

in kernel-2.6.18-245.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 21 Petr Beňas 2011-02-22 14:26:31 UTC

Reproduced in 2.6.18-244.el5 and verified in 2.6.18-245.el5.

Comment 23 errata-xmlrpc 2011-07-21 10:10:54 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html

Note You need to log in before you can comment on or make changes to this bug.