588599 – Kernel BUG at fs/ext3/super.c:425

Bug 588599 - Kernel BUG at fs/ext3/super.c:425

Summary: Kernel BUG at fs/ext3/super.c:425

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Josef Bacik
QA Contact:	Igor Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-05-04 03:10 UTC by Tao Ma
Modified:	2018-11-14 14:14 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-13 21:30:44 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
a temp fix. (478 bytes, patch) 2010-05-04 03:30 UTC, Tao Ma	no flags	Details \| Diff
potential fix (2.18 KB, patch) 2010-05-04 18:47 UTC, Josef Bacik	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0017	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update	2011-01-13 10:37:42 UTC

Description Tao Ma 2010-05-04 03:10:09 UTC

Description of problem:
__journal_remove_journal_head: freeing b_frozen_data 
ext3_abort called. 
EXT3-fs error (device loop0): ext3_put_super: Couldn't clean up the journal 
sb orphan head is 49156 
sb_info orphan list: 
  inode loop0:49156 at ffff880017156a78: mode 100644, nlink 1, next 49153 
  inode loop0:49153 at ffff88001f59f488: mode 100644, nlink 1, next 0 
Assertion failure in ext3_put_super() at fs/ext3/super.c:426: 
"list_empty(&sbi->s_orphan)" 
----------- [cut here ] --------- [please bite here ] --------- 
Kernel BUG at fs/ext3/super.c:425 


Version-Release number of selected component (if applicable):
redhat enterprise linux 5.5

How reproducible:
It depends.

Steps to Reproduce:
1. Create a large sparse file in a ext3 volume. The file size should be larger than the volume size left.
2. setup this file as a loopback device.
3. try to do direct IO.
4. If step 3 ends with ENOSPC, umount the volume and the kernel bug shows up.
  
Actual results:


Expected results:


Additional info:

Comment 1 Tao Ma 2010-05-04 03:18:45 UTC

Acutally I have investigated the bug, it is caused by the function ext3_direct_IO. In case of write, we add the inodes in orphan first, and then after blockdev_direct_IO, we remove it. But that is a corner case. What if we succeed in adding while fail in removing? That would leave the inodes in the ext3_sb->s_orphans, so when we do umount it would trigger kernel bug.

Comment 2 Tao Ma 2010-05-04 03:30:56 UTC

Created attachment 411173 [details]
a temp fix.

Here is my temp fix, and I take what ext4_ind_direct_IO did as a reference.
But we can also call ext3_truncate there and then change ext3_truncate like 
mainline commit ef43618a47179b41e7203a624f2c7445e7da488c. That way we need to change 2 places and a little more codes to change.
I am not familiar with ext3 enough, so I choose the simplest way to work around it.

Comment 3 Josef Bacik 2010-05-04 14:45:11 UTC

You're patch seems to be correct, but it's not upstream.  Please post it upstream and then we can pull it into RHEL.  Thanks,

Josef

Comment 4 Josef Bacik 2010-05-04 14:47:07 UTC

Duh, of course I then go look at upstream and see what's done there and we call ext3_truncate there.  I'll backport the upstream patches.

Comment 5 Josef Bacik 2010-05-04 18:47:26 UTC

Created attachment 411368 [details]
potential fix

I tried to reproduce your problem, but I couldn't.  This is what I did

mkfs.ext3 -b 4096 /dev/sdb1 (500m partition)
mount /dev/sdb1 /mnt/test
dd if=/dev/zero of=/mnt/test/file bs=1M seek=1000 count=1
losetup /dev/loop0 /mnt/test/file
dd if=/dev/zero of=/dev/loop0 bs=4k oflag=direct count=262144

Here is a backport of a couple of upstream bz's that should fix the problem.  It's a little odd that the thing is failing in the journal_start, but if thats what's happening then this fix is definitely needed.  Can you verify this patch fixes the problem?  Thanks,

Josef

Comment 6 Tao Ma 2010-05-04 23:56:00 UTC

(In reply to comment #4)
> Duh, of course I then go look at upstream and see what's done there and we call
> ext3_truncate there.  I'll backport the upstream patches.    

So you have choose the 2nd way I said above. use ext3_truncate and backport commit ef43618a47179b41e7203a624f2c7445e7da488c. ;)
I have no objection to it.

Comment 7 Tao Ma 2010-05-05 00:02:08 UTC

(In reply to comment #5)
> Created an attachment (id=411368) [details]
> potential fix
> I tried to reproduce your problem, but I couldn't.  This is what I did
yes, it is not very easy to trigger. Sorry for not pasting the test script at the very beginning.
> mkfs.ext3 -b 4096 /dev/sdb1 (500m partition)
> mount /dev/sdb1 /mnt/test
> dd if=/dev/zero of=/mnt/test/file bs=1M seek=1000 count=1
> losetup /dev/loop0 /mnt/test/file
> dd if=/dev/zero of=/dev/loop0 bs=4k oflag=direct count=262144
here the volume is needed to be mounted and then be written into.
mkfs.ext3 /dev/loop0
mount /dev/loop0 /mnt/test1
cd /mnt/test1
dd if=/dev/zero of=testfile1 bs=1024k & 
dd if=/dev/zero of=testfile2 bs=1024k oflag=direct & 
dd if=/dev/zero of=testfile3 bs=24k  & 
dd if=/dev/zero of=testfile4 bs=1024k oflag=direct seek=10000 & 

Sometimes you will trigger it.
> Here is a backport of a couple of upstream bz's that should fix the problem. 
> It's a little odd that the thing is failing in the journal_start, but if thats
> what's happening then this fix is definitely needed.  Can you verify this patch
> fixes the problem?  Thanks,
OK, I will test it.
> Josef

Comment 8 Tao Ma 2010-05-05 14:25:16 UTC

yeah, we have tested the patch. It works. Thanks.

Comment 9 Josef Bacik 2010-05-05 15:24:00 UTC

Great, thanks for doing all of the actual work :).

Josef

Comment 10 Josef Bacik 2010-05-05 15:33:51 UTC

BTW this is a backport of

ef43618a47179b41e7203a624f2c7445e7da488c
7eb4969e04060dcf3fbd46af9c21b1059b853068

Comment 11 RHEL Program Management 2010-08-06 05:50:10 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Jarod Wilson 2010-08-11 00:12:26 UTC

in kernel-2.6.18-211.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 17 errata-xmlrpc 2011-01-13 21:30:44 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.