303671 – "EXT3-fs unexpected failure: !buffer_revoked(bh);"

Bug 303671 - "EXT3-fs unexpected failure: !buffer_revoked(bh);"

Summary: "EXT3-fs unexpected failure: !buffer_revoked(bh);"

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Josef Bacik
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-09-24 17:27 UTC by Jordan Russell
Modified:	2008-05-13 14:59 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-13 14:59:31 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
fsck screen shots [1/3]: Pass 1, Pass 2 (45.37 KB, image/png) 2007-10-04 18:12 UTC, Jordan Russell	no flags	Details
fsck screen shots [2/3]: more Pass 2 (54.42 KB, image/png) 2007-10-04 18:13 UTC, Jordan Russell	no flags	Details
fsck screen shots [3/3]: more Pass 2, Pass 3, Pass 4 (39.06 KB, image/png) 2007-10-04 18:14 UTC, Jordan Russell	no flags	Details
View All

Description Jordan Russell 2007-09-24 17:27:09 UTC

Description of problem:
This morning a server unexpectedly stopped responding. Checked it and found that
the root partition had been remounted read-only due to a failure in the ext3
journal code. dmesg reports:

EXT3-fs unexpected failure: !buffer_revoked(bh);
inconsistent data on disk
ext3_forget: aborting transaction: IO failure in __ext3_journal_revoke
ext3_abort called.
EXT3-fs error (device sda2): ext3_forget: error -5 when attempting revoke
Remounting filesystem read-only
Aborting journal on device sda2.
EXT3-fs error (device sda2) in ext3_free_blocks_sb: Journal has aborted
EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda2) in ext3_truncate: IO failure
EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda2) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda2) in ext3_delete_inode: IO failure
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
journal commit I/O error


There were no out-of-the-ordinary messages prior to that. No ata errors, no
SMART errors, nothing to indicate any sort of HW issue.

This happened at 4:02 AM, apparently just after cron.daily jobs began to kick
off, some of which are fairly disk-intensive.

The only non-standard mount option I'm using is "noatime":

/dev/root / ext3 rw,noatime,data=ordered 0 0

Version-Release number of selected component (if applicable):
kernel-2.6.18-8.1.10.el5.x86_64
(stock, no third-party modules)

How reproducible:
Unknown. This is the first time I've ever seen this error (server was built just
two weeks ago). Never had any troubles like this with RHEL3 or RHEL4.

Steps to Reproduce:
1. Heavy disk load?

Comment 3 Josef Bacik 2007-10-03 13:49:09 UTC

that message points to corruption on the disk.  I would recommend running an 
fsck on the volume and see if the problem occurs again.  If it does please 
give me information about what type of operations you are doing during the 
time of the failure so I can try and reproduce in house.

Comment 4 Jordan Russell 2007-10-04 17:19:27 UTC

(In reply to comment #3)
> that message points to corruption on the disk.

Hrm, that's unsettling. The machine was never shut down uncleanly, and the
partition was never written to outside of the ext3 mount. And just prior to
putting the machine into production -- about a week before getting this error --
I ran a full fsck on all partitions (via 'touch /forcefsck'); no problems were
found.

> I would recommend running an fsck on the volume

I rebooted (cleanly) after the error was discovered, and was forced to run a
manual fsck. On Pass 1 it reported:

Inodes that were part of a corrupted orphan linked list found.  Fix<y>? yes
Inode 19664102 was part of the orphaned inode list.  FIXED.

On Pass 2 it found about 15 entries with "deleted/unused inodes". They all
appeared to belong to files that should no longer exist (e.g. old log files,
expired news articles), so I told it to clear them.

On Pass 4 it found a few inodes with incorrect ref counts ("2, should be 1").

In the end, it appears no data was actually lost or corrupted.

> and see if the problem occurs again.  If it does please 
> give me information about what type of operations you are doing during the 
> time of the failure so I can try and reproduce in house.

From the files listed in fsck's Pass 2, I suspect it may have been running the
cron.daily script for INN (an NNTP news server), which, among other things,
looks for and deletes expired news articles (each of which is kept in a separate
file). Nothing really fancy going on there...

Thus far the error hasn't occurred again. If it does, I'll try to collect more
details. Suggestions for specific things to check would be helpful.

Comment 5 Eric Sandeen 2007-10-04 17:34:14 UTC

Do you have the actual full output of fsck, and can you attach it?

Thanks,
-Eric

Comment 6 Jordan Russell 2007-10-04 18:12:52 UTC

Created attachment 216181 [details]
fsck screen shots [1/3]: Pass 1, Pass 2

Comment 7 Jordan Russell 2007-10-04 18:13:41 UTC

Created attachment 216191 [details]
fsck screen shots [2/3]: more Pass 2

Comment 8 Jordan Russell 2007-10-04 18:14:40 UTC

Created attachment 216201 [details]
fsck screen shots [3/3]: more Pass 2, Pass 3, Pass 4

Comment 9 Jordan Russell 2007-10-04 18:28:20 UTC

I wasn't able to capture the actual text output, but I did get some KVM-over-IP
screen shots of the fsck process. Passes 1 through 3 are shown in their entirety
(with some overlap). I neglected to take a screen shot of the final fsck output,
but IIRC there were no further errors beyond what's shown in the last image.

Comment 10 Eric Sandeen 2007-10-04 18:29:47 UTC

Thanks.  I asked because I wondered if this might be related to the corruption
Stephen mentioned when he converted this sort of error from a BUG() to a
filesystem shutdown:

        http://lkml.org/lkml/2004/11/19/391

Thanks,
-Eric

Comment 11 Josef Bacik 2007-10-05 14:35:42 UTC

ok since disk corruption isn't likely I'll come up with a testcase that will 
create a bunch of files, walk through them all and stat them and then delete 
randomly and see if I can reproduce this problem.  Let me know if this sounds 
close to what your application is doing, and if not any suggestions on what to 
do better would be appreciated.

Comment 12 Jordan Russell 2008-04-05 04:57:41 UTC

Sorry about the lack of updates.

I never did find any way to reproduce this, and there have been no reoccurrences
of the error on the system in the 5 months since I filed this report.

Feel free to close...

Comment 13 Josef Bacik 2008-05-13 14:59:31 UTC

Ok, you can reopen it if you find a way to reproduce.

Note You need to log in before you can comment on or make changes to this bug.