Bug 155162

Summary: ext3 journal aborts
Product: [Fedora] Fedora Reporter: David Juran <djuran>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4CC: davej, pfrields, sct
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-05-04 13:41:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Juran 2005-04-17 11:22:48 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050325 Firefox/1.0.2 (Debian package 1.0.2-1)

Description of problem:
Today when I powered on my dual CPU system, at startup I got these error messages:

ext3_free_blocks_sb: bit already cleared for block 6644165
Journal has aborted
ext3_reserve_inode_write: Journal has aborted
ext3_truncate: Journal has aborted
ext3_reserve_inode_write: Journal has aborted
ext3_orphan_del: Journal has aborted
ext3_reserve_inode_write: Journal has aborted
ext3_delete_inode: Journal has aborted

ext3_abort_called

ext3_journal_start_sb: Detected aborted Journal
Remounting filesystem read only.


Looking back at the logfiles, shows the following entry from yesterday which might be relevant (hdb2 is my root fs):

Apr 16 11:30:23 c83-248-2-72 kernel: hdb: dma_timer_expiry: dma status == 0x60
Apr 16 11:30:23 c83-248-2-72 kernel: hdb: DMA timeout retry
Apr 16 11:30:23 c83-248-2-72 kernel: hdb: timeout waiting for DMA
Apr 16 11:30:28 c83-248-2-72 kernel: hdb: status timeout: status=0xd0 { Busy }
Apr 16 11:30:28 c83-248-2-72 kernel:
Apr 16 11:30:28 c83-248-2-72 kernel: ide: failed opcode was: unknown
Apr 16 11:30:28 c83-248-2-72 kernel: hda: DMA disabled
Apr 16 11:30:28 c83-248-2-72 kernel: hdb: drive not ready for command
Apr 16 11:30:28 c83-248-2-72 kernel: ide0: reset: success


Also, last night running 'rpm -qa' as root gave some weird error message indicating the rpmdb was locked while running the same command as an unpriviliged user worked just fine. I could however (to the best of my knowledge shut the system down cleanly) 

Version-Release number of selected component (if applicable):
kernel-smp-2.6.11-1.1231_FC4

How reproducible:
Didn't try

Steps to Reproduce:
1. Run with kernel-smp-2.6.11-1.1231_FC4 for a coupple of days?

  

Additional info:

Comment 1 Stephen Tweedie 2005-04-18 13:14:03 UTC
The journal abort message simply means that something happened that the
filesystem considered too serious to continue to write to the disk.  In this case,

ext3_free_blocks_sb: bit already cleared for block 6644165

you've got corruption in either a bitmap or indirect block: you'll need to force
a full fsck on the filesystem (the journal abort should record an error status
in the journal that will force a full fsck automatically on the next boot.)

But that doesn't tell us where the original error came from; the

Apr 16 11:30:23 c83-248-2-72 kernel: hdb: dma_timer_expiry: dma status == 0x60
Apr 16 11:30:23 c83-248-2-72 kernel: hdb: DMA timeout retry
Apr 16 11:30:23 c83-248-2-72 kernel: hdb: timeout waiting for DMA
Apr 16 11:30:28 c83-248-2-72 kernel: hdb: status timeout: status=0xd0 { Busy }

errors indicate that the root cause of this problem is probably in the IDE
layer, not in the filesystem at all.




Comment 2 David Juran 2005-04-18 18:06:18 UTC
Well, I realize this report is a bit thin on detail )-: And yes,  I managed to
do a full recovery running fsck manually. 
I have the log from the fsck run, but I doubt that it would give you
substantially more to go on. A coupple of illegal blocks cleared a 'Extended
attribute block with reference count 15 instead of 16', a 'free blocks count
wrong for group #202' ... 
I've now upgraded to kernel-smp-2.6.11-1.1240_FC4, but if the problem would
reoccur, is there anything I could log from a running kernel that would help you
pinpoint the error if I would notice that the filesystem is acting up again? 

Comment 3 Stephen Tweedie 2005-04-18 18:14:18 UTC
The incorrect xattr refcounts are most likely the result of a reference counting
bug in the intial FC3 release which we've since fixed.  It is conceivable that
the "bit already cleared" also resulted from that same problem, as an incorrect
xattr refcount could in theory lead to such a block being released early.  Can I
assume you have SELinux enabled on this filesystem?

Comment 4 David Juran 2005-04-18 18:21:52 UTC
Yes, I have selinux enabled (Though I did briefly, just before installing this
kernel turn it off a boot time). This also was the first kernel that I've been
running  that deviated from the ones released for FC3.

Comment 5 Stephen Tweedie 2005-04-18 22:11:06 UTC
OK, then it's possible that the filesystem/fsck complaints were just due to the
old xattr bug.  That still leaves the ATA DMA complaints, though.

Can you please report back if you see any further filesystem problems even with
recent kernels?


Comment 7 David Juran 2005-07-28 16:04:25 UTC
A very similar thing happened again, this time with
kernel-smp-2.6.12-1.1398_FC4. One note that might be of value is that this
happened under heave filesystem stress while running yum, copying a DVD and a
couple of other things.
Below is an excerpt from dmesg

hdb: dma_timer_expiry: dma status == 0x60
hdb: DMA timeout retry
hdb: timeout waiting for DMA
hdb: status timeout: status=0xd0 { Busy }

ide: failed opcode was: unknown
hda: DMA disabled
hdb: drive not ready for command
ide0: reset: success
UDF-fs INFO UDF 0.9.8.1 (2004/29/09) Mounting volume 'DVDVolume', timestamp
2036/02/07 10:58 (1000)
SELinux: initialized (dev hdd, type udf), uses genfs_contexts
EXT3-fs error (device hdb2): ext3_add_entry: bad entry in directory #230378:
rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Aborting journal on device hdb2.
EXT3-fs error (device hdb2) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device hdb2) in ext3_dirty_inode: Journal has aborted
ext3_abort called.
EXT3-fs error (device hdb2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device hdb2) in start_transaction: Journal has aborted
EXT3-fs error (device hdb2) in ext3_create: IO failure
__journal_remove_journal_head: freeing b_committed_data
.
.
.
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
.
.
.
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_committed_data
journal commit I/O error
journal commit I/O error


Comment 8 Dave Jones 2005-09-30 06:45:44 UTC
Mass update to all FC4 bugs:

An update has been released (2.6.13-1.1526_FC4) which rebases to a new upstream
kernel (2.6.13.2). As there were ~3500 changes upstream between this and the
previous kernel, it's possible your bug has been fixed already.

Please retest with this update, and update this bug if necessary.

Thanks.


Comment 9 Dave Jones 2005-11-10 19:51:27 UTC
2.6.14-1.1637_FC4 has been released as an update for FC4.
Please retest with this update, as a large amount of code has been changed in
this release, which may have fixed your problem.

Thank you.


Comment 10 Dave Jones 2006-02-03 07:02:06 UTC
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.


Comment 11 John Thacker 2006-05-04 13:41:37 UTC
Closing per previous comment.