Description of problem: This morning a server unexpectedly stopped responding. Checked it and found that the root partition had been remounted read-only due to a failure in the ext3 journal code. dmesg reports: EXT3-fs unexpected failure: !buffer_revoked(bh); inconsistent data on disk ext3_forget: aborting transaction: IO failure in __ext3_journal_revoke ext3_abort called. EXT3-fs error (device sda2): ext3_forget: error -5 when attempting revoke Remounting filesystem read-only Aborting journal on device sda2. EXT3-fs error (device sda2) in ext3_free_blocks_sb: Journal has aborted EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted EXT3-fs error (device sda2) in ext3_truncate: IO failure EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted EXT3-fs error (device sda2) in ext3_orphan_del: Journal has aborted EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted EXT3-fs error (device sda2) in ext3_delete_inode: IO failure __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data journal commit I/O error There were no out-of-the-ordinary messages prior to that. No ata errors, no SMART errors, nothing to indicate any sort of HW issue. This happened at 4:02 AM, apparently just after cron.daily jobs began to kick off, some of which are fairly disk-intensive. The only non-standard mount option I'm using is "noatime": /dev/root / ext3 rw,noatime,data=ordered 0 0 Version-Release number of selected component (if applicable): kernel-2.6.18-8.1.10.el5.x86_64 (stock, no third-party modules) How reproducible: Unknown. This is the first time I've ever seen this error (server was built just two weeks ago). Never had any troubles like this with RHEL3 or RHEL4. Steps to Reproduce: 1. Heavy disk load?
that message points to corruption on the disk. I would recommend running an fsck on the volume and see if the problem occurs again. If it does please give me information about what type of operations you are doing during the time of the failure so I can try and reproduce in house.
(In reply to comment #3) > that message points to corruption on the disk. Hrm, that's unsettling. The machine was never shut down uncleanly, and the partition was never written to outside of the ext3 mount. And just prior to putting the machine into production -- about a week before getting this error -- I ran a full fsck on all partitions (via 'touch /forcefsck'); no problems were found. > I would recommend running an fsck on the volume I rebooted (cleanly) after the error was discovered, and was forced to run a manual fsck. On Pass 1 it reported: Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes Inode 19664102 was part of the orphaned inode list. FIXED. On Pass 2 it found about 15 entries with "deleted/unused inodes". They all appeared to belong to files that should no longer exist (e.g. old log files, expired news articles), so I told it to clear them. On Pass 4 it found a few inodes with incorrect ref counts ("2, should be 1"). In the end, it appears no data was actually lost or corrupted. > and see if the problem occurs again. If it does please > give me information about what type of operations you are doing during the > time of the failure so I can try and reproduce in house. From the files listed in fsck's Pass 2, I suspect it may have been running the cron.daily script for INN (an NNTP news server), which, among other things, looks for and deletes expired news articles (each of which is kept in a separate file). Nothing really fancy going on there... Thus far the error hasn't occurred again. If it does, I'll try to collect more details. Suggestions for specific things to check would be helpful.
Do you have the actual full output of fsck, and can you attach it? Thanks, -Eric
Created attachment 216181 [details] fsck screen shots [1/3]: Pass 1, Pass 2
Created attachment 216191 [details] fsck screen shots [2/3]: more Pass 2
Created attachment 216201 [details] fsck screen shots [3/3]: more Pass 2, Pass 3, Pass 4
I wasn't able to capture the actual text output, but I did get some KVM-over-IP screen shots of the fsck process. Passes 1 through 3 are shown in their entirety (with some overlap). I neglected to take a screen shot of the final fsck output, but IIRC there were no further errors beyond what's shown in the last image.
Thanks. I asked because I wondered if this might be related to the corruption Stephen mentioned when he converted this sort of error from a BUG() to a filesystem shutdown: http://lkml.org/lkml/2004/11/19/391 Thanks, -Eric
ok since disk corruption isn't likely I'll come up with a testcase that will create a bunch of files, walk through them all and stat them and then delete randomly and see if I can reproduce this problem. Let me know if this sounds close to what your application is doing, and if not any suggestions on what to do better would be appreciated.
Sorry about the lack of updates. I never did find any way to reproduce this, and there have been no reoccurrences of the error on the system in the 5 months since I filed this report. Feel free to close...
Ok, you can reopen it if you find a way to reproduce.