Bug 166151

Summary:	Short read causes journal corruption in ext3 fs
Product:	[Fedora] Fedora	Reporter:	Ken Presser <capnlinux>
Component:	kernel	Assignee:	Stephen Tweedie <sct>
Status:	CLOSED NOTABUG	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	davej, hafflys, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
URL:	http://forums.fedoraforum.org/forum/showthread.php?t=72860&highlight=short+read
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-08-30 11:36:09 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ken Presser 2005-08-17 15:23:35 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Description of problem:
I am getting the following message when my system boots. /dev/hda1 is the boot partition. From dmesg:

Buffer I/O error on device hda1, logical block 526
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=1115,
sector=1115
ide: failed opcode was: unknown
end_request: I/O error, dev hda, sector 1115
JBD: IO error reading journal superblock
EXT3-fs: error loading journal.

The system does boot up OK, but /dev/hda1 is not mounted. Attempting to mount it gives the above errors again. /dev/hda2 mounts as swap fine as does /dev/hda3 as /tmp. / is mounted on md0, which is mirrored (RAID1) drives /dev/hdb1 and /dev/hde1.

fsck reports the following:

[root@linserv /]# fsck -V /dev/hda1
fsck 1.37 (21-Mar-2005)
[/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 /dev/hda1 e2fsck 1.37 (21-Mar-2005)
/boot1: Attempt to read block from filesystem resulted in short read while reading block 526

/boot1: Attempt to read block from filesystem resulted in short read reading journal superblock

fsck.ext3: Attempt to read block from filesystem resulted in short read while checking ext3 journal for /boot1

On checking FedoraForum I discovered another user with the exact same problem which started out with a short read. Since at least two people are having the problem since upgrading to the current kernel release, I felt it would be a good idea to have a developer check to see if any new bugs regarding the handling of journalling in ext3 have been introduced.

Version-Release number of selected component (if applicable):
kernel version 2.6.12-1.1398_FC4

How reproducible:
Didn't try

Steps to Reproduce:
1. This is a disk corruption which I cannot reproduce.
2.
3.

Actual Results: n/a

Expected Results: n/a

Additional info:

By following the steps in the Red Hat Enterprise Linux Admin Guide to revert the fs to ext2 and remove the journal then converting it back to ext3 and recreating the journal the problem can be "worked-around".

Comment 1 Ken Presser 2005-08-17 15:28:52 UTC

Please see the url listed for a more complete discussion of the problem on
FedoraForum.org.

Comment 2 Stephen Tweedie 2005-08-30 11:36:09 UTC

Buffer I/O error on device hda1, logical block 526
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=1115,
sector=1115

is a sign of a hardware error.  I don't think there's a bug here --- just a bad
sector which, due to seriously bad luck, landed bang in the middle of the
journal.  So yes, recreating the journal is the workaround; but the underlying
problem is hardware, not software.

Comment 3 Stephen Haffly 2005-08-30 15:16:00 UTC

Does this then bring up the question of when the journal was created in the
first place during the formatting of the partition, that there might be
inadequate checks to make sure that there are no bad sectors, or that bad
sectors are mapped out before the journal and other structures are written?

Hardware problems may have caused this issue in one case, but there are more of
us that have experienced this problem than just Mr. Tweedie.

I believe there might be something else going on here, but I don't have the
skill to say what it is.  I just know that all of a sudden, my partition became
read-only, and since I didn't know how to recreate the journal, I wound up
wiping the whole installation and reinstalling.  Fortunately, I had a backup of
my home partition so it didn't take too long to get back to where I was.

Stephen

Comment 4 Stephen Tweedie 2005-08-30 15:42:00 UTC

Bad sector checking is by-and-large just not useful on modern disk drives.  If
there has been an error on a sector, it gets remapped transparently on the next
write; the O/S never sees it.  

Certainly, having code in e2fsprogs to re-write the journal automatically if it
detects this sort of thing could be useful.  But the 

dma_intr: error=0x40 { UncorrectableError }

error is just the kernel reporting what the disk drive told us about a bad
sector --- it's not something that the kernel can handle on its own.

Comment 5 Stephen Haffly 2005-08-31 03:39:36 UTC

The problem is, I didn't get this error message when my partition became
read-only.  I didn't get any error message at all.  I rebooted, and tried to do
a FSCHK, but didn't know how to answer the questions.  That is when I wiped and
reinstalled.

Does the problem I had differ enough that it should be listed as another bug,
leaving this one closed?  When the URL mentioned in the Additional Bug
Information is viewed, at least two of us had this problem that does not look to
be connected with the bad sector error message. that caplinux had.

If it was a kernel error, it might be a moot point since kernel
2.6.12-1.1447_FC4 just came through on Fedora Updates.

I am just a bit paranoid about this since it did take the better part of the
week to get all the programs set back up they way they were before.  I don't
want to have to go through this again.

Respectfully,

Stephen

Comment 6 Stephen Tweedie 2005-08-31 10:24:34 UTC

The kernel *always* emits an error when turning the partition read-only.  The
only code in the whole of the kernel capable of making a partition read-only in
response to an error (as opposed to in response to an explicit user request)
unconditionally emits an error to say that it is doing so.

Now, you may have missed it --- if you were running under X, and the kernel
error logs were on the root filesystem, and the root fs itself became readonly,
then obviously you'd miss the console message and the syslog copy would not be
writable.  It's quite common not to *see* the error due to this combination. 
But it will still be produced.  The question is how to capture it; serial or
network console is the recommended mechanism.

Comment 7 Ken Presser 2005-09-01 12:18:33 UTC

I guess I can accept that this was just a hardware problem and not a bug.  It
being just a coincidence that 3 people had the exact same problem running the
exact same code.

At least it has been pointed out so that future occurances might be suspect as
indicating a real problem.

My system has been running fine for several weeks now since recovering the journal.