From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Description of problem: I am getting the following message when my system boots. /dev/hda1 is the boot partition. From dmesg: Buffer I/O error on device hda1, logical block 526 hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=1115, sector=1115 ide: failed opcode was: unknown end_request: I/O error, dev hda, sector 1115 JBD: IO error reading journal superblock EXT3-fs: error loading journal. The system does boot up OK, but /dev/hda1 is not mounted. Attempting to mount it gives the above errors again. /dev/hda2 mounts as swap fine as does /dev/hda3 as /tmp. / is mounted on md0, which is mirrored (RAID1) drives /dev/hdb1 and /dev/hde1. fsck reports the following: [root@linserv /]# fsck -V /dev/hda1 fsck 1.37 (21-Mar-2005) [/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 /dev/hda1 e2fsck 1.37 (21-Mar-2005) /boot1: Attempt to read block from filesystem resulted in short read while reading block 526 /boot1: Attempt to read block from filesystem resulted in short read reading journal superblock fsck.ext3: Attempt to read block from filesystem resulted in short read while checking ext3 journal for /boot1 On checking FedoraForum I discovered another user with the exact same problem which started out with a short read. Since at least two people are having the problem since upgrading to the current kernel release, I felt it would be a good idea to have a developer check to see if any new bugs regarding the handling of journalling in ext3 have been introduced. Version-Release number of selected component (if applicable): kernel version 2.6.12-1.1398_FC4 How reproducible: Didn't try Steps to Reproduce: 1. This is a disk corruption which I cannot reproduce. 2. 3. Actual Results: n/a Expected Results: n/a Additional info: By following the steps in the Red Hat Enterprise Linux Admin Guide to revert the fs to ext2 and remove the journal then converting it back to ext3 and recreating the journal the problem can be "worked-around".
Please see the url listed for a more complete discussion of the problem on FedoraForum.org.
Buffer I/O error on device hda1, logical block 526 hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=1115, sector=1115 is a sign of a hardware error. I don't think there's a bug here --- just a bad sector which, due to seriously bad luck, landed bang in the middle of the journal. So yes, recreating the journal is the workaround; but the underlying problem is hardware, not software.
Does this then bring up the question of when the journal was created in the first place during the formatting of the partition, that there might be inadequate checks to make sure that there are no bad sectors, or that bad sectors are mapped out before the journal and other structures are written? Hardware problems may have caused this issue in one case, but there are more of us that have experienced this problem than just Mr. Tweedie. I believe there might be something else going on here, but I don't have the skill to say what it is. I just know that all of a sudden, my partition became read-only, and since I didn't know how to recreate the journal, I wound up wiping the whole installation and reinstalling. Fortunately, I had a backup of my home partition so it didn't take too long to get back to where I was. Stephen
Bad sector checking is by-and-large just not useful on modern disk drives. If there has been an error on a sector, it gets remapped transparently on the next write; the O/S never sees it. Certainly, having code in e2fsprogs to re-write the journal automatically if it detects this sort of thing could be useful. But the dma_intr: error=0x40 { UncorrectableError } error is just the kernel reporting what the disk drive told us about a bad sector --- it's not something that the kernel can handle on its own.
The problem is, I didn't get this error message when my partition became read-only. I didn't get any error message at all. I rebooted, and tried to do a FSCHK, but didn't know how to answer the questions. That is when I wiped and reinstalled. Does the problem I had differ enough that it should be listed as another bug, leaving this one closed? When the URL mentioned in the Additional Bug Information is viewed, at least two of us had this problem that does not look to be connected with the bad sector error message. that caplinux had. If it was a kernel error, it might be a moot point since kernel 2.6.12-1.1447_FC4 just came through on Fedora Updates. I am just a bit paranoid about this since it did take the better part of the week to get all the programs set back up they way they were before. I don't want to have to go through this again. Respectfully, Stephen
The kernel *always* emits an error when turning the partition read-only. The only code in the whole of the kernel capable of making a partition read-only in response to an error (as opposed to in response to an explicit user request) unconditionally emits an error to say that it is doing so. Now, you may have missed it --- if you were running under X, and the kernel error logs were on the root filesystem, and the root fs itself became readonly, then obviously you'd miss the console message and the syslog copy would not be writable. It's quite common not to *see* the error due to this combination. But it will still be produced. The question is how to capture it; serial or network console is the recommended mechanism.
I guess I can accept that this was just a hardware problem and not a bug. It being just a coincidence that 3 people had the exact same problem running the exact same code. At least it has been pointed out so that future occurances might be suspect as indicating a real problem. My system has been running fine for several weeks now since recovering the journal.