From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116 Description of problem: For time to time my dmesg output fills with following message: EXT3-fs error (device hda6) in start_transaction: Journal has aborted. There is no info in system logs because hda6 is a /var and while error occures it is remounted read-only. It already happened twice since upgrade from FC1 to FC2, every time system was left idle over weekend. The onlu way to "repair" system is to reboot a machine. Then everything seems OK. system is standard PC without any special hardware: Motherboard Microstar with VIA IDE vt82c686a controller and Pentium III 800 MHz. Disk is divided into following partitions: /dev/hda1 on / type ext3 (rw) /dev/hda3 on /usr type ext3 (rw) /dev/hda5 on /tmp type ext3 (rw) /dev/hda6 on /var type ext3 (rw) /dev/hda7 on /home type ext3 (rw) I got this error only from /dev/hda6. This is probably because system was idle, so only writes to /var had been occuring. Version-Release number of selected component (if applicable): 2.6.5-1.358 How reproducible: Didn't try Steps to Reproduce: Additional info: I found some info regarding this error on kernel traffic, but it concerns previous kernel versions (2.6.0). Now it occures with 2.6.5-1.358.
I have this issue as well after upgrading from FC1 to FC2 on my Dell PowerEdge 1650 with scsi. Each morning when I come in the system is displaying that error to the screen over and over. Unable to reboot machine and have to power-cycle to get it back up.
The "journal is aborted" is just ext3's normal reaction to certain error conditions. An IO failure on a critical part of the disk (eg. the journal) can be enough to trigger it. So the root cause for this error can just as easily be a hardware or driver fault as anything to do with ext3. Without more complete logs, there's just no way to diagnose why it happened --- I really need to see the first message from ext3 that mentions the abort, as that will tell us _why_ ext3 decided to abort the journal in the first place.
I get the same error scrolling on the console but it never shows up in the messages file, nor do I ever see a reason for the Journal aborted in the messages log.
If the messages file is on the same filesystem as the abort occurred on, you won't be able to capture it that way --- by definition, the filesystem goes readonly on abort. A serial console cable is often the best way to trap the exact error in these cases.
Mine turned out to be a bad hard drive.
I've had this intermittently in several occasions since "upgrading" from FC1 to FC2. At first I thought it must be a hardware fault with one of the hard discs but as of now that it has happened to me with ext3 filesystems on four different discs on two different machines. Machine 1: Compaq DeskPro EN 866MHz (Intel 815e Chipset?) IDE hard discs Machine 2: Intel 440BX Motherboard IDE hard discs Both machines running FC2, kernel 2.6.5-1.358 (Standard i686 build as shipped with distro). The fault last occured on machine 2 while moving and deleting some old files and directories from the /var partition on /dev/hdc1. Logged in through ssh, my first indicator of a problem was the filesystem going read-only. dmesg output produced immediately afterwards: EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339721: rec_len % 4 != 0 - offset=0, inode=572662306, rec_len=8738, name_len=34 Aborting journal on device hdc1. ext3_abort called. EXT3-fs abort (device hdc1): ext3_journal_start: Detected aborted journal Remounting filesystem read-only EXT3-fs error (device hdc1) in start_transaction: Journal has aborted EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339723: rec_len % 4 != 0 - offset=0, inode=572662306, rec_len=8738, name_len=34 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339725: rec_len % 4 != 0 - offset=0, inode=3995472896, rec_len=61439, name_len=237 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339727: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339729: directory entry across blocks - offset=0, inode=1038630372, rec_len=15856, name_len=244 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339731: rec_len % 4 != 0 - offset=0, inode=954087615, rec_len=14563, name_len=3 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339733: rec_len % 4 != 0 - offset=0, inode=1025719571, rec_len=15719, name_len=137 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339735: rec_len % 4 != 0 - offset=0, inode=839659960, rec_len=12866, name_len=138 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339739: directory entry across blocks - offset=0, inode=1048788601, rec_len=16024, name_len=104 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339742: rec_len % 4 != 0 - offset=0, inode=850342558, rec_len=12987, name_len=199 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339744: rec_len % 4 != 0 - offset=0, inode=2535210785, rec_len=60019, name_len=48 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339746: rec_len % 4 != 0 - offset=0, inode=16843009, rec_len=257, name_len=1 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339748: rec_len % 4 != 0 - offset=0, inode=4294967295, rec_len=15621, name_len=69 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339750: rec_len is smaller than minimal - offset=84, inode=4784204, rec_len=0, name_len=0 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339752: rec_len % 4 != 0 - offset=0, inode=267112565, rec_len=18059, name_len=12 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339754: rec_len % 4 != 0 - offset=0, inode=4294791301, rec_len=60415, name_len=29 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339756: directory entry across blocks - offset=0, inode=15493130, rec_len=49152, name_len=233 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339758: directory entry across blocks - offset=0, inode=3221225626, rec_len=18408, name_len=34 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339760: rec_len % 4 != 0 - offset=0, inode=2312715081, rec_len=2125, name_len=117 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339764: directory entry across blocks - offset=0, inode=1190966301, rec_len=23296, name_len=195 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339766: rec_len % 4 != 0 - offset=0, inode=1374456661, rec_len=33617, name_len=101 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339768: rec_len % 4 != 0 - offset=0, inode=3354686857, rec_len=63557, name_len=158 EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory #339770: rec_len % 4 != 0 - offset=0, inode=2784166633, rec_len=51339, name_len=131 EXT3-fs error (device hdc1) in start_transaction: Journal has aborted EXT3-fs error (device hdc1) in start_transaction: Journal has aborted EXT3-fs error (device hdc1) in start_transaction: Journal has aborted EXT3-fs error (device hdc1) in start_transaction: Journal has aborted EXT3-fs error (device hdc1) in start_transaction: Journal has aborted EXT3-fs error (device hdc1) in start_transaction: Journal has aborted Rebooted machine and had to run fsck manually - multiple errors requring manual confirmation, but machine able to boot normally again afterwards. Will report details if happens again (limiting to just first few lines if that's all that's relevant).
This happens to me everytime I try to upgrade to the latest kernel rpm (435-smp) (on 3 different identical machines!), and now it happened to a machine that's been running for a week (on 427). (see my post in the mailing list) All machines that had this problem were: Dell 2450, dual 866 CPU, 1GB RAM, PERC 3/Si raid, 2 drives configured hardware mirror. I created a custom boot cd that does a kickstart network install, so all installs were the exact same as well. I had syslog going to a different machine, and here are some log entries where it broke: Jun 20 19:06:06 dc17 syslogd: /var/log/messages: Read-only file system Jun 20 19:06:06 dc17 kernel: EXT3-fs error (device sda2): ext3_free_blocks: Freeing blocks not in datazone - block = 12978214 49, count = 1 Jun 20 19:06:06 dc17 kernel: Aborting journal on device sda2. Jun 20 19:06:06 dc17 kernel: EXT3-fs error (device sda2): ext3_free_blocks: Freeing blocks not in datazone - block = 34942840 90, count = 1 Jun 20 19:06:07 dc17 kernel: ext3_abort called. Jun 20 19:06:07 dc17 kernel: EXT3-fs abort (device sda2): ext3_journal_start: Detected aborted journal Jun 20 19:06:07 dc17 kernel: Remounting filesystem read-only Jun 20 19:06:07 dc17 kernel: EXT3-fs error (device sda2) in start_transaction: Journal has aborted Jun 20 19:06:07 dc17 kernel: EXT3-fs error (device sda2): ext3_free_blocks: Freeing blocks not in datazone - block = 40156468 81, count = 1 Jun 20 19:06:07 dc17 kernel: EXT3-fs error (device sda2): ext3_free_blocks: Freeing blocks not in datazone - block = 91000576 8, count = 1 Jun 20 19:06:07 dc17 kernel: EXT3-fs error (device sda2): ext3_free_blocks: Freeing blocks not in datazone - block = 56902952 4, count = 1 .. several pages of these, then a few these.. Jun 20 19:06:16 dc17 kernel: EXT3-fs error (device sda2) in start_transaction: Journal has aborted .. many more of these.. Jun 20 19:07:27 dc17 kernel: EXT3-fs error (device sda2): ext3_free_blocks: Freeing blocks not in datazone - block = 25129345 86, count = 1 .. then this.. Jun 20 19:07:28 dc17 kernel: ext3_reserve_inode_write: aborting transaction: Journal has aborted in __ext3_journal_get_write_ access<2>EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted Jun 20 19:07:28 dc17 kernel: EXT3-fs error (device sda2) in ext3_truncate: Journal has aborted Jun 20 19:07:28 dc17 kernel: ext3_reserve_inode_write: aborting transaction: Journal has aborted in __ext3_journal_get_write_ access<2>EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted Jun 20 19:07:28 dc17 kernel: EXT3-fs error (device sda2) in ext3_orphan_del: Journal has aborted Jun 20 19:07:28 dc17 kernel: ext3_reserve_inode_write: aborting transaction: Journal has aborted in __ext3_journal_get_write_ access<2>EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted Jun 20 19:07:29 dc17 kernel: EXT3-fs error (device sda2) in ext3_delete_inode: Journal has aborted Jun 20 19:07:29 dc17 kernel: __journal_remove_journal_head: freeing b_committed_data Jun 20 19:07:30 dc17 kernel: EXT3-fs error (device sda2) in start_transaction: Journal has aborted
All I can see from these is that there's some data corruption on disk. The problems are not consistent between reports here. Some of the reports are UP, some SMP. I suspect we're going to need more data to figure out the common issue here, whether it's something stomping on kernel memory, or driver issues, or filesystem issues, or possibly several different problems altogether. Are these cases using ACLs, SELinux, quotas? Exactly what problems (if any) does fsck report afterwards? Are there any other driver errors in the logs? Is there any pattern of workload that seems to trigger the problem? The "everytime I try to upgrade to the latest kernel rpm" one looks really weird. I wonder if there's a PERC driver bug in the previous kernel version there. Trying the upgrade with a UP kernel might be one thing to try.
One other thing --- whenever this occurs, we need to look carefully for the last kernel messages *before* the journal goes offline. Journal aborts are usually due to some form of IO error, either an IO failure or corrupted data from an IO, and it is important to know whether or not there are any driver log messages prior to the abort which could indicate a problem.
Usually the only way to take this sort of bug report forward is either just to collect enough data that patterns become apparent, or to find a specific reproducer for the problem. The only obviously reproducible error here is the "everytime I try to upgrade to the latest kernel rpm (435-smp)" one. From the fedora-list email: >>>> All I'm doing is 'yum update', and it downloads the latest kernel (single and smp), then does a test transaction, then updates the single cpu rpm, then it tries to update the SMP kernel rpm, and it blows up, with a whole bunch of these: scsi1 (0:0): rejecting I/O to offline device >>>> which is pretty clearly not a filesystem problem at root --- if the filesystem is getting its I/O rejected because the underlying device is offline, then the journal abort is definitely its expected response! Have you been able to capture any more from the logs? We really need to know why the device went offline to take this particular case any further.
Created attachment 101432 [details] FC2 ks.cfg file used to install all my machines This will tell you exactly what I've got installed on my machine when I then do a yum upgrade and it causes the FS to go south on a kernel update. Before the kernel update, I had already applied most other updates available.
I meant to add this comment before the above one, so here goes.. Unfortunately I don't think I was remotely logging the machines at the time that I was having the kernel-upgrade crashes. I have since put FC1 on most of the machines, because I need to actually do something with them, but I do have 1 or 2 that still have FC2 installed, so I'll try to do the upgrade again, and save the output from a console and setup remote syslog. Is there anything special you want me to do before/when I try to do the upgrade? Looking thru the logs I do have, the crash on Jun20th (which was random, not from a kernel upgrade attempt), here are the first messages to appear in the log for several hours (same as above): Jun 20 19:06:06 dc17 syslogd: /var/log/messages: Read-only file system Jun 20 19:06:06 dc17 kernel: EXT3-fs error (device sda2): ext3_free_blocks: Freeing blocks not in datazone - bloc k = 1297821449, count = 1 Jun 20 19:06:06 dc17 kernel: Aborting journal on device sda2. There were no other messages from that host logged before the above entries. All of the machines that I had the problem with went thru the exact same install via the same remote ks file. Nothing special, pretty well stripped down except for a webserver and basic X (not window managers even), gcc, etc.. Actually, why don't I just upload the ks.cfg file and you can take a look for yourself. -ray
The "ext3_free_blocks: Freeing blocks not in datazone - block = 1297821449, count = 1" seems to be a different pattern from the "death on kernel upgrade." But in both cases I've really got no idea what the underlying problem is --- really, capturing as much log output as possible is going to be the best way to narrow things down right now. Thanks.
I have this issue - the original "EXT3-fs error (device hda2) in start_transaction: Journal has aborted." happening regularly after approx 3 days uptime. Could you inform me of the best way to troubleshoot this once it has happend? I still have access to xterms/console (although other applications have been killed off - and any remote seessions are unavailable) Some commands are still available to me whilst the machine is in this error state without producing an I/O error (commands like 'more' for example produce I/O errors). Many thanks.
I have some additional info that might be helpful: - the error occures on diferent filesystems (my disk has: /, /usr, /tmp, /var, /home) - disk was SMART checked and it did not report any error - before "journal aborted" there is a message: "EXT3-fs error (device hda7): ext3_free_blocks: bit already cleared for block 8939931"
Created attachment 103633 [details] /var/log/messages file with error on hda3
Created attachment 103634 [details] /var/log/messages file with error on hda5
Created attachment 103635 [details] /var/log/messages file with error on hda7
Do current rawhide kernels show the same problem? Does a memtest86 show hardware problems? Right now, all I can see is you're getting memory or IO corruption somewhere. There's nothing conclusive showing whether it's a kernel or hardware fault, or whether it's memory or disk or something in between.
1. Well, I did not try rawhide kernel. 2. Memtest86 does not show any errors (but I will run it once more)
I am also experiencing these problems on IDE drives on a uniprocessor under kernels as recent as kernel-2.6.8-1.521. I can reproduce this problem within about 24 hours of runtime by doing full amanda backups. The error messages are either "freeeing blocks not in datazone", "directory #NNNNN contains a hole at offset XXXX" or "attempt to access beyond end of device". I previously experienced this problem over the past 9 months on this machine under RH9 as well, so it has probably been around for a while. This was the problem that lead me to try upgrading to FC2 in the first place. Suspecting "bit rot" on the drive, I have swapped the hard drive and IDE cable which did not fix the problem. I do not see file data corruption, only problems with the filesystem structure metadata being corrupted, which also leads me to suspect that this is not a hardware problem. Oddly, the files that are corrupted are rarely ones we actually use, and certainly aren't modifying (e.g. a Makefile for an M68K software package we have never run or a file in a graphics package and we don't run X on this machine).
Created attachment 103726 [details] output of dmesg
Created attachment 103727 [details] /proc/pci configuration for machine of kent
Created attachment 103729 [details] /proc/cpuinfo for machine of kent
Created attachment 103730 [details] /proc/iomem for machine of kent
Created attachment 103731 [details] /proc/ioports for machine of kent
Created attachment 103732 [details] /proc/modules for machine of kent
Created attachment 103733 [details] /sbin/lspci output for machine of kent
But are you actually checking for file data corruption? Also, if there are memory errors and the files concerned are mostly being accessed for read, it's entirely possible to get errors in memory which don't get propagated back to disk. That can easily result in a system passing a simple data verify, but still failing with data errors under load. Re memtest comment above: running memtest86 once is not a good test of memory. I usually recommend at least a full overnight run. http://www.bit-net.com/~rmiller/dt.html is a useful test program that verifies the data it writes, when there's a suspicion of bad hardware. A problem like this that's reproducible on a wide range of kernels definitely sounds like a hardware problem.
Of course, before running any disk check program, make double-sure that the filesystem you're running on is already free from errors by fscking it (or just run on a newly-mkfs'ed filesystem). You can "touch /forcefsck" then reboot to force a full check of all filesystems.
To check for file corruption, I run tripwire each night, which performs an MD5 checksum (amoung other tests) over most (but not all) of the files on my system. If there were data corruption, the checksums would differ. Thus far (i.e in the last 9 months going back to RH9), the only files that tripwire reports as bad are the ones where the metadata has gone bad. One thing I noticed yesterday while logging messages from an fsck session is that many (but not all) of the bad files have illegal blocks as blocks 8 and/or 9 (almost never 1-7). Looking back through my console log, I see that this is consistent. Often, fsck winds up also changing the number allocated blocks in the inode from 8 down to 1 (or 0). To try to make progress, I've copied all of my ext3 filesystems onto new ext2 filesystems on another drive, and have booted from the ext2 drive. This is a drive which had gotten these types of errors when the filesystems were ext3. If the problem doesn't re-occur over the next few days on ext2 fs, I'm going to suspect even more that the problem is specific to ext3 fs. As for running fsck, I wind up doing this almost daily, making it seem to me that I can reproduce this problem in about 24 hours. Then amanda kicks off at 1 am. As long as I don't run amanda full backups, I rarely see any problems. Running amanda full backups, I generally have errors in the logs in the morning that occur while the backups were running. I suspect that the backups generate enough disk traffic to cause the problem to occur. Tripwire normally runs at 4am each day. I check the logs for ext3 errors and the tripwire report for errors about 9am each day. Thanks for the suggestion about dt. I've looked at it. I suppose I ought to run it on a raw disk partition (or drive). Is /dev/raw/* no longer included in the kernel by default? i.e. What's the best way to name a raw disk partition in FC2? Specifying (for example) /dev/hdc7 to dt gets (not surprisingly) the block device. Do I need to re-gen the kernel to get raw disk dev support?
The raw driver had a ton of limitations, such as an upper limit of 255 raw devices active at once. The new way to achieve the same thing on 2.6 kernels is to open the device with O_DIRECT instead. Is the problem _always_ in blocks 8/9? Or just usually? That's a strange symptom: there's just nothing special about those blocks in the fs code (the first "special" block in a file is block 12, when we start using indirect blocks.) Are there any particular apps you've got that might be doing something unusual at that offset in a file? Is there any pattern to the files which show corruption? Are you using xattrs? Quotas? What other modules are in use -- there could be one stomping on ext3's memory here. Running a kernel-debug kernel would be one way to try to catch that. Could you capture a log of the fsck output at some point, showing the exact errors it detects, please? As for running dt, I think there's value both in running it on a fs itself _and_ on the raw device. Each provides different info. Sorry, but there are lots of questions here. There's simply no obvious pattern at present; we have tons of users using ext3 _very_ heavily without these problems being reported, and in our internal testing I've never seen them reported either. So it's going to be a bit of a hunt to find what's special in your case.
Here's an update: I've had the problem appear on ext2 filesystems as well. I can reproduce it using rsync to copy my /home filesystem from one partition to another. I'll attach a "typical" fsck output for /home. I am not running xattrs or quotas. The list of modules I use is included in the prior /proc/modules attachment, I believe. I'll try dt with "flags=direct" on a filesystem, and on a raw drive and report on what I find. I've run memtest86 overnight in extended tests mode with no errors.
Created attachment 104413 [details] fsck output of /home Scenario: In single-user mode, ran fsck cleanly on /home. ran "rsync -uax --verbose /home/ /mnt/home_copy/". Re-ran fsck on /home and got the errors in this attachment. In theory only writes to /home should be updating the file access times of of every inode in /home. /home is /dev/hda6. /mnt/home_copy is /dev/hdc6.
Kent, Some notes: First, tripwire does not help detect data corruption if the corruption is occurring on writes (it will only end up checksumming the incorrect data!) Does fsck run cleanly if you run again after fixing up the problems? This particular problem has never felt like an ext3 problem; if you can reproduce with ext2, that's confirmed. It might still be hardware; it might be some other part of the system scribbling on the inode structure in memory (ext2 and ext3 keep direct block 8/9 at the same offset into their inode structs, although the structs are rather different sizes.) The "dt" program (http://www.bit-net.com/~rmiller/dt.html) is an excellent way to start hunting for the first possibility. Building a debug kernel (especially one with CONFIG_DEBUG_SLAB set) would be a good first step towards isolating the second.
Agreed about tripwire. Wouldn't detect data changing on writes. My point was that it would detect "bit rot" in the data -- errors on re-reading. Fsck runs clean after fixing up any problems. I saw two some single-bit errors with dt after a few hundred passes writing to a raw disk partition. I'm going to conclude based on this that it's a hardware defect or else I need to us hdparm to back down the UDMA performance. You can close this bug as far as I'm concerned... Thanks for the help.
OK, will do --- thanks for helping to chase this down.