From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Description of problem: About three days after upgrading to kernel 2.4.21-20, I received hundreds of EXT3 error messages in /var/log/messages: Dec 21 01:05:01 linux01 kernel: EXT3-fs error (device cciss0(104,1)): ext3_new_block: Allocating block in system zone - block = 96 Dec 21 01:05:01 linux01 kernel: EXT3-fs error (device cciss0(104,1)): ext3_new_block: Allocating block in system zone - block = 160 Dec 21 01:05:01 linux01 kernel: EXT3-fs error (device cciss0(104,1)): ext3_new_block: Allocating block in system zone - block = 161 Dec 21 01:05:01 linux01 kernel: EXT3-fs error (device cciss0(104,1)): ext3_new_block: Allocating block in system zone - block = 162 At the time, I was using the SMP kernel on an Intel P4 with hyperthreading. Brand new HP server, Compaq RAID SCSI controller with two RAID1 disks. This system showed no errors and was up for over a year without problems. When I noticed the errors, the machine was still running. An 'ls' in the root directory showed no files. The hard disks showed no indication of a problem. I brought the machine to run level 1 to run fsck. Fsck reparied hundreds of file system errors. When it was finished, there were no files left. The entire system had been destroyed and it would not boot. I was forced to do a bare metal restore from tape and downgraded the kernel to the one that shipped on the RHEL 3 CDs. I also ran the UP kernel instead of SMP. Since the kernel downgrade, the server has been up for a month with no sign of the previous errors. I speculate that the problem is in the ext3 code and gets triggered when using the SMP kernel. Version-Release number of selected component (if applicable): 3 How reproducible: Didn't try Steps to Reproduce: 1. use kernel vmlinuz-2.4.21-20.ELsmp with ext3 2. 3. Additional info:
Unfortunately there's not enough information to even begin to diagnose this. Without the e2fsck logs or full kernel logs all I can see is that there was a bitmap corruption in the first part of the filesystem. Now, corruption near the start of the filesystem would also be consistent with losing the root directory. If that happened, e2fsck would recover most of your files, but would place them into /lost+found. "no files left" is not a situation that e2fsck normally results in --- but if the root dir is gone, all files gone into /lost+found is expected. So it looks as if we lost some data at the beginning of the disk, including the initial block bitmap and root directory. But even that is just a guess; as for guessing *why*, I've no idea, but it could easily be bad hardware, some other component of the kernel or driver stomping on memory, a raid fault, etc. The problem description also sounds odd: if an "ls" showed no files in the root directory, how did you successfully manage to bring it to runlevel 1? This sounds undiagnosable as it stands, but I'll ask to see if anyone knows about problems with cciss in that release.
Even though an "ls" showed no files in the root directory, I could "cd" into directory that I knew was there, like "cd bin". Then I could "ls" and see all the files in /bin. Very strange. Somehow, it got to runlevel 1. After the e2fsck, the only thing left was the lost+found directory, but there were no files or partial files in lost+found. It was empty. I only found this out after booting from a rescue CD because the reboot after the e2fsck failed. The boot loader could not find the kernel. I found an old bug from 2002 where the ext3_new_block error showed up when many small files were written to disk at the same time. From reading the kernel mailing list archives, it was not clear that the problem was completely resolved, only mitigated. It is also possible that the bug is in the HP SCSI RAID driver (cciss). I would gladly provide you with logs and core dumps if I had them, but everything was blown away after the e2fsck. I only have logs from the backup the night before it died, showing all the ext3 errors.
The ability to "cd" through the empty directory was probably because "ls" was trying to read the directory --- which was broken --- whereas "cd" was simply traversing the existing in-core directory cache tree (which would have been pinned in memory because executable files from /bin were still running.) The ext3_new_block error simply marks that ext3 has detected certain on-disk data corruption. On its own it tells us nothing about how it got into that state. Running e2fsck on a live, still-mounted filesystem is excessively dangerous, and may have contributed to the problem. ext3 is tested very extensively, and I've got no reason to believe it still has massive fs-destroying bugs. About all I can suggest is that you open a support ticket to find out if there are any known problems with the particular combination of CCISS driver version, firmware version, disks, and other software you may have running.