Description of problem: Observed in RHEL 4.5, 2.6.9-55.ELsmp (mpt version 3.02.73rh) on IBM HS21 Blade, which uses LSI SAS1064E controller. 2 drives configured as RAID 1. Filesystem is ext3. On reboot, a host may boot with a very dirty filesystem. Fsck tends to fail, requiring manual intervention. After cleanup, files will be missing in seemingly random locations, though /var/log was reasonably consistent in having missing files. The only consistent observation is that boxes that exhibit this issue appear to have greater than 200 days uptime. We have not yet found a manual trigger for this behavior. IBM case 12RG9KK also opened on this issue. Particulars: IBM HS21 blade, SAS1064E controller, fw 0.10.15.00 as reported by LSI's cfggen utility. Redhat 4.5 2.6.9-55.ELsmp kernel, mpt version 3.02.73rh, ext3 filesystem. Bugzilla case opened on advice of jscalia for tracking purposes. How reproducible: Unfortunately, we haven't deduced a reliable, on-demand test case yet. Reboot and hope.
How often does this happen? Do any errors appear in the logs prior to reboot, are there any problems reported as the system is shutting down? How does fsck fail? Do you mean that fsck -a fails, but ultimately fsck is able to repair the fs (or at least run to completion) when manually invoked? Are the missing files those which fsck reported moving to lost+found/ ? What problems were found with the filesystem? There's just not a lot to go on at this point. Any more information you can gather would be helpful. -Eric
Agreed, we've been trying to capture as much information as possible. Once a host is recovered, we've seen nothing unusual in the logs (assuming the logfile survived the recovery). We've seen a total of 8 examples so far, 6 in the last month. When I say that fsck fails (I wish I had captured one of those), it gets to a state during the boot where it no longer continues and drops into single user for repair. At that point we've been re-booting into rescue mode and executing fsck. The fsck at that time completes successfully, but leaves us in a state where we're not sure what files might have been removed into lost+found. Fsck primarily complained about the number if inodes found for a file to be greater than it expected for the file, which smells like corruption with the directory to us. We're actively working to try to replicate this somehow (without waiting 200 days). I would leave this ticket at low priority and have it remain a collection point for data as we can discover it.
Sounds like a plan, thanks.
Here's what a failure would look like on boot. The fsck error (duplicate blocks shared across 4 files) isn't one of the usual that I come across, and I'm rather curious as to what sort of event or action could cause that. Transcription is from the remote console window (actually.. multiple screenshots.. it keeps rebooting because apparently /etc/passwd has bit the dust). Checking root filesystem / contains a file system with errors, check forced. /: Duplicate or bad block in use! /: Duplicate/bad block(s) in inode: 7733410: 15495780 15495781 15495810 15495811 /: Duplicate/bad block(s) in inode: 7734079: 15495781 /: Duplicate/bad block(s) in inode: 7735469: 15495780 /: Duplicate/bad block(s) in inode: 7735473: 15495810 /: Duplicate/bad block(s) in inode: 7735519: 15495811 /: File /var/cache/man/whatis (inode #7733410, mod time Sun Apr 6 04:22:38 2008) has 4 duplicate block(s), shared with 4 file(s): /: /var/spool/clientmqueue/qfm3G501Ck000622 (inode #7735519, mod time Wed Apr 16 01:00:01 2008) /: /var/spool/clientmqueue/qfm388229w004126 (inode #7735473, mod time Tue Apr 8 04:02:02 2008) /: /var/spool/clientmqueue/dfm3G501Ck000622 (inode #7734079, mod time Wed Apr 16 01:00:01 2008) /: /var/spool/clientmqueue/dfm388229w004126 (inode #7735469, mod time Tue Apr 8 04:02:02 2008) /: /: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. That's when it attempts to drop to a single user shell, sulogin can't find the password db and it reboots. I can boot this in rescue mode, fsck the system and collect the errors/warnings, but I'll wait and see if there is more data collection desired by RedHat before I make filesystem changes. Just let me know. I've got 2 blades doing this particular event, both went belly up at the same time, after they were rebooted following a change in hostname. This is the same sort of behavior that we saw on the other 6 blade servers.
The boxes are totally fine for the 200(-ish) days prior to the reboot, and this only shows up when you actually bring it up again? Above, the files in question were last modified in the last week or two (Apr 6, 8, 16)... Before you repair it, perhaps you can do dumpe2fs, or even make an e2image -r image, to preserve the current state. Depending on whether or not filenames are sensitive, perhaps the image could be provided for further analysis. Is there anything else "unique" about these blades? I've not seen this sort of thing anywhere else, so perhaps something unique about your setup could be triggering it. I suppose I should ask, are there any 3rd party modules in use? Thanks, -Eric
Yep, that seems to be the attendant behavior. Our first thought was write cache bugs or other issues with the RAID controller, but problem blades have had both the SA1064E and LSI53C1020/1030 controllers (SAS and SCSI). Nothing consistent. Also, I hate to say it, but all of the Windows blades have been just fine. I saw the file modification times, but they don't match to any human intervention. The hosts had been retired from previous work and had been simply sitting around for a bit before being re-purposed. I'm guessing regular cron maintenance tasks. On the second host that's broken now, his boot time fsck output is similar: /1: Duplicate/bad block(s) in inode 426133: 873016 /1: Duplicate/bad block(s) in inode 427934: 873016 /1: (There are 2 inodes containing duplicate/bad blocks.) /1: File /var/log/prelink.log (inode #426133, mod time Thu Feb 21 04:02:03 2008) has 1 duplicate block(s), shared with 1 file(s): /1: /var/spool/clientmqueue/dfm2160166001562 (inode #427934, mod time Sat Mar 1 01:00:01 2008) I'll create filesystem images today. Can those be attached to the bugzilla ticket?
let's see how big the images are (bzipped per e2image manpage instructions...). I'm not sure how big an attachment you can make, or how public you want the images to be, if they have any sensitive/private information. ftp or whatnot would be ok too. Hum; the 2nd host also has a file in clientmqueue with duplicately-mapped blocks? that's .... interesting ....
compressed e2image file mailed to you. Compressed down, it's about a meg.
New host crashed today in the middle of a tar operation. We grabbed the e2image file and will append errors to this ticket.
fsck output as attachment. There was so much of it, in fact, that we couldn't even keep up with the capture (mental note, typescript next time). Disk image is compressing.
Created attachment 304033 [details] fsck output from IBM Blade running Redhat ES 4 update 4
so: Illegal block #12 (192503584) in inode 1393785. CLEARED. Illegal block #13 (1734438497) in inode 139785. CLEARED. Illegal block #14 (540701285) in inode 1393785. CLEARED. Illega block #15 (1987208558) in inode 1393785. CLEARED. Illegal block #16 (1952526384) in inode 139378. CLEARED. Illegal block #17 (1952525932) in inode 1393785. CLEARED. Illegal blck #18 (1663971436) in inode 1393785. CLEARED. Illegal block #19 (1870226786) ininode 1393785. CLEARED. Illegal block #20 (1848534126) in inode 1393785. CLEARED. Illegal block #21 (1091204197) in inode 1393785. CLEARED. Illegal block #22 (153391975) in inode 1393785. CLEARED. ... so, it appears that the first indirect block for this inode is not an indirect block at all, but contains file data (the above block numbers actually translate into an email address at cbeyond.net...!) (the first 12 blocks (0..11) are stored directly in the inode; for the 12th and beyond, they are gathered from indirect blocks). So either the inode points to the wrong indirect block, or the indirect block was never updated with metadata, or it got overwritten by data... pretty hard to say which, from the information at hand. I'll look at the image you mailed, now... but wow, that fsck output really is a mess. Can you give me any more information about the machines; is the kernel tainted? How is storage set up (lvm? md? simple partitions?) etc... Thanks, -Eric
Stock kernel, right from the kickstart. Storage is onboard LSI hardware raid, 2 drives in RAID 1. Two partitions on /dev/sda, one swap and one root.
I wonder if there would be any value at all to checking each of the 2 legs of the mirrors independently? Oh... if it's hardware raid then is that possible? Looking at the image you sent, no clues as to *why* things went wrong yet, but I can characterize it. /var/cache/man/whatis shares blocks with several files in /var/spool/clientmqueue/* the "whatis" file is 51 blocks, timestamped Apr 6; it came first. the mqueue files are each 1 block, timestamped Apr 8 and 16, so they came later. each file has a single value for atime, mtime, & ctime. the single-block mqueue files seem to have "stolen" blocks from the whatis file, which is laid out like: BLOCKS: (0-11):15495780-15495791, (IND):15495792, (12-15):15495793-15495796, (16-23):15495798-15495805, (24-49):15495810-15495835 TOTAL: 51 the first 2 blocks (15495780, 15495781) are duplicated, as are the blocks in the last extent (15495810, 15495811) Because the first 2 blocks are stored in the inode itself, which looks perfectly fine, and because they are contiguous with a nice string of blocks in the file, it seems like the mqueue files that came along later somehow wrongly were given those same 2 blocks (rather than, say, the whatis inode getting corrupted).
Oh, also, the filesystem was marked as having errors. There should be something in the logs as to why... any chance of finding that? -/bin/sh-3.00# fsck -y /dev/sda1 fsck 1.35 (28-Feb-2004) WARNIG: couldn't open /etc/fstab: No such file or directory e2fsck 1.35 (28-Feb-2004) contains a file system with errors, check forced.
One other question on this; did the machines lose power prior to the reboot? Are you still experiencing these problems? Thanks, -Eric
In NEEDINFO since 2008, and RHEL4 is now very late in life. Closing WONTFIX unless something new comes up here.