From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; es-ES; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc4 Firefox/1.0.6 Description of problem: I've been in a situation on 3 occasions when I've had a freeze / power failure cause my '/tmp' LVM volume to get hosed. In fact, it was corrupted to the extent that on reboot, fsck was unable to repair the damage. Luckily, I was able to do the following from the single-user mode shell I got dropped into: grep /dev/VolGroup00/LogVol02 /etc/fstab ... and establish that the FS that was corrupted was '/tmp'. Since nobody should be relying on the contents of '/tmp', I was then able to do the following to recover the situation... mke2fs -j /dev/VolGroup00/LogVol02 <control-D_to_reboot> On reboot, the situation was resolved. Thinking about it, could we not automate this process by modifying /etc/rc.sysinit such that if /tmp (and only '/tmp'! ;-) gets corrupted (ie where fsck is unable to recover it), we either: 1) display a message showing what's going on, and then automatically reformat the FS for the user. 2) offer the user a chance to reformat the FS We could display a message like: Recoverable error(s) detected on temporary filesystem possibly caused by crash/power failure. Repair now? ([y]/n)? Version-Release number of selected component (if applicable): initscripts-8.11.1-1 How reproducible: Didn't try Steps to Reproduce: 1. force a crash when apps have got files open in /tmp 2. reboot 3. watch as fsck fails to repair the damage. Actual Results: System drops into single-user mode with lots of scary messages about fsck errors, etc. Expected Results: If I were a newbie user, I would expect that if the damage was repairable, it would be repaired for me. Lets face it, when fsck fails, the console looks very scary and the situation looks very bad indeed. If a newbie had been sitting at the terminal, they probably would have fainted. However, when they recovered, they may have thought their entire machine was dead and been forced to re-install it (probably the only option available to them, unless they had another box on the net to use to ask for help from the forums/google). Additional info: I think this would be a very useful feature. Obviously, we need to be *very* sure that we are reformatting the '/tmp' filesystem and no other filesystem. I suggest that if this feature is implemented, we fsck each volume in turn. If the _only_ volume that fails fsck is '/tmp' (ie we keep track of all the fsck return codes on all the other filesystems), we offer the user the chance to reformat it. If 2 FS's are hosed, and these FS's are '/tmp' and '/var', maybe we could reformat both (can just about live with '/var' being wiped). If >1 FS's are hosed and they are not '/tmp' or '/var', we're in deep trouble so probably just give up at this point. I guess an alternative to this whole issue is to have a tmpfs based '/tmp' a la solaris! ;-)
/var contains the package database. That would be bad to wipe. :) Realistically, I'd say that this is a lot of complication just for the one special case of /tmp. Those that aren't able to handle fsck failures and corruption messages normally don't have /tmp separate in any case, so I'm not sure this solves a big usage problem for those users. Hence, I don't think this will be worked on in the near future.
Hi Bill, I agree that wiping /var would be bad, but if the FS is corrupt beyond repair anyway....? <ASIDE> It would be a "nice-to-have" feature if we could maybe create something like, "/var/backup-data/last-backup-metadata.dat" such that we could recommend that all software that handles backups updates this file using a well-defined interface. This would allow us to keep track of what had been backed up and when, but also to remind users that they hadn't done a backup for 'n' days in a standard way - maybe a window could pop up in gnome, and suggest the user clicks "create backup now" to be taken to their favourite backup software ("/etc/alternative/backup-software" or summut) be it K3B, or whatever. Part of that backup procedure would allow them to backup their RPM database in an easy manner. _If_ /var then gets corrupted, having recreated the /var FS, we could prompt for some backup media maybe to restore the rpm db. (If that is not available, we could in theory go wild and heuristically determine the set of packages on the system by comparing "well-known" files/directories to the yum repository data, and then rebuild the RPM db based on that data ;-) </ASIDE> I also agree that my suggestion is complication for the special case of '/tmp'. However, from what I have observed, this FS is the one most likely to be damaged by power down / kernel locks. I admit I'd forgotten that the default Fedora install only creates a single FS for /, /home, /etc, /var, /tmp, etc (I always partition by hand), but maybe now is the time for this policy to change? After all, there are many good reasons for spreading the system across separate FS's. In the old days I can see that it made a lot of sense for the installer to create 1 huge FS for everything (much simpler logic), but now we've got LVM2 with resizeable volumes, and 500Gb disks appearing in desktop machines, I'd question that decision. If the installer was changed to create '/tmp' as say 2Gb (which is _more_ than enough for most people), leaving the other 498Gb free for other FS's, if the system locks up hard and /tmp gets corrupted I'd vote for a system where we could recover /tmp, and allow the user to continue working. If we dedicated the entire 500Gb to '/', and files in /tmp cause FSCK to barf, the entire system is dead. Am I missing something?
Re: your first suggestion, it would probably be best to *pick* a backup software by default, and then start down that road. It's certainly an idea, though. I'm still at a loss to understand *why* /tmp would be unrecoverable so often - what makes it more likely to be corrupted?
Hi Bill, I slightly confused too. I can understand that apps may have files open in '/tmp', but as to why that causes the FS to get trashed occasionally, I don't know. My '/tmp' is ext3 (no fancy ext3 options enabled), and I've checked my HDD for errors using smartctl. The memory isn't the problem (memtest86 is happy). The problem I'm experiencing (thankfully!) only ever affects '/tmp'. I'm starting to think the problem could lie with Bastille Linux. I haven't had a chance to look into it in any great detail, but Bastille creates some sort of "safe" temporary area (as a subdirectory of '/tmp') and fires off a shell script that looks for "suspicious" activity in that area. Whatever it is doing shouldn't cause the FS to be irrecoverable though... I'll post more info here as and when I get it. Cheers, James.