166838 – /etc/rc.sysinit could be improved wrt fsck failure on /tmp

Bug 166838 - /etc/rc.sysinit could be improved wrt fsck failure on /tmp

Summary: /etc/rc.sysinit could be improved wrt fsck failure on /tmp

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	initscripts
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Bill Nottingham
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-08-26 11:25 UTC by James Hunt
Modified:	2014-03-17 02:55 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-08-26 18:58:06 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description James Hunt 2005-08-26 11:25:58 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; es-ES; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc4 Firefox/1.0.6

Description of problem:
I've been in a situation on 3 occasions when I've had a freeze / power failure cause my '/tmp' LVM volume to get hosed. In fact, it was corrupted to the extent that on reboot, fsck was unable to repair the damage. Luckily, I was able to do the following from the single-user mode shell I got dropped into:

grep /dev/VolGroup00/LogVol02 /etc/fstab

... and establish that the FS that was corrupted was '/tmp'. Since nobody should be relying on the contents of '/tmp', I was then able to do the following to recover the situation...

mke2fs -j /dev/VolGroup00/LogVol02
<control-D_to_reboot>

On reboot, the situation was resolved.

Thinking about it, could we not automate this process by modifying /etc/rc.sysinit such that if /tmp (and only '/tmp'! ;-) gets corrupted (ie where fsck is unable to recover it), we either:

1) display a message showing what's going on, and then automatically
reformat the FS for the user.

2) offer the user a chance to reformat the FS

We could display a message like:

Recoverable error(s) detected on temporary filesystem possibly
caused by crash/power failure. Repair now? ([y]/n)?

Version-Release number of selected component (if applicable):
initscripts-8.11.1-1

How reproducible:
Didn't try

Steps to Reproduce:
1. force a crash when apps have got files open in /tmp
2. reboot
3. watch as fsck fails to repair the damage.

Actual Results: System drops into single-user mode with lots of scary messages about fsck errors, etc.

Expected Results: If I were a newbie user, I would expect that if the damage was repairable, it would be repaired for me.

Lets face it, when fsck fails, the console looks very scary and the situation looks very bad indeed. If a newbie had been sitting at the terminal, they probably would have fainted. However, when they recovered, they may have thought their entire machine was dead and been forced to re-install it (probably the only option available to them, unless they had another box on the net to use to ask for help from the forums/google).

Additional info:

I think this would be a very useful feature. Obviously, we need to be *very* sure that we are reformatting the '/tmp' filesystem and no other filesystem.

I suggest that if this feature is implemented, we fsck each volume in turn. If the _only_ volume that fails fsck is '/tmp' (ie we keep track of all the fsck return codes on all the other filesystems), we offer the user the chance to reformat it. If 2 FS's are hosed, and these FS's are '/tmp' and '/var', maybe we could reformat both (can just about live with '/var' being wiped).

If >1 FS's are hosed and they are not '/tmp' or '/var', we're in deep trouble so probably just give up at this point.

I guess an alternative to this whole issue is to have a tmpfs based '/tmp' a la solaris! ;-)

Comment 1 Bill Nottingham 2005-08-26 18:58:06 UTC

/var contains the package database. That would be bad to wipe. :)

Realistically, I'd say that this is a lot of complication just for the one
special case of /tmp. Those that aren't able to handle fsck failures and
corruption messages normally don't have /tmp separate in any case, so I'm not
sure this solves a big usage problem for those users.

Hence, I don't think this will be worked on in the near future.

Comment 2 James Hunt 2005-08-27 08:39:34 UTC

Hi Bill, I agree that wiping /var would be bad, but if the FS is corrupt beyond
repair anyway....?

<ASIDE>

It would be a "nice-to-have" feature if we could maybe create something like,
"/var/backup-data/last-backup-metadata.dat" such that we could recommend that
all software that handles backups updates this file using a well-defined
interface. This would allow us to keep track of what had been backed up and
when, but also to remind users that they hadn't done a backup for 'n' days in a
standard way - maybe a window could pop up in gnome, and suggest the user clicks
"create backup now" to be taken to their favourite backup software
("/etc/alternative/backup-software" or summut) be it K3B, or whatever. Part of
that backup procedure would allow them to backup their RPM database in an easy
manner.

_If_ /var then gets corrupted, having recreated the /var FS, we could prompt for
some backup media maybe to restore the rpm db. (If that is not available, we
could in theory go wild and heuristically determine the set of packages on the
system by comparing "well-known" files/directories to the yum repository data,
and then rebuild the RPM db based on that data ;-)

</ASIDE>

I also agree that my suggestion is complication for the special case of '/tmp'.
However, from what I have observed, this FS is the one most likely to be damaged
by power down / kernel locks. I admit I'd forgotten that the default Fedora
install only creates a single FS for /, /home, /etc, /var, /tmp, etc (I always
partition by hand), but maybe now is the time for this policy to change? After
all, there are many good reasons for spreading the system across separate FS's.
In the old days I can see that it made a lot of sense for the installer to
create 1 huge FS for everything (much simpler logic), but now we've got LVM2
with resizeable volumes, and 500Gb disks appearing in desktop machines, I'd
question that decision. If the installer was changed to create '/tmp' as say 2Gb
(which is _more_ than enough for most people), leaving the other 498Gb free for
other FS's, if the system locks up hard and /tmp gets corrupted I'd vote for a
system where we could recover /tmp, and allow the user to continue working. If
we dedicated the entire 500Gb to '/', and files in /tmp cause FSCK to barf, the
entire system is dead. Am I missing something?

Comment 3 Bill Nottingham 2005-08-29 01:56:40 UTC

Re: your first suggestion, it would probably be best to *pick* a backup software
by default, and then start down that road. It's certainly an idea, though.

I'm still at a loss to understand *why* /tmp would be unrecoverable so often -
what makes it more likely to be corrupted?

Comment 4 James Hunt 2005-08-30 15:40:15 UTC

Hi Bill,

I slightly confused too. I can understand that apps may have files open in
'/tmp', but as to why that causes the FS to get trashed occasionally, I don't know.

My '/tmp' is ext3 (no fancy ext3 options enabled), and I've checked my HDD for
errors using smartctl. The memory isn't the problem (memtest86 is happy). The
problem I'm experiencing (thankfully!) only ever affects '/tmp'.

I'm starting to think the problem could lie with Bastille Linux. I haven't had a
chance to look into it in any great detail, but Bastille creates some sort of
"safe" temporary area (as a subdirectory of '/tmp') and fires off a shell script
that looks for "suspicious" activity in that area. Whatever it is doing
shouldn't cause the FS to be irrecoverable though... I'll post more info here as
and when I get it.

Cheers,

James.

Note You need to log in before you can comment on or make changes to this bug.