Bug 649089

Summary:	anaconda should not disable automatic filesystem checks on journaled ext3/4
Product:	[Fedora] Fedora	Reporter:	James Ralston <ralston>
Component:	anaconda	Assignee:	David Lehman <dlehman>
Status:	CLOSED RAWHIDE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	rawhide	CC:	esandeen, jasonmc, jcm, jonathan, oliver.henshaw, sct, vanmeeuwen+fedora
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-02-15 22:19:41 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description James Ralston 2010-11-02 22:39:10 UTC

(I am filing this against Fedora rawhide, but as far as I can tell, this applies to all versions of both Fedora and RHEL.)

The default behavior of mke2fs is to set the interval-between-checks option to 6 months, and the max-mount-counts option to a value near 30. (The default value for max-mount-counts will vary slightly across successive invocations of recent versions of mke2fs, in attempt to avoid the scenario where the max-mount-counts value is hit on every filesystem at the same time.)

The reasoning for these filesystem parameters is described in the tune2fs man page:

You should strongly consider the consequences of disabling
mount-count-dependent checking entirely. Bad disk drives, cables,
memory, and kernel bugs could all corrupt a filesystem without
marking the filesystem dirty or in error. If you are using
journaling on your filesystem, your filesystem will never be
marked dirty, so it will not normally be checked. A filesystem
error detected by the kernel will still force an fsck on the next
reboot, but it may already be too late to prevent data loss at
that point.

It is strongly recommended that either -c (mount-count-dependent)
or -i (time-dependent) checking be enabled to force periodic full
e2fsck(8) checking of the filesystem. Failure to do so may lead
to filesystem corruption (due to bad disks, cables, memory, or
kernel bugs) going unnoticed, ultimately resulting in data loss or
corruption.

However, when anaconda creates an ext2/3/4 filesystem, if the filesystem has a journal, anaconda deliberately disables mount-count-based and time-based forced filesystem checks:

try:
rc = iutil.execWithRedirect("tune2fs",
["-c0", "-i0",
"-ouser_xattr,acl", self.device],
stdout = "/dev/tty5",
stderr = "/dev/tty5")

Not only is there no way to override this action via kickstart options and/or the graphical installer, I have not been able to find any documentation anywhere that even bothers to mention that anaconda does this.

I assert that the reasoning behind anaconda's behavior is incorrect. Journaled filesystems are *more* susceptible to damage than non-journaled filesystems, not less, and therefore it is important that periodic filesystem checks are NOT disabled on journaled filesystems.

To understand why this is so, it is important to keep in mind exactly what the journal in a journaled filesystem does. Filesystem operations that appear to be atomic to programs (e.g., deleting a file) can actually require the filesystem driver to perform multiple I/O operations on the filesystem. In a nutshell, the journal simply records the recent operations (both pending and committed) to the filesystem.

Because of the journal, in the event that the filesystem was not unmounted cleanly (e.g., the kernel crashed, there was a loss of system power), and there were filesystem changes whose underlying I/O operations were only partially complete, the OS can bring the filesystem back into a consistent state simply by replaying the journal, without having to exhaustively walk the entire filesystem looking for partially-complete filesystem changes. For large filesystems, this is a fantastic time-saver.

However, this advantage can also be a disadvantage. Consider the scenario where corruption occurs in a filesystem, unbeknown to the filesystem driver. Such corruption could happen for any number of reasons, including (but not limited to)...

1. a bug with the kernel or filesystem driver code,

2. a physical disk that is failing,

3. a hardware RAID device that experiences problems, or

4. a SAN device that experiences problems (firmware problems, network connectivity problems, et. al.).

First, if the system can be brought down cleanly, the filesystems will unmount cleanly, and neither a journaled nor non-journaled filesystem will detect the corruption.

But if the system crashes as a result of the same problems that caused the filesystem corruption (which isn't uncommon), consider what happens when the system reboots, in the case of a journaled filesystem versus a non-journaled filesystem:

1. For the non-journaled filesystem, because there is no journal to replay, the only way the OS can bring the filesystem into a consistent state is to perform an exhaustive check of the filesystem. Performing the exhaustive check will discover the corruptions that exist in the filesystem.

2. For the journaled filesystem, the OS (filesystem driver) simply replays the journal, which brings the filesystem into a consistent state (from the point of view of the OS). Because the journal obviates the need to perform an exhaustive check of the filesystem, the corruptions are not discovered.

Thus, the strength of a journaled filesystem (avoiding an exhaustive filesystem check if a filesystem isn't unmounted cleanly) is also its weakness: in the event that a crash is hardware-related, you absolutely, positively want to perform an exhaustive filesystem check, because it is the only way to find corruption.

When filesystem corruption is caught immediately, it can often be repaired with minimal damage. But when filesystem corruption is not caught immediately, it has a nasty tendency to "cascade"; that is, the corruptions compound, and create more and more errors. By the time the corruption reaches the point where it is noticed by people/programs using the filesystem, the corruption is often extensive enough that the filesystem cannot be saved.

This is why periodic forced filesystem checks are critically important for journaled filesystems: since a journaled filesystem will NEVER be marked as dirty, periodic forced filesystem checks are the only possible mechanism to find corruption in the filesystem before it can cascade.

For Fedora, I find it difficult to envision a convincing argument for disabling periodic forced filesystem checks. The only argument I could see is that users will consider it to be a bug if it occasionally takes a long[er] amount of time to boot their Fedora install, because Windows doesn't do that. But the counterargument is that Fedora should automatically use cron.weekly or cron.monthly to apply the "online fsck via LVM snapshot" technique described in the tune2fs man page (under the "-T" option).

For RHEL, the situation is different. Periodic forced filesystem checks may be unacceptable to some admins, because they can unexpectedly vary the time it takes a server to reboot, and multi-terabyte filesystems can take hours to check:

http://blog.ronnyegner-consulting.de/2009/10/01/ext3-beware-of-periodic-file-system-checks/

But even if this is the case, it is still wrong for anaconda to disable periodic forced filesystem checks. Rather, mke2fs itself should be patched to default max-mount-counts to -1 and interval-between-checks to 0. This is because admins can (and frequently do, in the case of RHEL) create new filesystems after the initial install.

Furthermore, there is no reason why the "online fsck via LVM snapshot" technique isn't applicable to RHEL.

So, in summary, anaconda's behavior of disabling periodic forced filesystem checks on the journaled ext3/4 filesystems it creates is wrong because...

1. it is undocumented,

2. it cannot be overridden,

3. periodic forced filesystem checks are more important for journaled filesystems (in contrast to non-journaled filesystems), and

4. if this behavior is truly necessary, mke2fs itself is the proper place to implement it, not anaconda.

Given how long this behavior has persisted in anaconda, it's probably not reasonable to change the default at this time. Therefore, I think the best way to address this issue is to:

1. Patch mke2fs to default max-mount-counts to -1 and interval-between-checks to 0, and document this difference both in the mke2fs man page and in mke2fs's output.

2. Provide anaconda kickstart options to set the max-mount-counts and interval-between-checks for filesystems that anaconda creates. Also provide a way to set these options via the interactive installer. Finally, document these options, including the trade-offs of enabling/disabling periodic forced filesystem checks.

3. Implement a utility to periodically check filesystems via the "online fsck via LVM snapshot" technique.

(Actually, I am working on a utility to perform #3, and will release it publicly when I am finished.)

Comment 1 David Lehman 2010-11-03 22:00:34 UTC

I am inclined to think that anaconda should not be in the business of overriding default filesystem settings. It doesn't matter at all to me how long we have been doing it this way.

Stephen, what do you think? This tune2fs call to remove the forced fsck was added per your request in August of 2001.

Comment 2 Stephen Tweedie 2010-11-08 11:21:17 UTC

There are two completely different questions here... what should the default be, and how should we set the default.

mke2fs.conf was only added in 2006, so our changes before that predate the ability to set meaningful system-wide defaults in the config file.  In general, mke2fs.conf would seem to be a more appropriate place to be doing this these days.

As for what the default is, current behaviour is widely expected and easy enough to override if users want to.  Online fs checking via snapshots is possible but can be slow; online correction is not yet possible, of course.  And desktop users are not immune from the "too slow to boot" issue mentioned above for servers; these days desktops with TB-capacity disks are common, and unexpected slow boot can still be a serious problem (eg. booting a laptop to run a presentation and finding it takes half an hour to fsck... not good!)

Seems like the sort of change that would be better discussed on the fedora lists, though, as it's likely to garner a wide variety of opinions.

Comment 3 David Lehman 2010-11-09 02:26:30 UTC

Eric, I'm adding you in since you maintain e2fsprogs. We have code in anaconda
that calls 'tune2fs -c0 -i0' on all new ext[234] filesystems. It doesn't make
sense for anaconda to override filesystem defaults, so I'm going to remove it
(from rawhide and F15). If you think this should be default behavior for new
ext[234] filesystems, please add something to /etc/mke2fs.conf or wherever you
think is appropriate.

Comment 4 Stephen Tweedie 2010-11-09 10:05:48 UTC

Please don't make the change until we've at least had a chance to discuss it more widely, it's a huge impact to the end user if we end up with the tuning gone from anaconda but not added to the mke2fs.conf.

Comment 5 Eric Sandeen 2010-11-12 19:42:46 UTC

Sorry for the late reply, was travelling a lot.  Let me try to tackle some of these ..

I'm sympathetic to the argument that anaconda shouldn't be overriding defaults; that's a good guiding principle.

To be honest I'd like to remove the forced fsck upstream as well, and have talked with Ted about it.  The rationale in the original comment on this bug includes things like:

> Thus, the strength of a journaled filesystem (avoiding an exhaustive filesystem
> check if a filesystem isn't unmounted cleanly) is also its weakness: in the
> event that a crash is hardware-related, you absolutely, positively want to
> perform an exhaustive filesystem check, because it is the only way to find
> corruption.

and I totally agree - but it doesn't follow that therefore extN should be your nanny and (eventually) do it for you.  In the case above you probably want to -immediately- run fsck, not wait until 6 months or 30 mounts have expired.

As for the creeping corruption argument, extN should be good at finding corruption runtime; for example we exhaustively check htree directories on every access - almost too often I think, at a performance penalty.  If we don't catch existing on-disk data corruption on access, then we have a filesystem bug.

I'm not sold on the notion that semi-random forced full filesystem checks of journaling filesystems are a good thing.  extN is the only one I know of which has this interesting feature.

Comment 6 Eric Sandeen 2010-11-12 19:45:13 UTC

Another note:

> This is why periodic forced filesystem checks are critically important for
> journaled filesystems: since a journaled filesystem will NEVER be marked as
> dirty, periodic forced filesystem checks are the only possible mechanism to
> find corruption in the filesystem before it can cascade.

This isn't quit correct.  Any error which would trip the error handling (i.e. errors=remount-ro) behavior (for example the aforementioned directory tree consistency checking) will mark the fs as being in an error state, and the next fsck -will- do a full run.

Comment 7 Jon Masters 2010-11-16 07:52:58 UTC

A far better user experience would be to pop-up a notification (or whatever will replace that in gnome-shell...) telling the user that they might want to do an fsck and offering them an option. You could create a flag file in /boot that indicates on next boot a full fsck should be performed. There. No need to check on some random interval or whatever, just if the user chooses to do so.

Comment 8 Eric Sandeen 2010-12-03 21:00:24 UTC

Jon, that'd be as awesome as Windows XP constantly asking me if I really want to do <insert many random things here>  ;)

If the fs detects corruption it'll shut down and fsck on next reboot.  Why do we need more than this?

Comment 9 David Lehman 2010-12-07 21:02:58 UTC

One of these days I'm going to go ahead and remove this from anaconda. If you guys want to get something else in place before then, I'd suggest you get to it.

Comment 10 Eric Sandeen 2010-12-07 21:07:49 UTC

Ted doesn't want to drop it, so I guess it is what it is.

Comment 11 Oliver Henshaw 2011-01-18 15:44:41 UTC

Is it feasible to add lvcheck (for regular cron-scheduled checking of snapshots) and reverting to the default fsck intervals? That way, boot-time fscks are indefinitely delayed by successful snapshot-time fscks; a failed snapshot fsck triggers a real fsck on te next boot; finally, a boot-time fsck eventually happens even if the cron-scheduled fscks repeatedly fail to start, or complete.

Comment 12 David Lehman 2011-02-15 22:19:41 UTC

I have committed and pushed a patch for rawhide (not F15) that removes all non-default setting of options via tune2fs. The non-default settings we applied were disabling time- and mount-based fsck intervals and enabling posix acls and user-defined xattrs.

Comment 13 David Lehman 2011-02-16 22:40:20 UTC

(In reply to comment #11)
> Is it feasible to add lvcheck (for regular cron-scheduled checking of
> snapshots) and reverting to the default fsck intervals? That way, boot-time
> fscks are indefinitely delayed by successful snapshot-time fscks; a failed
> snapshot fsck triggers a real fsck on te next boot; finally, a boot-time fsck
> eventually happens even if the cron-scheduled fscks repeatedly fail to start,
> or complete.

lvcheck is not a viable solution since not all filesystems are on lvm storage. Perhaps I misunderstand or the name is misleading?

Comment 14 Eric Sandeen 2011-02-16 22:55:16 UTC

It'd work on any snapshottable storage, in theory, or could be expanded to do so, but it is only useful with snapshots.  It's not a complete solution.

Oliver, lvcheck in fedora would be nice; someone should champion it a a feature in a future release, hint hint...

Comment 15 David Lehman 2011-02-16 23:06:13 UTC

I agree that the lvcheck script is a very nice idea and is worthy of inclusion in Fedora -- it just doesn't fill _all_ of our fscking needs.