Bug 719952

Summary:	reboot or shutdown commands unresponsive during systemd-fsck
Product:	[Fedora] Fedora	Reporter:	Russ <admin>
Component:	systemd	Assignee:	systemd-maint
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	erappleman, esandeen, fche, harald, johannbg, johannbg, johannbg, lpoetter, metherid, mschmidt, mzdunek, plautrba, rvokal, zbyszek
Target Milestone:	---	Keywords:	FutureFeature, Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-05-31 12:11:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Russ 2011-07-08 13:58:04 UTC

Description of problem:
A fsck of a large filesystem can often take an inordinate amount of time. Therefore users often need to reboot or shutdown the system during unscheduled long duration maximal-mount file system checks to avoid lengthy delays when booting a system. In the standard SystemV boot process this reboot/shutdown was simply accomplished using Ctl-Alt-Del or power management using the power button. Systemd does not respond to reboot/shutdown requests using Ctl-Alt-Del or the power button when a fsck is in progress. Using the power button to shutdown the system thus forces a hard shutdown and always results in corruption of the root file system. This also frequently results in a corruption of the Yum database. This bug is specified as high severity due to the corruption of the root filesystem and the Yum database when attempting reboot/shutdown. 

Version-Release number of selected component (if applicable):
26-5

How reproducible:
Always

Steps to Reproduce:
1. Boot system with a file system at maximal mount count.
2. Attempt reboot or shutdown during systemd-fsck.
3. Note if systemd responds to reboot/shutdown requests with a full reboot or shutdown.
  
Actual results:
Reboot/shutdown requests during systemd-fsck result in a hung system.

Expected results:
Systemd should respond to reboot/shutdown requests without waiting for other processes to complete.

Additional info:

Comment 1 Fedora Admin XMLRPC Client 2011-10-20 16:28:28 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 2 Jóhann B. Guðmundsson 2012-01-25 13:54:59 UTC

Is this still an issue or can this bug be closed?

Comment 3 Jóhann B. Guðmundsson 2012-01-26 22:19:57 UTC

Closing this bug you can reopen it if this is still an issue. 

Arguably users should not be allowed to reboot or shutdown a systemd while it's performing fsck...

Comment 4 Russ 2012-05-05 23:03:56 UTC

"Closing this bug you can reopen it if this is still an issue."

This is still a very problematic issue for us with 37-13. Therefore I am reopening the bug.

"Arguably users should not be allowed to reboot or shutdown a systemd while it's
performing fsck..."

Not interrupting an fsck would be the ideal thing in an ideal world. But unfortunately this is not an ideal world. Here is a scenario where the inability to bypass the fsck is a major problem:

1) User boots (or reboots)computer.
2) The file system has reached the maximal mount count.
3) Systemd dutifully performs an fsck as required.
4) Filesystem is > 500GB. Therefore the fsck WILL normally take over an hour.
5) This is a system that is needed immediately. Possibly it is a critical server. Management will not tolerate waiting for the fsck to complete.
6) User has no method to interrupt the fsck and perform it later.
7) The only option to get the system running again is to pull the plug & reboot.
8) The power loss corrupts the filesystem.
9) To prevent future such issues the user disables maximal mount checks.

The ability to bypass the fsck was present in SystemV by pressing Ctrl-Alt-Del or the power button. Many people need that ability back. There was a reason why it was present. We are only requesting reinstatement of an existing feature.

The feature is needed primarily for administrators. The reason for the fsck is to avoid filesystem corruption. Which is better, to ALLOW the user to GRACEFULLY bypass the fsck and shutdown the system, and then perform it later, or to REQUIRE them to hit the reset button or pull the plug, thus risking corruption of the filesystem?

Comment 5 Michal Schmidt 2012-05-06 06:48:00 UTC

Interruptible fsck is an item on the upstream TODO list:

* There's currently no way to cancel fsck (used to be possible via C-c or c on the console)

http://cgit.freedesktop.org/systemd/systemd/tree/TODO?id=eecd1362f7f4de432483b5d77c56726c3621a83a#n127

Comment 6 Michal Schmidt 2012-05-06 06:49:50 UTC

A fix is not likely to get into F15. Setting as rawhide+FutureFeature to avoid autoclosing.

Comment 7 Michal Schmidt 2012-05-06 07:08:07 UTC

By the way...

(In reply to comment #4)
> 4) Filesystem is > 500GB. Therefore the fsck WILL normally take over an hour.

What filesystem type do you use? fsck of ext4 is usually much faster than that. 

> 8) The power loss corrupts the filesystem.

This is odd. fsck mostly reads the filesystem. Writes are rare. So I don't see why a power loss at this point would be likely to cause corruption.

Comment 8 Eric Sandeen 2012-05-07 19:08:02 UTC

Slightly OT but regarding the forced-fsck scenario:

> 2) The file system has reached the maximal mount count.

commit 3daf592646b668133079e2200c1e776085f2ffaf
Author: Eric Sandeen <sandeen>
Date:   Thu Feb 17 15:55:15 2011 -0600

    e2fsprogs: turn off enforced fsck intervals by default

has been there since 1.42, so filesystems mkfs'd after that won't run into an unexpected forced fsck.

> 4) Filesystem is > 500GB. Therefore the fsck WILL normally take over an hour.

e2fsck on 500G ext4 filesystems should take nowhere near an hour to complete, in general.

> 8) The power loss corrupts the filesystem.

Power loss will not corrupt a journaling filesystem on properly configured storage, i.e. storage which properly handles write barriers.

Now, I agree that it might be nice to cancel out of fsck, but I'm not sure the above rationale, specifically, holds up well.

Comment 9 Russ 2012-05-08 14:18:29 UTC

> Interruptible fsck is an item on the upstream TODO list:

Excellent!

The issue is not just to be able to interrupt the fsck with ctrl-c, however, but to also have a response to the power button and/or ctrl-alt-del. That alone would solve the problem by allowing a graceful shutdown/reboot.

In response to the other comments:

> What filesystem type do you use? fsck of ext4 is usually much faster than >that. 

Unfortunately most of our systems are still using ext3. Trying to get slowly upgraded, but many systems are servers, and mgmt. is slow to change.

>This is odd. fsck mostly reads the filesystem. Writes are rare. So I don't see
>why a power loss at this point would be likely to cause corruption.

It's not the big data partitions that get corrupted, fortunately. But the root partition has already been mounted while /home is being fscked. When the guys pull the plug to reboot the systems to get them back online the root partition gets corrupted. For some reason it is usually the yum database that gets impacted. I assume there are some writes going on there at boot time.

>e2fsprogs: turn off enforced fsck intervals by default
>has been there since 1.42, so filesystems mkfs'd after that won't run into an
>unexpected forced fsck.

Thanks for the patch, Eric! Another good reason for us to upgrade the filesystems, and also justification to disable maximal-mount-count on the remaining servers which have not already had it done.

>Power loss will not corrupt a journaling filesystem on properly configured
>storage, i.e. storage which properly handles write barriers.

Thus far no MAJOR corruption has occurred. But, on some earlier occasions where the plug was pulled on a system during an fsck, it seems there were some early writes to the yum database which did not get completed when the plug was pulled. The yum database needed to be rebuilt when that occurred. It was on F15 when something like that occurred. Hopefully that issue has been corrected in the current Fedora release. 

But no matter how good the journaling works, and even if no corruption occurs, the system should never get so unresponsive as to ignore the power button or ctrl-alt-del, if they are enabled.

Comment 10 Michal Schmidt 2012-05-11 10:59:30 UTC

It turns out the non-responsiveness at least to CTRL+ALT+DEL should be easy to fix.
Try adding "Conflicts=shutdown.target" to the [Unit] section of fsck@.service.
I cannot think of a reason for not putting it there by default.

Comment 11 Lennart Poettering 2012-10-30 18:11:42 UTC

*** Bug 799574 has been marked as a duplicate of this bug. ***

Comment 12 Zbigniew Jędrzejewski-Szmek 2015-03-13 04:50:11 UTC

If should be possible to cancel shutdown under plymouth, with the code added post systemd-219.

Comment 13 Zbigniew Jędrzejewski-Szmek 2015-04-29 03:39:07 UTC

With recent changes, reporting is improved compared to v219, but there's no way to cancel things. It's unclear what the solution should be.

Comment 14 Lennart Poettering 2015-05-31 12:11:19 UTC

In general: i am pretty sure that cancelling an fsck is not really something we should do by default in our codepaths since it is simply not clear whether thats a safe operation. If the user chooses to do so manually with tools like kill and after authenticating as root then thats ok. However i am pretty sure that cancelling fsck should not be default or easy UI exposed behaviour and not be available without authentication. Providing a boot time ui for cancelling fsck via C-c is nothing we want to support in systemd hence. Moreover cancelling it on ctrl-alt-del just like that isnt really an option either.

I will hence close this bug now. I understand that not providing this is inconvenient in many cases but I think we shouldbreally make sure to be safe and secure by default, and not expose unsafe behaviour in the UI and especially not without authentication

Comment 15 Eric Sandeen 2015-06-01 19:42:08 UTC

> it is simply not clear whether thats a safe operation

e2fsck is cancelable; there is a specific exit code (32) to indicate this, and there is a specific option to make that transparent to the boot process and exit with (0) instead; see 

commit eb065ccf181d49cd1a3709bf607c25d07a6322f1
Author: Theodore Ts'o <tytso>
Date:   Sat Dec 31 00:52:23 2005 -0500

    Add allow_cancellation config option
    
    If the e2fsck configuration file sets the allow_cancellation option to be
    true, then if the filesystem does not have any known problems, and was
    known to be cleanly unmounted, then let e2fsck exit with a status code of 0
    instead of 32 (FSCK_CANCELED) so that the bootup scripts will continue
    without stopping the boot.  (Addresses Debian Bug: #150295)

Other work has been done to explicitly support cancelation:

commit bee24f364ca921e10cefa0d3241b61383aa62866
Author: Valerie Aurora Henson <vaurora>
Date:   Tue Aug 4 22:48:15 2009 -0400

    e2fsck: Allow cancellation during group descriptor checks.
    
    Signed-off-by: Valerie Aurora Henson <vaurora>
    Signed-off-by: "Theodore Ts'o" <tytso>

...

+               /* If the user aborts e2fsck by typing ^C, stop right away */
+               if (ctx->flags & E2F_FLAG_SIGNAL_MASK)

xfs_repair is safe to cancel as well (from xfs_repair(8): "Interrupting a stuck xfs_repair is safe.")  I can't speak to btrfsck.

But in general, the ability to cancel a running fsck is supported by the major filesystem repair tools.  Canceling the run on a clean filesystem should have no effect whatsoever, canceling it on a corrupted filesystem should leave it in no worse shape than it started.