Bug 719952
Summary: | reboot or shutdown commands unresponsive during systemd-fsck | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Russ <admin> |
Component: | systemd | Assignee: | systemd-maint |
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | erappleman, esandeen, fche, harald, johannbg, johannbg, johannbg, lpoetter, metherid, mschmidt, mzdunek, plautrba, rvokal, zbyszek |
Target Milestone: | --- | Keywords: | FutureFeature, Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-05-31 12:11:19 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Russ
2011-07-08 13:58:04 UTC
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. Is this still an issue or can this bug be closed? Closing this bug you can reopen it if this is still an issue. Arguably users should not be allowed to reboot or shutdown a systemd while it's performing fsck... "Closing this bug you can reopen it if this is still an issue." This is still a very problematic issue for us with 37-13. Therefore I am reopening the bug. "Arguably users should not be allowed to reboot or shutdown a systemd while it's performing fsck..." Not interrupting an fsck would be the ideal thing in an ideal world. But unfortunately this is not an ideal world. Here is a scenario where the inability to bypass the fsck is a major problem: 1) User boots (or reboots)computer. 2) The file system has reached the maximal mount count. 3) Systemd dutifully performs an fsck as required. 4) Filesystem is > 500GB. Therefore the fsck WILL normally take over an hour. 5) This is a system that is needed immediately. Possibly it is a critical server. Management will not tolerate waiting for the fsck to complete. 6) User has no method to interrupt the fsck and perform it later. 7) The only option to get the system running again is to pull the plug & reboot. 8) The power loss corrupts the filesystem. 9) To prevent future such issues the user disables maximal mount checks. The ability to bypass the fsck was present in SystemV by pressing Ctrl-Alt-Del or the power button. Many people need that ability back. There was a reason why it was present. We are only requesting reinstatement of an existing feature. The feature is needed primarily for administrators. The reason for the fsck is to avoid filesystem corruption. Which is better, to ALLOW the user to GRACEFULLY bypass the fsck and shutdown the system, and then perform it later, or to REQUIRE them to hit the reset button or pull the plug, thus risking corruption of the filesystem? Interruptible fsck is an item on the upstream TODO list: * There's currently no way to cancel fsck (used to be possible via C-c or c on the console) http://cgit.freedesktop.org/systemd/systemd/tree/TODO?id=eecd1362f7f4de432483b5d77c56726c3621a83a#n127 A fix is not likely to get into F15. Setting as rawhide+FutureFeature to avoid autoclosing. By the way... (In reply to comment #4) > 4) Filesystem is > 500GB. Therefore the fsck WILL normally take over an hour. What filesystem type do you use? fsck of ext4 is usually much faster than that. > 8) The power loss corrupts the filesystem. This is odd. fsck mostly reads the filesystem. Writes are rare. So I don't see why a power loss at this point would be likely to cause corruption. Slightly OT but regarding the forced-fsck scenario: > 2) The file system has reached the maximal mount count. commit 3daf592646b668133079e2200c1e776085f2ffaf Author: Eric Sandeen <sandeen> Date: Thu Feb 17 15:55:15 2011 -0600 e2fsprogs: turn off enforced fsck intervals by default has been there since 1.42, so filesystems mkfs'd after that won't run into an unexpected forced fsck. > 4) Filesystem is > 500GB. Therefore the fsck WILL normally take over an hour. e2fsck on 500G ext4 filesystems should take nowhere near an hour to complete, in general. > 8) The power loss corrupts the filesystem. Power loss will not corrupt a journaling filesystem on properly configured storage, i.e. storage which properly handles write barriers. Now, I agree that it might be nice to cancel out of fsck, but I'm not sure the above rationale, specifically, holds up well. > Interruptible fsck is an item on the upstream TODO list: Excellent! The issue is not just to be able to interrupt the fsck with ctrl-c, however, but to also have a response to the power button and/or ctrl-alt-del. That alone would solve the problem by allowing a graceful shutdown/reboot. In response to the other comments: > What filesystem type do you use? fsck of ext4 is usually much faster than >that. Unfortunately most of our systems are still using ext3. Trying to get slowly upgraded, but many systems are servers, and mgmt. is slow to change. >This is odd. fsck mostly reads the filesystem. Writes are rare. So I don't see >why a power loss at this point would be likely to cause corruption. It's not the big data partitions that get corrupted, fortunately. But the root partition has already been mounted while /home is being fscked. When the guys pull the plug to reboot the systems to get them back online the root partition gets corrupted. For some reason it is usually the yum database that gets impacted. I assume there are some writes going on there at boot time. >e2fsprogs: turn off enforced fsck intervals by default >has been there since 1.42, so filesystems mkfs'd after that won't run into an >unexpected forced fsck. Thanks for the patch, Eric! Another good reason for us to upgrade the filesystems, and also justification to disable maximal-mount-count on the remaining servers which have not already had it done. >Power loss will not corrupt a journaling filesystem on properly configured >storage, i.e. storage which properly handles write barriers. Thus far no MAJOR corruption has occurred. But, on some earlier occasions where the plug was pulled on a system during an fsck, it seems there were some early writes to the yum database which did not get completed when the plug was pulled. The yum database needed to be rebuilt when that occurred. It was on F15 when something like that occurred. Hopefully that issue has been corrected in the current Fedora release. But no matter how good the journaling works, and even if no corruption occurs, the system should never get so unresponsive as to ignore the power button or ctrl-alt-del, if they are enabled. It turns out the non-responsiveness at least to CTRL+ALT+DEL should be easy to fix. Try adding "Conflicts=shutdown.target" to the [Unit] section of fsck@.service. I cannot think of a reason for not putting it there by default. *** Bug 799574 has been marked as a duplicate of this bug. *** If should be possible to cancel shutdown under plymouth, with the code added post systemd-219. With recent changes, reporting is improved compared to v219, but there's no way to cancel things. It's unclear what the solution should be. In general: i am pretty sure that cancelling an fsck is not really something we should do by default in our codepaths since it is simply not clear whether thats a safe operation. If the user chooses to do so manually with tools like kill and after authenticating as root then thats ok. However i am pretty sure that cancelling fsck should not be default or easy UI exposed behaviour and not be available without authentication. Providing a boot time ui for cancelling fsck via C-c is nothing we want to support in systemd hence. Moreover cancelling it on ctrl-alt-del just like that isnt really an option either. I will hence close this bug now. I understand that not providing this is inconvenient in many cases but I think we shouldbreally make sure to be safe and secure by default, and not expose unsafe behaviour in the UI and especially not without authentication > it is simply not clear whether thats a safe operation
e2fsck is cancelable; there is a specific exit code (32) to indicate this, and there is a specific option to make that transparent to the boot process and exit with (0) instead; see
commit eb065ccf181d49cd1a3709bf607c25d07a6322f1
Author: Theodore Ts'o <tytso>
Date: Sat Dec 31 00:52:23 2005 -0500
Add allow_cancellation config option
If the e2fsck configuration file sets the allow_cancellation option to be
true, then if the filesystem does not have any known problems, and was
known to be cleanly unmounted, then let e2fsck exit with a status code of 0
instead of 32 (FSCK_CANCELED) so that the bootup scripts will continue
without stopping the boot. (Addresses Debian Bug: #150295)
Other work has been done to explicitly support cancelation:
commit bee24f364ca921e10cefa0d3241b61383aa62866
Author: Valerie Aurora Henson <vaurora>
Date: Tue Aug 4 22:48:15 2009 -0400
e2fsck: Allow cancellation during group descriptor checks.
Signed-off-by: Valerie Aurora Henson <vaurora>
Signed-off-by: "Theodore Ts'o" <tytso>
...
+ /* If the user aborts e2fsck by typing ^C, stop right away */
+ if (ctx->flags & E2F_FLAG_SIGNAL_MASK)
xfs_repair is safe to cancel as well (from xfs_repair(8): "Interrupting a stuck xfs_repair is safe.") I can't speak to btrfsck.
But in general, the ability to cancel a running fsck is supported by the major filesystem repair tools. Canceling the run on a clean filesystem should have no effect whatsoever, canceling it on a corrupted filesystem should leave it in no worse shape than it started.
|