Bug 1213778

Summary:

drops into emergency mode without any error message if it cannot find a filesystem in /etc/fstab

Product:

Red Hat Enterprise Linux 7

Reporter:

Martin Steigerwald <martin.steigerwald>

Component:

systemd

Assignee:

systemd-maint

Status:

CLOSED WORKSFORME

QA Contact:

qe-baseos-daemons

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

7.3

CC:

christoph.buchmann, devel, elliot.li.tech, giulioo, james.xiong, mgandhi, mschmidt, msekleta, robert.coner3, rodolfocasas, systemd-maint-list, vedran, vorpal, wibrown, yaplej

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-06-10 12:24:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
screenshot of failed boot	none

Description Martin Steigerwald 2015-04-21 09:32:52 UTC

Created attachment 1016755 [details]
screenshot of failed boot

This original happened with CentOS 7.1, but I think it will happen in RHEL as well. I see: 1188864 Entry in fstab to mount ISO causes boot into emergency mode
CLOSED NOTABUG

but… this is about systemd not even printing a error message in case of a failed boot. I think this clearly is a bug. At least it is a regression from sysvinit.



Description of problem:

LABEL=bootfs            /boot                   xfs     defaults        1 2

but it didn´t have the label due to a omission in configuring it.


Version-Release number of selected component (if applicable):

systemd-208-20.el7_1.2.x86_64


How reproducible:

Always


Steps to Reproduce:
1. Have an entry like above in /etc/fstab,
2. but no filesystem with that label
3. Reboot

Actual results:
Drops into emergency mode without any explaination as to why it does so.


Expected results:
At least prints an error message as to why it is dropping me into emergency mode. Ideally gives option to try booting anyway.


Additional info:
See attached screenshot for the result of boot with that configuration and without "quiet" kernel command line argument.

journalctl -xb output that finally told me what happened:

-- The start-up result is done.
Apr 21 10:13:27 lintraincentos7 systemd[1]: Job dev-disk-by\x2dlabel-bootfs.device/start timed out.
Apr 21 10:13:27 lintraincentos7 systemd[1]: Timed out waiting for device dev-disk-by\x2dlabel-bootfs.device.
-- Subject: Unit dev-disk-by\x2dlabel-bootfs.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel [^]
-- 
-- Unit dev-disk-by\x2dlabel-bootfs.device has failed.
-- 
-- The result is timeout.
Apr 21 10:13:27 lintraincentos7 systemd[1]: Dependency failed for /boot.
-- Subject: Unit boot.mount has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel [^]
-- 
-- Unit boot.mount has failed.
-- 
-- The result is dependency.
Apr 21 10:13:27 lintraincentos7 systemd[1]: Dependency failed for Local File Systems.
-- Subject: Unit local-fs.target has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel [^]
-- 
-- Unit local-fs.target has failed.
-- 
-- The result is dependency.
Apr 21 10:13:27 lintraincentos7 systemd[1]: Dependency failed for Relabel all filesystems, if necessary.
-- Subject: Unit rhel-autorelabel.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel [^]
--
-- The result is dependency.
Apr 21 10:13:27 lintraincentos7 systemd[1]: Dependency failed for Mark the need to relabel after reboot.
-- Subject: Unit rhel-autorelabel-mark.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel [^]
-- 
-- Unit rhel-autorelabel-mark.service has failed.
-- 
-- The result is dependency.
Apr 21 10:13:27 lintraincentos7 systemd[1]: Triggering OnFailure= dependencies of local-fs.target.
Apr 21 10:13:27 lintraincentos7 systemd[1]: Dependency failed for /boot/efi.
-- Subject: Unit boot-efi.mount has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel [^]
-- 
-- Unit boot-efi.mount has failed.
-- 
-- The result is dependency.
Apr 21 10:13:27 lintraincentos7 systemd[1]: Dependency failed for File System Check on /dev/disk/by-label/bootfs.
-- Subject: Unit systemd-fsck@dev-disk-by\x2dlabel-bootfs.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel [^]
-- 
-- Unit systemd-fsck@dev-disk-by\x2dlabel-bootfs.service has failed.
-- 
-- The result is dependency. 


Why didn´t I see a *failed* for this unit?

Comment 2 Michal Schmidt 2015-04-22 17:44:05 UTC

That the system enters emergency mode is not a bug. systemd cannot know if continuing without the mount would be safe, so it fails safe. systemd would ignore the failure to mount if the /etc/fstab line had the "nofail" option.

It is a valid usability bug that the emergency message is not more specific. As you could see, the actual error can indeed be discovered using the suggested journalctl command, but it is not very convenient.

Comment 3 Martin Steigerwald 2015-04-23 08:57:01 UTC

Michal, I understand that this is a policy decision. And I know about nofail. And yes, it can be argued in either way. Its a policy change tough from factual sysvinit – and also upstart? – way of doing things, so an error message is even more important.

Bonus points for adding a way to override the policy in order to bring the system up and running. It would clearly help the following scenario, where this policy decision causes issues:

- Server needs to be rebooted to activate kernel update or whatever.
- Admin reboots server and expects the machine to be up again in say 5 minutes.
- Admin checks then and server is not up
- Admin sees the server having dropped in emergency mode.
- Admin currently does not see why.
- Admin has to browse journalctl and understand systemd´s dependency model enough to find out.
- Maybe the remote console has issues with keyboard layout and admin cannot type "journalctl -xb" at all (was so in that case due to broken or misconfigured other software)

This can make an expected server downtime of 5 minutes into an downtime of half an hour or more. Now have this in a production server. Do you see this as an improvement compared to the behavior to just boot the system and log an error about the failed mount?

So my expected behavior would be:

Cannot boot system cause of home.mount cannot be initialized.

Proceed anyway or drop into emergency mode?

Say proceed anyway, be done with it, fix the mount in /etc/fstab then. Then you have SSH without broken keyboard layout and everything available. Or say "drop into emergency mode" if this is a very important mount you want to fix before booting.

From a usability point of view this is much saner than saying "systemd cannot know and thus we *enforce* policy on the user, without even telling him so" breaking learned workflows by doing so. Enforcing policy on "we cannot not know" without giving an easy option to opt out… in my oppinion worsens the user experience.

Also: If a important mount is missing, I usually always found out. Cause then something important does not work on the server. So even that policy decision is questionable. I understand the reasons for it, in order to make a missing mount more visible. So if you decide on using that policy, I think it would be sane to give the admin the option to opt out in order to bring up the machine and fix things later. And for /boot not there, I sure find out on next grub update.

Its important to think things from the point of view of users of this system. A user who is faced with such behavior I think usually doesn´t care for technical founded decisions (like on service dependency) at all. Don´t be arrogant to the user / admin. There are some who know what they do.

Comment 4 Martin Steigerwald 2015-04-23 09:04:55 UTC

Ok, I understand how having /boot mounted can be important. If its not mounted, a grub or kernel upgraded will get installed into the wrong location. That would be a reason for your policy decision.

So let this bug about the error reporting – policy is arguable: Please at least provide a clear error message indicating a way to opt out in order to bring up the machine quickly again. I think for a good admin / user experience this would be a huge improvement over the current situation.

Dropping into emergency mode is a *major* error condition. So provide an error message.

Comment 5 Lars Kruse 2015-05-18 23:09:22 UTC

Sadly the boot process is not only interrupted for important mountpoints (e.g. /boot), but also for any other line in /etc/fstab that fails to succeed out of any reasen (being it a network filesystem or any other human error, technical failure, ...).

I just came back from visiting a server in person that I did not see for years. This server stopped booting due to an outdated entry in /etc/fstab.
This behaviour may be acceptable for local computers but it is clearly a dangerous (and maybe expensive) decision for remote servers (so it was for me).

Just assume the situation of a bunch of servers somewhere that just recovered from a power outage. Every single server would drop dead (not being reachable remotely) if the nfs server is not quick enough during the boot procedure (maybe his filesystem check takes a bit longer due to its bigger size).

Thus I doubt that the systemd policy of "enter emergency boot" is suitable for most non-virtual server setups.

Regarding "nofail": my "fstab" man page says: "do not report errors for this device if it does not exist". "Emergency mode" or "fails to boot" is not mentioned.
This also indicates that the current behaviour (stopping and not reporting any error at all) is not wanted and certainly not expected.

Comment 6 bob coner 2015-06-08 18:14:31 UTC

Note that I was hit with this same bug/feature when I recently upgraded from Centos 7.0 to 7.1.  The workaround I found was to add the parameters noauto,x-systemd.automount to the filesystems in /etc/fstab that you want systemd to defer mounting until later.  
Credit to nous in the following thread:  https://bbs.archlinux.org/viewtopic.php?id=147478

Comment 7 Martin Steigerwald 2015-06-10 12:44:59 UTC

Michal, why did you close this? You didn´t provide any other explaination than "WORKSFORME" which I think is not sufficient to close the bug.

I am not keen to add the non-standard "x-systemd.automount" in order to have some sane behavior and think it is important to be able to start some ssh service if some filesystem has could not been mounted, without adding a non-standard option to fstab.

Comment 8 Michal Sekletar 2015-06-10 13:09:54 UTC

Yes, this is a policy decision and current is to use nofail to get what you want. Please discuss upstream if you disagree.

Comment 9 Martin Steigerwald 2015-06-10 15:45:23 UTC

Well, I disagree. But from my past experience with discussing things with upstream I conclude it would likely be a total waste of my and their energy to discuss it with them. I hope when I ever find this in any Debian setup that the maintainers will not blindly follow upstream decision in a case like this.

As I do not use RHEL/CentOS on a regular base and you are one of the maintainers of these packages, thats all for me for now.

(That said I think this kind handling of bug reports by users is one of the reasons systemd triggers that much polarity in the Linux world. Its not the first time I saw this pattern.)

Comment 10 Michal Sekletar 2015-06-10 21:42:59 UTC

I agree that closing a bug was maybe a bit harsh, sorry about that. However, please understand that I just wanted to end discussion which was going bit of track IMHO. 

Also I think that now more than ever it is *very* important to discuss such policy decision upstream because systemd was adopted by all major distros.

Next, please understand that we will not digress from upstream behavior for no good reason. Arguments you made were basically your interpretation of the expected behavior. Which is just not enough. Hence we (downstream maintainers) will not propose this upstream. Thus there is no point in keeping bug open if we already now we would just defer it indefinitely. 

And yes, man page can explain clearly that all nofail mount points will cause emergency mode in case they can't be mounted during boot. But this is something to discuss in separate bug.

Comment 11 Michal Sekletar 2015-06-10 21:45:14 UTC

(In reply to Michal Sekletar from comment #10)

> And yes, man page can explain clearly that all nofail mount points will
> cause emergency mode in case they can't be mounted during boot. But this is
> something to discuss in separate bug.

I mean the other way around of course.

Comment 12 Lars Kruse 2015-06-11 04:34:14 UTC

(In reply to Michal Sekletar from comment #10)
> Next, please understand that we will not digress from upstream behavior for
> no good reason. Arguments you made were basically your interpretation of the
> expected behavior.

No.
The above arguments (and thus this bug report) were not based on personal taste, but on:
1) the current documentation (see man fstab)
2) the behaviour of all previous RHEL releases

Besides this I agree with your words.

Comment 13 Martin Steigerwald 2015-06-11 07:52:52 UTC

Michal, thank you for your explaination. I can understand your point of view.

For now I won´t discuss this with upstream. My last attempts to discuss anything with upstream ended in disaster despite my, I think successful, attempts to stay away from any personal attacks myself I received personal attacks. I am not willing to go through any of this again.

Comment 14 giulioo 2016-01-02 08:30:18 UTC

The current behavior is very user un-friendly, to put it mildly.

To add to what has already been said:

1) The emergency message advises you to run
    journalctl -xb
but this is not a good advice. A good advice would be to run
    journalctl -xb -p3
so you can see immediately what went wrong.


2) Let's say you realize fstab was the problem, you then correct the problem in fstab and press Ctrld-D, and what happens? This happens:

    Error getting authority. Error initializing authority: Could not connect:    
    No such file or directory  (g-io-error-quark, 1)
    
I understand this is a message from polkit, and surely there's a reason or policy that states this is the right thing to happen at this very moment, but it's definitely not helping at all.

The system then stalls for another 60 secs, and then you are again presented with the emergency message.

You then hopefully realize that maybe fstab-generator early in the boot created unit files from fstab and systemd is using the "old" unit files instead of the updated fstab.

So you try
   systemctl daemon-reload
and you get
    Error getting authority. Error initializing authority: Could not connect: No such file or directory  (g-io-error-quark, 1)

This time the system won't stall, so you try again ctrl-d and you get
    Error getting authority. Error initializing authority: Could not connect: No such file or directory  (g-io-error-quark, 1)

but this time after a few seconds the system will complete boot succesfully.
========================================================================

This is very bad behavior. It leaves you with the impression that there are unhandled issues everywhere.

You are basically saying "it's upstream, everybody is in the same boat, you talk to upstream"

I think you as the maintainer should state whether you think the current behavior is OK or not.

If you think it's OK, then fine.

If you think it's not OK, maybe you should be the one to talk to upstream since as the maintainer for a big distro you have more chance to get some results.

Comment 15 Justin 2016-02-05 14:28:17 UTC

Hi there,

I just ran into this same problem and was able to get it fixing by changing "default" to "default,nofail" to my boot entry in /etc/fstab.

What I noticed is that systemd was trying to mount the device as "dev-disk-by\x2duuid..."

When I checked "/dev/disk/by-*" there is no x2uuid option only "by-uuid".  Is this right/correct?  The UUID listed matched the one that was listed in /etc/fstab.

Comment 16 Jakub Jelen 2016-04-28 14:23:05 UTC

*** Bug 1213781 has been marked as a duplicate of this bug. ***

Comment 17 BugMasta 2016-10-28 19:54:08 UTC

Giulio is spot-on:

(In reply to giulioo from comment #14)
> The current behavior is very user un-friendly, to put it mildly.
> 
> To add to what has already been said:
> 
> 1) The emergency message advises you to run
>     journalctl -xb
> but this is not a good advice. A good advice would be to run
>     journalctl -xb -p3
> so you can see immediately what went wrong.
> 
> 
> 2) Let's say you realize fstab was the problem, you then correct the problem
> in fstab and press Ctrld-D, and what happens? This happens:
> 
>     Error getting authority. Error initializing authority: Could not
> connect:    
>     No such file or directory  (g-io-error-quark, 1)
>     
> I understand this is a message from polkit, and surely there's a reason or
> policy that states this is the right thing to happen at this very moment,
> but it's definitely not helping at all.
> 
> The system then stalls for another 60 secs, and then you are again presented
> with the emergency message.
> 
> You then hopefully realize that maybe fstab-generator early in the boot
> created unit files from fstab and systemd is using the "old" unit files
> instead of the updated fstab.
> 
> So you try
>    systemctl daemon-reload
> and you get
>     Error getting authority. Error initializing authority: Could not
> connect: No such file or directory  (g-io-error-quark, 1)
> 
> This time the system won't stall, so you try again ctrl-d and you get
>     Error getting authority. Error initializing authority: Could not
> connect: No such file or directory  (g-io-error-quark, 1)
> 
> but this time after a few seconds the system will complete boot succesfully.
> ========================================================================
> 
> This is very bad behavior. It leaves you with the impression that there are
> unhandled issues everywhere.
> 
> You are basically saying "it's upstream, everybody is in the same boat, you
> talk to upstream"
> 
> I think you as the maintainer should state whether you think the current
> behavior is OK or not.
> 
> If you think it's OK, then fine.
> 
> If you think it's not OK, maybe you should be the one to talk to upstream
> since as the maintainer for a big distro you have more chance to get some
> results.

The way this bug has been handled is a disgrace.

At a minimum, if redhat has any respect for its customers, and if systemd has any respect for its users, someone needs to follow Giulioo's suggestion and change the advice given when dropping to emergency mode so it suggests:

journalctl -xb -p3

The current suggestion, journalctl -xb, is NOT GOOD ENOUGH.

With -p3 option, we can immediately see what the issue is, ie what causes the drop to emergency mode. Without -p3, IT IS IMPOSSIBLE TO SEE WHAT HAS CAUSED THE PROBLEM.

This bug report appears to be the only source on the entire internet which suggests the additional -p3 option in order to directly find the cause of the problem. People could google around for hours and still never find this vital piece of advice.

Have some respect for users and modify the message upstream, so it sticks, and so it can save thousands of people much anguish & gnashing of teeth.

Comment 18 Christoph Buchmann 2017-06-14 10:06:07 UTC

(In reply to giulioo from comment #14)
> The current behavior is very user un-friendly, to put it mildly.
> 
> To add to what has already been said:
> 
> 1) The emergency message advises you to run
>     journalctl -xb
> but this is not a good advice. A good advice would be to run
>     journalctl -xb -p3
> so you can see immediately what went wrong.
> 
> 
> 2) Let's say you realize fstab was the problem, you then correct the problem
> in fstab and press Ctrld-D, and what happens? This happens:
> 
>     Error getting authority. Error initializing authority: Could not
> connect:    
>     No such file or directory  (g-io-error-quark, 1)
>     
> I understand this is a message from polkit, and surely there's a reason or
> policy that states this is the right thing to happen at this very moment,
> but it's definitely not helping at all.
> 
> The system then stalls for another 60 secs, and then you are again presented
> with the emergency message.
> 
> You then hopefully realize that maybe fstab-generator early in the boot
> created unit files from fstab and systemd is using the "old" unit files
> instead of the updated fstab.
> 
> So you try
>    systemctl daemon-reload
> and you get
>     Error getting authority. Error initializing authority: Could not
> connect: No such file or directory  (g-io-error-quark, 1)
> 
> This time the system won't stall, so you try again ctrl-d and you get
>     Error getting authority. Error initializing authority: Could not
> connect: No such file or directory  (g-io-error-quark, 1)
> 
> but this time after a few seconds the system will complete boot succesfully.
> ========================================================================
> 
> This is very bad behavior. It leaves you with the impression that there are
> unhandled issues everywhere.
> 
> You are basically saying "it's upstream, everybody is in the same boat, you
> talk to upstream"
> 
> I think you as the maintainer should state whether you think the current
> behavior is OK or not.
> 
> If you think it's OK, then fine.
> 
> If you think it's not OK, maybe you should be the one to talk to upstream
> since as the maintainer for a big distro you have more chance to get some
> results.

Giulio, 

thanks for providing the procedure to restore the system. A simple reboot after correcting the /etc/fstab did not work at my system. 
In general I can only reassure that the provided error message is neither clear nor helpful. 
So don't call it a bug, call it a feature request. BUT CHANGE IT!

Regards
Christoph