Created attachment 1016756 [details]
failure to start ssh when a non critical filesystem cannot be found
Happens with CentOS 7.1, but I bet this will be in RHEL as well.
Description of problem:
I cannot start ssh service with
service sshd start
and I think also, not tested
when systemd cannot find a filesystem in /etc/fstab to mount
see Bug 1213778 - drops into emergency mode without any error message if it cannot find a filesystem in /etc/fstab
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Have LABEL= in /etc/fstab, but no fs with that label
3. Log into emergency shell
4. service sshd start
It doesn´t start the ssh service.
It does start the ssh service.
See attached screenshot
It fails because it attempts to start the requirement dependencies of sshd.service, one of which is (indirectly via basic.target, sysinit.target, local-fs.target) the failing mount.
You could either fix the reason for getting into emergency mode first before starting services that require basic.target, or perhaps try "systemctl --ignore-dependencies start sshd", which would likely work, but in general ignoring dependencies is not advisable.
There is one unfortunate aspect of the current behaviour. It does not merely fail the start of sshd, it also re-enters the emergency mode. This is because the OnFailure stanza of local-fs.target gets to run again. This is ugly. And the situation would be even worse if local-fs.target did not fail this time, because then the user would have a system with sshd.service running (with basic.target active), but *not much else*. The emergency shell would stop because it's conflicted by sysinit.target.
For this reason I think it would be better if the default services' dependency "Requires=basic.target" were a Requisite (maybe RequisiteOverridable?) dependency instead. basic.target is still required by multi-user.target, so it would be pulled into the boot transaction, but attempting to start individual services from emergency/rescue targets would no longer pull in a cascade units. It might be a risky change to do now though.
There are several issues with it:
1) It again doesn´t tell me the reason why it doesn´t start SSH.
2) SSH does *not* depend on /boot *ever*. The SSH daemon does not need /boot to be mounted. The SSH daemon is completely ignorant about whether a /boot is there or not, it just does not care at all.
3) It does not hint me at a way to -f override the broken dependency, even just one time.
4) It tries to mount again and runs into the timeout again, so I even have to wait for a minute or two for it to drop me into emergency mode.
What did I do as a user?
I called /usr/bin/sshd – running ssh by circumventing systemd´s service management facility. This worked. But still, ssh is such a basic service that has so few requirements to be fully operational, it should not depend on any non essential filesystems in order to be started. I know about nofail, but still, if at all for ssh to check would make sense whether it can find /usr/bin, /var/log and /run, I think maybe a few others, and if there are there, just run the service. Or stop pretending to know better than me as a user who starts the service in emergency mode in order to provide proper debug information for bug reports.
I understand the technical reasons you stated, Michal, but from a user and usability and common sense point of view this behavior IMO is just terminally broken. Or otherwise said: If you pretend to know *better* then the user, then be very, very sure of it and only make sure that the *actual* requirements for SSH are met. Otherwise do not pretend so. Cause with the current systemd configuration it does so, while it obviously doesn´t have a clue. Its just arrogant to the user to have a software behave like that.
Lets move this to openssh. MAybe they should use DefaultDependencies=no and require only the subset of units (network and root mount?).
Documentation about DefaultDependencies in systemd.unit(5) is quite broad but not specific about _what_ will be disabled.
> [...] It is highly recommended to leave this option enabled for the majority of common units. If set to false, this option does not disable all implicit dependencies, just non-essential ones.
I hope at least you, people around systemd, know what are non-essential ones ...
I can't evaluate impact of such change from this paragraph and from my knowledge about systemd. This should be properly evaluated and tested more system-wide.
As I read through the original report and the other bug, it looks like it is problem with systemd's behaviour. Skip fstab errors is possibility, but what if /home/ directories are not mounted? Or others required? Than you will leave sshd to start and die in runtime?
Having sshd service as unique and working even as "emergency login" makes sense, since it is sometimes the most appropriate way to access servers. But I would prefer some system change, than ad-hoc adding options to service file.
(In reply to Jakub Jelen from comment #5)
> Documentation about DefaultDependencies in systemd.unit(5) is quite broad
> but not specific about _what_ will be disabled.
Since this is specific to the unit type, for services it is described in systemd.service(5):
Unless DefaultDependencies= is set to false, service units will implicitly have dependencies of type Requires= and After= on basic.target as well as dependencies of type Conflicts= and Before= on shutdown.target. These ensure that normal service units pull in basic system initialization, and are terminated cleanly prior to system shutdown. Only services involved with early boot or late system shutdown should disable this option.
> I would prefer some system change, than ad-hoc adding options to service
I actually agree.
> > I would prefer some system change, than ad-hoc adding options to service
> > file.
> I actually agree.
Hmm I am not sure what kind of change it should be.
Maybe can we make sshd part of rescue target?
> Maybe can we make sshd part of rescue target?
That will not solve the issue with its dependencies.
Ok, after closing the bz1213778, what to do with this one you actively moved to openssh?
I was reproducing this behaviour from such sparse description and just because of this one line in fstab I ended up with almost non-bootable RHEL7.1 (of course also without ssh access).
Steps to reproduce:
1) $ echo "LABEL=TestLabel /mnt/label xfs defaults 1 1" >> /etc/fstab
2) $ shutdown -r now
3) wait for systemd to timeout on filesystem, log as root (turn off selinux if selogin fails)
4) Add DefaultDependencies=no to sshd service (/usr/lib/systemd/system/sshd.service)
5) $ systemctl daemon-reload
6) $ systemctl start sshd
... loosing shell, waiting for systemd to timeout and returning to 3)
so no progress for me this way. Any other ideas how to handle this? Or can anyone test it after me if I did something wrong?
> 4) Add DefaultDependencies=no to sshd service (/usr/lib/systemd/system/sshd.service)
Sorry. Putting it into [Unit] block works for me. I'm able to start sshd, but it is listening only on localhost (network is not up?) so I'm still unable to access machine remotely and this doesn't solve anything.
Any more ideas?
(In reply to Jakub Jelen from comment #10)
> Any more ideas?
If I get this right you are basically asking for connection to network and sshd running if boot ends up in emergency target?
Basically yes. With this tweak SSHD is started during boot (or rather during emergency target?), but it listens only on localhost so it is the same way helpful as if it was not running.
Thats my main issue with it: As Michal Sekletar closed the bz1213778 due to noting that this is a policy decision from systemd upstream:
The current state causes the following severe regression:
From: sysvinit and I bet upstart as well: System boots when filesystem other than / and other critical filesystems cannot be mounted. Machine is nicely accessible via network.
To: system does not boot in that case, is not accessible via network, probably cannot easily be made to be in network without manually ip addr, ip link set up, ip route and /usr/bin/sshd and thus if any means to contact to it via out of band management also fail, a technician needs to come to the physical location of the system. This can mean a downtime of several hours before before the server would be up and running in minutes again and any issue could be easily fixed via SSH access.
A documentation as manpage may help a bit, but IMHO this requires a HUGE, BIG, FAT warning in release notes or even on installing RHEL 7 / CentOS 7 or upgrading to it.
I stick to my oppinion: The current state is broke in several severe ways. I don´t use RHEL 7 or CentOS 7 as my main system, so until I find this to be happening in a Debian – it may have the same issue, but so far I didn´t test it yet – machine I won´t care that much about it.
Booting up nicely enough to have the system running, i.e. a robust and failure tolerant boot process is a huge value in itself. For me it counts more than 100% correctness of the boot process. This can still be monitored / diagnosed with systemctl is-system-running or whatever it was.
Its your decision whether to stick to upstream policy decisions or use your own common sense or as an alternative document this in a way that makes this grave change *very obvious* to any sysadmins using RHEL / CentOS 7. Or deal with any support requests and dissatisfied customers who run into that issue.
Right now, I won´t bring this upstream, cause I am not into any "You are wrong, cause we are right" kind of discussions with systemd developers. And I have seen this pattern with them way too often to feel comfortable with a discussion like that. Maybe its a personal incompatibility, but right now, I won´t do this. I have better ways to spend the time in my life.
Its your distro? What kind of quality to you want it to have?
For now moving to 7.3. This is really unfortunate use case and as I see the systemd maintainers dealing with problem, I don't believe it will ever get fixed.
On the other side, putting rows into fstab without appropriate nofail, means that if they fail to mount, it will take the whole system with it. I know that is is extra burden for users to write such a field to their fstab, but they are the ones who should decide if the filesystem is mandatory for their application or no.
Seems like we will not solve anything more here. This bug is closely related to the systemd bug #1213778 and proposed solutions didn't work here.
There are more reports from people failing in this case, but it is on systemd to make a change (if system-wide, then in upstream) or for users to get used to the new (documented) behaviour.
*** This bug has been marked as a duplicate of bug 1213778 ***
This sounds a bit like: We don´t care if upstream breaks important data center availability functionality that was available in sysvinit. No one these days seems to take the courage to speak up about this with systemd upstream anymore, and frankly I can even understand that considering the discussions I had on systemd-devel, after which I unsubscribed there for good.
> We don´t care if upstream breaks important data center availability functionality that was available in sysvinit.
For keeping maximum compatibility, we have minor updates (6.7 -> 6.8). Behavior between major updates (6.x -> 7.x) can differ and is documented. In this case you are exaggerating.
Jakub, I am not talking about the "nofail" option there and that systemd goes into rescue mode, while I do not agree with upstream decision one can argue about the best policy there.
I am talking about that systemd does not even provide ssh access in the rescue system and even actively prevents starting SSH as a service in there unless you insist on starting "/usr/bin/sshd" manually. I.e. I am talking about what *this* bug is about (not the other one, my both bug reports never have been duplicates although you marked them as such).
This is a clear robustness regression over SysVInit. Sure it doesn´t provide SSH access in rescue mode as well, but it didn´t go in rescue mode unless you had very serious issues and thus for the case of a missing filesystem you still had SSH access to fix things easily.
Of course you are totally free to ignore this regression, but consider whether doing so is in service of your users and enterprise customers, even tough it is good to have some remote management, being able to fix things quickly via SSH reduces downtimes.
Martin is correct.
Anyone who supports remote systems can see this is a massive ballsup.
Dropping into rescue mode on any failed mount is a huge pitfall. Yes, we know nofail is the solution. But if you add a new mount to a remote server and forget to add the nofail option, then that is small consolation.
If systems are going to drop into emergency mode for reasons as trivial as this, then we must modify emergency mode so that if ssh is set to start in the default target, then we also attempt to start ssh in emergency mode. This is the only sane behaviour.
Anything else is effectively setting a trap for your users, and brutally kicking them in the gonads when they slip up & fall for it.
So, this bug is not a duplicate of the other bug: (https://bugzilla.redhat.com/show_bug.cgi?id=1213778). This bug is different, and it has a different solution.
I repeat, the solution is: modify emergency mode so that if ssh is set to start in the default target, then we also attempt to start ssh in emergency mode.
Simple, and painless.
PS: obfuscating the solution with talk of systemd dependencies is not acceptable. Systemd is a flawless wonder, which makes everything easier and better. It is therefore inconceivable that systemd could make this bug hard to fix.
(In reply to BugMasta from comment #21)
> I repeat, the solution is: modify emergency mode so that if ssh is set to
> start in the default target, then we also attempt to start ssh in emergency
> Simple, and painless.
Patches welcome. You are the Bug Masta, after all.
Let me quote you from the bug 1213778 as well:
> The way this bug has been handled is a disgrace.
> At a minimum, if redhat has any respect for its customers ...
I get it now, you have no respect for Red Hat. You spelled Red Hat wrong.
Any chance you're the original reporter who trolls in bugzillas that were closed months ago? I'm *really* dying to know! Thank you again.
You think i'm trolling, because i'm commenting on a bugzilla that was closed 6 months ago?
I'm here commenting on this issue, because the bug was raised A YEAR AND A HALF AGO, and was then ignored and obstructed for a year, before being inappropriately closed. And now, 6 months after being closed, i've just been bitten by it (again, because I have been bitten by this before). My time has been wasted, again, because you closed the bug, but DID NOT FIX THE PROBLEM.
When you close a bug which describes a real issue, which affects real people, then people are likely to come along later and point out the fact that you've made an error, and your error has continued to cause people grief.
How about you just FIX THE PROBLEM, or at least, HELP. Instead of obstructing any proper resolution.
(In reply to Jan Synacek from comment #23)
> (In reply to BugMasta from comment #21)
> I get it now, you have no respect for Red Hat. You spelled Red Hat wrong.
> Any chance you're the original reporter who trolls in bugzillas that were
> closed months ago? I'm *really* dying to know! Thank you again.
That was uncalled for. I am not BugMasta although I agree with him.
I just saw now point in telling you again that I still think very similar about this issue than when I reported it.
You are of course free to call my feedback "trolling" – its actually a quite easy approach to avoid actually *considering* it. I just ask you to consider whether doing you serves you and… Red Hat.
Sorry for typos, it seems I cannot edit my comment, so:
I just saw no point in telling you again that I still think very similar about this issue than when I reported it.
You are of course free to call my feedback "trolling" – its actually a quite easy approach to avoid actually *considering* it. I just ask you to consider whether doing so serves you and… Red Hat.
(In reply to Martin Steigerwald from comment #25)
> Hello Jan.
> (In reply to Jan Synacek from comment #23)
> > (In reply to BugMasta from comment #21)
> > I get it now, you have no respect for Red Hat. You spelled Red Hat wrong.
> > Any chance you're the original reporter who trolls in bugzillas that were
> > closed months ago? I'm *really* dying to know! Thank you again.
> That was uncalled for. I am not BugMasta although I agree with him.
I'm sorry, my comment wasn't meant to offend you in any way, even though it may look like it.
(In reply to Jan Synacek from comment #27)
> (In reply to Martin Steigerwald from comment #25)
> > Hello Jan.
> > (In reply to Jan Synacek from comment #23)
> > > (In reply to BugMasta from comment #21)
> > […]
> > > I get it now, you have no respect for Red Hat. You spelled Red Hat wrong.
> > >
> > > Any chance you're the original reporter who trolls in bugzillas that were
> > > closed months ago? I'm *really* dying to know! Thank you again.
> > That was uncalled for. I am not BugMasta although I agree with him.
> I'm sorry, my comment wasn't meant to offend you in any way, even though it
> may look like it.
I accept your apology.
I also see how the way BugMasta wrote his comment invited such an reaction. BugMasta, I do think the tone makes the music – while I at the same time see how challenging it can be to make one´s one point clearly without attacking others personally. I do believe that Red Hat employees have no interest in being disrespectful to customers – it would be doing them a disservice –, so I ask you to keep that out of here.
I think the first step forward here is to agree to disagree. I do hope that there will be a good fix for this issue at some time in the future, but honestly I am not willing to discuss this with upstream at the moment given my past experiences on systemd-devel mailing list.
Dropping the stale needinfo. If our input is still needed, please set the needinfo again.