Bug 1352214 - sshd sometimes starts too early, before the network is ready, and fails
Summary: sshd sometimes starts too early, before the network is ready, and fails
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: openssh
Version: 24
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
Assignee: Jakub Jelen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: network-online.target
TreeView+ depends on / blocked
 
Reported: 2016-07-02 13:52 UTC by Sam Varshavchik
Modified: 2017-08-08 15:15 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-08 15:15:25 UTC


Attachments (Terms of Use)

Description Sam Varshavchik 2016-07-02 13:52:28 UTC
Description of problem:

systemd sometimes may start openssh before the network interfaces are ready. If openssh-server is configured to listen on a specific IP address(s), it will temporarily fail to bind to it.

Version-Release number of selected component (if applicable):

openssh-7.2p2-7.fc24.x86_64

How reproducible:

Fairly reliably.

Steps to Reproduce:
1. Configure /etc/ssh/sshd_config to listen on a specific IP address
2. Reboot

Actual results:

The output from journalctl -l /usr/sbin/sshd:

-- Reboot --
Jul 02 09:35:38 shorty.email-scan.com sshd[1007]: WARNING: 'UseLogin yes' is not supported in Fedora and may cause several problems.
Jul 02 09:35:38 shorty.email-scan.com sshd[1077]: error: Bind to port 22 on 192.168.0.1 failed: Cannot assign requested address.
Jul 02 09:35:38 shorty.email-scan.com sshd[1077]: fatal: Cannot bind any address.
Jul 02 09:36:20 shorty.email-scan.com sshd[1494]: WARNING: 'UseLogin yes' is not supported in Fedora and may cause several problems.
Jul 02 09:36:20 shorty.email-scan.com sshd[1495]: Server listening on 192.168.0.1 port 22.

sshd, in this case, initially fails to bind on 192.168.0.1

Looks like sshd bails out and exits initially. After 40 seconds systemd starts it again. By this time the network interfaces have come up, and sshd successfully starts.

Expected results:

sshd should get started by systemd only after all network interfaces are configured.

Additional info:

This is the same bug as bug 1350097 for privoxy, except that sshd eventually recovers and binds to the IP address. So, the impact is low. Still, it's better to get this right.

The fix is the same fix as in bug 1350097. Instead of:

After=network.target sshd-keygen.target
Wants=sshd-keygen.target

sshd.service should be changed to:

After=network-online.target sshd-keygen.target
Wants=network-online.target sshd-keygen.target

Comment 1 Jakub Jelen 2016-07-04 08:19:24 UTC
Any reference to this in the Fedora packaging guidelines, what does network-online.target mean? I have the same naive understanding, that it should wait for NM to say it is somehow online. Upstream systemd documentation leaves this on the downstreams.

This question comes from time to time and there is some reason why it is not yet in openssh. From systemd documentation [1]:

> network-online.target: [...] It is strongly recommended not to pull in this target too liberally: for example network server software should generally not pull this in (since server software generally is happy to accept local connections even before any routable network interface is up), it's primary purpose is network client software that cannot operate without network.


Using this target delays boot (we usually want to boot fast). We still have a tracker bug #1119787 for this, but the  network.target  works in most of the cases (and the remaining were solved by the restart on the failure as you describe, effectively delaying start of sshd and speeding up the overall boot time).

Note I don't have a problem applying it, but I don't see any advantage of doing so by default. Also if we should apply that, the Fedora packaging guidelines should say that (as already discussed in the referenced tracker bug.

[1] https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

Comment 2 Sam Varshavchik 2016-07-04 12:27:57 UTC
The status quo adds a dependency on systemd's current behavior of restarting failed services. Currently, the default timeout between restart attempt -- looks like it's 40 seconds -- keeps things somewhat tolerable.

It would not surprise me to find a future release of systemd change the default to, maybe, 5 minutes. Or, maybe not restart failed services by default. This will become quite problematic.

Perhaps adding an explicit

Restart=always
RestartSec=30

will result in a, more or less, deterministic behavior.

Incidentally, I find that NetworkManager-wait-online does not need to be enabled. The dependency on the network-online target is sufficient. It looks to me like the referenced delays refer to waiting for IP addresses to be acquired via DHCP. There is no delay with configuring static IP addresses, which happens before the target is reached, so scheduling a service after the target only rearranges the startup sequence.

Comment 3 Jakub Jelen 2016-07-04 12:50:44 UTC
(In reply to Sam Varshavchik from comment #2)
> The status quo adds a dependency on systemd's current behavior of restarting
> failed services. Currently, the default timeout between restart attempt --
> looks like it's 40 seconds -- keeps things somewhat tolerable.

Precisely 42.

> It would not surprise me to find a future release of systemd change the
> default to, maybe, 5 minutes. Or, maybe not restart failed services by
> default. This will become quite problematic.
> 
> Perhaps adding an explicit
> 
> Restart=always
> RestartSec=30
> 
> will result in a, more or less, deterministic behavior.

It is not a default systemd behavior but perfectly engineered constant in service file. The variation of this is already there for years [1]:

    Restart=on-failure
    RestartSec=42s

> Incidentally, I find that NetworkManager-wait-online does not need to be
> enabled. The dependency on the network-online target is sufficient.

AFAIK it is basically the same thing. Pulling the network-online to the boot sequence brings also the NM-wait-online. 

 [1] http://pkgs.fedoraproject.org/cgit/rpms/openssh.git/tree/sshd.service

Comment 4 Edgar Hoch 2016-07-06 12:18:52 UTC
I think the default works for most of the use cases.

In your special case, when you listen on a specific network address, I suggest to create a directory /etc/systemd/system/sshd.service.d/ and put a file with postfix .conf (e.g. wait.conf) in it with lines

[Unit]
After=network-online.target

Then reboot the system, or reload system with "systemctl daemon-reload" (and check the journal for error).

Comment 5 Fedora Update System 2016-07-27 10:40:08 UTC
openssh-7.2p2-11.fc24 selinux-policy-3.13.1-191.8.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-99191c4aab

Comment 6 Lukas Vrabec 2016-07-27 10:52:18 UTC
Reverting to NEW state. BZ was switched to MODIFIED due to wrong bodhi update.

Comment 7 Matthew Miller 2017-01-19 13:34:47 UTC
Since this is a requirement in a specific non-default configuration, I think Edgar's suggestion in comment #4) is probably the right thing — but it'd be nice if we documented this somewhere other than this bug. Do we have a good place for it? 

A comment in the config file or sshd_config man page (or both) might make sense, but I'm not sure how that would be taken upstream.

Comment 8 Jakub Jelen 2017-01-20 08:42:44 UTC
Hello Matt,
based on the bug #1289175 we modified the RHEL7 Sysadmin guide to cover this case. I believe we have similar guide for Fedora, which could mention similar use case. Should we reassign it to the documentation on Fedora? 

Documenting things in configuration files is not a best practice and upstream acceptance in their manual pages does not look much feasible. So I don't think there is much to do about it in OpenSSH

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/System_Administrators_Guide/s1-ssh-configuration.html

Comment 9 Sam Varshavchik 2017-03-11 16:19:01 UTC
Good news, everyone!

openssh-7.4p1-4.fc25.src.rpm now no longer retries, when it fails to bind to its IP address. The ssh service fails permanently, leaving the server completely unaccessible.

/var/log/secure:

Mar 11 10:56:58 shorty sshd[979]: error: Bind to port 22 on 192.168.0.1 failed: Cannot assign requested address.
Mar 11 10:56:58 shorty sshd[979]: fatal: Cannot bind any address.

/var/log/messages:

Mar 11 10:56:58 shorty systemd: sshd.service: Main process exited, code=exited, status=255/n/a
Mar 11 10:56:58 shorty systemd: sshd.service: Unit entered failed state.
Mar 11 10:56:58 shorty systemd: sshd.service: Failed with result 'exit-code'.
Mar 11 11:04:48 shorty systemd: Removed slice system-sshd\x2dkeygen.slice.
Mar 11 11:04:48 shorty systemd: Stopped target sshd-keygen.target.
Mar 11 11:07:17 shorty systemd: Reached target sshd-keygen.target.

It appears that openssh-7.4p1-4.fc25.src.rpm added something extra to sshd.service:

RestartPreventExitStatus=255

That's new. I don't recall seeing that before. And, it seems that failing to bind to the IP address results in an exit code of 255, disabling the ssh service permanently, instead of having it retrying 42 seconds later.

Comment 10 Jakub Jelen 2017-03-13 09:00:51 UTC
Hello,
yes, this is new and it got here with SD_NOTIFY. But I am not sure if that is the right thing since it breaks exactly this use case this bug is talking about (starting too early before slow DHCP is ready).

Comment 11 Eric Lajoie 2017-06-05 11:58:43 UTC
Works for me in Fedora 25 after doing this per bug Bug 1289175:

with openssh-7.4p1-4.fc25.x86_64


mkdir /etc/systemd/system/sshd.service.d
vi /etc/systemd/system/sshd.service.d/wait.conf
     [Unit]
     Wants=network-online.target
     After=network-online.target

Comment 12 Fedora End Of Life 2017-07-25 21:31:00 UTC
This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '24'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 13 Fedora End Of Life 2017-08-08 15:15:25 UTC
Fedora 24 changed to end-of-life (EOL) status on 2017-08-08. Fedora 24 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.