Bug 1381997
Summary: | Systemctl reload sshd caused inactive service even if the service is running | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Radka Brychtova <rskvaril> | |
Component: | openssh | Assignee: | Jakub Jelen <jjelen> | |
Status: | CLOSED ERRATA | QA Contact: | Stanislav Zidek <szidek> | |
Severity: | unspecified | Docs Contact: | Mirek Jahoda <mjahoda> | |
Priority: | urgent | |||
Version: | 7.3 | CC: | carl, chris.daigle, derek.rodger, devel, fsumsal, gcerami, jfenal, jjelen, jon.dufresne, jpichon, kari.hautio, michele, mjahoda, mjtrangoni, mlinden, mschmidt, nmavrogi, pasteur, rbeyel, salmy, sdordevi, systemd-maint, szidek, vorpal | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openssh-7.4p1-1.el7 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, a race condition between systemd starting the OpenSSH server daemon and sshd writing a PID file could occur. As a consequence, systemd kept the sshd service restarting endlessly. With this update, the sd_notify() function has been added to the sshd code, and systemd is now able to properly detect already running sshd.service.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1426603 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-01 18:42:47 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1298243, 1426603 |
Comment 2
Jakub Jelen
2016-10-05 14:38:59 UTC
From the bug, it is triggered by the reload-or-restart phase (restarting changes the PID, but not so fast): sshd[11765]: Received SIGHUP; restarting. systemd[1]: PID 11765 read from file /var/run/sshd.pid does not exist or is a zombie. [...] sshd[11800]: debug1: Bind to port 22 on 0.0.0.0. SSHD receives SIGHUP and ends. The error message from systemd comes after that. Therefore the service is running, but the systemd thinks it is not (even though the PID file is properly set). (In reply to Jakub Jelen from comment #2) > The same service file (except the sshd-keygen) is in current Fedora and I > don't see this behavior. It's a race condition, so it may take more than one try to reproduce. I can reproduce it in Fedora 24 and 25. Try it in a loop: while :; do systemctl reload sshd; done It starts spitting out "sshd.service is not active, cannot reload." very soon with a sshd process running inside an inactive sshd.service. The verbs in systemd were modelled by the LSB initscript actions. The specification http://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html says: > reload cause the configuration of the service to be reloaded without > actually stopping and restarting the service When sshd receives SIGHUP, it closes its listening sockets, then re-execs itself. I'd say that counts as stopping and restarting the service, so this action should not be exposed as ExecReload. To see why this causes systemd to see the service as inactive, consider what happens after sshd re-execs. sshd daemonizes itself by calling "daemon(0, 0)". This is a fork and an immediate exit of the parent process. Later the child process writes a new PID file. By the time systemd receives a SIGCHLD from the dying original sshd process, the new sshd process may not have written the PID file yet. So systemd sees the main process of sshd.service has exited. That makes the service inactive. systemd does actually have support for daemons that re-fork and move to a new PID. However this can only work if the new PID is written to the PID file before the original process exits. Possible fixes, any one of these would avoid this bug: A) When re-executing sshd, pass it an additional option ("-D") to stop it from daemonizing again. B) Rewrite sshd's SIGHUP handling so that it only re-reads the config file, but don't re-exec. C) Rewrite sshd's daemonization to follow the recommendations in man 7 daemon. D) Use "-D", implement sd_notify() in sshd and use Type=notify in sshd.service. In my opinion B), C), D) would be welcome improvements regardless of this bug. Thank you for the explanation and references. The reload worked with init scripts, because they were not preserving the state (process PID) across different invocations. I am not sure if systemd should do that either. We have PID file with the correct PID most of the time, but if systemd retrieves the PID in wrong time, it fails hard (not saying that being more strict about reload does not mean bad). We should probably say that we do not provide reload for sshd rather than trying to workaround that somehow. The reload is optional according to LSB. Would removing ExecReload= option work or is there something more needed? From the A) option we moved based on the bug #1291172, because systemd failed to report errors from the failed startups (there is deterministic reproducer for RHEL7.2 -- might be some bug in systemd too). Option B) and C) would require significant changes in OpenSSH source code, not only for re-reading the configuration files and doing so systemd-way. The last option with sd_notify() was a thing we wanted to avoid, though, if there would be no other choice, there is already patch implementing this in openssh for Debian [1]. [1] https://anonscm.debian.org/cgit/pkg-ssh/openssh.git/commit/?id=fe97848e044743f0bac019a491ddf0138f84e14a (In reply to Jakub Jelen from comment #5) > We should probably say that we do not provide reload for sshd rather than > trying to workaround that somehow. The reload is optional according to LSB. > Would removing ExecReload= option work or is there something more needed? As long as the reload operation cannot operate reliably, it should not be exposed as ExecReload. However, nothing prevents administrators from sending SIGHUP to the sshd process manually, so they could still get into the trouble even if we don't define ExecReload. > From the A) option we moved based on the bug #1291172, because systemd > failed to report errors from the failed startups (there is deterministic > reproducer for RHEL7.2 -- might be some bug in systemd too). I should clarify what I meant in option A). I did not mean to add "-D" to the ExecStart line in the unit file. I meant adding "-D" to the arguments of the execv() call in sighup_restart(). sshd would still fork on normal start. It just would not bother to fork again when handling SIGHUP. > Option B) and C) would require significant changes in OpenSSH source code, > not only for re-reading the configuration files and doing so systemd-way. > > The last option with sd_notify() was a thing we wanted to avoid, though, if > there would be no other choice, there is already patch implementing this in > openssh for Debian [1]. > > [1] > https://anonscm.debian.org/cgit/pkg-ssh/openssh.git/commit/ > ?id=fe97848e044743f0bac019a491ddf0138f84e14a Why do you want to avoid sd_notify()? Do you know if Debian developers proposed this patch to upstream? (In reply to Michal Schmidt from comment #6) > As long as the reload operation cannot operate reliably, it should not be > exposed as ExecReload. > However, nothing prevents administrators from sending SIGHUP to the sshd > process manually, so they could still get into the trouble even if we don't > define ExecReload. Well ... not that it would not operate reliably, but it does a bit different way than systemd expects it to do and therefore it fails to track it. > > From the A) option we moved based on the bug #1291172, because systemd > > failed to report errors from the failed startups (there is deterministic > > reproducer for RHEL7.2 -- might be some bug in systemd too). > > I should clarify what I meant in option A). I did not mean to add "-D" to > the ExecStart line in the unit file. I meant adding "-D" to the arguments of > the execv() call in sighup_restart(). sshd would still fork on normal start. > It just would not bother to fork again when handling SIGHUP. So that the before-reload and after-reload process would behave differently? AFAIK this is a bad idea and we would get burnt again with the same bug, but not in case of restart, but in reload if the configuration file would have typo. > Why do you want to avoid sd_notify()? Why to avoid this? Well, it is adding more systemd-specific bits and new build dependency to something that always worked well under other inits without any problems for years. > Do you know if Debian developers proposed this patch to upstream? I didn't find any trace of Debian maintainers or the authors of the patch trying to upstream this change. Would you like to do that? I am wondering what is missing in the C) option, if we use -D switch and why we are not getting the logs nor the failure from the service. Regardless the solution from the list, the other related problem is that unless we will add RestartPreventExitStatus=255 systemd is trying to restart the daemon, reports failure, but does not say what is wrong with the service (typo in sshd_config). But adding this option basically prevents restart when the network is not online or in other valid cases, if I remember well (the same exit status). Anyway, do you consider this as high priority to get it fixed in 7.3.Z (whatever way) or do you think it will wait for RHEL7.4? [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=778913 *** Bug 1395836 has been marked as a duplicate of this bug. *** also reported https://bugs.centos.org/view.php?id=12757 As this is a regression in the single most important service on any networked server, IMHO it should be fixed ASAP... Also I do not understand why the solution in https://bugzilla.redhat.com/show_bug.cgi?id=1291172#c2 was not selected in the first place instead of removing -D ? Type=forking will always be racy and unreliable, whereas Type=simple (and Type=notify) will let systemd have exact knowledge of what happens to the process. We have patched all EL7 services shipped as Type=forking on our systems (currently sshd, acpid, chronyd and vsftpd) to run in the foreground so that systemd can keep track of them properly. *** Bug 1415218 has been marked as a duplicate of this bug. *** We will have to use a bit different patch [1], because I missed bogus messages that are written to the journal (from Fedora bug #1427526): Feb 28 08:39:59 wildwest systemd: sshd.service: Got notification message from PID 2959, but reception only permitted for main PID 1438 I will respin the z-stream soon. [1] https://anonscm.debian.org/cgit/pkg-ssh/openssh.git/commit/?id=0fd4134 (In reply to Jakub Jelen from comment #18) This would be avoided if using -D as suggested by https://bugzilla.redhat.com/show_bug.cgi?id=1291172#c2 Type=notify can still be used and is a nice extra feature. "As this is a regression in the single most important service on any networked server, IMHO it should be fixed ASAP..." Fixed in openssh-7.4p1 [root@rhevm-host ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.3 (Maipo) [root@rhevm-host ~]# rpm -q openssh openssh-6.6.1p1-33.el7_3.x86_64 Nice one. Is this going to be pushed upstream? (In reply to Mario Trangoni from comment #22) > Is this going to be pushed upstream? You can follow up on the upstream bug attached to this report. OpenSSH team is not very eager to accept this change/extension so far. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2029 |