Bug 1643176
Summary: | SSH reload failed with "fatal: Cannot bind any address" | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | SHAURYA <sshaurya> |
Component: | openssh | Assignee: | Jakub Jelen <jjelen> |
Status: | CLOSED WONTFIX | QA Contact: | BaseOS QE Security Team <qe-baseos-security> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.5 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-02-11 15:39:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
SHAURYA
2018-10-25 16:45:35 UTC
Seems like I can reproduce the issue with sshd on high load. 1. Configure the public key authentication 2. for i in `seq 1 100`; do ssh -oBatchMode=yes localhost true & done 3. systemctl reload sshd (running the "systemctl reload sshd" once more might help triggering the issue) The logs with LogLevel DEBUG3: Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: drop connection #41 from [::1]:33610 on [::1]:22 past MaxStartups Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: Received SIGHUP; restarting. Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug3: oom_adjust_restore Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug1: Set /proc/self/oom_score_adj to 0 [...] Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug3: already daemonized Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug3: oom_adjust_setup Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug1: Set /proc/self/oom_score_adj from 0 to -1000 Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug2: fd 3 setting O_NONBLOCK Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug1: Bind to port 22 on 0.0.0.0. Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use. Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug2: fd 3 setting O_NONBLOCK Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug3: sock_set_v6only: set socket 3 IPV6_V6ONLY Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: debug1: Bind to port 22 on ::. Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: error: Bind to port 22 on :: failed: Address already in use. Oct 26 14:38:30 jjelen-rhel-7.3 sshd[10777]: fatal: Cannot bind any address. Otherwise, everything else was already mentioned in the customer case -- the service will restart itself after 42 seconds -- if you want to achieve more "reliability", the interval can be configured to lower number. I will investigate what is going on there and keep you updated. This is interesting ... the code after receiving sighup is perfectly linear -- once the signal is received and the part of the main loop handling this case is executed, the sockets are closed and the sshd process re-execs. sshd.c: 313 static void 314 sighup_restart(void) 315 { 316 logit("Received SIGHUP; restarting."); 317 if (options.pid_file != NULL) 318 unlink(options.pid_file); 319 platform_pre_restart(); 320 close_listen_socks(); <--- server closes the listen sockets here 321 close_startup_pipes(); 322 alarm(0); /* alarm timer persists across exec */ 323 signal(SIGHUP, SIG_IGN); /* will be restored after exec */ 324 execv(saved_argv[0], saved_argv); <-- the new server is re-execed here The only short time when we can hit this race condition is if before the forked child will close its listening sockets, which can really happen only in the busy system, where some child is hold in these few lines of the code, while the sshd is restarted. sshd.c: 1341 if ((pid = fork()) == 0) { 1342 /* 1343 * Child. Close the listening and 1344 * max_startup sockets. Start using 1345 * the accepted socket. Reinitialize 1346 * logging (since our pid has changed). 1347 * We break out of the loop to handle 1348 * the connection. 1349 */ 1350 platform_post_fork_child(); 1351 startup_pipe = startup_p[1]; 1352 close_startup_pipes(); 1353 close_listen_socks(); <--- child after fork closes listen sockets here I am not sure if there is some good solution for this or at least better than restarting on failure in systemd unit file. One option might be postponing the restart after all the children will signalize their sockets are closed, but it will again block the daemon from accepting new connections for undefined time. My suggestion would be to use standard restart rather than reload. In theory, it could have the same issue, but it is less likely to happen, because it involves a creation of brand new process, rather than just running exec. This will in theory affect any RHEL, Fedora, even upstream OpenSSH version, but in the end does not cause any non-recoverable issue -- it only delays the start of the service so I would not consider this so high priority issue, but rather configuration issue. This issue was not selected to be included either in Red Hat Enterprise Linux 7.7 because it is seen either as low or moderate impact to a small amount of use-cases. The next release will be in Maintenance Support 1 Phase, which means that qualified Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released as they become available. We will now close this issue, but if you believe that it qualifies for the Maintenance Support 1 Phase, please re-open; otherwise we recommend moving the request to Red Hat Enterprise Linux 8 if applicable. |