Created attachment 1423064 [details]
ssh troubleshooting session output
Description of problem:
We are expriencing SSH issues on the ESS storage cluster at LLNL during scaling test of the ESS. What we are seeing is that when commands/scripts that operate on multiple nodes will error out with either:
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: read: Connection reset by peer
They occur on different systems each time, so we know it is not host related.
These errors are easily reproducible and even occur when fanout is set to a low number (16 in this example):
date; time xdsh gss_ppc64 -f 16 ssh -v sierra4459 hostname
Version-Release number of selected component (if applicable):
[root@sierra4459:~]# ssh -V
OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017
[root@sierra4459:~]# rpm -qa | grep ssh
[root@sierra4459:~]# rpm -qa --last | grep -i openssh
openssh-server-7.4p1-13.el7_4.ppc64le Fri Dec 1 15:25:41 2017
openssh-clients-7.4p1-13.el7_4.ppc64le Fri Dec 1 15:20:47 2017
openssh-7.4p1-13.el7_4.ppc64le Fri Dec 1 15:20:47 2017
Steps to Reproduce:
1. xdsh gss_ppc64 -f 16 ssh -v sierra4459 hostname
sierra4329: ssh_exchange_identification: read: Connection reset by peer
sierra4330: ssh_exchange_identification: Connection closed by remote host
no closes or resets.
Additional debug output attached.
An additional detail I forgot to add: We experience the same issue over both high-speed network and management network.
We at first found that sshd_config had been overwritten on December 20th, with two additional parameters added (KeyRegenerationInterval 0 and MaxStartups 1024). I saw on my GA ESS config that these parameters were not set, so we removed them and restarted sshd on all ionodes and the ems but we still encountered the closed connections and peer resets - and we additionally saw Host Key Authentication errors. We returned them to their original state and the Host Key Authentication errors went away, but the ssh issue still remains of course.
This sounds like an issue of the opening too many connection to single host. The default configuration of MaxStartups is 10:30:60, which means that after 10 in-authentication processes, the others are randomly dropped. This is not a bug, but a feature.
The server logs can probably confirm this theory, but you need to run it at least with "LogLevel VERBOSE" in sshd_config.
If you set this configuration option to MaxStartups 1024, you should no longer see this issue. The KeyRegenerationInterval is unrelated.
Also make sure the sshd services are restarted after configuration changes. Please, provide logs from server and ideally also from client to really see what is going on there.
Thank you that resolved the problem.
MaxStartups 1024 configuration changes in sshd_config resolved to support more than 10 thread.