Bug 1298760

Summary:	ssh_exchange_identification: read: Connection reset by peer
Product:	[Fedora] Fedora	Reporter:	Bogdan <bogdan>
Component:	openssh	Assignee:	Jakub Jelen <jjelen>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	23	CC:	jjelen, mattias.ellert, mgrepl, plautrba, tmraz
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-01-18 15:10:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bogdan 2016-01-14 23:41:40 UTC

Description of problem:
SSH server refuses connection after the machine runs big rsync jobs

Version-Release number of selected component (if applicable):
openssh-server-7.1p1-6.fc23.x86_64

How reproducible:
run big rsync job with thousands of files from crontab - no ssh connections are established during rsync run

Expected results:
SSH not allowed - ssh_exchange_identification: read: Connection reset by peer

Additional info:
It is not possible to get into the box over ssh anymore - the system needs to be rebooted. The system is operational, but no SSH. Also the cockpit daemon dies with the SSH server too. This happens on Fedora 23 x64 - fully updated.

Comment 1 Bogdan 2016-01-14 23:46:07 UTC

This is not caused by any firewall as port 22 is open on the box. Before the rsync run, all is ok. Also noticed same issue with SSH when toggling the windows in mc with CTRL+O. This looks to me like a memory overlapping/sharing issue where rsync rides/overwrites the memory segment for ssh and once the daemon gets corrupt, daemon becomes not functional.

Comment 2 Jakub Jelen 2016-01-15 09:07:51 UTC

Without getting any more information from the server (logs, core dumps, retraces) or reliable reproducer we are not able to investigate this issue any further.
At least verbose log from client might be helpful, debug log from server should give us some clue. Can you provide something like this?

From these words, it sounds for me as disk space exhaustion or similar resource problem. If it would be memory issue, it would be more serious problem than only openssh.

Comment 3 Bogdan 2016-01-15 23:19:56 UTC

I just got to reproduce it. After the rsync run, I got locked out. This is what ssh -vvv root@fileserver gives me:

[sysadmin@gateway ~]$ ssh -vvv root@fileserver
OpenSSH_7.1p1, OpenSSL 1.0.2e-fips 3 Dec 2015
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 56: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to fileserver [192.168.100.253] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/sysadmin/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/sysadmin/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/sysadmin/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/sysadmin/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/sysadmin/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/sysadmin/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/sysadmin/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/sysadmin/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.1
ssh_exchange_identification: read: Connection reset by peer

Comment 4 Bogdan 2016-01-15 23:22:59 UTC

I was logged in all the time when the rsync happened. The nemory sits at around 900MB usage out of 6GB or so. Memory cache is at around 500MB max and I am emptying the cache after every rsync to make sure the memory is fine. There is plenty of disk space. The rsync syncs from /home to /storage that is a second physical drive and to /nas that is a samba drive mounted locally.

[root@fileserver scripts]# df
Filesystem                                1K-blocks      Used  Available Use% Mounted on
devtmpfs                                    3523556         4    3523552   1% /dev
tmpfs                                       3535452         0    3535452   0% /dev/shm
tmpfs                                       3535452       760    3534692   1% /run
tmpfs                                       3535452         0    3535452   0% /sys/fs/cgroup
/dev/mapper/fedora-root                    52403200   5696356   46706844  11% /
tmpfs                                       3535452        64    3535388   1% /tmp
/dev/sda1                                    487652    160792     297164  36% /boot
/dev/mapper/fedora-home                  1863643824 251882132 1517070980  15% /home
/dev/sdb                                 2884285104 420890532 2316881244  16% /storage
//nas.domain/Administration/backup 2870309064 479849064 2390357600  17% /nas
tmpfs                                        707092         0     707092   0% /run/user/0

Comment 5 Bogdan 2016-01-15 23:28:04 UTC

If I try to reload ssh, I get this:

[root@fileserver scripts]# systemctl reload sshd
Job for sshd.service failed because the control process exited with error code. See "systemctl status sshd.service" and "journalctl -xe" for details.

[root@fileserver scripts]# systemctl status sshd.service
● sshd.service - OpenSSH server daemon
   Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
   Active: active (running) (Result: exit-code) since Fri 2016-01-15 10:01:17 MST; 6h ago
     Docs: man:sshd(8)
           man:sshd_config(5)
  Process: 12618 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=208/STDIN)
 Main PID: 932 (sshd)
   CGroup: /system.slice/sshd.service
           └─932 /usr/sbin/sshd -D

Jan 15 15:27:24 fileserver.domain sshd[11428]: pam_succeed_if(sshd:auth): requirement "uid >= 1000" not met by user "root"
Jan 15 15:27:26 fileserver.domain sshd[11428]: Failed password for root from 192.168.100.1 port 54808 ssh2
Jan 15 15:27:33 fileserver.domain sshd[11428]: Accepted password for root from 192.168.100.1 port 54808 ssh2
Jan 15 15:37:06 fileserver.domain sshd[11620]: Accepted password for root from 192.168.100.1 port 54852 ssh2
Jan 15 15:39:15 fileserver.domain sshd[11693]: Accepted password for root from 192.168.100.1 port 58822 ssh2
Jan 15 16:12:39 fileserver.domain sshd[932]: Couldn't open /dev/null: No such file or directory
Jan 15 16:13:07 fileserver.domain sshd[932]: Couldn't open /dev/null: No such file or directory
Jan 15 16:13:31 fileserver.domain sshd[932]: Couldn't open /dev/null: No such file or directory
Jan 15 16:23:40 fileserver.domain systemd[1]: sshd.service: Control process exited, code=exited status=208
Jan 15 16:23:40 fileserver.domain systemd[1]: Reload failed for OpenSSH server daemon.

Comment 6 Bogdan 2016-01-15 23:36:14 UTC

The strange part is that SSH comes back after 30 mins of refusing the connections...

Comment 7 Jakub Jelen 2016-01-18 10:07:58 UTC

This looks really weird.

> Jan 15 16:12:39 fileserver.domain sshd[932]: Couldn't open /dev/null: No such file or directory
> Jan 15 16:13:07 fileserver.domain sshd[932]: Couldn't open /dev/null: No such file or directory
> Jan 15 16:13:31 fileserver.domain sshd[932]: Couldn't open /dev/null: No such file or directory

Another idea that came to my mind is that it might be hardware problem, that some part of memory might get corrupted, but if you managed to reproduce it again it sounds quite unlikely (though some memtest would not hurt).

To understand the errors, you issued three connection respective to the three quoted logs above? Or is it from the one connection?

/dev/null should be available. Unless the data segment with the value _PATH_DEVNULL is overwritten by something else (as you tried to propose) and the actual opening file is different. This would be serious issue with kernel.

I build scratch build to add what file is actually opening, if it occurs in `sanitise_stdfd` and the constant is malformed:
http://koji.fedoraproject.org/koji/taskinfo?taskID=12590686

diff --git a/misc.c b/misc.c
index c5d2686..bcc5722 100644
--- a/misc.c
+++ b/misc.c
@@ -728,8 +728,8 @@ sanitise_stdfd(void)
 	int nullfd, dupfd;
 
 	if ((nullfd = dupfd = open(_PATH_DEVNULL, O_RDWR)) == -1) {
-		fprintf(stderr, "Couldn't open /dev/null: %s\n",
-		    strerror(errno));
+		fprintf(stderr, "Couldn't open %s: %s\n",
+		    _PATH_DEVNULL, strerror(errno));
 		exit(1);
 	}
 	while (++dupfd <= 2) {

Could you try to install this build as a server and re-reproduce it, if the path in error message changes? Also it would be useful to see move verbose log from server (`LogLevel` option in `sshd_config`) and how we did get to the point of this error. Then it should be more clear from the `/var/log/secure` or journal.

Comment 8 Bogdan 2016-01-18 15:10:22 UTC

It is very possible that the bug I am seeing is caused by the kernel itself. I have a script that outputs the errors to /dev/null. As soon as I took the redirection out, I do not need to drop the disk caches anymore and it seems to be stable. So, for some reason outputting to /dev/null causes the issue. On the other hand, there is not much error dropped, so I do not understand of why /dev/null would become inaccessible. To get back on track, this does not seem to be an SSH bug after all, ssh just gets affected by it.

As I would not go through the effort of posting this as kernel bug, I will close the issue, but the issue is still present and caused by redirection of 2> /dev/null.