Description of problem: Per https://rhn.redhat.com/errata/RHSA-2015-2088.html : > Due to a race condition between auditing messages and answers when using ControlMaster multiplexing, one session in the shared connection randomly and unexpectedly exited the connection. This update fixes the race condition in the auditing code, and multiplexing connections now work as expected even with a number of sessions created at once. (BZ#1240613) Unfortunately, the issue appears to remain - even in the packages and versions listed as correcting the issue. Version-Release number of selected component (if applicable): openssh-7.1p2-3.fc23.x86_64 How reproducible: Consistently (though with the current sample size in the referenced script, multiple client runs may be required to hit the race condition). Steps to Reproduce: Refer to https://gist.github.com/ziesemer/93d64a074abdd8e6f632 . Actual results: The ControlMaster session unexpectedly and abruptly terminates, and the associated child session also exits with a failure (return code 255, and an error of "Write failed: Broken pipe"). Expected results: No failures should occur. Additional info: While running the server with LogLevel VERBOSE, a sampling of observed buffer errors are shown below: sshd[116589]: Starting session: command for ssh-test from ::1 port 54179 sshd[116586]: error: buffer_get_ret: trying to get more bytes 8 than in buffer 4 sshd[116586]: fatal: buffer_get_int: buffer error sshd[116586]: pam_unix(sshd:session): session closed for user ssh-test sshd[116589]: Starting session: command for ssh-test from ::1 port 54179 sshd[116589]: fatal: mm_request_send: write: Broken pipe sshd[116589]: fatal: mm_request_send: write: Broken pipe sshd[18516]: Starting session: command for ssh-test from ::1 port 54190 sshd[18513]: error: buffer_get_string_ret: bad string length 2080374784 sshd[18513]: fatal: buffer_get_string: buffer error sshd[18513]: pam_unix(sshd:session): session closed for user ssh-test sshd[18516]: Starting session: command for ssh-test from ::1 port 54190 sshd[18516]: fatal: mm_request_receive: read: Connection reset by peer sshd[18516]: fatal: mm_request_send: write: Broken pipe sshd[18768]: Starting session: command for ssh-test from ::1 port 56670 sshd[18765]: error: buffer_get_ret: trying to get more bytes 8 than in buffer 0 sshd[18765]: fatal: buffer_get_int: buffer error sshd[18765]: pam_unix(sshd:session): session closed for user ssh-test sshd[18768]: fatal: mm_request_receive: read: Connection reset by peer sshd[18768]: fatal: mm_request_send: write: Broken pipe RHSA-2015:2088-5 refers to BZ#1240613 (which is apparently restricted). As of 2016-02-13, my current testing shows: - CentOS 7.2.1511, openssh-6.6.1p1-23.el7_2.x86_64 - broken. - CentOS 7.2.1511, downgraded to openssh-6.6.1p1-22.el7.x86_64 - still broken (despite this is the supposed fix release per RHSA-2015-2088). - Fedora 23, openssh-7.1p2-3.fc23.x86_64 - broken. - Ubuntu 15.10, OpenSSH_6.9p1 Ubuntu-2ubuntu0.1 - Works without issue. - CentOS 6.6, openssh-5.3p1-104.el6_6.1.i686 - Works without issue. - Fedora 20, openssh-6.4p1-8.fc20.x86_64 - Works without issue. Additional research and notes: - https://ahwhattheheck.wordpress.com/2015/07/02/debugging-sporadically-encountered-ssh-encountered-an-unknown-error-in-ansible-runs/ - Bypassed the issue by effectively ensuring that no ControlMaster would be concurrently accessed by multiple client sessions, at the expense of increasing the number of ControlMasters used. - http://www.zenoss.org/forum/10136 - Posts indicated that the new "UsePrivilegeSeparation sandbox" could be a problem here - but I am able to consistently reproduce with or without this enabled.
Thank you for verbose report. We have private tracker for this bug on RHEL7 (bug #1252318), but we were not able to reproduce the issue reliably and even from the verbose logs that shows occasionally it was not clear where the problem was. We have collected logs with DEBUG3 log level, but they were not also much helpful. But I will check it during this week. Given your analysis, it looks like related to our downstream patches. > - Posts indicated that the new "UsePrivilegeSeparation sandbox" could be a problem here - but I am able to consistently reproduce with or without this enabled. This will not be the issue. RHEL7 is using rlimit sandbox and Fedora 23 is using seccomp one. > - http://www.zenoss.org/forum/10136 > Jun 26 01:06:53 hypervisor.neumann.local sshd[6089]: fatal: mm_request_receive_expect: read: rtype 115 != type 125 I can confirm these messages are related to the already fixed problem, both in RHEL7 and Fedora 22+.
Created attachment 1128271 [details] sychronize audit messages during privilege separation I hit the similar issue during the run of upstream testsuite of openssh-7.2 with ControlMaster (might be even different issue?). Current log [process id]: [9559] debug3: mm_request_send entering: type 120 [9557] debug3: mm_request_receive entering [9560] debug3: mm_request_send entering: type 124 [9557] debug3: monitor_read: checking request 120 buffer_get_string_ret: string is too large buffer_get_string: buffer error [9557] debug1: do_cleanup [9560] debug3: mm_request_receive_expect entering: type 125 [9560] debug3: mm_request_receive entering [9559] debug3: mm_request_receive_expect entering: type 121 [9559] debug3: mm_request_receive entering mm_request_receive: read: Connection reset by peer [9560] debug1: do_cleanup [9560] debug3: mm_request_send entering: type 122 mm_request_send: write: Broken pipe [9559] debug1: do_cleanup [9559] debug3: mm_request_send entering: type 124 mm_request_send: write: Broken pipe This is another race condition between user process sending rekey audit message and the children process auditing server keys removal. Both of the messages are written to the same pipe and they might get mixed up somehow and it was pretty reproducible. I tried to introduce some synchronization between these three processes to avoid issue with auditing. Implemented idea: Postauth child: wait for message from monitor (until the other child destroys keys so monitor will not get "confused") Monitor: wait for the message, that all the server keys are destroyed Child: destroy keys, audit them, inform monitor Monitor: receive destroy keys messages; after all of them; inform the postauth child to continue Postauth child: receive message, continue But I wonder how does it work on Fedora 20 and RHEL6. The auditing was still the same even in Fedora 20. Could you give it a try with my proposed patch and scratch build (openssh-7.2 pre-release for Fedora 23)? The selftest passed for me several times and I will try our stress test: http://koji.fedoraproject.org/koji/taskinfo?taskID=13036097
I wasn't exactly sure how to test. I downloaded all of the above RPMs from https://kojipkgs.fedoraproject.org//work/tasks/6099/13036099/ . Trying to install them all resulted in failed dependencies needed by openssh-askpass. I tried to install openssh-7.2*.rpm openssh-clients-*.rpm openssh-server-*.rpm, but would not succeed unless using `rpm -i --force` due to conflicts. At this point: $ rpm -q openssh openssh-7.1p2-3.fc23.x86_64 openssh-7.2p1-1.fc23.x86_64 $ ssh -V OpenSSH_7.1p2, OpenSSL 1.0.2f-fips 28 Jan 2016 $ ll /usr/bin/ssh -rwxr-xr-x. 1 root root 733032 Feb 18 10:20 /usr/bin/ssh $ sha1sum /usr/bin/ssh 0d58281daf92e3b0f901d7bd148c5cdc4b3c10a5 /usr/bin/ssh Any pointers to how to better apply such a scratch build would be much appreciated. Regardless, upon re-running my tests after the above installation, I am no longer able to reproduce after 10+ executions of the test I had written - so 10 x (10 threads x 100 iterations).
You would probably need to update instead of install. With RPM or DNF, it should not matter. I usually do dnf update *.rpm with all the downloaded RPMS in the working directory. Thanks for feedback. Glad that it worked for you, but I ran into another issues during another regression tests. Non-multiplexed tasks break the audit, because monitor pipes are non-blocking (?). Therefore events from cleanup are send before all the auditing is done and then the messages gets (again) mixed up. It is wrong, but it does not affect the general usability (the fatal error appears during connection cleanup and does not affect session-only exits). I will give it some more tries later. And as soon as I will have working patch, I will update also Fedora 23.
Created attachment 1129778 [details] patch: forward audit messages I made another scratch build addressing this issue (with attached patch). Currently privileged child takes care of forwarding messages from forked child to the monitor, avoiding any other race condition during delivering this message. During my tests I didn't notice any http://koji.fedoraproject.org/koji/taskinfo?taskID=13107809
openssh-7.1p2-4.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-25e3f1c255
openssh-7.1p2-4.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-25e3f1c255
openssh-7.1p2-4.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.
Just curious if or when this fix might be made available for EL 7?
Further testing required - but this now appears fixed as of openssh-6.6.1p1-31.el7.x86_64 under CentOS 7.3.1611! I even cranked my test script up to 1,000 iterations x 50 threads, and was unable to cause a ControlMaster failure. Looks like the fix was actually in -26 (which was never yet released for 7.2): > * Fri Apr 01 2016 Jakub Jelen <jjelen> 6.6.1p1-26 + 0.9.3-9 > ... > - Fix race condition between audit messages from different processes (#1310684) > ...