| Summary: | Unresolved OpenSSH ControlMaster multiplexing race condition (RHSA-2015:2088-5). | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Mark Ziesemer <bugs> | ||||||
| Component: | openssh | Assignee: | Jakub Jelen <jjelen> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 23 | CC: | jjelen, mattias.ellert, mgrepl, plautrba, tmraz | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | openssh-7.1p2-4.fc23 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2016-02-28 12:19:59 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Mark Ziesemer
2016-02-14 02:59:36 UTC
Thank you for verbose report. We have private tracker for this bug on RHEL7 (bug #1252318), but we were not able to reproduce the issue reliably and even from the verbose logs that shows occasionally it was not clear where the problem was. We have collected logs with DEBUG3 log level, but they were not also much helpful. But I will check it during this week. Given your analysis, it looks like related to our downstream patches. > - Posts indicated that the new "UsePrivilegeSeparation sandbox" could be a problem here - but I am able to consistently reproduce with or without this enabled. This will not be the issue. RHEL7 is using rlimit sandbox and Fedora 23 is using seccomp one. > - http://www.zenoss.org/forum/10136 > Jun 26 01:06:53 hypervisor.neumann.local sshd[6089]: fatal: mm_request_receive_expect: read: rtype 115 != type 125 I can confirm these messages are related to the already fixed problem, both in RHEL7 and Fedora 22+. Created attachment 1128271 [details] sychronize audit messages during privilege separation I hit the similar issue during the run of upstream testsuite of openssh-7.2 with ControlMaster (might be even different issue?). Current log [process id]: [9559] debug3: mm_request_send entering: type 120 [9557] debug3: mm_request_receive entering [9560] debug3: mm_request_send entering: type 124 [9557] debug3: monitor_read: checking request 120 buffer_get_string_ret: string is too large buffer_get_string: buffer error [9557] debug1: do_cleanup [9560] debug3: mm_request_receive_expect entering: type 125 [9560] debug3: mm_request_receive entering [9559] debug3: mm_request_receive_expect entering: type 121 [9559] debug3: mm_request_receive entering mm_request_receive: read: Connection reset by peer [9560] debug1: do_cleanup [9560] debug3: mm_request_send entering: type 122 mm_request_send: write: Broken pipe [9559] debug1: do_cleanup [9559] debug3: mm_request_send entering: type 124 mm_request_send: write: Broken pipe This is another race condition between user process sending rekey audit message and the children process auditing server keys removal. Both of the messages are written to the same pipe and they might get mixed up somehow and it was pretty reproducible. I tried to introduce some synchronization between these three processes to avoid issue with auditing. Implemented idea: Postauth child: wait for message from monitor (until the other child destroys keys so monitor will not get "confused") Monitor: wait for the message, that all the server keys are destroyed Child: destroy keys, audit them, inform monitor Monitor: receive destroy keys messages; after all of them; inform the postauth child to continue Postauth child: receive message, continue But I wonder how does it work on Fedora 20 and RHEL6. The auditing was still the same even in Fedora 20. Could you give it a try with my proposed patch and scratch build (openssh-7.2 pre-release for Fedora 23)? The selftest passed for me several times and I will try our stress test: http://koji.fedoraproject.org/koji/taskinfo?taskID=13036097 I wasn't exactly sure how to test. I downloaded all of the above RPMs from https://kojipkgs.fedoraproject.org//work/tasks/6099/13036099/ . Trying to install them all resulted in failed dependencies needed by openssh-askpass. I tried to install openssh-7.2*.rpm openssh-clients-*.rpm openssh-server-*.rpm, but would not succeed unless using `rpm -i --force` due to conflicts. At this point: $ rpm -q openssh openssh-7.1p2-3.fc23.x86_64 openssh-7.2p1-1.fc23.x86_64 $ ssh -V OpenSSH_7.1p2, OpenSSL 1.0.2f-fips 28 Jan 2016 $ ll /usr/bin/ssh -rwxr-xr-x. 1 root root 733032 Feb 18 10:20 /usr/bin/ssh $ sha1sum /usr/bin/ssh 0d58281daf92e3b0f901d7bd148c5cdc4b3c10a5 /usr/bin/ssh Any pointers to how to better apply such a scratch build would be much appreciated. Regardless, upon re-running my tests after the above installation, I am no longer able to reproduce after 10+ executions of the test I had written - so 10 x (10 threads x 100 iterations). You would probably need to update instead of install. With RPM or DNF, it should not matter. I usually do dnf update *.rpm with all the downloaded RPMS in the working directory. Thanks for feedback. Glad that it worked for you, but I ran into another issues during another regression tests. Non-multiplexed tasks break the audit, because monitor pipes are non-blocking (?). Therefore events from cleanup are send before all the auditing is done and then the messages gets (again) mixed up. It is wrong, but it does not affect the general usability (the fatal error appears during connection cleanup and does not affect session-only exits). I will give it some more tries later. And as soon as I will have working patch, I will update also Fedora 23. Created attachment 1129778 [details] patch: forward audit messages I made another scratch build addressing this issue (with attached patch). Currently privileged child takes care of forwarding messages from forked child to the monitor, avoiding any other race condition during delivering this message. During my tests I didn't notice any http://koji.fedoraproject.org/koji/taskinfo?taskID=13107809 openssh-7.1p2-4.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-25e3f1c255 openssh-7.1p2-4.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-25e3f1c255 openssh-7.1p2-4.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report. Just curious if or when this fix might be made available for EL 7? Further testing required - but this now appears fixed as of openssh-6.6.1p1-31.el7.x86_64 under CentOS 7.3.1611! I even cranked my test script up to 1,000 iterations x 50 threads, and was unable to cause a ControlMaster failure.
Looks like the fix was actually in -26 (which was never yet released for 7.2):
> * Fri Apr 01 2016 Jakub Jelen <jjelen> 6.6.1p1-26 + 0.9.3-9
> ...
> - Fix race condition between audit messages from different processes (#1310684)
> ...
|