Bug 1308295 - Unresolved OpenSSH ControlMaster multiplexing race condition (RHSA-2015:2088-5).
Summary: Unresolved OpenSSH ControlMaster multiplexing race condition (RHSA-2015:2088-5).
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: openssh
Version: 23
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Jakub Jelen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-14 02:59 UTC by Mark Ziesemer
Modified: 2016-12-14 17:05 UTC (History)
5 users (show)

Fixed In Version: openssh-7.1p2-4.fc23
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-28 12:19:59 UTC


Attachments (Terms of Use)
sychronize audit messages during privilege separation (4.98 KB, patch)
2016-02-18 16:27 UTC, Jakub Jelen
no flags Details | Diff
patch: forward audit messages (3.95 KB, patch)
2016-02-23 14:59 UTC, Jakub Jelen
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1240613 None None None 2019-06-28 01:36:06 UTC
Red Hat Bugzilla 1252318 None None None 2019-06-28 01:36:06 UTC

Internal Links: 1252318

Description Mark Ziesemer 2016-02-14 02:59:36 UTC
Description of problem:

Per https://rhn.redhat.com/errata/RHSA-2015-2088.html :

> Due to a race condition between auditing messages and answers when using
ControlMaster multiplexing, one session in the shared connection randomly
and unexpectedly exited the connection. This update fixes the race
condition in the auditing code, and multiplexing connections now work as
expected even with a number of sessions created at once. (BZ#1240613)

Unfortunately, the issue appears to remain - even in the packages and versions listed as correcting the issue.

Version-Release number of selected component (if applicable):

openssh-7.1p2-3.fc23.x86_64

How reproducible:

Consistently (though with the current sample size in the referenced script, multiple client runs may be required to hit the race condition).

Steps to Reproduce:

Refer to https://gist.github.com/ziesemer/93d64a074abdd8e6f632 .

Actual results:

The ControlMaster session unexpectedly and abruptly terminates, and the associated child session also exits with a failure (return code 255, and an error of "Write failed: Broken pipe").

Expected results:

No failures should occur.

Additional info:

While running the server with LogLevel VERBOSE, a sampling of observed buffer errors are shown below:

sshd[116589]: Starting session: command for ssh-test from ::1 port 54179
sshd[116586]: error: buffer_get_ret: trying to get more bytes 8 than in buffer 4
sshd[116586]: fatal: buffer_get_int: buffer error
sshd[116586]: pam_unix(sshd:session): session closed for user ssh-test
sshd[116589]: Starting session: command for ssh-test from ::1 port 54179
sshd[116589]: fatal: mm_request_send: write: Broken pipe
sshd[116589]: fatal: mm_request_send: write: Broken pipe

sshd[18516]: Starting session: command for ssh-test from ::1 port 54190
sshd[18513]: error: buffer_get_string_ret: bad string length 2080374784
sshd[18513]: fatal: buffer_get_string: buffer error
sshd[18513]: pam_unix(sshd:session): session closed for user ssh-test
sshd[18516]: Starting session: command for ssh-test from ::1 port 54190
sshd[18516]: fatal: mm_request_receive: read: Connection reset by peer
sshd[18516]: fatal: mm_request_send: write: Broken pipe

sshd[18768]: Starting session: command for ssh-test from ::1 port 56670
sshd[18765]: error: buffer_get_ret: trying to get more bytes 8 than in buffer 0
sshd[18765]: fatal: buffer_get_int: buffer error
sshd[18765]: pam_unix(sshd:session): session closed for user ssh-test
sshd[18768]: fatal: mm_request_receive: read: Connection reset by peer
sshd[18768]: fatal: mm_request_send: write: Broken pipe

RHSA-2015:2088-5 refers to BZ#1240613 (which is apparently restricted).

As of 2016-02-13, my current testing shows:
  - CentOS 7.2.1511, openssh-6.6.1p1-23.el7_2.x86_64 - broken.
  - CentOS 7.2.1511, downgraded to openssh-6.6.1p1-22.el7.x86_64 - still broken
      (despite this is the supposed fix release per RHSA-2015-2088).
  - Fedora 23, openssh-7.1p2-3.fc23.x86_64 - broken.
  - Ubuntu 15.10, OpenSSH_6.9p1 Ubuntu-2ubuntu0.1 - Works without issue.
  - CentOS 6.6, openssh-5.3p1-104.el6_6.1.i686 - Works without issue.
  - Fedora 20, openssh-6.4p1-8.fc20.x86_64 - Works without issue.

Additional research and notes:
  - https://ahwhattheheck.wordpress.com/2015/07/02/debugging-sporadically-encountered-ssh-encountered-an-unknown-error-in-ansible-runs/
    - Bypassed the issue by effectively ensuring that no ControlMaster would be concurrently accessed by multiple client sessions,
        at the expense of increasing the number of ControlMasters used.
  - http://www.zenoss.org/forum/10136
    - Posts indicated that the new "UsePrivilegeSeparation sandbox" could be a problem here - but I am able to consistently
        reproduce with or without this enabled.

Comment 1 Jakub Jelen 2016-02-15 09:33:53 UTC
Thank you for verbose report. We have private tracker for this bug on RHEL7 (bug #1252318), but we were not able to reproduce the issue reliably and even from the verbose logs that shows occasionally it was not clear where the problem was. We have collected logs with DEBUG3 log level, but they were not also much helpful. But I will check it during this week. Given your analysis, it looks like related to our downstream patches.


> - Posts indicated that the new "UsePrivilegeSeparation sandbox" could be a problem here - but I am able to consistently reproduce with or without this enabled.

This will not be the issue. RHEL7 is using rlimit sandbox and Fedora 23 is using seccomp one.


> - http://www.zenoss.org/forum/10136
>  Jun 26 01:06:53 hypervisor.neumann.local sshd[6089]: fatal: mm_request_receive_expect: read: rtype 115 != type 125

I can confirm these messages are related to the already fixed problem, both in RHEL7 and Fedora 22+.

Comment 2 Jakub Jelen 2016-02-18 16:27:01 UTC
Created attachment 1128271 [details]
sychronize audit messages during privilege separation

I hit the similar issue during the run of upstream testsuite of openssh-7.2 with ControlMaster (might be even different issue?).

Current log [process id]:

[9559] debug3: mm_request_send entering: type 120
[9557] debug3: mm_request_receive entering
[9560] debug3: mm_request_send entering: type 124
[9557] debug3: monitor_read: checking request 120
buffer_get_string_ret: string is too large
buffer_get_string: buffer error
[9557] debug1: do_cleanup
[9560] debug3: mm_request_receive_expect entering: type 125
[9560] debug3: mm_request_receive entering
[9559] debug3: mm_request_receive_expect entering: type 121
[9559] debug3: mm_request_receive entering
mm_request_receive: read: Connection reset by peer
[9560] debug1: do_cleanup
[9560] debug3: mm_request_send entering: type 122
mm_request_send: write: Broken pipe
[9559] debug1: do_cleanup
[9559] debug3: mm_request_send entering: type 124
mm_request_send: write: Broken pipe

This is another race condition between user process sending rekey audit message and the children process auditing server keys removal. Both of the messages are written to the same pipe and they might get mixed up somehow and it was pretty reproducible.

I tried to introduce some synchronization between these three processes to avoid issue with auditing. Implemented idea:

Postauth child:   wait for message from monitor (until the other child destroys keys so monitor will not get "confused")
Monitor:          wait for the message, that all the server keys are destroyed
Child:            destroy keys, audit them, inform monitor
Monitor:          receive destroy keys messages; after all of them; inform the postauth child to continue
Postauth child:   receive message, continue

But I wonder how does it work on Fedora 20 and RHEL6. The auditing was still the same even in Fedora 20.

Could you give it a try with my proposed patch and scratch build (openssh-7.2 pre-release for Fedora 23)? The selftest passed for me several times and I will try our stress test:
http://koji.fedoraproject.org/koji/taskinfo?taskID=13036097

Comment 3 Mark Ziesemer 2016-02-18 19:15:08 UTC
I wasn't exactly sure how to test.  I downloaded all of the above RPMs from https://kojipkgs.fedoraproject.org//work/tasks/6099/13036099/ .  Trying to install them all resulted in failed dependencies needed by openssh-askpass.  I tried to install openssh-7.2*.rpm openssh-clients-*.rpm openssh-server-*.rpm, but would not succeed unless using `rpm -i --force` due to conflicts.  At this point:

$ rpm -q openssh
openssh-7.1p2-3.fc23.x86_64
openssh-7.2p1-1.fc23.x86_64

$ ssh -V
OpenSSH_7.1p2, OpenSSL 1.0.2f-fips  28 Jan 2016

$ ll /usr/bin/ssh
-rwxr-xr-x. 1 root root 733032 Feb 18 10:20 /usr/bin/ssh

$ sha1sum /usr/bin/ssh
0d58281daf92e3b0f901d7bd148c5cdc4b3c10a5  /usr/bin/ssh

Any pointers to how to better apply such a scratch build would be much appreciated.

Regardless, upon re-running my tests after the above installation, I am no longer able to reproduce after 10+ executions of the test I had written - so 10 x (10 threads x 100 iterations).

Comment 4 Jakub Jelen 2016-02-19 16:32:39 UTC
You would probably need to update instead of install. With RPM or DNF, it should not matter. I usually do   dnf update *.rpm   with all the downloaded RPMS in the working directory.

Thanks for feedback. Glad that it worked for you, but I ran into another issues during another regression tests. Non-multiplexed tasks break the audit, because monitor pipes are non-blocking (?). Therefore events from cleanup are send before all the auditing is done and then the messages gets (again) mixed up.

It is wrong, but it does not affect the general usability (the fatal error appears during connection cleanup and does not affect session-only exits). I will give it some more tries later. And as soon as I will have working patch, I will update also Fedora 23.

Comment 5 Jakub Jelen 2016-02-23 14:59:56 UTC
Created attachment 1129778 [details]
patch: forward audit messages

I made another scratch build addressing this issue (with attached patch). 

Currently privileged child takes care of forwarding messages from forked child to the monitor, avoiding any other race condition during delivering this message. During my tests I didn't notice any 

http://koji.fedoraproject.org/koji/taskinfo?taskID=13107809

Comment 6 Fedora Update System 2016-02-25 10:32:59 UTC
openssh-7.1p2-4.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-25e3f1c255

Comment 7 Fedora Update System 2016-02-26 20:53:16 UTC
openssh-7.1p2-4.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-25e3f1c255

Comment 8 Fedora Update System 2016-02-28 12:19:49 UTC
openssh-7.1p2-4.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 9 Mark Ziesemer 2016-03-18 14:35:31 UTC
Just curious if or when this fix might be made available for EL 7?

Comment 10 Mark Ziesemer 2016-12-14 17:05:47 UTC
Further testing required - but this now appears fixed as of openssh-6.6.1p1-31.el7.x86_64 under CentOS 7.3.1611! I even cranked my test script up to 1,000 iterations x 50 threads, and was unable to cause a ControlMaster failure.

Looks like the fix was actually in -26 (which was never yet released for 7.2):

> * Fri Apr 01 2016 Jakub Jelen <jjelen@redhat.com> 6.6.1p1-26 + 0.9.3-9
> ...
> - Fix race condition between audit messages from different processes (#1310684)
> ...


Note You need to log in before you can comment on or make changes to this bug.