1226209 – segfault in ssleay_rand_bytes due to locking regression

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1226209 - segfault in ssleay_rand_bytes due to locking regression

Summary: segfault in ssleay_rand_bytes due to locking regression

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	openssl
Sub Component:
Version:	6.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Mraz
QA Contact:	Alicja Kario
Docs Contact:
URL:
Whiteboard:
Depends On:	1225994 1226204 1227734
Blocks:	CVE-2015-3216
TreeView+	depends on / blocked

Reported:	2015-05-29 08:30 UTC by Tomas Mraz
Modified:	2015-07-22 07:31 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openssl-1.0.1e-39.el6
Doc Type:	Bug Fix
Doc Text:	Cause: A refactoring of locking of the RAND subsystem in the OpenSSL library introduced a regression. Consequence: A multithreaded application using the OpenSSL library could crash if multiple threads simultaneously pulled random numbers from the OpenSSL RNG. Fix: The regression is fixed and the locking is now properly handled. Result: The multithreaded applications do not crash anymore when multiple threads pull random numbers from the OpenSSL RNG simultaneously.
Clone Of:	1226204
Environment:
Last Closed:	2015-07-22 07:31:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:1398	0	normal	SHIPPED_LIVE	openssl bug fix and enhancement update	2015-07-20 18:07:24 UTC

Description Tomas Mraz 2015-05-29 08:30:26 UTC

+++ This bug was initially created as a clone of Bug #1226204 +++

+++ This bug was initially created as a clone of Bug #1225994 +++

Originally found in:
openssl-1.0.1e-30.el6_6.4.src.rpm

Apparently introduced in:
openssl-1.0.1e-25.el7.src.rpm
openssl-1.0.1e-34.fc20.src.rpm

Accidentally(?) fixed in:
openssl-1.0.1i-1.fc21.src.rpm

Still present (I believe) in RH6/RH7/FC20. (Raising against Fedora because I currently use FC20 as my main desktop OS, though others will consider RH the more important packages.)

This patch from Tomáš Mráz in November 2013, amongst other things, refactored the PRNG locking into a single function, private_RAND_lock():

https://lists.fedoraproject.org/pipermail/scm-commits/Week-of-Mon-20131118/1146865.html

Neither the provenance, nor need, for this particular part of the patch is entirely clear. It doesn't appear to have ever been part of upstream openssl. AFAICT it originated either in RH or Fedora about then, was duplicated into both distros (and therefore picked up automatically by downstream distros such as CentOS as well). The particular problem is the change from ssleay_rand_bytes():

        if (!do_not_lock)
                {
                CRYPTO_w_lock(CRYPTO_LOCK_RAND);
                
                /* prevent ssleay_rand_bytes() from trying to obtain the lock again */
                CRYPTO_w_lock(CRYPTO_LOCK_RAND2);
                CRYPTO_THREADID_cpy(&locking_threadid, &cur);
                CRYPTO_w_unlock(CRYPTO_LOCK_RAND2);
                crypto_lock_rand = 1;
                }

To private_LOCK_rand():

        if (do_lock)
                {
                CRYPTO_w_lock(CRYPTO_LOCK_RAND);
                crypto_lock_rand = 1;
                CRYPTO_w_lock(CRYPTO_LOCK_RAND2);
                CRYPTO_THREADID_current(&locking_threadid);
                CRYPTO_w_unlock(CRYPTO_LOCK_RAND2);
                }

(RAND is the main lock that protects the real critical section. RAND2 isn't a real critical section as such, but forces readers/writers of locking_threadid to do so atomically since thread ids are now a multi-word struct. crypto_lock_rand/locking_threadid are used to allow a recursive mutex to be built on top of the external lock API which provides only non-recursive rwlocks.)

Excitingly, this reintroduces a really old upstream bug from 2001:

https://www.mail-archive.com/openssl-dev@openssl.org/msg09018.html

In a multi-threaded program with multiple threads calling RAND_bytes(), the following can happen:

1) thread-A calls RAND_bytes() and completes normally. LOCK_RAND is left unlocked, crypto_lock_rand is 0, and locking_threadid remains set to thread-A's id.

2) thread-B calls RAND_bytes(), and gets as far as setting crypto_lock_rand=1 before being pre-empted just before it takes the RAND2 lock.

3) thread-A calls into RAND_bytes() again, sees that crypto_lock_rand==1, takes RAND2 before thread-B can, sees its own thread id in locking_threadid, and enters the main critical section without bothering to take the RAND lock

4) thread-B takes RAND2, sets its own thread id, and also enters the main critical section

5) Within the critical section in ssleay_rand_bytes() both threads execute:

        st_idx=state_index;
        st_num=state_num;
[...]
        state_index+=num_ceil;
        if (state_index > state_num)
                state_index %= state_num;

The += and %= are done as two separate memory writes, so one thread can read st_idx in between the other thread executing the writes. It can therefore see st_idx>st_num.

6) If st_idx+(MD_DIGEST_LENGTH/2)<=st_num, we feed that part of state[] into the hash function as a single block. If however it is > st_num, we feed it in as two separate blocks, the part from st_idx to the end of state[], then the part from state[0] up to the remaining number of bytes. If st_idx itself is > st_num, the former calculation results in a "negative" length passed to the hash, which causes it to start reading memory from past the end of state[] and continue around the entire address space until it wraps round to the actual end of state[]. During this loop it eventually hits an unmapped page and segfaults. (The actual final effects may depend on which optimised version of the hash function openssl chooses, in our case with the default SHA1 we were running on SSSE3 capable x86_64 hardware.)

In a multi-threaded server handling TLS connections, for appropriate ciphersuites, we read one block's worth of random for each TLS record transmitted to form the IV for the block cipher. With multiple threads fielding TLS connections, under moderate load, this causes repeated crashes anywhere from minutes to hours apart. (This isn't specifically a security bug: completely non-malicious traffic at entirely reasonable load levels triggers it.)

This part of the patch was actually removed during FC21, as part of the upgrade from 1.0.1h to 1.0.1i. There appears to be no reason given why they were deemed no longer necessary (other parts of the patch remain), which makes me think fixing this bug was not the motivation. Distros that are still on 1.0.1e (FC20/RH* and others) remain affected. If the refactor is still required (and apart from this issue I think it's actually better code) for those versions, the fix is simple enough:

--- openssl-1.0.1e/crypto/rand/rand_lib.c.randlock	2015-05-14 14:48:26.073062716 +0100
+++ openssl-1.0.1e/crypto/rand/rand_lib.c	2015-05-14 14:50:02.907762929 +0100
@@ -208,10 +208,10 @@
 	if (do_lock)
 		{
 		CRYPTO_w_lock(CRYPTO_LOCK_RAND);
-		crypto_lock_rand = 1;
 		CRYPTO_w_lock(CRYPTO_LOCK_RAND2);
 		CRYPTO_THREADID_current(&locking_threadid);
 		CRYPTO_w_unlock(CRYPTO_LOCK_RAND2);
+		crypto_lock_rand = 1;
 		}
 	return do_lock;
 	}

--- Additional comment from Tomas Mraz on 2015-05-29 10:22:44 CEST ---

Thank you for the report and analysis. We will need to fix this everywhere where present. The refactoring was done due to need to fix RAND locking issues in the FIPS mode - deadlocks or no locking where needed. Unfortunately I made this mistake in the process. The refactoring was later dropped because upstream fixed the FIPS RNG locking in a different way.

--- Additional comment from Tomas Mraz on 2015-05-29 10:27:57 CEST ---

We should fix this in RHEL6.7. I need to discuss with Stephan whether this is regarded as touching the crypto implementation. Although it does not touch the DRBG algorithm implementation, it is a fix in support code which is called from DRBG.

Comment 11 errata-xmlrpc 2015-07-22 07:31:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1398.html

Note You need to log in before you can comment on or make changes to this bug.