1225994 – segfault in ssleay_rand_bytes due to locking regression

Bug 1225994 - segfault in ssleay_rand_bytes due to locking regression

Summary: segfault in ssleay_rand_bytes due to locking regression

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	openssl
Sub Component:
Version:	20
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Tomas Mraz
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1226204 1226209
TreeView+	depends on / blocked

Reported:	2015-05-28 16:58 UTC by John Sullivan
Modified:	2019-07-22 13:13 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	1226204 (view as bug list)
Environment:
Last Closed:	2015-06-30 01:38:55 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description John Sullivan 2015-05-28 16:58:30 UTC

Originally found in:
openssl-1.0.1e-30.el6_6.4.src.rpm

Apparently introduced in:
openssl-1.0.1e-25.el7.src.rpm
openssl-1.0.1e-34.fc20.src.rpm

Accidentally(?) fixed in:
openssl-1.0.1i-1.fc21.src.rpm

Still present (I believe) in RH6/RH7/FC20. (Raising against Fedora because I currently use FC20 as my main desktop OS, though others will consider RH the more important packages.)

This patch from Tomáš Mráz in November 2013, amongst other things, refactored the PRNG locking into a single function, private_RAND_lock():

https://lists.fedoraproject.org/pipermail/scm-commits/Week-of-Mon-20131118/1146865.html

Neither the provenance, nor need, for this particular part of the patch is entirely clear. It doesn't appear to have ever been part of upstream openssl. AFAICT it originated either in RH or Fedora about then, was duplicated into both distros (and therefore picked up automatically by downstream distros such as CentOS as well). The particular problem is the change from ssleay_rand_bytes():

        if (!do_not_lock)
                {
                CRYPTO_w_lock(CRYPTO_LOCK_RAND);
                
                /* prevent ssleay_rand_bytes() from trying to obtain the lock again */
                CRYPTO_w_lock(CRYPTO_LOCK_RAND2);
                CRYPTO_THREADID_cpy(&locking_threadid, &cur);
                CRYPTO_w_unlock(CRYPTO_LOCK_RAND2);
                crypto_lock_rand = 1;
                }

To private_LOCK_rand():

        if (do_lock)
                {
                CRYPTO_w_lock(CRYPTO_LOCK_RAND);
                crypto_lock_rand = 1;
                CRYPTO_w_lock(CRYPTO_LOCK_RAND2);
                CRYPTO_THREADID_current(&locking_threadid);
                CRYPTO_w_unlock(CRYPTO_LOCK_RAND2);
                }

(RAND is the main lock that protects the real critical section. RAND2 isn't a real critical section as such, but forces readers/writers of locking_threadid to do so atomically since thread ids are now a multi-word struct. crypto_lock_rand/locking_threadid are used to allow a recursive mutex to be built on top of the external lock API which provides only non-recursive rwlocks.)

Excitingly, this reintroduces a really old upstream bug from 2001:

https://www.mail-archive.com/openssl-dev@openssl.org/msg09018.html

In a multi-threaded program with multiple threads calling RAND_bytes(), the following can happen:

1) thread-A calls RAND_bytes() and completes normally. LOCK_RAND is left unlocked, crypto_lock_rand is 0, and locking_threadid remains set to thread-A's id.

2) thread-B calls RAND_bytes(), and gets as far as setting crypto_lock_rand=1 before being pre-empted just before it takes the RAND2 lock.

3) thread-A calls into RAND_bytes() again, sees that crypto_lock_rand==1, takes RAND2 before thread-B can, sees its own thread id in locking_threadid, and enters the main critical section without bothering to take the RAND lock

4) thread-B takes RAND2, sets its own thread id, and also enters the main critical section

5) Within the critical section in ssleay_rand_bytes() both threads execute:

        st_idx=state_index;
        st_num=state_num;
[...]
        state_index+=num_ceil;
        if (state_index > state_num)
                state_index %= state_num;

The += and %= are done as two separate memory writes, so one thread can read st_idx in between the other thread executing the writes. It can therefore see st_idx>st_num.

6) If st_idx+(MD_DIGEST_LENGTH/2)<=st_num, we feed that part of state[] into the hash function as a single block. If however it is > st_num, we feed it in as two separate blocks, the part from st_idx to the end of state[], then the part from state[0] up to the remaining number of bytes. If st_idx itself is > st_num, the former calculation results in a "negative" length passed to the hash, which causes it to start reading memory from past the end of state[] and continue around the entire address space until it wraps round to the actual end of state[]. During this loop it eventually hits an unmapped page and segfaults. (The actual final effects may depend on which optimised version of the hash function openssl chooses, in our case with the default SHA1 we were running on SSSE3 capable x86_64 hardware.)

In a multi-threaded server handling TLS connections, for appropriate ciphersuites, we read one block's worth of random for each TLS record transmitted to form the IV for the block cipher. With multiple threads fielding TLS connections, under moderate load, this causes repeated crashes anywhere from minutes to hours apart. (This isn't specifically a security bug: completely non-malicious traffic at entirely reasonable load levels triggers it.)

This part of the patch was actually removed during FC21, as part of the upgrade from 1.0.1h to 1.0.1i. There appears to be no reason given why they were deemed no longer necessary (other parts of the patch remain), which makes me think fixing this bug was not the motivation. Distros that are still on 1.0.1e (FC20/RH* and others) remain affected. If the refactor is still required (and apart from this issue I think it's actually better code) for those versions, the fix is simple enough:

--- openssl-1.0.1e/crypto/rand/rand_lib.c.randlock	2015-05-14 14:48:26.073062716 +0100
+++ openssl-1.0.1e/crypto/rand/rand_lib.c	2015-05-14 14:50:02.907762929 +0100
@@ -208,10 +208,10 @@
 	if (do_lock)
 		{
 		CRYPTO_w_lock(CRYPTO_LOCK_RAND);
-		crypto_lock_rand = 1;
 		CRYPTO_w_lock(CRYPTO_LOCK_RAND2);
 		CRYPTO_THREADID_current(&locking_threadid);
 		CRYPTO_w_unlock(CRYPTO_LOCK_RAND2);
+		crypto_lock_rand = 1;
 		}
 	return do_lock;
 	}

Comment 1 Tomas Mraz 2015-05-29 08:22:44 UTC

Thank you for the report and analysis. We will need to fix this everywhere where present. The refactoring was done due to need to fix RAND locking issues in the FIPS mode - deadlocks or no locking where needed. Unfortunately I made this mistake in the process. The refactoring was later dropped because upstream fixed the FIPS RNG locking in a different way.

Comment 2 Tomas Mraz 2015-05-29 08:58:46 UTC

The only thing I am curious about is why nobody reported the regression earlier - especially on RHEL.

Comment 3 John Sullivan 2015-05-29 10:06:13 UTC

I can only imagine that using openssl to implement multi-threaded TLS servers is somewhat of a rare use case. (Most of the off-the-shelf uses prefer to run multi-process single-threaded.)

(The other possibility is that even in that case, as I hinted above, the exact optimised form of the hash may be important. In the SSSE3 case the inner hash block loop is formulated as a loop from a start pointer to a specific end pointer value, so the negative length causes an unchecked wrap around the address space. Other forms may check the length value directly against an index, so under signed arithmetic we simply terminate the loop early. That might mean less good randomness is produced, but won't crash.)

Comment 4 Fedora End Of Life 2015-05-29 13:51:20 UTC

This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 5 Tomas Mraz 2015-06-02 08:06:08 UTC

Fedora 22 is not affected.

Comment 6 Fedora Update System 2015-06-02 08:37:51 UTC

openssl-1.0.1e-43.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/openssl-1.0.1e-43.fc20

Comment 7 Florian Weimer 2015-06-02 09:54:00 UTC

(In reply to John Sullivan from comment #0)
> Still present (I believe) in RH6/RH7/FC20. (Raising against Fedora because I
> currently use FC20 as my main desktop OS, though others will consider RH the
> more important packages.)

I'm trying to reproduce this with the stress tester below.  I don't see the issue.  Is there something else I need to configure to expose it?

I do see a data races with Helgrind, though, even on FC22, which supposedly has been fixed.


#include <openssl/crypto.h>
#include <openssl/rand.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>

static pthread_mutex_t *locks;

static void
lock_callback(int mode, int type, const char *file, int line)
{
  if (mode & CRYPTO_LOCK) {
    if (pthread_mutex_lock(locks + type) != 0) {
      abort();
    }
  } else {
    if (pthread_mutex_unlock(locks + type) != 0) {
      abort();
    }
  }
}

static void
init_locks(void)
{
  int count = CRYPTO_num_locks();
  locks = calloc(count, sizeof(*locks));
  for (int i = 0; i < count; ++i) {
    if (pthread_mutex_init(locks + i, NULL) != 0) {
      abort();
    }
  }
  CRYPTO_set_locking_callback(lock_callback);
}

static void *
runner(void *closure)
{
  uintptr_t size = (uintptr_t) closure;
  unsigned char *buffer = malloc(size);
  if (buffer == NULL) {
    abort();
  }
  while (true) {
    RAND_bytes(buffer, size);
  }
}



int
main(int argc, char **argv)
{
  int threads = atoi(argv[1]);
  int start_size = atoi(argv[2]);
  int size_increment = atoi(argv[3]);
  init_locks();

  for (int i = 1; i < threads; ++i) {
    pthread_t thread;
    if (pthread_create(&thread, NULL, runner,
		       (void *)(uintptr_t) start_size + i * size_increment) != 0) {
      abort();
    }
  }
  runner((void *)(uintptr_t) start_size);
}

Comment 8 Florian Weimer 2015-06-02 11:37:16 UTC

(In reply to Florian Weimer from comment #7)
> (In reply to John Sullivan from comment #0)
> > Still present (I believe) in RH6/RH7/FC20. (Raising against Fedora because I
> > currently use FC20 as my main desktop OS, though others will consider RH the
> > more important packages.)
> 
> I'm trying to reproduce this with the stress tester below.  I don't see the
> issue.  Is there something else I need to configure to expose it?

I have been able to reproduce it with openssl-1.0.1e-30.el6.ppc64 and openssl-1.0.1e-30.el6.x86_64.  Thanks.

Comment 9 Huzaifa S. Sidhpurwala 2015-06-03 04:03:14 UTC

This issue has been classified as a security flaw and hence been assigned CVE-2015-3216

Comment 10 Fedora Update System 2015-06-04 20:20:54 UTC

Package openssl-1.0.1e-43.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing openssl-1.0.1e-43.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-9371/openssl-1.0.1e-43.fc20
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2015-06-18 13:24:02 UTC

Package openssl-1.0.1e-44.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing openssl-1.0.1e-44.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-9371/openssl-1.0.1e-44.fc20
then log in and leave karma (feedback).

Comment 12 Fedora Update System 2015-06-24 16:00:04 UTC

Package openssl-1.0.1e-45.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing openssl-1.0.1e-45.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-9371/openssl-1.0.1e-45.fc20
then log in and leave karma (feedback).

Comment 13 Fedora End Of Life 2015-06-30 01:38:55 UTC

Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.