1727912 – Weird valgrind-openssl interaction

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1727912 - Weird valgrind-openssl interaction

Summary: Weird valgrind-openssl interaction

Keywords:
Status:	CLOSED DUPLICATE of bug 1717438
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	valgrind
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	Mark Wielaard
QA Contact:	qe-baseos-tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-08 13:50 UTC by Tomas Mraz
Modified:	2021-09-17 14:44 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-08 14:12:44 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Reproducer code (340 bytes, text/x-csrc) 2019-07-08 13:50 UTC, Tomas Mraz	no flags	Details
View All

Description Tomas Mraz 2019-07-08 13:50:16 UTC

Created attachment 1588375 [details]
Reproducer code

The attached reproducer shows an invalid read when run under valgrind and terminated with signal (f.e. SIGINT).

make reproducer LDFLAGS='-lcrypto'
cc   -lcrypto  reproducer.c   -o reproducer
[root@ci-vm-10-0-137-203 bz1226209-segfault-in-ssleay-rand-bytes-due-to-locking]# valgrind ./reproducer
==9542== Memcheck, a memory error detector
==9542== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9542== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==9542== Command: ./reproducer
==9542== 
^C==9542== 
==9542== Process terminating with default action of signal 2 (SIGINT)
==9542==    at 0x53E44E8: nanosleep (nanosleep.c:28)
==9542==    by 0x53E441D: sleep (sleep.c:55)
==9542==    by 0x4006E2: runner (in /mnt/tests/CoreOS/openssl/Regression/bz1226209-segfault-in-ssleay-rand-bytes-due-to-locking/reproducer)
==9542==    by 0x4006F8: main (in /mnt/tests/CoreOS/openssl/Regression/bz1226209-segfault-in-ssleay-rand-bytes-due-to-locking/reproducer)
==9542== Invalid read of size 8
==9542==    at 0x58F7539: check_free (dlerror.c:188)
==9542==    by 0x58F7A65: free_key_mem (dlerror.c:221)
==9542==    by 0x58F7A65: __dlerror_main_freeres (dlerror.c:239)
==9542==    by 0x5489029: __libc_freeres (in /usr/lib64/libc-2.28.so)
==9542==    by 0x4A2B71E: _vgnU_freeres (vg_preloaded.c:77)
==9542==    by 0x53E441D: sleep (sleep.c:55)
==9542==    by 0x4006E2: runner (in /mnt/tests/CoreOS/openssl/Regression/bz1226209-segfault-in-ssleay-rand-bytes-due-to-locking/reproducer)
==9542==    by 0x4006F8: main (in /mnt/tests/CoreOS/openssl/Regression/bz1226209-segfault-in-ssleay-rand-bytes-due-to-locking/reproducer)
==9542==  Address 0x5d21228 is 12 bytes after a block of size 12 alloc'd
==9542==    at 0x4C30EDB: malloc (vg_replace_malloc.c:309)
==9542==    by 0x4FC708C: CRYPTO_zalloc (mem.c:230)
==9542==    by 0x4FC0C45: ossl_init_get_thread_local (init.c:66)
==9542==    by 0x4FC0C45: ossl_init_get_thread_local (init.c:59)
==9542==    by 0x4FC0C45: ossl_init_thread_start (init.c:465)
==9542==    by 0x4FED449: RAND_DRBG_get0_public (drbg_lib.c:1123)
==9542==    by 0x4FED483: drbg_bytes (drbg_lib.c:968)
==9542==    by 0x4006D8: runner (in /mnt/tests/CoreOS/openssl/Regression/bz1226209-segfault-in-ssleay-rand-bytes-due-to-locking/reproducer)
==9542==    by 0x4006F8: main (in /mnt/tests/CoreOS/openssl/Regression/bz1226209-segfault-in-ssleay-rand-bytes-due-to-locking/reproducer)
==9542== 
==9542== 
==9542== HEAP SUMMARY:
==9542==     in use at exit: 15,170 bytes in 21 blocks
==9542==   total heap usage: 42 allocs, 21 frees, 54,830 bytes allocated
==9542== 
==9542== LEAK SUMMARY:
==9542==    definitely lost: 0 bytes in 0 blocks
==9542==    indirectly lost: 0 bytes in 0 blocks
==9542==      possibly lost: 0 bytes in 0 blocks
==9542==    still reachable: 15,170 bytes in 21 blocks
==9542==         suppressed: 0 bytes in 0 blocks
==9542== Rerun with --leak-check=full to see details of leaked memory
==9542== 
==9542== For lists of detected and suppressed errors, rerun with: -s
==9542== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

However I do not think this is a bug in OpenSSL. There is a memory deallocation going on in the atexit handler which should correctly deallocate the memory allocated by the 0x4FC708C: CRYPTO_zalloc (mem.c:230) call. The deallocation is performed by pthread key destructor routine which probably confuses valgrind.

Comment 1 Mark Wielaard 2019-07-08 13:57:42 UTC

I'll have a closer look, but this looks like a bug in glibc:
https://bugzilla.redhat.com/show_bug.cgi?id=1717438
__libc_freeres (under valgrind) triggers bad free in libdl if dlerror was not used

If so, adding a dlerror () call to the program might be a temporary workaround.

Comment 2 Tomas Mraz 2019-07-08 14:05:04 UTC

Yes, adding dlerror() call to the program fixes the issue, should I mark it as duplicate?

Comment 3 Florian Weimer 2019-07-08 14:08:59 UTC

Mark, does valgrind somehow detect whether the SIGINT arrives in an async-signal-safe context?  Calling __libc_freeres in such a context *will* result in bogus reports (and even crashes) because __libc_freeres calls free etc. and is therefore not async-signal-safe.

(This may not be related to this bug, but the backtrace made me think of this issue.)

Comment 4 Mark Wielaard 2019-07-08 14:12:44 UTC

(In reply to Tomas Mraz from comment #2)
> Yes, adding dlerror() call to the program fixes the issue, should I mark it
> as duplicate?

Thanks for testing. Yes, lets mark this a a duplicate of glibc bug #1717438

*** This bug has been marked as a duplicate of bug 1717438 ***

Comment 5 Mark Wielaard 2019-07-08 14:16:40 UTC

(In reply to Florian Weimer from comment #3)
> Mark, does valgrind somehow detect whether the SIGINT arrives in an
> async-signal-safe context?  Calling __libc_freeres in such a context *will*
> result in bogus reports (and even crashes) because __libc_freeres calls free
> etc. and is therefore not async-signal-safe.
> 
> (This may not be related to this bug, but the backtrace made me think of
> this issue.)

I don't think valgrind does. And that might indeed be the underlying bug for some different (upstream) issues:
https://bugs.kde.org/show_bug.cgi?id=409141
https://bugs.kde.org/show_bug.cgi?id=409367

Comment 6 Tomas Mraz 2019-07-08 14:25:55 UTC

Please note that all this in OpenSSL happens from atexit() handler. It is not called directly from a signal handler. (I am not sure whether that makes any difference though.)

Comment 7 Mark Wielaard 2019-07-08 17:50:06 UTC

(In reply to Tomas Mraz from comment #6)
> Please note that all this in OpenSSL happens from atexit() handler. It is
> not called directly from a signal handler. (I am not sure whether that makes
> any difference though.)

That doesn't really make a difference for this specific bug.
Given that adding dlerror () fixes it, this is obviously glibc bug #1717438.

In theory valgrind should not have any trouble running atexit() handlers, since those would look like normal code execution as far as valgrind is concerned (they program hasn't actually exited yet). valgrind does do some more work after the process is actually exiting. If a signal is coming in after that, it might in theory confuse valgrind and/or the code in the signal handler if it was actually run (it shouldn't). But that isn't the issue in this case.

Comment 8 Tomas Mraz 2019-07-09 08:30:59 UTC

I meant is calling free in atexit handler when the atexit is triggered as consequence of SIGTERM or SIGINT default action safe in general or not?

Comment 9 Mark Wielaard 2019-07-09 11:06:24 UTC

(In reply to Tomas Mraz from comment #8)
> I meant is calling free in atexit handler when the atexit is triggered as
> consequence of SIGTERM or SIGINT default action safe in general or not?

As long as the atexit handler isn't called in the signal context itself, then yes.
But I am not sure an atexit handler is called when the process dies because of signal.
The manual page implies it is not called.

Comment 10 Florian Weimer 2019-07-09 11:23:11 UTC

It depends on what the signal handler does.  If it calls exit (not _exit), then the handlers are run.

Of course, calling _exit from an asynchronous signal is not safe.

Comment 11 Tomas Mraz 2019-07-09 11:23:45 UTC

Ah, you're right. I'm sorry for all the confusion. There is no signal handler registered so there is nothing that would libcrypto execute in the signal context itself.

Note You need to log in before you can comment on or make changes to this bug.