Bug 1952836

Summary:	Regression in RAND_bytes() due to locking
Product:	Red Hat Enterprise Linux 9	Reporter:	Alicja Kario <hkario>
Component:	valgrind	Assignee:	Mark Wielaard <mjw>
valgrind sub component:	system-version	QA Contact:	Jesus Checa <jchecahi>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	dbelyavs, fche, fweimer, jakub, mjw, ohudlick, sahana
Version:	9.0	Keywords:	Bugfix, Triaged
Target Milestone:	beta
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	valgrind-3.18.1-1.el9	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-17 12:48:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alicja Kario 2021-04-23 10:26:19 UTC

Description of problem:
Application that calls RAND_bytes() over and over crashes with openssl 3.0.0

Version-Release number of selected component (if applicable):
openssl-3.0.0-0.alpha13.1.el9.x86_64

How reproducible:
most of the time

Steps to Reproduce:
1. run reproducer from bug 1226209

Actual results:
==40067== Memcheck, a memory error detector
==40067== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==40067== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==40067== Command: ./reproducer 40 32 0
==40067== 
==40067== 
==40067== Process terminating with default action of signal 15 (SIGTERM)
==40067==    at 0x4C739E9: __futex_abstimed_wait_common64 (in /usr/lib64/libpthread-2.33.so)
==40067==    by 0x4C6C0AE: pthread_rwlock_wrlock (in /usr/lib64/libpthread-2.33.so)
==40067==    by 0x49FF9FC: CRYPTO_THREAD_write_lock (threads_pthread.c:94)
==40067==    by 0x4ABE868: ossl_drbg_lock_parent (drbg.c:69)
==40067==    by 0x4AC39E5: get_parent_reseed_count (drbg.c:116)
==40067==    by 0x4AC3FF2: ossl_prov_drbg_generate.part.0 (drbg.c:669)
==40067==    by 0x49E1C47: evp_rand_generate_locked (evp_rand.c:545)
==40067==    by 0x49E1D5F: EVP_RAND_generate (evp_rand.c:567)
==40067==    by 0x4A2C168: RAND_bytes_ex (rand_lib.c:349)
==40067==    by 0x4012DE: runner (in /tmp/tmp.TH3Y0SipCN/reproducer)
==40067==    by 0x40138F: main (in /tmp/tmp.TH3Y0SipCN/reproducer)
==40067== 
==40067== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==40067==  General Protection Fault
==40067==    at 0x4C6EC72: __pthread_once_slow (in /usr/lib64/libpthread-2.33.so)
==40067==    by 0x4DAF56E: __rpc_thread_variables.part.0 (in /usr/lib64/libc-2.33.so)
==40067==    by 0x4DF1A4C: free_mem (in /usr/lib64/libc-2.33.so)
==40067==    by 0x4DF16A1: __libc_freeres (in /usr/lib64/libc-2.33.so)
==40067==    by 0x48351E7: _vgnU_freeres (vg_preloaded.c:74)
==40067==    by 0x104020D722BC5AD7: ???
==40067==    by 0x632C1283BF8D86B4: ???
==40067==    by 0xCE4A201EC93C07D1: ???
==40067==    by 0xE5AC22417815E1CC: ???
==40067==    by 0xB2430986DB8012F7: ???
==40067==    by 0x503412B: ???
==40067== 
==40067== HEAP SUMMARY:
==40067==     in use at exit: 669,538 bytes in 4,215 blocks
==40067==   total heap usage: 7,390 allocs, 3,175 frees, 1,484,772 bytes allocated
==40067== 
==40067== LEAK SUMMARY:
==40067==    definitely lost: 0 bytes in 0 blocks
==40067==    indirectly lost: 0 bytes in 0 blocks
==40067==      possibly lost: 10,608 bytes in 39 blocks
==40067==    still reachable: 658,930 bytes in 4,176 blocks
==40067==         suppressed: 0 bytes in 0 blocks
==40067== Rerun with --leak-check=full to see details of leaked memory
==40067== 
==40067== For lists of detected and suppressed errors, rerun with: -s
==40067== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Expected results:
No crash

Additional info:

Comment 9 Florian Weimer 2021-09-17 13:46:02 UTC

Could the issue be that __libc_freeres is not async-signal-safe?

Comment 11 Mark Wielaard 2021-09-17 14:15:20 UTC

(In reply to Florian Weimer from comment #9)
> Could the issue be that __libc_freeres is not async-signal-safe?

You mean that if control-C is hit while __libc_freeres is running (on the last terminating thread) that would explain the issue?
Is the program stuck in __libc_freeres when it is being terminated?
Does this mean, it is only an issue when running this program under valgrind?

Comment 13 Dmitry Belyavskiy 2021-09-17 14:22:47 UTC

(In reply to Mark Wielaard from comment #11)
> (In reply to Florian Weimer from comment #9)
> > Could the issue be that __libc_freeres is not async-signal-safe?
> 
> You mean that if control-C is hit while __libc_freeres is running (on the
> last terminating thread) that would explain the issue?
> Is the program stuck in __libc_freeres when it is being terminated?
> Does this mean, it is only an issue when running this program under valgrind?

Yes, I always got this error running the program under valgrind and didn't get any issue approx. 10 times running without valgrind. Our tests also don't show any problems without valgrind.

Comment 14 Florian Weimer 2021-09-17 14:24:11 UTC

(In reply to Mark Wielaard from comment #11)
> (In reply to Florian Weimer from comment #9)
> > Could the issue be that __libc_freeres is not async-signal-safe?
> 
> You mean that if control-C is hit while __libc_freeres is running (on the
> last terminating thread) that would explain the issue?
> Is the program stuck in __libc_freeres when it is being terminated?
> Does this mean, it is only an issue when running this program under valgrind?

No, SIGINT interrupts pthread_rwlock_wrlock, and valgrind invokes __libc_freeres essentially from a signal handler context to clean up the process. That's how I interpret the backtraces in comment 8.

Comment 17 Mark Wielaard 2021-09-17 14:33:23 UTC

(In reply to Florian Weimer from comment #14)
> (In reply to Mark Wielaard from comment #11)
> > (In reply to Florian Weimer from comment #9)
> > > Could the issue be that __libc_freeres is not async-signal-safe?
> > 
> > You mean that if control-C is hit while __libc_freeres is running (on the
> > last terminating thread) that would explain the issue?
> > Is the program stuck in __libc_freeres when it is being terminated?
> > Does this mean, it is only an issue when running this program under valgrind?
> 
> No, SIGINT interrupts pthread_rwlock_wrlock, and valgrind invokes
> __libc_freeres essentially from a signal handler context to clean up the
> process. That's how I interpret the backtraces in comment 8.

OK, so the question is if this SIGINT is part of the issue or if it is just some after effect because the test failed some other way (doesn't want to terminate/has a deadlock?) and it gets explicitly killed. I am trying to figure out if we are just debugging the valgrind termination/cleanup sequence after a real issue was found, or if this SIGINT is part of the original observed buggy (?) behavior.

Comment 19 Dmitry Belyavskiy 2021-09-17 14:37:27 UTC

(In reply to Mark Wielaard from comment #17)
> (In reply to Florian Weimer from comment #14)
> > (In reply to Mark Wielaard from comment #11)
> > > (In reply to Florian Weimer from comment #9)
> > > > Could the issue be that __libc_freeres is not async-signal-safe?
> > > 
> > > You mean that if control-C is hit while __libc_freeres is running (on the
> > > last terminating thread) that would explain the issue?
> > > Is the program stuck in __libc_freeres when it is being terminated?
> > > Does this mean, it is only an issue when running this program under valgrind?
> > 
> > No, SIGINT interrupts pthread_rwlock_wrlock, and valgrind invokes
> > __libc_freeres essentially from a signal handler context to clean up the
> > process. That's how I interpret the backtraces in comment 8.
> 
> OK, so the question is if this SIGINT is part of the issue or if it is just
> some after effect because the test failed some other way (doesn't want to
> terminate/has a deadlock?) and it gets explicitly killed. I am trying to
> figure out if we are just debugging the valgrind termination/cleanup
> sequence after a real issue was found, or if this SIGINT is part of the
> original observed buggy (?) behavior.

SIGINT looks like a part of issue to me. The program is not terminated itself.

Comment 20 Frank Ch. Eigler 2021-09-27 20:48:51 UTC

Can you rerun with

% valgrind --run-libc-freeres=no [...]

It may well be that __libc_freeres is just not safe to invoke from this context,
and valgrind should ideally detect the situation and not even try it.  But the
above flag should let us know either way.

(I am not able to find a workable reproducer among the bug reference tree.)

Comment 21 Frank Ch. Eigler 2021-09-27 22:10:23 UTC

OK, had a chance to try the comment #8 reproducer here.  The SEGV is intermittent with SIGINT in valgrind's default mode, but does not appear to happen with --run-libc-freeres=no at all.  I don't see any explicit signal handling code in the reproducer.c, nor do I see anything special in an strace.  That suggests that valgrind may be improvising the SIGINT signal handler rather than emulating a signal event and dispatching to the program's SIGINT handler.  If so, if valgrind is interjecting a __libc_freeres() call from random program state, then this does appear to be a bug.

The simplest solution could be to have valgrind flip the default from =yes to =no.
Second simplest could be to check if there are any glibc blocking operations in progress in any traced thread, and if so, not do the __libc_freeres().

Mark, should we file this upstream also?

Comment 23 Mark Wielaard 2021-09-29 12:21:25 UTC

(In reply to Frank Ch. Eigler from comment #21)
> OK, had a chance to try the comment #8 reproducer here.

Note that comment #8 and the reproducer mentioned there are private.

> The SEGV is
> intermittent with SIGINT in valgrind's default mode, but does not appear to
> happen with --run-libc-freeres=no at all.

Which valgrind, glibc and openssl version was this against?
It looks like this hasn't been an issue in the past and I am unclear if it is caused by an valgrind, glibc or openssl upgrade.

>  I don't see any explicit signal
> handling code in the reproducer.c, nor do I see anything special in an
> strace.  That suggests that valgrind may be improvising the SIGINT signal
> handler rather than emulating a signal event and dispatching to the
> program's SIGINT handler.

If the program doesn't handle SIGINT itself valgrind will catch it and defaults to terminate the program.
Note that you can use --trace-signals=yes to see a bit more details how valgrind handles signals.

>  If so, if valgrind is interjecting a
> __libc_freeres() call from random program state, then this does appear to be
> a bug.

Yes, that would be a bug. __libc_freeres () is supposed to only run on the last terminating thread after every other program action has terminated (valgrind has its own thread scheduler so should know when all other threads are done). You should be able to follow the valgrind thread scheduling with --trace-sched=yes.

> The simplest solution could be to have valgrind flip the default from =yes
> to =no.

At least for this reproducer this seems like a good workaround.
But in general this would produce leaks in programs because glibc by default doesn't clean up at program termination.

> Second simplest could be to check if there are any glibc blocking operations
> in progress in any traced thread, and if so, not do the __libc_freeres().

Yes, we should figure out what those blocking operations are and make sure the valgrind scheduler knows about them before switching to the program death thread.

> Mark, should we file this upstream also?

Yes, if we can figure out a public reproducer.

Comment 24 Mark Wielaard 2021-09-29 12:28:08 UTC

(In reply to Mark Wielaard from comment #23)
> > Second simplest could be to check if there are any glibc blocking operations
> > in progress in any traced thread, and if so, not do the __libc_freeres().
> 
> Yes, we should figure out what those blocking operations are and make sure
> the valgrind scheduler knows about them before switching to the program
> death thread.

Or maybe the best thing might be to distinguish between normal program termination and abnormal program termination. Where abnormal program termination is when the program is killed by an signal that it didn't catch itself. When the program termination is abnormal valgrind shouldn't invoke __libc_freeres because it could be in any random state.

Comment 25 Alicja Kario 2021-09-29 12:54:01 UTC

(In reply to Mark Wielaard from comment #23)
> (In reply to Frank Ch. Eigler from comment #21)
> > OK, had a chance to try the comment #8 reproducer here.
> 
> Note that comment #8 and the reproducer mentioned there are private.
> 
> > The SEGV is
> > intermittent with SIGINT in valgrind's default mode, but does not appear to
> > happen with --run-libc-freeres=no at all.
> 
> Which valgrind, glibc and openssl version was this against?
> It looks like this hasn't been an issue in the past and I am unclear if it
> is caused by an valgrind, glibc or openssl upgrade.

valgrind-1:3.17.0-12.el9.x86_64
openssl-3.0.0-1.el9.x86_64
glibc-2.34-2.el9.x86_64

That being said, I've also noticed that it happens on Fedora 34 but not on Fedora 33.

That's with:
openssl-1.1.1l-2.fc34.x86_64
valgrind-3.17.0-1.fc34.x86_64
glibc-2.33-16.fc34.x86_64

and with:
openssl-1.1.1l-1.fc33.x86_64
valgrind-3.16.1-18.fc33.x86_64
glibc-2.32-4.fc33.x86_64


> > Mark, should we file this upstream also?
>
> Yes, if we can figure out a public reproducer.

The reproducer is public: https://bugzilla.redhat.com/show_bug.cgi?id=1225994#c7

Comment 26 Frank Ch. Eigler 2021-09-29 14:01:09 UTC

% valgrind --trace-sched=yes --trace-signals=yes ~/reproducer 40 32 0
[wait awhile]
[hit ^C]
--3158864--   SCHED[12]: releasing lock (VG_(vg_yield)) -> VgTs_Yielding
--3158864--   SCHED[12]:  acquired lock (VG_(vg_yield))
--3158864--   SCHED[12]: releasing lock (VG_(vg_yield)) -> VgTs_Yielding
--3158864--   SCHED[12]:  acquired lock (VG_(vg_yield))
--3158864--   SCHED[12]: releasing lock (VG_(vg_yield)) -> VgTs_Yielding
--3158864--   SCHED[12]:  acquired lock (VG_(vg_yield))
--3158864--   SCHED[12]: releasing lock (VG_(vg_yield)) -> VgTs_Yielding
--3158864--   SCHED[11]:  acquired lock (thread_wrapper(starting new thread))
--3158864--   SCHED[11]: entering VG_(scheduler)
--3158864--   SCHED[11]: exiting VG_(scheduler)
--3158864--   SCHED[11]: release lock in VG_(exit_thread)
--3158864--   SCHED[12]:  acquired lock (VG_(vg_yield))
--3158864--   SCHED[12]: exiting VG_(scheduler)
--3158864-- Caught __NR_exit; running __gnu_cxx::__freeres and __libc_freeres() wrapper
--3158864-- sys_sigaction: sigNo 11, new 0x100fdfcea0, old 0x0, new flags 0x0
--3158864-- sys_sigaction: sigNo 7, new 0x100fdfcea0, old 0x0, new flags 0x0
--3158864-- sys_sigaction: sigNo 4, new 0x100fdfcea0, old 0x0, new flags 0x0
--3158864-- sys_sigaction: sigNo 8, new 0x100fdfcea0, old 0x0, new flags 0x0
--3158864--   SCHED[12]: entering VG_(scheduler)
--3158864-- delivering signal 11 (SIGSEGV):128 to thread 12
--3158864-- delivering 11 (code 128) to default handler; action: terminate+core
==3158864== 
==3158864== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3158864==  General Protection Fault
==3158864==    at 0x4B6CC82: __pthread_once_slow (pthread_once.c:114)
==3158864==    by 0x4CAD03E: __rpc_thread_variables.part.0 (rpc_thread.c:59)
==3158864==    by 0x4CEF61C: free_mem (in /usr/lib64/libc-2.33.so)
==3158864==    by 0x4CEF271: __libc_freeres (in /usr/lib64/libc-2.33.so)
==3158864==    by 0x48351E7: _vgnU_freeres (vg_preloaded.c:74)
==3158864==    by 0x100000002F: ???
==3158864==    by 0xE0B00AF623FA69FF: ???
==3158864==    by 0xF: ???
==3158864==    by 0x49D7620: UnknownInlinedFun (drbg_ctr.c:94)
==3158864==    by 0x49D7620: ctr_BCC_update.part.0 (drbg_ctr.c:139)
==3158864==    by 0x49D79FF: UnknownInlinedFun (drbg_ctr.c:119)
==3158864==    by 0x49D79FF: UnknownInlinedFun (drbg_ctr.c:192)
==3158864==    by 0x49D79FF: ctr_update (drbg_ctr.c:254)
==3158864==    by 0x49D7EBD: drbg_ctr_generate (drbg_ctr.c:329)
==3158864==    by 0x49D89E8: RAND_DRBG_generate (drbg_lib.c:642)
--3158864--   SCHED[12]: exiting VG_(scheduler)
==3158864== 
==3158864== HEAP SUMMARY:
==3158864==     in use at exit: 6,908 bytes in 38 blocks
==3158864==   total heap usage: 52 allocs, 14 frees, 7,844 bytes allocated
==3158864== 
==3158864== LEAK SUMMARY:
==3158864==    definitely lost: 0 bytes in 0 blocks
==3158864==    indirectly lost: 0 bytes in 0 blocks
==3158864==      possibly lost: 2,992 bytes in 11 blocks
==3158864==    still reachable: 3,916 bytes in 27 blocks
==3158864==         suppressed: 0 bytes in 0 blocks
==3158864== Rerun with --leak-check=full to see details of leaked memory
==3158864== 
==3158864== For lists of detected and suppressed errors, rerun with: -s
==3158864== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
[1]    3158864 segmentation fault (core dumped)  valgrind --trace-sched=yes --trace-signals=yes ~/reproducer 40 32 0

Comment 27 Mark Wielaard 2021-10-11 16:38:28 UTC

The upstream bug has a proposed patch that seems to resolve this issue:
https://bugs.kde.org/show_bug.cgi?id=443605

Comment 28 Mark Wielaard 2021-10-18 13:37:07 UTC

valgrind 3.18.1 has been released with the proposed patch included

Comment 35 errata-xmlrpc 2022-05-17 12:48:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: valgrind), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2401