Bug 1952836
Summary: | Regression in RAND_bytes() due to locking | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Alicja Kario <hkario> |
Component: | valgrind | Assignee: | Mark Wielaard <mjw> |
valgrind sub component: | system-version | QA Contact: | Jesus Checa <jchecahi> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | low | ||
Priority: | low | CC: | dbelyavs, fche, fweimer, jakub, mjw, ohudlick, sahana |
Version: | 9.0 | Keywords: | Bugfix, Triaged |
Target Milestone: | beta | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | valgrind-3.18.1-1.el9 | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-17 12:48:06 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alicja Kario
2021-04-23 10:26:19 UTC
Could the issue be that __libc_freeres is not async-signal-safe? (In reply to Florian Weimer from comment #9) > Could the issue be that __libc_freeres is not async-signal-safe? You mean that if control-C is hit while __libc_freeres is running (on the last terminating thread) that would explain the issue? Is the program stuck in __libc_freeres when it is being terminated? Does this mean, it is only an issue when running this program under valgrind? (In reply to Mark Wielaard from comment #11) > (In reply to Florian Weimer from comment #9) > > Could the issue be that __libc_freeres is not async-signal-safe? > > You mean that if control-C is hit while __libc_freeres is running (on the > last terminating thread) that would explain the issue? > Is the program stuck in __libc_freeres when it is being terminated? > Does this mean, it is only an issue when running this program under valgrind? Yes, I always got this error running the program under valgrind and didn't get any issue approx. 10 times running without valgrind. Our tests also don't show any problems without valgrind. (In reply to Mark Wielaard from comment #11) > (In reply to Florian Weimer from comment #9) > > Could the issue be that __libc_freeres is not async-signal-safe? > > You mean that if control-C is hit while __libc_freeres is running (on the > last terminating thread) that would explain the issue? > Is the program stuck in __libc_freeres when it is being terminated? > Does this mean, it is only an issue when running this program under valgrind? No, SIGINT interrupts pthread_rwlock_wrlock, and valgrind invokes __libc_freeres essentially from a signal handler context to clean up the process. That's how I interpret the backtraces in comment 8. (In reply to Florian Weimer from comment #14) > (In reply to Mark Wielaard from comment #11) > > (In reply to Florian Weimer from comment #9) > > > Could the issue be that __libc_freeres is not async-signal-safe? > > > > You mean that if control-C is hit while __libc_freeres is running (on the > > last terminating thread) that would explain the issue? > > Is the program stuck in __libc_freeres when it is being terminated? > > Does this mean, it is only an issue when running this program under valgrind? > > No, SIGINT interrupts pthread_rwlock_wrlock, and valgrind invokes > __libc_freeres essentially from a signal handler context to clean up the > process. That's how I interpret the backtraces in comment 8. OK, so the question is if this SIGINT is part of the issue or if it is just some after effect because the test failed some other way (doesn't want to terminate/has a deadlock?) and it gets explicitly killed. I am trying to figure out if we are just debugging the valgrind termination/cleanup sequence after a real issue was found, or if this SIGINT is part of the original observed buggy (?) behavior. (In reply to Mark Wielaard from comment #17) > (In reply to Florian Weimer from comment #14) > > (In reply to Mark Wielaard from comment #11) > > > (In reply to Florian Weimer from comment #9) > > > > Could the issue be that __libc_freeres is not async-signal-safe? > > > > > > You mean that if control-C is hit while __libc_freeres is running (on the > > > last terminating thread) that would explain the issue? > > > Is the program stuck in __libc_freeres when it is being terminated? > > > Does this mean, it is only an issue when running this program under valgrind? > > > > No, SIGINT interrupts pthread_rwlock_wrlock, and valgrind invokes > > __libc_freeres essentially from a signal handler context to clean up the > > process. That's how I interpret the backtraces in comment 8. > > OK, so the question is if this SIGINT is part of the issue or if it is just > some after effect because the test failed some other way (doesn't want to > terminate/has a deadlock?) and it gets explicitly killed. I am trying to > figure out if we are just debugging the valgrind termination/cleanup > sequence after a real issue was found, or if this SIGINT is part of the > original observed buggy (?) behavior. SIGINT looks like a part of issue to me. The program is not terminated itself. Can you rerun with % valgrind --run-libc-freeres=no [...] It may well be that __libc_freeres is just not safe to invoke from this context, and valgrind should ideally detect the situation and not even try it. But the above flag should let us know either way. (I am not able to find a workable reproducer among the bug reference tree.) OK, had a chance to try the comment #8 reproducer here. The SEGV is intermittent with SIGINT in valgrind's default mode, but does not appear to happen with --run-libc-freeres=no at all. I don't see any explicit signal handling code in the reproducer.c, nor do I see anything special in an strace. That suggests that valgrind may be improvising the SIGINT signal handler rather than emulating a signal event and dispatching to the program's SIGINT handler. If so, if valgrind is interjecting a __libc_freeres() call from random program state, then this does appear to be a bug. The simplest solution could be to have valgrind flip the default from =yes to =no. Second simplest could be to check if there are any glibc blocking operations in progress in any traced thread, and if so, not do the __libc_freeres(). Mark, should we file this upstream also? (In reply to Frank Ch. Eigler from comment #21) > OK, had a chance to try the comment #8 reproducer here. Note that comment #8 and the reproducer mentioned there are private. > The SEGV is > intermittent with SIGINT in valgrind's default mode, but does not appear to > happen with --run-libc-freeres=no at all. Which valgrind, glibc and openssl version was this against? It looks like this hasn't been an issue in the past and I am unclear if it is caused by an valgrind, glibc or openssl upgrade. > I don't see any explicit signal > handling code in the reproducer.c, nor do I see anything special in an > strace. That suggests that valgrind may be improvising the SIGINT signal > handler rather than emulating a signal event and dispatching to the > program's SIGINT handler. If the program doesn't handle SIGINT itself valgrind will catch it and defaults to terminate the program. Note that you can use --trace-signals=yes to see a bit more details how valgrind handles signals. > If so, if valgrind is interjecting a > __libc_freeres() call from random program state, then this does appear to be > a bug. Yes, that would be a bug. __libc_freeres () is supposed to only run on the last terminating thread after every other program action has terminated (valgrind has its own thread scheduler so should know when all other threads are done). You should be able to follow the valgrind thread scheduling with --trace-sched=yes. > The simplest solution could be to have valgrind flip the default from =yes > to =no. At least for this reproducer this seems like a good workaround. But in general this would produce leaks in programs because glibc by default doesn't clean up at program termination. > Second simplest could be to check if there are any glibc blocking operations > in progress in any traced thread, and if so, not do the __libc_freeres(). Yes, we should figure out what those blocking operations are and make sure the valgrind scheduler knows about them before switching to the program death thread. > Mark, should we file this upstream also? Yes, if we can figure out a public reproducer. (In reply to Mark Wielaard from comment #23) > > Second simplest could be to check if there are any glibc blocking operations > > in progress in any traced thread, and if so, not do the __libc_freeres(). > > Yes, we should figure out what those blocking operations are and make sure > the valgrind scheduler knows about them before switching to the program > death thread. Or maybe the best thing might be to distinguish between normal program termination and abnormal program termination. Where abnormal program termination is when the program is killed by an signal that it didn't catch itself. When the program termination is abnormal valgrind shouldn't invoke __libc_freeres because it could be in any random state. (In reply to Mark Wielaard from comment #23) > (In reply to Frank Ch. Eigler from comment #21) > > OK, had a chance to try the comment #8 reproducer here. > > Note that comment #8 and the reproducer mentioned there are private. > > > The SEGV is > > intermittent with SIGINT in valgrind's default mode, but does not appear to > > happen with --run-libc-freeres=no at all. > > Which valgrind, glibc and openssl version was this against? > It looks like this hasn't been an issue in the past and I am unclear if it > is caused by an valgrind, glibc or openssl upgrade. valgrind-1:3.17.0-12.el9.x86_64 openssl-3.0.0-1.el9.x86_64 glibc-2.34-2.el9.x86_64 That being said, I've also noticed that it happens on Fedora 34 but not on Fedora 33. That's with: openssl-1.1.1l-2.fc34.x86_64 valgrind-3.17.0-1.fc34.x86_64 glibc-2.33-16.fc34.x86_64 and with: openssl-1.1.1l-1.fc33.x86_64 valgrind-3.16.1-18.fc33.x86_64 glibc-2.32-4.fc33.x86_64 > > Mark, should we file this upstream also? > > Yes, if we can figure out a public reproducer. The reproducer is public: https://bugzilla.redhat.com/show_bug.cgi?id=1225994#c7 % valgrind --trace-sched=yes --trace-signals=yes ~/reproducer 40 32 0 [wait awhile] [hit ^C] --3158864-- SCHED[12]: releasing lock (VG_(vg_yield)) -> VgTs_Yielding --3158864-- SCHED[12]: acquired lock (VG_(vg_yield)) --3158864-- SCHED[12]: releasing lock (VG_(vg_yield)) -> VgTs_Yielding --3158864-- SCHED[12]: acquired lock (VG_(vg_yield)) --3158864-- SCHED[12]: releasing lock (VG_(vg_yield)) -> VgTs_Yielding --3158864-- SCHED[12]: acquired lock (VG_(vg_yield)) --3158864-- SCHED[12]: releasing lock (VG_(vg_yield)) -> VgTs_Yielding --3158864-- SCHED[11]: acquired lock (thread_wrapper(starting new thread)) --3158864-- SCHED[11]: entering VG_(scheduler) --3158864-- SCHED[11]: exiting VG_(scheduler) --3158864-- SCHED[11]: release lock in VG_(exit_thread) --3158864-- SCHED[12]: acquired lock (VG_(vg_yield)) --3158864-- SCHED[12]: exiting VG_(scheduler) --3158864-- Caught __NR_exit; running __gnu_cxx::__freeres and __libc_freeres() wrapper --3158864-- sys_sigaction: sigNo 11, new 0x100fdfcea0, old 0x0, new flags 0x0 --3158864-- sys_sigaction: sigNo 7, new 0x100fdfcea0, old 0x0, new flags 0x0 --3158864-- sys_sigaction: sigNo 4, new 0x100fdfcea0, old 0x0, new flags 0x0 --3158864-- sys_sigaction: sigNo 8, new 0x100fdfcea0, old 0x0, new flags 0x0 --3158864-- SCHED[12]: entering VG_(scheduler) --3158864-- delivering signal 11 (SIGSEGV):128 to thread 12 --3158864-- delivering 11 (code 128) to default handler; action: terminate+core ==3158864== ==3158864== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==3158864== General Protection Fault ==3158864== at 0x4B6CC82: __pthread_once_slow (pthread_once.c:114) ==3158864== by 0x4CAD03E: __rpc_thread_variables.part.0 (rpc_thread.c:59) ==3158864== by 0x4CEF61C: free_mem (in /usr/lib64/libc-2.33.so) ==3158864== by 0x4CEF271: __libc_freeres (in /usr/lib64/libc-2.33.so) ==3158864== by 0x48351E7: _vgnU_freeres (vg_preloaded.c:74) ==3158864== by 0x100000002F: ??? ==3158864== by 0xE0B00AF623FA69FF: ??? ==3158864== by 0xF: ??? ==3158864== by 0x49D7620: UnknownInlinedFun (drbg_ctr.c:94) ==3158864== by 0x49D7620: ctr_BCC_update.part.0 (drbg_ctr.c:139) ==3158864== by 0x49D79FF: UnknownInlinedFun (drbg_ctr.c:119) ==3158864== by 0x49D79FF: UnknownInlinedFun (drbg_ctr.c:192) ==3158864== by 0x49D79FF: ctr_update (drbg_ctr.c:254) ==3158864== by 0x49D7EBD: drbg_ctr_generate (drbg_ctr.c:329) ==3158864== by 0x49D89E8: RAND_DRBG_generate (drbg_lib.c:642) --3158864-- SCHED[12]: exiting VG_(scheduler) ==3158864== ==3158864== HEAP SUMMARY: ==3158864== in use at exit: 6,908 bytes in 38 blocks ==3158864== total heap usage: 52 allocs, 14 frees, 7,844 bytes allocated ==3158864== ==3158864== LEAK SUMMARY: ==3158864== definitely lost: 0 bytes in 0 blocks ==3158864== indirectly lost: 0 bytes in 0 blocks ==3158864== possibly lost: 2,992 bytes in 11 blocks ==3158864== still reachable: 3,916 bytes in 27 blocks ==3158864== suppressed: 0 bytes in 0 blocks ==3158864== Rerun with --leak-check=full to see details of leaked memory ==3158864== ==3158864== For lists of detected and suppressed errors, rerun with: -s ==3158864== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) [1] 3158864 segmentation fault (core dumped) valgrind --trace-sched=yes --trace-signals=yes ~/reproducer 40 32 0 The upstream bug has a proposed patch that seems to resolve this issue: https://bugs.kde.org/show_bug.cgi?id=443605 valgrind 3.18.1 has been released with the proposed patch included Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: valgrind), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2401 |