Bug 2128412

Summary:	stunnel consumes high amount of memory when pestered with TCP connections without a TLS handshake
Product:	Red Hat Enterprise Linux 9	Reporter:	Sven Hoexter <sven>
Component:	openssl	Assignee:	Dmitry Belyavskiy <dbelyavs>
Status:	CLOSED ERRATA	QA Contact:	Alicja Kario <hkario>
Severity:	medium	Docs Contact:
Priority:	low
Version:	CentOS Stream	CC:	bstinson, cllang, dbelyavs, hkario, jwboyer, quentin, ruben, ssorce
Target Milestone:	rc	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openssl-3.0.7-1.el9	Doc Type:	Bug Fix
Doc Text:	Cause: A flag value conflict in the OpenSSL headers caused a memory leak in TLS services with the OpenSSL library if a TCP connection was opened and closed without a TLS handshake. Consequence: A small amount of memory leaked for every connection without a TLS handshake. Fix: Backport the fix for value conflict. Result: No memory leaks when TCP connections are closed without a TLS handshake.	Story Points:	---
Clone Of:
Clones:	2144008 2144009 (view as bug list)		Environment:
Last Closed:	2023-05-09 08:20:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2144008, 2144009

Description Sven Hoexter 2022-09-20 13:27:01 UTC

Description of problem:
While running stunnel on CentOS Stream 9 in a Google Cloud VM fronted by a TCP loadbalancer we noticed a high memory consumption and ultimately a OOM kill of stunnel.
Looking into the issue it seems the TCP health check is leading into a memory leak. Switching to a SSL based health check which completes the TLS handshake
works around it.
We can reproduce the memory leak with nc.

Version-Release number of selected component (if applicable):
stunnel-5.62-2.el9.x86_64

How reproducible:
I can reproduce it with Vagrant + Virtualbox using
https://app.vagrantup.com/generic/boxes/centos9s

$ cat stunnel.conf
[foo]
accept = 2600
connect = 8000
cert = /etc/stunnel/cert/cert.crt
key = /etc/stunnel/cert/key.key

There is no service running on port 8000.
Key and cert were generated with minica, I can provide those if required. They are not secret and were just generated to reproduce the issue.

Steps to Reproduce:
1. setup a centos 9 machine, dnf update; dnf install stunnel netcat
2. install a very simple stunnel.conf as shown above, systemctl start stunnel
3. run:
while true; do nc -z localhost 2600; sleep 1; done
4. check the memory consumption and watch it grow, according to what I can see in /proc/<pid>/smaps one segment of the heap keeps growing. Without the sleep
in the loop above it often crashes quite soon.

Actual results:
Growing memory consumption, sometimes even crashes.

Expected results:
Memory consumption should be stable and stunnel should not crash.

Additional info:
I rebuild stunnel 5.66 from FC on CentOS 9 as well and disabled almost all patches, but it does not fix the issue. So that might be rooted in an issue somewhere else.
I also ran the same test case on Debian/11 with stunnel 5.60 from bullseye-backports and did not experience this issue.

Comment 1 Clemens Lang 2022-09-20 17:08:21 UTC

Thank you for the report.

Just to make sure: Do you have PKCS#11 configured in OpenSSL?

Comment 2 Sven Hoexter 2022-09-21 06:40:09 UTC

I'm not 100% sure what you mean by configured in this context.

Installed are:
openssl-pkcs11-0.4.11-7.el9.x86_64
openssl-libs-3.0.1-41.el9.x86_64
openssl-3.0.1-41.el9.x86_64

The pkcs11 engine in openssl is available
$ openssl engine pkcs11 -t
(pkcs11) pkcs11 engine
     [ available ]

That's true for the system we originally experienced the issue on, and the one I used to reproduce it.

Comment 3 Sven Hoexter 2022-09-21 08:03:30 UTC

When stunnel crashes I see the following in dmesg:
stunnel[8093] general protection fault ip:7fd0ab7aa0e0 sp:7fd0ab2db6c0 error:0 in libcrypto.so.3.0.1[7fd0ab5af000+257000]

Right now I fail to record a core dump with with systemd-coredump. According to coredumpctl list it's noticing it but does not write a core file:
Wed 2022-09-21 07:58:49 UTC 8092   0   0 SIGSEGV none     /usr/bin/stunnel  n/a

Comment 4 Clemens Lang 2022-09-21 09:59:10 UTC

By configured I mean changes to /etc/pki/tls/openssl.cnf that would cause any user of OpenSSL to load a pkcs11 implementation, e.g.:

> openssl_conf = pkcs11_conf
>
> [...]
>
> [pkcs11_conf]
> engines = engine_section
> ssl_conf = ssl_module
>
> [engine_section]
> pkcs11 = pkcs11_section
>
> [pkcs11_section]
> engine_id = pkcs11
> MODULE_PATH = /usr/lib64/pkcs11/libsofthsm2.so
> init = 0

The reason I'm asking is that we have previously seen memory leaks for each request handled by stunnel when pkcs11 modules are used with OpenSSL. I'd just like to check whether this is potentially the same issue.

Comment 5 Sven Hoexter 2022-09-21 10:40:44 UTC

Ok, none of that is configured.
[vagrant@centos9s ~]$ grep pkcs11 /etc/pki/tls/openssl.cnf
[vagrant@centos9s ~]$

Comment 6 Sven Hoexter 2022-09-21 11:22:47 UTC

Not sure if it helps, I got out of my comfort zone and installed gdb + debug headers. Removed along the way the openssl-pkcs11 package to be sure. I got gdb attached to stunnel and provoked a SIGSEGV and that is the backtrace.

Thread 2 "stunnel" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fbe07487640 (LWP 184215)]
0x00007fbe07956dc4 in bn_sqr8x_internal () at crypto/bn/x86_64-mont5.s:1746
1746            leaq    (%rdi,%r9,1),%rdi
(gdb) bt
#0  0x00007fbe07956dc4 in bn_sqr8x_internal () at crypto/bn/x86_64-mont5.s:1746
#1  0x00007fbe079550c5 in bn_sqr8x_mont () at crypto/bn/x86_64-mont.s:795
#2  0x00007fbe07793b7b in bn_mul_mont_fixed_top (r=r@entry=0x7fbe00003298, a=a@entry=0x7fbe00003298, b=b@entry=0x7fbe00003298, 
    mont=mont@entry=0x7fbe00007ae0, ctx=ctx@entry=0x7fbe00001fb0) at crypto/bn/bn_mont.c:48
#3  0x00007fbe07799f4b in BN_mod_exp_mont (rr=rr@entry=0x7fbe00003250, a=a@entry=0x7fbe00003268, p=p@entry=0x7fbe00003238, 
    m=m@entry=0x7fbe00001dd0, ctx=ctx@entry=0x7fbe00001fb0, in_mont=in_mont@entry=0x7fbe00007ae0) at crypto/bn/bn_exp.c:427
#4  0x00007fbe0779bf0e in ossl_bn_miller_rabin_is_prime (w=w@entry=0x7fbe00001dd0, iterations=<optimized out>, iterations@entry=1, 
    ctx=ctx@entry=0x7fbe00001fb0, cb=cb@entry=0x7fbe00001340, enhanced=enhanced@entry=0, status=status@entry=0x7fbe07486c14)
    at crypto/bn/bn_prime.c:405
#5  0x00007fbe0779c2d2 in ossl_bn_miller_rabin_is_prime (status=0x7fbe07486c14, enhanced=0, cb=0x7fbe00001340, ctx=0x7fbe00001fb0, 
    iterations=1, w=0x7fbe00001dd0) at crypto/bn/bn_prime.c:345
#6  bn_is_prime_int (cb=0x7fbe00001340, do_trial_division=0, ctx=0x7fbe00001fb0, checks=1, w=0x7fbe00001dd0) at crypto/bn/bn_prime.c:311
#7  bn_is_prime_int (w=w@entry=0x7fbe00001dd0, checks=checks@entry=1, ctx=ctx@entry=0x7fbe00001fb0, 
    do_trial_division=do_trial_division@entry=0, cb=cb@entry=0x7fbe00001340) at crypto/bn/bn_prime.c:266
#8  0x00007fbe0779cac5 in BN_generate_prime_ex2 (ret=ret@entry=0x7fbe00001dd0, bits=bits@entry=2048, safe=safe@entry=1, 
    add=add@entry=0x7fbe00001bd0, rem=rem@entry=0x7fbe00001be8, cb=cb@entry=0x7fbe00001340, ctx=<optimized out>)
    at crypto/bn/bn_prime.c:186
#9  0x00007fbe0779ce87 in BN_generate_prime_ex (ret=0x7fbe00001dd0, bits=2048, safe=1, add=0x7fbe00001bd0, rem=0x7fbe00001be8, 
    cb=0x7fbe00001340) at crypto/bn/bn_prime.c:222
#10 0x00007fbe077d50e6 in dh_builtin_genparams (cb=0x7fbe00001340, generator=2, prime_len=2048, ret=0x7fbe00001860)
    at crypto/dh/dh_gen.c:216
#11 DH_generate_parameters_ex (ret=0x7fbe00001860, prime_len=2048, generator=2, cb=0x7fbe00001340) at crypto/dh/dh_gen.c:124
#12 0x000055b676dc8c89 in cron_dh_param (bn_gencb=0x7fbe00001340) at /usr/src/debug/stunnel-5.62-2.el9.x86_64/src/cron.c:205
#13 cron_worker () at /usr/src/debug/stunnel-5.62-2.el9.x86_64/src/cron.c:158
#14 cron_thread (arg=<optimized out>) at /usr/src/debug/stunnel-5.62-2.el9.x86_64/src/cron.c:100
#15 0x00007fbe07543802 in start_thread (arg=<optimized out>) at pthread_create.c:443
#16 0x00007fbe074e3450 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Actually I've done it twice and it always happened in bn_sqr8x_internal().

Comment 7 Sven Hoexter 2022-09-21 11:42:58 UTC

I believe the memory leak and the crash are two separate issues. Generating dh parameter and attaching them to the certificate, thus preventing stunnel to generate them, seems to prevent the crashes from happening. The memory leak stays. I also only see the crashes in the local Virtualbox based VM, and could not reproduce those inside a google cloud compute engine instance.

Comment 8 Clemens Lang 2022-09-21 12:20:42 UTC

I found the easiest way to identify where the memory leaks in stunnel is running valgrind with gdb stub support and periodically interrupting stunnel to trigger a valgrind leak scan:

1. Run valgrind with gdbserver support using valgrind --vgdb=yes --vgdb-error=0 --leak-check=full --num-callers=60 --track-origins=yes /usr/bin/stunnel stunnel.conf
2. Start gdb /usr/bin/stunnel
3. Attach gdb to the running valgrind instance using "target remote | vgdb"
4. Issue a gdb "continue" command until stunnel has successfully started up
5. Interrupt program execution using ^C, and issue a valgrind leak check using "monitor leak_check full possibleleak changed" (most of these leaks happen during startup and are not the ones we're after)
6. Issue a gdb "continue" command and send 100 netcat requests to stunnel in a separate shell
7. Repeat steps (5) and (6) to ignore leaks that happen when a particular stunnel server is first used
8. Repeat step (5); this now shows only leaks that happened in relation to the additional requests, i.e., this memory likely leaks for every request.

Comment 9 Sven Hoexter 2022-09-22 08:39:42 UTC

I had to use stunnel with "foreground = yes", not sure in how far that alters the behaviour.
Everything executed on a fresh google cloud compute instance.

(gdb) monitor leak_check full possibleleak changed
==194994== 314 (+120) bytes in 1 (+0) blocks are possibly lost in loss record 893 of 1,215
==194994==    at 0x48496AF: realloc (vg_replace_malloc.c:1437)
==194994==    by 0x115501: str_realloc_internal_debug.lto_priv.0 (str.c:340)
==194994==    by 0x4AFDFC1: sk_reserve (stack.c:210)
==194994==    by 0x4AFE262: OPENSSL_sk_insert (stack.c:254)
==194994==    by 0x4AB2223: UnknownInlinedFun (initthread.c:45)
==194994==    by 0x4AB2223: UnknownInlinedFun (initthread.c:164)
==194994==    by 0x4AB2223: UnknownInlinedFun (initthread.c:109)
==194994==    by 0x4AB2223: UnknownInlinedFun (initthread.c:93)
==194994==    by 0x4AB2223: ossl_init_thread_start (initthread.c:378)
==194994==    by 0x4A6B5B4: ossl_err_get_state_int (err.c:667)
==194994==    by 0x4A6125C: ERR_clear_error (err.c:319)
==194994==    by 0x48BC26C: state_machine.part.0 (statem.c:326)
==194994==    by 0x116AA4: ssl_start (client.c:580)
==194994==    by 0x11A7B2: UnknownInlinedFun (client.c:404)
==194994==    by 0x11A7B2: client_run (client.c:301)
==194994==    by 0x1230B0: client_thread (client.c:130)
==194994==    by 0x4DCA801: start_thread (pthread_create.c:443)
==194994==    by 0x4D6A313: clone (clone.S:100)
==194994== 
==194994== 5,440 (+2,720) bytes in 20 (+10) blocks are possibly lost in loss record 1,159 of 1,215
==194994==    at 0x4849464: calloc (vg_replace_malloc.c:1328)
==194994==    by 0x4016732: UnknownInlinedFun (rtld-malloc.h:44)
==194994==    by 0x4016732: allocate_dtv (dl-tls.c:375)
==194994==    by 0x4017151: _dl_allocate_tls (dl-tls.c:634)
==194994==    by 0x4DCB4C4: allocate_stack (allocatestack.c:429)
==194994==    by 0x4DCB4C4: pthread_create@@GLIBC_2.34 (pthread_create.c:648)
==194994==    by 0x12E55A: create_client.constprop.0 (sthreads.c:599)
==194994==    by 0x113EBF: UnknownInlinedFun (stunnel.c:447)
==194994==    by 0x113EBF: UnknownInlinedFun (stunnel.c:382)
==194994==    by 0x113EBF: UnknownInlinedFun (stunnel.c:356)
==194994==    by 0x113EBF: UnknownInlinedFun (ui_unix.c:114)
==194994==    by 0x113EBF: main (ui_unix.c:58)
==194994== 
==194994== 3,331,600 (+1,665,800) bytes in 200 (+100) blocks are definitely lost in loss record 1,215 of 1,215
==194994==    at 0x4849464: calloc (vg_replace_malloc.c:1328)
==194994==    by 0x1152F4: str_alloc_detached_debug (str.c:295)
==194994==    by 0x48A79BF: ssl3_setup_write_buffer (ssl3_buffer.c:119)
==194994==    by 0x48BD1FE: UnknownInlinedFun (ssl3_buffer.c:148)
==194994==    by 0x48BD1FE: UnknownInlinedFun (ssl3_buffer.c:142)
==194994==    by 0x48BD1FE: state_machine.part.0 (statem.c:402)
==194994==    by 0x116AA4: ssl_start (client.c:580)
==194994==    by 0x11A7B2: UnknownInlinedFun (client.c:404)
==194994==    by 0x11A7B2: client_run (client.c:301)
==194994==    by 0x1230B0: client_thread (client.c:130)
==194994==    by 0x4DCA801: start_thread (pthread_create.c:443)
==194994==    by 0x4D6A313: clone (clone.S:100)
==194994== 
==194994== LEAK SUMMARY:
==194994==    definitely lost: 3,331,600 (+1,665,800) bytes in 200 (+100) blocks
==194994==    indirectly lost: 0 (+0) bytes in 0 (+0) blocks
==194994==      possibly lost: 1,775,537 (+2,840) bytes in 7,647 (+10) blocks
==194994==    still reachable: 5,745 (+0) bytes in 23 (+0) blocks
==194994==         suppressed: 0 (+0) bytes in 0 (+0) blocks
==194994== Reachable blocks (those to which a pointer was found) are not shown.
==194994== To see them, add 'reachable any' args to leak_check
==194994==

Comment 10 Clemens Lang 2022-10-14 09:28:24 UTC

This may actually be a problem in OpenSSL. See bug 2134754 and https://github.com/acassen/keepalived/issues/2199#issuecomment-1277175404.

Comment 11 Clemens Lang 2022-10-14 10:37:40 UTC

I can confirm that this problem does occur with openssl-libs-1:3.0.1-41.el9_0.x86_64, but is not reproducible with openssl-libs-1:3.0.5-5.fc38.x86_64.

I'm moving this bug to openssl.

Comment 12 Clemens Lang 2022-11-08 13:23:24 UTC

*** Bug 2134754 has been marked as a duplicate of this bug. ***

Comment 13 Clemens Lang 2022-11-08 13:25:51 UTC

3d046c4d047a55123beeceffe9f8bae09159445e is the first fixed commit
commit 3d046c4d047a55123beeceffe9f8bae09159445e
Author: yangyangtiantianlonglong <yangtianlong1224>
Date:   Wed Jan 19 11:19:52 2022 +0800

    Fix the same BIO_FLAGS macro definition

    Also add comment to the public header to avoid
    making another conflict in future.

    Fixes #17545

    Reviewed-by: Paul Dale <pauli>
    Reviewed-by: Tomas Mraz <tomas>
    (Merged from https://github.com/openssl/openssl/pull/17546)

    (cherry picked from commit e278f18563dd3dd67c00200ee30402f48023c6ef)

 include/internal/bio.h   | 2 +-
 include/openssl/bio.h.in | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)


See https://github.com/openssl/openssl/commit/3d046c4d047a55123beeceffe9f8bae09159445e and https://github.com/openssl/openssl/issues/17545.
I guess might not actually be aware that this fixed a memory leak.

I confirmed in bug 2134754 comment 3 that backporting this change fixes the leak.

Comment 14 Dmitry Belyavskiy 2022-11-10 12:23:52 UTC

*** Bug 2134754 has been marked as a duplicate of this bug. ***

Comment 24 errata-xmlrpc 2023-05-09 08:20:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: openssl security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2523