Bug 2012249

Summary: gnutls_priority_set_direct occasionally fails with "The request is invalid"
Product: Red Hat Enterprise Linux 9 Reporter: Richard W.M. Jones <rjones>
Component: gnutlsAssignee: Daiki Ueno <dueno>
Status: CLOSED ERRATA QA Contact: Alexander Sosedkin <asosedki>
Severity: high Docs Contact:
Priority: high    
Version: 9.0CC: asosedki, berrange, eblake, jeckersb, kkiwi, michele, ssorce
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gnutls-3.7.2-9.el9 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-17 15:52:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log file without gnutls debugging
none
synch-parallel-tls.sh.log
none
Log with GNUTLS_DEBUG_LEVEL=10
none
tlsthread.c none

Description Richard W.M. Jones 2021-10-08 16:22:26 UTC
Description of problem:

In RHEL 9, possibly when using lots of threads, calling
gnutls_priority_set_direct can fail with the error

error: failed to set TLS session priority to @NBDKIT,SYSTEM:+ECDHE-PSK:+DHE-PSK:+PSK: The request is invalid.

Version-Release number of selected component (if applicable):

gnutls-3.7.2-4.el9.x86_64

How reproducible:

Rare

Steps to Reproduce:

$ git clone https://gitlab.com/nbdkit/libnbd
$ sudo dnf builddep libnbd
$ cd libnbd
$ ./configure
$ make
$ while make -C tests check TESTS=synch-parallel-tls.sh  >& /tmp/log; do echo -n . ; done

Eventually it should fail.  For the log see tests/synch-parallel-tls.sh.log

Comment 1 Richard W.M. Jones 2021-10-08 16:23:55 UTC
Created attachment 1830898 [details]
log file without gnutls debugging

Comment 2 Richard W.M. Jones 2021-10-08 17:15:18 UTC
Only seems to be reproducible in RHEL 9.  I cannot reproduce it in Fedora.

Some think it might be connected to this non-upstream change
which is only in RHEL 9:
https://gitlab.com/gnutls/gnutls/-/merge_requests/1427

Comment 3 Daiki Ueno 2021-10-14 12:57:39 UTC
3.7.2-4 is the package that re-introduced LTO enablement after a long time. As it created several obscure issues in tests when running on aarch64 and ppc64le, we disabled LTO on those arches in 3.7.2-6. As far as I read from the original thread, the failure seems to happen only on aarch64 with 3.7.2-4, so I would suggest building with the latest gnutls package (3.7.2-7).

Comment 4 Richard W.M. Jones 2021-10-25 08:03:09 UTC
Created attachment 1836609 [details]
synch-parallel-tls.sh.log

My locally testing is on x86-64.

The bug still happens (perhaps less often?) with gnutls-3.7.2-7.el9.x86_64

Attached latest log of the failure.

Comment 5 Daniel Berrangé 2021-10-25 08:21:41 UTC
(In reply to Daiki Ueno from comment #3)
> 3.7.2-4 is the package that re-introduced LTO enablement after a long time.
> As it created several obscure issues in tests when running on aarch64 and
> ppc64le, we disabled LTO on those arches in 3.7.2-6. 

(In reply to Richard W.M. Jones from comment #4)
> My locally testing is on x86-64.
>
> The bug still happens (perhaps less often?) with gnutls-3.7.2-7.el9.x86_64

Perhaps worth doing a gnutls scratch build with LTO disabled on x86_64 too, and seeing if that solves it, as LTO has been a source of many wierd  non-deterministic bugs.

Comment 6 Richard W.M. Jones 2021-10-25 08:39:15 UTC
Oh interesting, I thought LTO had been disabled on all architectures.
I did a scratch build with LTO disabled on x86-64 too which I will
test once it has finished:
https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=746280

Comment 7 Richard W.M. Jones 2021-10-25 09:13:13 UTC
(In reply to Richard W.M. Jones from comment #6)
> https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=746280

This did *not* fix the problem, so it's not LTO.

Comment 8 Richard W.M. Jones 2021-10-25 10:08:03 UTC
(In reply to Richard W.M. Jones from comment #7)
> (In reply to Richard W.M. Jones from comment #6)
> > https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=746280
> 
> This did *not* fix the problem, so it's not LTO.

My apologies, I was reading the wrong log file.  In fact this package
does fix the problem, so it is a problem related to LTO on x86-64.

Comment 9 Richard W.M. Jones 2021-10-25 10:11:49 UTC
Oh I hate intermittent errors!  Just as I hit submit on that comment, the
test which had run successfully for 100+ cycles failed again with the
same problem.

This is with LTO disabled, so the problem still seems to be present and
NOT related to LTO after all.

I'm going to try this with upstream gnutls, and also see if I can get a
more reliable test case.

Comment 10 Richard W.M. Jones 2021-10-25 11:01:17 UTC
It would be really nice if gnutls got rid of the requirement for autogen.
This is not available on RHEL 9 and almost impossible to build on RHEL 9
because it depends on both itself and gnulib.

Comment 11 Richard W.M. Jones 2021-10-25 17:31:35 UTC
Created attachment 1836943 [details]
Log with GNUTLS_DEBUG_LEVEL=10

Comment 12 Daiki Ueno 2021-10-25 18:25:26 UTC
(In reply to Richard W.M. Jones from comment #11)
> Created attachment 1836943 [details]
> Log with GNUTLS_DEBUG_LEVEL=10

OK, thank you so much for looking into this; it seems indeed like a race condition: the resolved priority string is stored in the global variable system_wide_priority_string, while the other threads may independently update the variable, without lock. A similar race seems to be found in _gnutls_unload_system_priorities() for system_wide_priority_strings, though it is not called so frequently as the caller checks mtime of the config file.  I'll create a patch shortly.

Comment 13 Richard W.M. Jones 2021-10-25 18:31:36 UTC
Created attachment 1836965 [details]
tlsthread.c

This is a reproducer.  It fails for me reliably and in a couple
of interesting ways.  It does not require anything except RHEL 9
and gnutls-devel.

$ gcc -O2 -Wall -pthread tlsthread.c -o tlsthread -lgnutls                      
$ while ./tlsthread ; do echo -n . ; done                                       
.............................................................................................................................................../tlsthread: gnutls_priority_set_direct: The request is invalid.

Sometimes it fails indicating memory corruption:

$ while ./tlsthread ; do echo -n .; done
tcache_thread_shutdown(): unaligned tcache chunk detected
Aborted (core dumped)

(The stack trace from this was not very interesting)

Comment 14 Daiki Ueno 2021-10-26 12:24:23 UTC
Thank you for the reproducer. I've created a scratch build with the proposed fix:
https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=748121

Comment 21 errata-xmlrpc 2022-05-17 15:52:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: gnutls), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:3937