Bug 2112125

Summary: Multithreaded clnt_create() may deadlock.
Product: Red Hat Enterprise Linux 9 Reporter: Steve Dickson <steved>
Component: libtirpcAssignee: Steve Dickson <steved>
Status: CLOSED DUPLICATE QA Contact: Zhi Li <yieli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 9.1CC: attipaci, xzhou, yoyang
Target Milestone: rcKeywords: Patch, Triaged
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2112116 Environment:
Last Closed: 2022-08-16 14:02:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2112116    
Bug Blocks:    

Description Steve Dickson 2022-07-28 21:43:07 UTC
+++ This bug was initially created as a clone of Bug #2112116 +++

Description of problem:

If calling clnt_create() or one of its related functions concurrently from multiple threads, the call may occasionally deadlock, and the program making the call will hang.

The bug may affect NFS (remote file systems), and hence the Kubernetes infrastructure also, or applications that rely on managing RPC clients in parallel.   

Version-Release number of selected component (if applicable):

The bug is definitely present in libtirpc versions 1.1.4 to 3.2.1. However, it likely affected at least some earlier versions of the library also.

How reproducible:

Making clnt_create calls on a multi-core system in parallel threads will produce the deadlock sooner or later. In our case on a 4-core x86_64 VM, with 8 parallel threads calling clnt_create() nearly simultaneously to 8 different RPC hosts, the deadlock typically occurs after a few dozen attempts.


Steps to Reproduce:

1. Make clnt_create() calls in multiple threads on a multicore Linux PC. Assume you have server nodes ('server1' through 'server8') running some RPC service (SOMEPROG, SOMVERS). You want to talk to these servers asynchronously in parallel threads. Each thread makes its own RPC client connection. Here is an example C test program for that particular scenario: 

 #include <stdio.h>
 #include <stdlib.h>
 #include <rpc/rpc.h>

 int main() {
   int i;

   #pragma omp parallel for num_threads(8)
   for (i = 1; i<=8; i++) {
     char hostname[40];
     sprintf(hostname, "server%d", i);
     clnt_create(hostname, SOMEPROG, SOMEVERS, "tcp");
   }

   fprint(stderr, "Success!!!\n");
   return 0;
 } 

2. Modify the above program for a particular RPC service that runs on some cluster of nodes as appropriate for their host names and RPC program info. 

3. compile with -fopenmp -lpthread -ltirpc -lrt


Actual results:

The program will mostly run fine, printing "Success!!!" to stderr, and returning to the shell prompt. However, after several (few dozen) attempts, it will eventually just hang without printing anything.

Expected results:

The program should ALWAYS print "Success!!!" and ALWAYS return to the prompt. Crucially, it should never hang. 

Additional info:

The expected behavior (no hangs in MT environment) was in fact the old behavior of the original SunRPC library, such as the one we use on some very old LynxOS 3.1.0 PowerPCs from the 1990s... The hanging is a regression that was introduced in libtirpc sometime after cloning the original SunRPC...

--- Additional comment from Steve Dickson on 2022-07-28 21:41:42 UTC ---

commit 667ce638454d0995170dd8e6e0668ada733d72e7
Author: Attila Kovacs <attila.kovacs.edu>
Date:   Thu Jul 28 09:14:24 2022 -0400

    SUNRPC: mutexed access blacklist_read state variable.

commit 3f2a5459fb00c2f529d68a4a0fd7f367a77fa65a
Author: Attila Kovacs <attila.kovacs.edu>
Date:   Tue Jul 26 15:24:01 2022 -0400

    thread safe clnt destruction.

commit 7a6651a31038cb19807524d0422e09271c5ffec9
Author: Attila Kovacs <attila.kovacs.edu>
Date:   Tue Jul 26 15:20:05 2022 -0400

    clnt_dg_freeres() uncleared set active state may deadlock.


Author: Attila Kovacs <attila.kovacs.edu>
Date:   Wed Jul 20 17:03:28 2022 -0400

    Eliminate deadlocks in connects with an MT environment

Comment 4 Steve Dickson 2022-08-01 18:16:07 UTC
This patch is also needed:

commit fa153d634228216fc162e5d6583a7035af2c40ba (HEAD -> master, tag: libtirpc-1-3-3-rc5)
Author: Attila Kovacs <attila.kovacs.edu>
Date:   Mon Aug 1 11:28:43 2022 -0400

    SUNRPC: MT-safe overhaul of address cache management in rpcb_clnt.c

(In reply to Steve Dickson from comment #0)
> 
> commit 667ce638454d0995170dd8e6e0668ada733d72e7
> Author: Attila Kovacs <attila.kovacs.edu>
> Date:   Thu Jul 28 09:14:24 2022 -0400
> 
>     SUNRPC: mutexed access blacklist_read state variable.
> 
> commit 3f2a5459fb00c2f529d68a4a0fd7f367a77fa65a
> Author: Attila Kovacs <attila.kovacs.edu>
> Date:   Tue Jul 26 15:24:01 2022 -0400
> 
>     thread safe clnt destruction.
> 
> commit 7a6651a31038cb19807524d0422e09271c5ffec9
> Author: Attila Kovacs <attila.kovacs.edu>
> Date:   Tue Jul 26 15:20:05 2022 -0400
> 
>     clnt_dg_freeres() uncleared set active state may deadlock.
> 
> 
> Author: Attila Kovacs <attila.kovacs.edu>
> Date:   Wed Jul 20 17:03:28 2022 -0400
> 
>     Eliminate deadlocks in connects with an MT environment

Comment 5 Steve Dickson 2022-08-16 14:02:12 UTC

*** This bug has been marked as a duplicate of bug 2118157 ***