Bug 2231559

Summary: CRIT - connection_table_move_connection_out_of_active_list - conn 0 is already OUT of the active list (refcnt is 0)
Product: Red Hat Enterprise Linux 9 Reporter: Viktor Ashirov <vashirov>
Component: 389-ds-baseAssignee: LDAP Maintainers <idm-ds-dev-bugs>
Status: NEW --- QA Contact: LDAP QA Team <idm-ds-qe-bugs>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 9.3CC: idm-ds-dev-bugs, jachapma, tbordaz
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pstack after the test was stopped
none
pstack of unresponsive server
none
pstack with turbo mode disabled
none
pstack with turbo mode removed none

Description Viktor Ashirov 2023-08-11 21:33:04 UTC
Description of problem:
While running https://github.com/389ds/389-ds-base/blob/main/dirsrvtests/tests/perf/ltest.py with nsslapd-numlisteners=4, test hangs with:
20000 connections are open. Starting the latency test using Namespace(test_duration=None, wait_time=10, uri='ldap://192.168.122.63:389', basedn='ou=people, dc=example, dc=com', binddn='cn=directory manager', bindpw='password', nbconn=20000, verbose=0) as parameters

After test is stopped, CRIT messages are logged:
[11/Aug/2023:16:11:41.072307556 -0400] - CRIT - connection_table_move_connection_out_of_active_list - conn 0 is already OUT of the active list (refcnt is 0)
[11/Aug/2023:16:11:41.396395450 -0400] - CRIT - connection_table_move_connection_out_of_active_list - conn 0 is already OUT of the active list (refcnt is 0)
[11/Aug/2023:16:11:41.919739558 -0400] - CRIT - connection_table_move_connection_out_of_active_list - conn 0 is already OUT of the active list (refcnt is 0)
[11/Aug/2023:16:11:46.958568482 -0400] - CRIT - connection_table_move_connection_out_of_active_list - conn 0 is already OUT of the active list (refcnt is 0)

With 4 listeners and smaller number of connections test is successful.

Version-Release number of selected component (if applicable):
389-ds-base-2.3.4-3.el9.x86_64

Comment 1 Viktor Ashirov 2023-08-12 11:37:11 UTC
Created attachment 1983126 [details]
pstack after the test was stopped

Comment 2 Viktor Ashirov 2023-08-12 11:38:32 UTC
Created attachment 1983127 [details]
pstack of unresponsive server

I left the server running after the test was stopped. After some time I found it unresponsive and consuming 100% CPU.

Comment 3 Jamie Chapman 2023-08-15 10:45:42 UTC
Hi Viktor, 

I would be very interested to know the size of the connection sub tables on the system you used for this issue.

cat  /var/log/dirsrv/[YOUR INSTANCE]/errors | grep connection_table_new
eg
[15/Aug/2023:06:32:52.962909140 -0400] - INFO - connection_table_new - Number of connection sub-tables 4, each containing 15984 slots.

Thanks

Comment 4 Viktor Ashirov 2023-08-15 10:58:26 UTC
Hi Jamie,

I have the same number:
[11/Aug/2023:17:08:23.654787326 -0400] - INFO - connection_table_new - Number of connection sub-tables 4, each containing 15984 slots.

Thanks

Comment 5 thierry bordaz 2023-08-16 11:31:47 UTC
Looking at 'pstack of the unresponsive server', it is looking like the test of the turbo mode may contribute. For example all workers may entered in turbo mode and entering in infinite loop to evaluate its rank. Suprisingly all connection should be close so active list should be empty.

Thread 27 (Thread 0x7f9b309ff640 (LWP 24356) "ns-slapd"):
#0  table_iterate_function (arg=<synthetic pointer>, conn=<optimized out>) at ldap/servers/slapd/connection.c:1410
#1  connection_table_iterate_active_connections (f=<optimized out>, arg=<synthetic pointer>, ct=0x7f9bb9c32870) at ldap/servers/slapd/conntable.c:343
#2  connection_find_our_rank (conn=<optimized out>, conn=0x7f9b3b90ed70, our_rank=<synthetic pointer>, connection_count=<synthetic pointer>) at ldap/servers/slapd/connection.c:1425
#3  connection_enter_leave_turbo (new_turbo_flag=<synthetic pointer>, current_turbo_flag=0, conn=0x7f9b3b90ed70) at ldap/servers/slapd/connection.c:1458
#4  connection_threadmain (arg=<optimized out>) at ldap/servers/slapd/connection.c:1661
#5  0x00007f9bbede9c34 in _pt_root () at target:/lib64/libnspr4.so
#6  0x00007f9bbe69f822 in start_thread () at target:/lib64/libc.so.6
#7  0x00007f9bbe63f450 in clone3 () at target:/lib64/libc.so.6

If reproduced, samples of top -H may say if only one worker is active or all of them (more likely).
Could you try to reproduce with turbo mode disabled ?

Comment 6 Viktor Ashirov 2023-08-16 12:02:16 UTC
Created attachment 1983591 [details]
pstack with turbo mode disabled

With turbo mode disabled the server still hangs.

Comment 7 Viktor Ashirov 2023-08-16 12:03:10 UTC
I'll make another build with https://github.com/389ds/389-ds-base/pull/5893 to see if it makes any difference.

Comment 8 Viktor Ashirov 2023-08-16 13:49:19 UTC
Created attachment 1983610 [details]
pstack with turbo mode removed

Test still hands with turbo mode completely removed.