Description of problem: While running https://github.com/389ds/389-ds-base/blob/main/dirsrvtests/tests/perf/ltest.py with nsslapd-numlisteners=4, test hangs with: 20000 connections are open. Starting the latency test using Namespace(test_duration=None, wait_time=10, uri='ldap://192.168.122.63:389', basedn='ou=people, dc=example, dc=com', binddn='cn=directory manager', bindpw='password', nbconn=20000, verbose=0) as parameters After test is stopped, CRIT messages are logged: [11/Aug/2023:16:11:41.072307556 -0400] - CRIT - connection_table_move_connection_out_of_active_list - conn 0 is already OUT of the active list (refcnt is 0) [11/Aug/2023:16:11:41.396395450 -0400] - CRIT - connection_table_move_connection_out_of_active_list - conn 0 is already OUT of the active list (refcnt is 0) [11/Aug/2023:16:11:41.919739558 -0400] - CRIT - connection_table_move_connection_out_of_active_list - conn 0 is already OUT of the active list (refcnt is 0) [11/Aug/2023:16:11:46.958568482 -0400] - CRIT - connection_table_move_connection_out_of_active_list - conn 0 is already OUT of the active list (refcnt is 0) With 4 listeners and smaller number of connections test is successful. Version-Release number of selected component (if applicable): 389-ds-base-2.3.4-3.el9.x86_64
Created attachment 1983126 [details] pstack after the test was stopped
Created attachment 1983127 [details] pstack of unresponsive server I left the server running after the test was stopped. After some time I found it unresponsive and consuming 100% CPU.
Hi Viktor, I would be very interested to know the size of the connection sub tables on the system you used for this issue. cat /var/log/dirsrv/[YOUR INSTANCE]/errors | grep connection_table_new eg [15/Aug/2023:06:32:52.962909140 -0400] - INFO - connection_table_new - Number of connection sub-tables 4, each containing 15984 slots. Thanks
Hi Jamie, I have the same number: [11/Aug/2023:17:08:23.654787326 -0400] - INFO - connection_table_new - Number of connection sub-tables 4, each containing 15984 slots. Thanks
Looking at 'pstack of the unresponsive server', it is looking like the test of the turbo mode may contribute. For example all workers may entered in turbo mode and entering in infinite loop to evaluate its rank. Suprisingly all connection should be close so active list should be empty. Thread 27 (Thread 0x7f9b309ff640 (LWP 24356) "ns-slapd"): #0 table_iterate_function (arg=<synthetic pointer>, conn=<optimized out>) at ldap/servers/slapd/connection.c:1410 #1 connection_table_iterate_active_connections (f=<optimized out>, arg=<synthetic pointer>, ct=0x7f9bb9c32870) at ldap/servers/slapd/conntable.c:343 #2 connection_find_our_rank (conn=<optimized out>, conn=0x7f9b3b90ed70, our_rank=<synthetic pointer>, connection_count=<synthetic pointer>) at ldap/servers/slapd/connection.c:1425 #3 connection_enter_leave_turbo (new_turbo_flag=<synthetic pointer>, current_turbo_flag=0, conn=0x7f9b3b90ed70) at ldap/servers/slapd/connection.c:1458 #4 connection_threadmain (arg=<optimized out>) at ldap/servers/slapd/connection.c:1661 #5 0x00007f9bbede9c34 in _pt_root () at target:/lib64/libnspr4.so #6 0x00007f9bbe69f822 in start_thread () at target:/lib64/libc.so.6 #7 0x00007f9bbe63f450 in clone3 () at target:/lib64/libc.so.6 If reproduced, samples of top -H may say if only one worker is active or all of them (more likely). Could you try to reproduce with turbo mode disabled ?
Created attachment 1983591 [details] pstack with turbo mode disabled With turbo mode disabled the server still hangs.
I'll make another build with https://github.com/389ds/389-ds-base/pull/5893 to see if it makes any difference.
Created attachment 1983610 [details] pstack with turbo mode removed Test still hands with turbo mode completely removed.